Here is a code I have been developing: GitHub - CaNS-World/CaNS: A code for fast, massively-parallel direct numerical simulations (DNS) of canonical flows. It supports GPU offloading on NVIDIA hardware.
But it is quite more complex than that in the link above. It does run on many GPUs and may use MPI as backend (in addition to other NVIDIA-specific backends for distributed-memory calculations, like NCCL and NVSHMEM, via the cuDecomp library). Feel free to use it for benchmarking and if you face any issues or need help finding a representative problem, I’m here to help. Otherwise, I am sorry, I also do not have the time to come up with simpler examples for your benchmarks .