I’ve just come across the following article which compares various GPUs extensively for different programs, which might be of interest for simulations.
It is interesting that the speedup from GTX980 to RTX5090 is about x10 for FluidX3D (maybe a CFD calculation), but only x4-5 for NAMD + MD calculation for 0.3 million (ATPase) and 1 million atom (mosaic virus) systems. I guess the latter may be more challenging for scalability because of the irregular nature of the particle data.
(BTW, I used GTX980 for MD simulations in around 2018, which was about $500 at that time. But the recent models of GPU seem very expensive, though the performance is also great… )
Timing results for a small run of the HipFT code on various CPU and GPUs. The code uses standard Fortran “do concurrent” to run in parallel on CPUs and offload to GPUs. For NVIDIA GPUs, unified memory is used, while for Intel GPUs, the use of Target directives are needed for data movement only. A neat result here is the use of the new $250 Intel Arc B580 GPU with its FP64 cores, performing right near where it should based on its memory bandwidth (HipFT is memory bandwidth bound).
something interesting to note for these benchmarks is what is the performance limitation for the algorithm, i.e. is the algorithm memory or compute bound?
If your algorithm is compute bound, you’ll see a super duper increase in performance as you move across GPUs. Basically, take the example of a DGEMM. I can bet my monthly salary that if I get a 1080, 2080, 3080, 4080, 5080, v100, a100, h100 and I do a dgemm I am going to see a beautiful trend of performance nearly doubling every architecture.
If the code is memory bound my improvement will be dependent on the innovations done to memory speeds, throughputs, caches, etc. I think FluidX3D being a CFD code there will be a memory limitation, so that benchmark is very effective for memory!
For MD that’s interesting, it is a weird algorithm overall, depending on how you calculate the forces you could be compute bound. So it does not really surprise me that the speedup is not as big. Whereas the other app is benefiting a lot from memory improvements naturally.
The code is highly memory bandwidth bound.
However since the code uses “do concurrent” it is hard to make custom caching as it is mostly in the compiler’s hands.