Just curious, in your research/simulations, have you tried using GPU?
If so, how much is the GPU speedup compare with a Fortran CPU code?
Of course it depends on the structure of the simulation.
My understanding is that as long the code can be highly vectorized, like using matrix operations instead of loops, then GPU can boost such operations.
But I am just curious to see for what kind of simulations GPU can have big advantage over Fortran code for CPU.
Can I say, like if I have a GPU cost $2000, if I have a CPU cost $500, then for some simulations I may roughly expect GPU is 2000/500 = 4 times faster?
Generally speaking, GPUs are good at processing with sequential memory access and a large amount of computation. The index of the amount of computation is flop/byte, that is, the number of times 1 byte of data from memory is used in computations.
Of course, it is critical to be able to compute in parallel.
In Computational Fluid Dynamics, I often see researches using the lattice Boltzmann methods implemented on GPUs because the methods can compute entirely in parallel.
Also, the Phase-Field method based on the Cahn-Hilliard equation is suitable (easy to speed up against CPU) for GPU because it has a high flop/byte.
I implemented a program computing compressible flows using the compact finite difference method on GPU. The program is roughly 40 times faster than the CPU version using the Intel MKL.
We cannot simply estimate performance differences between CPU and GPU from the prices, but discussing how fast a GPU is against a CPU is simple.
When comparing a GPU having the theoretical peak performance (single-precision floating-point number operations/second, FLOPS) of 1030 GFLOPS and the bandwidth of 148 GB/s and a CPU having the performance of 70 GFLOPS and the bandwidth of 32 GB/s, if the program is memory-bound, the GPU is about five times faster than the CPU. If the program is compute-bound, the GPU is about 15 times faster. Those are the theoretically achievable goals under the same algorithm.
I used to use GPUs from 2008 to 2016. Around 2010 there were many reports like GPUs were 100 times faster than CPUs. But around 2015, there were no longer direct comparisons between GPUs and CPUs. Instead of comparisons, the focus was on how much performance was achieved against the peak performance or bandwidth.
In my case mentioned above post, Intel MKL was not suitable for the algorithm of the compact finite difference method. So I implemented a suitable algorithm on a GPU and achieved such a speedup.
On average, GPU’s bandwidth could be 5-10 times larger than say, like DDR4 3200Mhz which is PC-25600 so 25.6GB/s.
I am mostly interested in trying Monte Carlo (MC) simulations on GPU. Since MC use a lot of walkers to represent a distribution, which is typically suitable for MPI. Just not sure how GPU can accelerate MC by how much.
May I ask, when you do GPU computing, do you have to try best to write the code using matrix operations (like matrix multiplications) as many as possible, instead of using loops?
@CRquantum Here’s some interesting data on the Summit supercomputer at ORNL (taken from my master’s thesis).
The Summit supercomputer is designed using the “fat node” paradigm – each Summit
compute node is equipped with two 22-core IBM POWER9 CPUs and six NVIDIA “Volta”
V100 GPUs. An important fact is that the performance of the system mainly comes
from the NVIDIA GPUs, which each contribute ∼7 TFLOP/s, compared to the CPUs,
which each contribute only ∼500 GFLOP/s per CPU*. This means about ∼98% of the
performance on Summit comes from its GPUs.
*This number is rounded from the theoretical maximum, using 8 FLOP/s per core per instruction cycle, 21 usable cores (1 core is reserved for system processes), and 3.07 GHz base clock speed frequency
That just tells you how efficient the GPU is compared to the CPU, in terms of raw performance.
By the way, I use OpenACC to offload Fortran code to the GPU. I also plan to start learning OpenMP target offload in depth in the near future.
Depends on the platform. For NVIDIA GPUs I use nvfortran (newest version available), and for AMD GPUs I use gfortran 10 or newer, since flang can’t process OpenACC yet for now. I heard Cray compiler also supports OpenACC, but I haven’t used it so I can’t say much.
Suppose your code can be written in Fortran’s intrinsic procedures or BLAS subroutines. In that case, you can use intrinsic procedures overloaded by nvfortran or device functions (functions for running on GPUs) provided by CUBLAS. In addition, you can use some libraries provided by NVIDIA or others.
When using OpenACC, you can write code with array operations like c = a + b instead of loop.
If you want to write procedures, called kernel that runs on GPUs, using CUDA Fortran, a Fortran binding of CUDA provided by nvfortran, it is necessary to write the code with loops. This is because we need to write operations executed by one thread in the CUDA Fortran’s kernel like below:
i = (blockIdx%x-1)*blockDim%x + threadIdx%x
c(i) = a(i) + b(i)
The code above is the CUDA Fortran version of the following Fortran code:
do i = 1, N
c(i) = a(i) + b(i)
I don’t recommend this unless your research/business purpose is to parallel the code on the GPU.