How fast can GPU speedup a Fortran CPU code?

Hey guys,

Just curious, in your research/simulations, have you tried using GPU?
If so, how much is the GPU speedup compare with a Fortran CPU code?
Of course it depends on the structure of the simulation.
My understanding is that as long the code can be highly vectorized, like using matrix operations instead of loops, then GPU can boost such operations.
But I am just curious to see for what kind of simulations GPU can have big advantage over Fortran code for CPU.

Can I say, like if I have a GPU cost $2000, if I have a CPU cost $500, then for some simulations I may roughly expect GPU is 2000/500 = 4 times faster?

3 Likes

@CRquantum
Generally speaking, GPUs are good at processing with sequential memory access and a large amount of computation. The index of the amount of computation is flop/byte, that is, the number of times 1 byte of data from memory is used in computations.
Of course, it is critical to be able to compute in parallel.

In Computational Fluid Dynamics, I often see researches using the lattice Boltzmann methods implemented on GPUs because the methods can compute entirely in parallel.
Also, the Phase-Field method based on the Cahn-Hilliard equation is suitable (easy to speed up against CPU) for GPU because it has a high flop/byte.

I implemented a program computing compressible flows using the compact finite difference method on GPU. The program is roughly 40 times faster than the CPU version using the Intel MKL.

4 Likes

@CRquantum
We cannot simply estimate performance differences between CPU and GPU from the prices, but discussing how fast a GPU is against a CPU is simple.

When comparing a GPU having the theoretical peak performance (single-precision floating-point number operations/second, FLOPS) of 1030 GFLOPS and the bandwidth of 148 GB/s and a CPU having the performance of 70 GFLOPS and the bandwidth of 32 GB/s, if the program is memory-bound, the GPU is about five times faster than the CPU. If the program is compute-bound, the GPU is about 15 times faster. Those are the theoretically achievable goals under the same algorithm.

I used to use GPUs from 2008 to 2016. Around 2010 there were many reports like GPUs were 100 times faster than CPUs. But around 2015, there were no longer direct comparisons between GPUs and CPUs. Instead of comparisons, the focus was on how much performance was achieved against the peak performance or bandwidth.

In my case mentioned above post, Intel MKL was not suitable for the algorithm of the compact finite difference method. So I implemented a suitable algorithm on a GPU and achieved such a speedup.

5 Likes

@tomohirodegawa Thank you very much indeed!

You explanation is clear.
I just check CPU performance,
https://setiathome.berkeley.edu/cpu_list.php
and GPU performance,

On paper, high-end GPUā€™s double precision performance could be 4-5 times faster than CPU.

Below is the memory bandwidth table of GPU,

On average, GPUā€™s bandwidth could be 5-10 times larger than say, like DDR4 3200Mhz which is PC-25600 so 25.6GB/s.

I am mostly interested in trying Monte Carlo (MC) simulations on GPU. Since MC use a lot of walkers to represent a distribution, which is typically suitable for MPI. Just not sure how GPU can accelerate MC by how much.

May I ask, when you do GPU computing, do you have to try best to write the code using matrix operations (like matrix multiplications) as many as possible, instead of using loops?

1 Like

@CRquantum Hereā€™s some interesting data on the Summit supercomputer at ORNL (taken from my masterā€™s thesis).

The Summit supercomputer is designed using the ā€œfat nodeā€ paradigm ā€“ each Summit
compute node is equipped with two 22-core IBM POWER9 CPUs and six NVIDIA ā€œVoltaā€
V100 GPUs. An important fact is that the performance of the system mainly comes
from the NVIDIA GPUs, which each contribute āˆ¼7 TFLOP/s, compared to the CPUs,
which each contribute only āˆ¼500 GFLOP/s per CPU*. This means about āˆ¼98% of the
performance on Summit comes from its GPUs.

*This number is rounded from the theoretical maximum, using 8 FLOP/s per core per instruction cycle, 21 usable cores (1 core is reserved for system processes), and 3.07 GHz base clock speed frequency

That just tells you how efficient the GPU is compared to the CPU, in terms of raw performance.

By the way, I use OpenACC to offload Fortran code to the GPU. I also plan to start learning OpenMP target offload in depth in the near future.

2 Likes

Nice. What Fortran compilers do you use with OpenACC?

1 Like

Depends on the platform. For NVIDIA GPUs I use nvfortran (newest version available), and for AMD GPUs I use gfortran 10 or newer, since flang canā€™t process OpenACC yet for now. I heard Cray compiler also supports OpenACC, but I havenā€™t used it so I canā€™t say much.

3 Likes

Can GFortran offload OpenACC to AMD GPUs? Do you have some example? Thatā€™s great news.

1 Like

The answer to your question is a bit difficult.

Suppose your code can be written in Fortranā€™s intrinsic procedures or BLAS subroutines. In that case, you can use intrinsic procedures overloaded by nvfortran or device functions (functions for running on GPUs) provided by CUBLAS. In addition, you can use some libraries provided by NVIDIA or others.

When using OpenACC, you can write code with array operations like c = a + b instead of loop.

If you want to write procedures, called kernel that runs on GPUs, using CUDA Fortran, a Fortran binding of CUDA provided by nvfortran, it is necessary to write the code with loops. This is because we need to write operations executed by one thread in the CUDA Fortranā€™s kernel like below:

i = (blockIdx%x-1)*blockDim%x + threadIdx%x
c(i) = a(i) + b(i)

The code above is the CUDA Fortran version of the following Fortran code:

do i = 1, N
    c(i) = a(i) + b(i)
end do

I donā€™t recommend this unless your research/business purpose is to parallel the code on the GPU.

1 Like

Can GFortran offload OpenACC to AMD GPUs?

Yes, indeed it can, if it was configured according to this link on the GCC wiki.

Fortunately on Ubuntu 20.04 (focal) and Debian 11 (bullseye) you donā€™t have to compile GCC from sources, and you can simply do

$ sudo apt install gcc-10-offload-amdgcn

Now, the tricky part is figuring out the GPU target codename. For instance, with my (former) AMD Radeon RX Vega 64, the target codename is gfx900, so the relevant additional compile flags are:

-fopenacc -foffload=amdgcn-amdhsa="-march=gfx900"

Also, from attending a recent AMD webinar, if you happen to have access to MI100, the target codename is gfx908. For all other AMD GPUs, this page on the LLVM docs might help.

2 Likes