(This is my first post so first of all, hi everyone and thank you for this amazing initiative for the Fortran community)
Since this seems to be a more general question, please, let me share my personal experience with Fortran and GPUs. There are many ways you can port you Fortran code to GPUs. From higher to lower level (or more to less portable), you could use:
- a library that is already ported to GPUs, e.g. cuBLAS, cuFFT, etc.
- directive based approaches like OpenACC or OpenMP. These will let you “describe” the code you want to port to GPUs and then, the compiler does the work for you.
- “low-level” APIs like CUDA (mainly targetting NVIDIA GPUs) or the AMD-equivalent ROCm. Here, you have a fine grain control of what you are doing on the GPU.
With the first option, you need a minimum to no understanding on how GPUs work while as you go down the options, you will need a more deep understanding in order to make it work efficiently. If you are interested, I can try to list a few pros and cons for each approach.
Now, let me add a few comments more directly related to your question. The CUDA Fortran compiler, nvfortran
, is based on the PGI one (actually, I think they rebranded it) so I would say that yes, it is stable/fast enough.
In principle, you can use CUDA for Fortran. You can very easily code small examples that work great! However, as it is unfortunately often the case with Fortran, I found that there is not the same level of support as you could find in C/C++, for example. By default, nvfortran
is not included in the CUDA toolkit and you need to download NVIDIA HPC SDK. If you use HPC clusters, you may find yourself with some problems. For example, the NVIDIA HPC SDK comes with its own openmpi
which was not compatible with the configuration of the machine I was using. One easy workaround to all these drawbacks would be to use to Fortran/C interface to call CUDA C code (which I agree is a bit cumbersome).
To summarize a bit, I would first look for already existing libraries doing what you are trying to do. If they are not available and if you don’t specifically need CUDA, try using OpenACC or OpenMP. They yields very acceptable performance gains, are “easy” to implement in the sense that they are based on an incremental approach and allow you to have the same source code for CPU and GPU (the directives are considered as comment by the compiler if not instructed otherwise). Finally, if you want to really have control on what you do on GPU, or you are really trying to achieve the best performance, then CUDA (or ROCm (HIP actually)) is the best option, but I would strongly encourage you to first have a look at your dependencies and how well you can compile/combine them with CUDA.