GPU offloading in Fortran

From what I’ve read on this forum, Fortraners enthusiastically support GPU offloading. What I don’t understand is why baking GPU stuff directly into core Fortran is considered a good idea :slight_smile:

First, since CUDA and AMD have their own APIs and runtimes, getting all Fortran compilers to handle GPU offloading sounds nearly impossible.

Second, why not just mix tools? NVFortran isn’t standard Fortran - it’s a specialized language for specific hardware, even if it keeps Fortran-like syntax. In the C++/C world, mixing NVCC with host compilers is routine (just set it up with CMake). Yes, it’s done through C API, but that works fine. I did not get this explanation here.

Third, the GPU architecture is very specific, so efficient kernels are not normal Fortran anyway.

Finally, the lack of features like the standard library or templates seems far more important, at least for me.

I was referring to the fact that, starting with Fortran 90 (it was 30+ years ago, not 40+, sorry), compilers started handling Fortran subprogram’s arguments and results in different ways. They also emitted incompatible *.mod files.

The bind(C) suffix/attribute actually inhibits that behavior, and makes it work in ways compatible with the C “Companion Processor”, but with C interoperability you lose certain things (like OOP).

I’m not sure about the ISO_Fortran_binding.h, but i think even that is somewhat incompatible among compiler vendors.

On the C side, on the other hand, you have that only stddef.h is different among compilers, so mixing stuff is less complicated.

1 Like

I’ve done a lot of GPU programming, albeit in C/C++, and I am very curious at how to do it in Fortran without being limited to the compiler of my choosing.

The stdlib project aims to provide a standard library for Fortran, it is a very active project and it is advancing quickly. I am using it in some projects of my own and have been very happy with it.

Templates are a funny thing, currently the best way to do this is to use fypp to do some pre-processing and generate code for different datatypes. It is a bit annoying to rely on a python tool but hey, Fortran is moving in the right direction.

Now, to go back to GPUs. If you write CUDA/HIP kernels and call them from Fortran you’re limited to them being accessible via an "extern" C interface, which limits things quite a bit since C has no concept of classes, etc. so objects are out of the question. This then forces your Fortran to adapt to this and limits what cool features of Fortran you can use.

An additional thing is that right now, cuda_fortran and hipfort are weird; cuda_fortran is only in the nvfortran compiler, which is nice but it still a bit…wobbly, missing a couple screws. Hipfort is best if you build it on the fly with your project, I struggled a lot to get this working without building it myself.

Probably the easiest way to use Fortran + GPUs is offloading matrix algebra routines via accelerated libraries, since for this you can just write some bind(C) interfaces to the cublas/hipblas/magma_gpu calls. However, if you want to access the nice features to allocate memory etc, you need those bind(C) routines too and there’s is no single big project that provides these.

Then you have directive driven offloading via OpenMP, OpenACC. This is 100% compiler dependent and makes things, in my experience, very annoying.

Standard parallelism using DO CONCURRENT is another good example, however also limited to the compiler.

If I were to choose something, I would write a set of glue subroutines that allow me to call explicit kernels written in CUDA/HIP, but it will need a lot of glue.

I don’t think that is correct. My understanding of NVFortran is that its based on “classic” flang along with some of the proprietary PGI software that Nvidia acquired when it absorbed Portland Group. Several US mostly DOE labs funded part of the original development of “classic” flang. However, NVfortran was around for a few years before offloading was added. I think you are confusing CUDA Fortran with NVFortran. NVFortran supports CUDA Fortran but they aren’t to the best of my knowledge the same thing. If it supports the F2003 standard it is for all intents and purposes a “standard conforming compiler” since as far as I know only the now late lamented Intel ifort compiler came close to be fully F08 and F18 compliant. I only use NVfortran to check to see if it will compile things I develop with ifort (and now ifx) but I treat it just like a normal Fortran compiler. You don’t need special hardware to get it to compile normal (non-GPU) applications. As to the topic of GPU offloading, my belief is that the only way to make it useful in Fortran (or any other language) is implement it in a way that is hardware/vendor agnostic. I doubt that will ever happen without someone with the purchasing clout of Government labs in the US and elsewhere demanding it. The only other option I see is for Intel to find some courage to release a commodity GPU under $500 dollars US that can do at a minimum sustained 20 Tflop FP64 calculations. Again, just my humble opinion but if Intel is serious about taking down Nvidia, this is one of the strategies I would look at if I had a say in how Intel was run.

… is English your native language? How would you feel if someone here in the forum requested of you to forcibly reply in say French, Spanish, German or any other language, even if you know it, not being your native/day-to-day language just adds a layer of complexity… you might even like that language, but you would like to use it by choice … Putting that aside

There are many layers to peal here… No compiler is the Language >> ifort, gfortran, nvfortran, NAG, LFortran, etc etc (there are many) are all “processors” which enable transforming the programming language into machine code. Each compiler decides to implement parts of the standard or to implement their own non-standard extensions. Nvfortran, is the next gen of the PGI compilers, you can find more info here Compilers — Fortran Programming Language. I think you are confusing CUDA Fortran which is indeed a language extension and Nvfortran, which is the compiler. With Nvfotran you can:

  • Compile CUDA Fortran code,
  • Use OpenACC standard > which calls CUDA without the programmers intervention
  • Use OpenMP standard > which in nvfortran’s case, uses OpenACC behind the curtain for offloading.

So no, it is not a “specialized language” It is a compiler in the full sense of the word.

Nvfortran has worked into enabling use of Fortran standard semantics for gpu offloading

Intel as well makes its own claims

So, maybe you are a little bit baffled about this being a topic, but as was mentioned by @PierU and @jwmwalrus the whole problem in the Fortran world is the incompatibility between the binaries that each compiler creates, which has historical reasons. So, the issue is, if you want to delve into GPU offloading, as of today with Fortran, you have to allow yourself to get kidnapped by one compiler+hardware.

Interestingly, if you stay on C++, you could use intel oneAPI SYCL extension to develop for offloading on Nvidia GPUs https://www.intel.com/content/www/us/en/developer/videos/developing-nvidia-gpus-using-sycl-oneapi-part-2.html but there is no equivalent in Fortran, because of incompatibilities between their OpenMP implementations.

In order for GPU programming to take off within the Fortran world, it has to be natively accessible within the language. Otherwise it will remain a niche within an already small niche.

2 Likes

The Intel Arc 770 might be an interesting choice in the entry level range, though from what I’ve read it does not support FP64, but it seems to be competitive compared to the RTX4060.

20 TFLOP is a lot haha an A100 is 19.3 if I’m not mistaken

And to be honest, there’s nothing better than the vendors seeing the need for this. Otherwise, nothing is going to happen.

Actually no its not a lot. Techpowerup rates the FP32 performance of the RTX 5080 at 56.28 TFLOPS. If Nvidia didn’t cripple the chip intentionally to run at 1:64th in FP64 (if you can actually get it to run in FP64 even though Techpowerup shows an FP64 performance of 879 GFLOPS) and let it run at half the speed of the FP32 in FP64 you have your 20 TFlops (peek though not sustained). I have a 3 year old Nvidia RTX A4500 that is rated at 23 TFLOP in FP32. I got it for 900 US to do machine learning. Nvidia could easily build and sell this class of card with FP64 enabled for around 500 US. If I remember correctly one of the earlier Nvidia Titan (black) GPUs had support for FP64 turned on. The only reason that I can see that we don’t have usable FP64 in a commodity card is that would cut into Nvidia’s profits selling overpriced GPUs for large HPC systems

“a commodity GPU that can do at a minimum sustained 20 Tflop FP64 calculations”

I am wondering what commodity CPU and memory could sustain this compuation rate, and also what type of computation.
I do structural Finite Element Analysis and the memory bandwidth required to sustain 20 Tflop FP64 calculations would probably need to be 100 x what I have available on the CPU and DDR memory I am using.
However these GPUs are used for other types of calculations on this hardware, for clearly calculations with much smaller memory bandwidth requirements.

I see a pathway via Fortran + OpenMP, but clearly some standardisation of the GPU instruction set would be helpful.
The OpenMP off-loading specification (which I have not used) appears to be very complex and it is diffucult to identify which compilers are supporting and with what hardware.

I absolutely hate this, I have so many if(CMAKE_Fortran_COMPILER_ID MATCHES XYZ) in my code to enable and disable omp offloading. It is painful and very annoying.

However, these things are necessary if we want to see progress…the hard reality is that we’re never going to get close to the peak performance a GPU has to offer via offloading directives, no compiler is smart enough

My inclination for the tasks I have in mind is either do concurrent or openMP. To me, these seem to have the highest likelihood of being portable across different systems.

Intel (ifx) supports both (for intel GPU), nvfortran supports for Nvidia GPU, gnu supports openMP for AMD, AMD flang may yet support for AMD GPU.

I’ve played around with these a bit, but hope to do in full from April or so.

My biggest obstacle to nvfortran is actually that it doesn’t support 2008 features such as coarrays (even just the syntax on a single image would suffice).

1 Like

I think once OpenMP offloading is standard across all compilers it will nice; then those that need that extra oomph will find a way to use CUDA or HIP.

3 Likes

“GPU” appears nowhere in the Fortran standard and I’m pretty sure that’s intentional. The late Dan Nagle, who chaired the US arm of the Fortran committee, said to me, “The philosophy of Fortran is to give the programmer the ability to communicate properties of their code rather than to mandate what the compiler do to exploit those properties.” I wasn’t on the committee when do concurrent was developed, but my understanding is that the committee developed it with GPUs in mind, but I doubt the point was “baking GPU stuff directly into Fortran.”

A do loop is inherently a sequential construct. It explicitly tells the compiler “do these iterations in this order.” That ordering is essential when it comes to doing things like time advancement, wherein the calculations must respect causality to be correct. But because parallel programming was necessary long before parallel programming languages went mainstream and developers understandably couldn’t wait, we developed a pattern of first telling the compiler explicitly to do something sequentially and then undoing that sequential ordering with directives. One of the worst outcomes of this pattern is that we sometimes end up with more directives than program statements – all in the name of undoing what we did! It seems to me much more clear to just tell the compiler what we mean: these iterations can be done in any order you choose. That’s the purpose of do concurrent and fortunately, there are at least four compilers that can now parallelize do concurrent on CPUs or GPUs: compilers form NVIDIA, Intel, HPE (Cray), and LLVM in that approximately chronological order in terms of how long the compiler has had this capability. For an example of do concurrent achieving essentially the same performance as OpenMP when compiling with LLVM FLang and running on a CPU, see the slides from my “Just Write Fortran” talk at the 2024 Parallel Applications Workshop – Alternatives to MPI+X. That work is based on AMD’s ROCm fork of LLVM Flang, where I believe there is also already a branch that offers experimental support for offloading do concurrent to a GPU.

I’m old enough to remember floating-point co-processors in the 1990s. These days, when I mention floating-point co-processors to anyone under 40, they usually haven’t even heard of them because those devices eventually got absorbed into the CPU. I suspect we’re already seeing the early stages of a similar trend with GPUs, which I suspect is why the committee never intended to explicitly address GPUs in the language. I often wonder whether young developers in future decades will even know the term GPU and will be discussing whether to bake some new form of accelerator into the language.

4 Likes

Sorry to hear that Dan has passed away. I worked with Dan in the early days of the old DoD HPCMP Programming Environments and Training (PET) program. I helped him teach a couple of parallel programming classes for some Air Force people as well as some of the submarine folks at Electric Boat Company in Connecticut. Very nice guy.

I’m old enough to remember when it became apparent that Cray was going to make a lot of money with vector processing that companies like Floating Point Systems (if I remember correctly) would sell you hardware and software that you could scab onto a VAX 11-780 to do vector processing. Later in the 80s, IBM would sell you a vector processing box you could add to a 3090. Now we have processors that have “vector processing” built into them and other than some compiler options and some knowledge of the best way to structure loops to achieve optimum vectorization its mostly transparent to users. I think (hope) this will be the path for GPUs. Using them should NOT have hardware specific instruction other than something like do concurrent. The reason we have compilers in the first place is to relieve programmers of the burden of writing everything in machine code by hand.

1 Like

Has anyone tried the following OpenACC/GFortran Interoperability with CUDA libraries - GCC Wiki ? It seems like there is an open door for using cuda through Gfortran’s OpenACC interoperability. Curious about the posibilities and limitations.

This looks very nice, basically it is what nvfortran provides to interface with cublas. Just need to make sure that the directives are supported when building gcc, which I haven’t tried when building my own GCC. Does anyone have a suggested ./configure line for this?

So I gave it a spin, before this I had to build gcc with offloading capabilities. I used this tool and it worked out of the box. I then took the main.f90 from your link and compiled the module I needed from the interfaces, in this case cublas.

I simply did: gfortran -fopenacc cublas.f90 -c and then gfortran -fopenacc main.f90 -lcublas This worked, I ran it with nvprof and got a call to the GPU. Easy. I then thought to set up an FPM repo for this:

name = "test"
version = "0.1.0"
license = "license"
author = "Jorge"
maintainer = "jg4@iastate.edu"
copyright = "Copyright 2025, Jorge"
[build]
link = ["cublas"]
auto-executables = false
auto-tests = false
auto-examples =false
module-naming = false
[[executable]]
name = "app-name"
link = ["cublas"]
[install]
library = false
[fortran]
implicit-typing =true
implicit-external = true
source-form = "free"

I was even very defensive and used: fpm build --flag="-fopenacc" --link-flag="-fopenacc -lcublas" --verbose

The verbose log looks like:

 <INFO> BUILD_NAME: build/gfortran
 <INFO> COMPILER:  /home/jorge/install/ompoffload/gcc//bin/gfortran
 <INFO> C COMPILER:  gcc
 <INFO> CXX COMPILER: g++
 <INFO> COMPILER OPTIONS:   -fopenacc
 <INFO> C COMPILER OPTIONS:
 <INFO> CXX COMPILER OPTIONS:
 <INFO> LINKER OPTIONS:   -fopenacc -lcublas
 <INFO> INCLUDE DIRECTORIES:  []
 + mkdir -p build/gfortran_6EAE69F332523106
 + mkdir -p build/gfortran_45175DAEA518FD85
 + mkdir -p build/gfortran_4871D01AF1A03C0E
[  0%]                     cublas.f90
 + mkdir -p build/gfortran_45175DAEA518FD85/test/
 + /home/jorge/install/ompoffload/gcc//bin/gfortran -c ././src/cublas.f90   -fopenacc -ffree-form -J build/gfortran_45175DAEA518FD85 -Ibuild/gfortran_45175DAEA518FD85 -o build/gfortran_45175DAEA518FD85/test/src_cublas.f90.o
[ 25%]                     cublas.f90  done.
[ 25%]                      libtest.a
 + mkdir -p build/gfortran_6EAE69F332523106/test/
 + ar -rs build/gfortran_6EAE69F332523106/test/libtest.a build/gfortran_45175DAEA518FD85/test/src_cublas.f90.o
ar: creating build/gfortran_6EAE69F332523106/test/libtest.a
[ 50%]                      libtest.a  done.
[ 50%]                       main.f90
 + /home/jorge/install/ompoffload/gcc//bin/gfortran -c app/main.f90   -fopenacc -ffree-form -J build/gfortran_45175DAEA518FD85 -Ibuild/gfortran_45175DAEA518FD85 -o build/gfortran_45175DAEA518FD85/test/app_main.f90.o
[ 75%]                       main.f90  done.
[ 75%]                       app-name
 + mkdir -p build/gfortran_4871D01AF1A03C0E/app/
 + /home/jorge/install/ompoffload/gcc//bin/gfortran    -fopenacc -fimplicit-none -Werror=implicit-interface  -fopenacc -lcublas build/gfortran_45175DAEA518FD85/test/app_main.f90.o -lcublas build/gfortran_6EAE69F332523106/test/libtest.a -lcublas -o build/gfortran_4871D01AF1A03C0E/app/app-name
[100%]                       app-name  done.
[100%] Project compiled successfully.

Now the fail comes at runtime. If I do fpm run it will fail, it does not call the cublassaxpy routine at all! so it fails because it verifies that things are the same. However, if I do:

./build/gfortran_4871D01AF1A03C0E/app/app-name
   11.0000000         11.0000000
   19.0000000         19.0000000
   27.0000000         27.0000000
   35.0000000         35.0000000
   43.0000000         43.0000000
   51.0000000         51.0000000
   59.0000000         59.0000000
   67.0000000         67.0000000
   75.0000000         75.0000000
   83.0000000         83.0000000

Boom! correct answer. How is the FPM failing to do things here? @FedericoPerini do you have any insight into why fpm run might be failing? Here’s a repo of the reproducible, you will need to build a gcc though: GitHub - JorgeG94/fpm-openacc-bug_questionmark: A simple reproducible of a bug in the FPM

1 Like

Oh ha! if I do:

FPM_FC=$GCC_DIR/bin/gfortran fpm run --flag="-fopenacc" --link-flag="-fopenacc -lcublas" --verbose
 <INFO> BUILD_NAME: build/gfortran
 <INFO> COMPILER:  /home/jorge/install/ompoffload/gcc//bin/gfortran
 <INFO> C COMPILER:  gcc
 <INFO> CXX COMPILER: g++
 <INFO> COMPILER OPTIONS:   -fopenacc
 <INFO> C COMPILER OPTIONS:
 <INFO> CXX COMPILER OPTIONS:
 <INFO> LINKER OPTIONS:   -fopenacc -lcublas
 <INFO> INCLUDE DIRECTORIES:  []
[100%] Project compiled successfully.
 + build/gfortran_4871D01AF1A03C0E/app/app-name
   11.0000000         11.0000000
   19.0000000         19.0000000
   27.0000000         27.0000000
   35.0000000         35.0000000
   43.0000000         43.0000000
   51.0000000         51.0000000
   59.0000000         59.0000000
   67.0000000         67.0000000
   75.0000000         75.0000000
   83.0000000         83.0000000

It works…

1 Like

Internally fpm just wraps execute_command_line.

So, all the environment variables from the parent shell where you’re running fpm should be passed over to the child process. But there is a similar issue for macOS+flang, I would look into environemnt variables