Fortran Programmers : How do you want to offload to GPU accelerators in the next 5 years?

fluidnumerics_joe · August 3, 2020, 11:14pm

There are a number of new directions for GPU acceleration coming our way, with multiple vendors now stepping into the GPUs-for-HPC arena.

Here’s a few options that are currently available for GPU acceleration in Fortran :

Directive based approaches
OpenACC : Supported by PGI, Flang, and GNU compilers (mileage varies with each). Does anyone know the story of AMD GPU support for OpenACC implementations ?

OpenMP 5.0 : AMD’s llvm fork has OpenMP 5.0 support for Nvidia and AMD GPUs https://github.com/ROCm-Developer-Tools/llvm-project

Kernel Based Approaches
CUDA-Fortran : PGI Compilers only, Nvidia GPUs only
hipfort : C-Fortran Interface for HIP, AMD and Nvidia GPUs
FortranCL : C-Fortran Interface for a number of common OpenCL routines. It’s not clear it’s supported anymore, but its out there.

I’m curious to know how this community wants program GPUs and what your pro’s and con’s are for current implementations available.

pcosta · August 3, 2020, 11:45pm

I find GPU programming quite hard and tricky, and it gets even harder if one wants to do distributed-memory calculations on many GPUs. Right now I use PGI’s CUDA Fortran and (CUDA-aware) MPI (I got some help from very good people), making extensive use of CUF kernels, and in some instances I had to compromise a bit efficiency to use available kernels, because writing efficient custom kernels is hard, at least for me.

It is my understanding that there are some emerging frameworks that allow for more neat and abstract implementations (by requiring the user to specify things like tasks, data layout and how data should be accessed, I think), while ensuring performance portability for different architectures, like kokkos or Stanford Legion. I would love to exploit something similar for Fortran.

lkedward · August 4, 2020, 9:51am

I’d like to add my own Fortran OpenCL abstraction library Focal to the list of available options, see here for the slides I presented at FortranCon. Unlike fortrancl and clfortran, Focal presents a ‘Fortranic’ interface that abstracts away a lot of the low-level C library.

My personal preference is strongly towards Kernel-based approaches since they are more explicit and give more control over memory management and synchronisation. By contrast, in Directives-based approaches you usually need to infer what is going on ‘behind the scenes’ to understand performance implications and then add additional directives to constrain the compiler, resulting in messy code.

By example, adding directives to completely specify variable locality for a parallel loop essentially amounts to writing a kernel interface, albeit very verbosely - so you may as well use a kernel based approach with more control.
Ultimately, I think GPU directives attempt to perform a code abstraction that isn’t actually useful since it is better to retain control over the specifics of execution on GPUs.

Some disadvantages of existing Fortran options (IMO):

CUDA-Fortran is propriety, non-portable and results hardware vendor lock-in
OpenACC & OpenMP: degree of implementation varies, not mature and also requires extra work to enable compiler support
HIP / OpenCL: kernels (currently) need to be written in another language

The Fortran language already has a number of abstractions, particularly for arrays, that would be immensely useful when writing accelerator kernels; my preference would be for a language keyword like kernel that would optionally allow a subroutine to be compiled to any GPU backend. Unlike elemental, this would support various hierarchical and fine-grain parallelism features such as execution blocks and thread synchronisation. See initial discussion here.

fluidnumerics_joe · August 4, 2020, 2:25pm

Do you have interest in participating in hackathons to help gain community experience with Focal ?

lkedward · August 4, 2020, 3:17pm

Yes absolutely, great idea! What did you have in mind for how this would this work?

certik · August 4, 2020, 4:12pm

Regarding Focal, it looks like the weakest point is that you still have to write the kernels in C. That is where I want to help with LFortran, we are making great progress lately on being able to translate Fortran to C++. I just posted at Fortran to C++ translation to continue this particular work.

fluidnumerics_joe · August 4, 2020, 4:14pm

@certik will the Fortran to C++ translation feature eventually become a part of the main llvm-project ?

certik · August 4, 2020, 4:18pm

Probably not, because LLVM already has Flang as the Fortran front-end, which uses a different design than LFortran. What would be the advantage of merging it into LLVM itself? I think LLVM should concentrate on LLVM IR, and also if MLIR matures (as part of LLVM), then both Flang and LFortran can target it. Finally, the Fortran to C++ translation does not go via LLVM. It is a separate backend in LFortran, next to the LLVM backend.

fluidnumerics_joe · August 4, 2020, 4:20pm

First, we need to identify individuals or teams that would be able/willing to set aside a few days for experimentation. Once connected with teams, we need to go through a process of profiling (hotspot analysis) and dependency analysis (generate a call graph) to establish a 1, 3, or 5 day sprint plan to port routines to the GPU with focal. Your role here would be in assisting developers in using focal as they work through the porting process.

We could do this all virtual using a combination of Slack (or RocketChat, or…) and Google Meets with daily virtual standups and open times for screen sharing interactions.

fluidnumerics_joe · August 4, 2020, 4:27pm

My thinking is that support across multiple compilers for a feature that allows GPU kernels to be written in Fortran syntax would help this effort survive in the long run. I recall that PGI-only support for CUDA-Fortran was a sticking point for some folks. Extrapolating from this, I suspect making such a feature available in only one compiler will limit the number of users.

certik · August 4, 2020, 4:41pm

We have two options as I see it:

We give up on writing kernels and targeting GPUs in Fortran, and we simply use C++ and then we can use Kokkos, or OpenCL. We can wrap it into Fortran using Focal. But quite frankly, why not to just move over to C++ completely? In my experience, it is typically more maintenance to have two languages, and having to manually transform from one to another as you move code from kernels to main program and vice versa. I don’t think that’s worth it. In fact, that is the number one reason I see large codes around me moving to C++ and doing new development in C++ only.
We want to write everything in Fortran. So then the question is how to achieve it. In Fortran, this really needs to become part of the language and compilers must be able to use it. So you start with one compiler and get it done. To avoid vendor lock-in, LFortran will be able to translate your source, so that you can use any Fortran compiler to actually compile it (if GFortran does not support offloading to GPU, LFortran can translate the kernels to C). So people can use LFortran as a pre-processor if they want to avoid lock-in. Some other people will use it as a full compiler. After this works, we can start talking about standardizing any potentially new features in the language itself (I am part of the Standards committee, etc.) — but that is a long process and we should have prior implementations first. So if this works, we can try to contribute this to Flang. Then we have two compilers. And so on.

milancurcic · August 4, 2020, 8:08pm

I have almost no experience with running stuff on GPUs. I tried playing with it in the past, every time unsuccessfully. I have a basic idea of what directive- and kernel-based approaches look like.

So, without knowing anything more, here’s my naive idea of how I’d like to run things on GPUs:

The compiler can detect the GPU for me and figure out the details.
I can compile a Fortran function or subroutine, perhaps with a special --gpu flag, or perhaps with a directive as a procedure decorator. The compiler will deal with any GPU-related details under the hood. I don’t want to see it or touch it. For example, lfortran -c --gpu my_gpu_procedure.f90 emits a binary object with machine instructions for the GPU.
From the client code, I can call my_gpu_procedure(a, b, c) and the compiler will take care of copying the data to and from GPU in an optimal way. my_gpu_procedure() runs on the GPU and returns the results back to the CPU.
In a sane default, I shouldn’t even know if I have a GPU or not. The compiler could emit both GPU and CPU object binaries, and decide which to call at run-time.
In other words, I want to state in the code what I want to calculate, not how to calculate it. The compiler should find an optimal way to do the calculation.

ivanpribec · August 4, 2020, 8:08pm

Very interesting discussion! I would be interested in a focal Hackathon too, although it will be hard to find a fitting block of time.

Does anyone have plans of using coarray syntax and exploiting GPUs? My understanding was the goal of coarrays was to prevent having a tight coupling between parallel algorithms/memory hierarchy and the underlying hardware.

I found a recent article by Michael Wolfe on the topic of accelerators very interesting : Burying The OpenMP Versus OpenACC Hatchet (I might have learned about it at FortranCon). His opinion is:

It should be a goal of all HPC compiler developers that over time programmers are able to use fewer directives, either because of automation where the compiler becomes better at making decisions than the typical programmer, or because the parallel annotations become part of the underlying languages themselves.

So how will OpenMP and OpenACC finally bury the hatchet? I predict that the Fortran and C++ language standards will do the job for us, as they should.

ivanpribec · August 4, 2020, 8:19pm

This sounds very much like the Hybrid Fortran project. It uses “Python-style” @ directive to indicate a region should/can use acceleration, and a preprocessor will translate this to OpenMP or CUDA-Fortran. It was used to speed up Japan’s meteorological weather prediction code. The description says they are planning to port WRF too.

certik · August 4, 2020, 8:40pm

What @milancurcic described is precisely how I think most users / scientists would like and expect to program GPUs. So in my mind, it is clear I want to program them in Fortran itself.

lkedward · August 5, 2020, 7:44am

I would be happy to provide help in a Focal Hackathon as well as some training materials for general GPU programming and getting started with Focal. This would be a great way to learn about GPU programming.

I agree with @certik: LFortran presents the opportunity to trial new language features, of which GPU programming is only one of many. Being able to quickly prototype and prove the efficacy of new language features can lead to a tried-and-tested standard that other compilers can adopt and that eventually enters the Fortran standard.

I’m not overly familiar with Coarrays but my understanding is that they are mainly targeted at coarse-grain parallelism so I’m not sure what they would look like for thread-level parallelism.

certik · August 5, 2020, 2:02pm

@kargl I am still getting more experienced with do concurrent. What are the potential problems with it that you mentioned in your post above?

certik · August 5, 2020, 4:06pm

Thanks. I am very familiar with that discussion (I participated in it too), but it is unclear to me whether there is an issue after all, or not.

certik · August 5, 2020, 5:11pm

@pmk, I think so. It’s hard to tell, because we don’t yet have experience with do concurrent reaching its full potential. But assuming do concurrent could do what we talked about in this and other threads, I think it would definitely be my preferred method, and if default(shared) is needed, that seems like a simple extension to the language, and later we can get it standardized if it has wide usage.

certik · August 5, 2020, 5:26pm

@pmk excellent question. It seems I might want both: sometimes I don’t want to parallelize and sometimes I do — sort of like combining do concurrent with regular do, depending on which loop you want to parallelize. These are precisely the kind of things we need more experience with. But I can say one thing for sure — I want such array operations to be as fast or faster than a manual serial loop, which is not always the case with some compilers.

What do others think?

Topic		Replies	Views
Fortran for P3HPC	10	871	September 20, 2021
Questions from a Fortran HPC Webinar Help	30	2705	July 15, 2021
Non - DOE Fortran Projects migrating AMD GPUs?	6	733	March 7, 2021
GPUFORT: source-to-source translation from Fortran+OpenACC and CUDA Fortran	2	661	October 9, 2021
Fortran projects running on GPUs in production Help	16	2581	June 7, 2023

Fortran Programmers : How do you want to offload to GPU accelerators in the next 5 years?

Related topics