Fortran projects running on GPUs in production

Are there any projects using Fortran with GPU in production?

1 Like

There must be a way.

At least for Intel, as their OneAPI name suggested and their firm willingness into GPU business. By the way, I am pretty sure they never give up GPU, and as GPU computation catches more and more attentions considering its potential in doing big matrix operations such as solving some partial differential equations, fluid dynamics, heat transfer, imaging processing (convolution matrix), cellular automata, etc. Those stuff previously are suitable for Xeon Phi (slower but have many threads and high bandwidth memory), now Xeon Phi is basically replaced by GPUs.

In short, I see no reason why Intel Fortran cannot offload openMP or something from Intel CPU to Intel GPU, they have some links as below,

https://www.intel.com/content/www/us/en/develop/documentation/get-started-with-cpp-fortran-compiler-openmp/top.html

https://www.intel.com/content/www/us/en/developer/videos/three-quick-practical-examples-openmp-offload-gpus.html#gs.vnh4qj

I have a similar thread, but I am not very sure how to really offload to Intel GPU, I think I am almost there but still need one more step to make it work. In fact the examples are all in OneAPI’s example folder, if their instruction on how to offload openMP to GPU a little clear (especially on Windows) it will be great,

I think gfortran or Lfortran can also offload openMP to GPU, if not now, in the near future it should be possible.

But I guess, if the application needs to frequently exchange data from CPU through memory to GPU, then the speedup by GPU will be ruined by the slow memory speed (bandwidth). For example, a typical DDR4 2666 memory’s bandwidth is just about 30 GB/s. GPU’s GDDR5 or something could be 10x faster.

In fact, I suspect the reason that apple’s M1 chip can achieve very high performance, one importance reason is that the memory on M1 chip Mac has quite high bandwidth, like 200 - 400 GB/s if I remember correctly. That is almost the same speed as L2 or L3 cache (but cache should have at least 10x shorter latency than memory). That also explains why M1 Mac is very good image/video processing, as those applications typically will be benefit from high speed memory (images are just some big arrays in the memory).

Amber Molecular Dynamics is one of the most popular Molecular Dynamics packages whose kernel is modern Fortran with GPU support.

2 Likes

Quantum ESPRESSO toward the exascale

Abstract: Quantum ESPRESSO is an open-source distribution of computer codes for quantum-mechanical materials modeling, based on density-functional theory, pseudopotentials, and plane waves, and renowned for its performance on a wide range of hardware architectures, from laptops to massively parallel computers, as well as for the breadth of its applications. In this paper, we present a motivation and brief review of the ongoing effort to port Quantum ESPRESSO onto heterogeneous architectures based on hardware accelerators, which will overcome the energy constraints that are currently hindering the way toward exascale computing.

Section III C describes the current status, capabilities, and some benchmarks for Quantum ESPRESSO on NVIDIA graphics processing units (GPUs), one of the leading candidate architectures for the future “exascale computer.” The benchmarks are designed for relatively small machines and do not aim at showing performances on large-sized systems. They aim instead at pointing out bottlenecks, inefficiencies, and the minimum size of calculations that saturate the computational power of a GPU.

Taken from the (reference: J. Chem. Phys. 152 , 154105 (2020).

Mostly, I would say 99% of the codebase is Fortran90 or Fortran200x. For more details, see the main website: https://www.quantum-espresso.org/

3 Likes

IBM’s GRAF is a proprietary fork of MPAS that runs on GPUs. I heard that it uses OpenACC directives but I don’t know this for a fact.

1 Like

Peter Ukkonen runs a fork of neural-fortran on GPUs using OpenACC directives and delegating matmul calls to cuBLAS (Paper).

This is academic research, so probably not what you’d call true production, but these projects often have the end-goal of being used in production a few years down the road.

2 Likes

Here is a list of HPC codes that all run on GPU: https://www.nvidia.com/content/intersect-360-HPC-application-support.pdf, most of the top 10 codes are in Fortran.

2 Likes

I don’t think this qualifies as a Fortran project, but PETSc which you can call from Fortran is adding/has added GPU support

2 Likes

Not a direct pointer, but many of the applications that use OpenACC for GPU acceleration are in Fortran as well. When I was last an NVIDIA, there were over 100 GPU-accelerated production apps with OpenACC. Most of those were primarily in Fortran, but I can’t lay my hands on a definitive list at the moment.

4 Likes

I am also interested in OpenACC, which I got to know from a recent tutorial seminar and seems relatively simple to use. VASP (an electronic structure program) also seems to support GPU via OpenACC.

2 Likes

I think both the Abaqus and LS-DYNA FEM codes have some level
of GPU support and both are (mostly) Fortran.

2 Likes

I think the link I posted above might be exactly the list that you have in mind.

Yours was from November 2017. :slight_smile:

I was thinking of an OpenACC list that was more recent. I thought they had a definitive OpenACC list on the openacc.org web site, but didn’t see it.

1 Like

If you find a more recent list, please share it! It’s hard to find these documents, but they are very helpful to see the status of Fortran. I did a big search around 2017, that document was the best I was able to find.

CaNS runs on many GPUs powered by OpenACC directives and an NVIDIA pencil domain decomposition library.

1 Like

A case-insensitive grep (actually Windows findstr) of my list of Fortran codes on GitHub for “gpu” gives the following. If my description of a project does not include “gpu”, as it did not for CaNS, the project will be missed.

Nbody6++GPU - Beijing version: N-body star cluster simulation code, by Rainer Spurzem and team. It is an offspring of Sverre Aarseth’s direct N-body codes.

POT3D: High Performance Potential Field Solver: computes potential field solutions to approximate the solar coronal magnetic field using observed photospheric magnetic fields as a boundary condition. A version of POT3D that includes GPU-acceleration with both MPI+OpenACC and MPI+OpenMP was released as part of the Standard Performance Evaluation Corporation’s (SPEC) beta version of the SPEChpc™ 2021 benchmark suites.

Fluid Transport Accelerated Solver (FluTAS): modular, multiphysics code for multiphase fluid dynamics simulations. The code is written following a “functional programming” approach and aims to accommodate various independent modules. One of the main purposes of the project is to provide an efficient framework able to run both on many-CPUs (MPI) and many-GPUs (MPI+OpenACC+CUDA-Fortran).

IMEXLB-1.0: Lattice Boltzmann Method (LBM) proxy application code-suite for heterogeneous platforms (such as ThetaGPU). A ProxyApp, by definition, is a proxy for a full-fledged application code that simulates a wider array of problems.

MGLC: multi-GPU parallel implementation of LBM(Lattice Boltzmann Method), using OpenACC to accelerate codes on single GPU and MPI for inter-GPU communication

fft-overlap: efficient implementations of ffts on multiple GPUs and across multiple nodes, by dappelha. Overlapping data transfer on multiple levels.

arrayfire-fortran: Fortran wrapper for ArrayFire, a general purpose GPU library.

Eigensolver_gpu: generalized eigensolver for symmetric/hermetian-definite eigenproblems with functionality similar to the DSYGVD/X or ZHEGVD/X functions available within LAPACK/MAGMA, by Josh Romero et al. This solver has less dependencies on CPU computation than comparable implementations within MAGMA, which may be of benefit to systems with limited CPU resources or to users without access to high-performing CPU LAPACK libraries.

GraSPH: Smoothed-particle Hydrodynamics (SPH) program originally intended for simulations of bulk granular material as well as fluids, by Edward Yang. Src_CAF contains code intended to run multi-core configuration enabled with the Coarray Fortran 2008 features, and src_GPU contains code intended to run on a CUDA-enabled GPU.

CUDA Fortran: Fortran programming on GPU: a complete introduction for beginners by Koushik Naskar

ExaTENSOR: basic numerical tensor algebra library for distributed HPC systems equipped with multicore CPU and NVIDIA (or AMD) GPU, by Dmitry I. Lyakh. The hierarchical task-based parallel runtime of ExaTENSOR is based on the virtual tensor algebra processor architecture, i.e. a software processor specialized to numerical tensor algebra workloads on heterogeneous HPC systems (multicore/KNL, NVIDIA or AMD GPU).

FGPU: code examples focusing on porting FORTRAN codes to run DOE heterogenous architecture CPU+GPU machines, from LLNL. The purpose of these is to provide both learning aids for developers and OpenMP and CUDA code examples for testing vendor compilers capabilities.

GPU programming with OpenMP offloading: exercises and other material for course, by Jussi Enkovaara et al.

gpu-tips: Fortran examples of CUDA and directives tips and tricks for IBM Power + Nvidia Systems, by dappelha

nbody-ifx-do-concurrent: N-body Fortran code port to test ifx (Intel Fortran) GPU offload of do concurrent, by Saroj Adhikari

Tensor Algebra Library Routines for Shared Memory Systems: Nodes equipped with multicore CPU, NVIDIA GPU, AMD GPU, and Intel Xeon Phi (TAL_SH): implements basic tensor algebra operations with interfaces to C, C++11, and Fortran 90+, by Dmitry I. Lyakh

BDpack: GPU-enabled Brownian dynamics package for simulation of polymeric solutions, by Amir Saadat. An associated paper is Computationally efficient algorithms for incorporation of hydrodynamic and excluded volume interactions in Brownian dynamics simulations: A comparative study of the Krylov subspace and Chebyshev based techniques, A. Saadat and B. Khomami, J. Chem. Phys., 140, 184903 (2014).

A Fortran Electronic Structure Programme (AFESP): project based on the Crawford Group’s C++ Programming Tutorial in Chemistry, but written in Fortran, by Kirk Pearce et al. The end goal of this project will be performing HF, MP2, CCSD, and CCSD(T), as per the original tutorial, but with additional support for multicore processors (modern CPUs, GPUs).

qe-gpu: GPU-accelerated Quantum ESPRESSO using CUDA Fortran, by Filippo Spiga

2 Likes

ICON, the global model developed in Germany to simulate weather and climate, is used on GPUs.