Research articles using Fortran

Recent paper in macroeconomics about wealth taxation, published on the Quarterly Journal of Economics (this is also to show that Fortran is a very popular choice among quantitative economists)

Title: “Use It Or Lose It: Efficiency Gains from Wealth Taxation”

Authors: Fatih Guvenen, Gueorgui Kambourov, Burhanettin Kuruscu, Sergio Ocampo-Diaz & Daphne Chen

Link to published paper:

Link to Fortran replication code:

2 Likes
1 Like

Thanks @shahmoradi for finding this one. A colleague of mine, Salvatore Cielo, has done the painstaking work of porting ECHO to SYCL:

Here is a performance graph of the SYCL version:

(Image Source: https://doi.org/10.1145/3585341.3585382)

The cool thing is that the same SYCL code runs on GPUs from all three vendors. Support of SYCL on NVIDIA and AMD cards is provided via the oneAPI DPC++ compiler (an open-sourced version of the Intel oneAPI DPC++ compiler). AFAIK, the support of NVIDIA and AMD cards was provided by Codeplay (recently acquired by Intel).

Anyways, the authors of the newly-published Fortran article state:

The highest peak performance using the four GPUs of a single LEONARDO node is 2.2 × 10^8 cells updated each iteration per second, reached at both 512^3 and 640^3 resolutions for the second-order version of the code.

The 2.2E8 updates per second would sit in the top right corner of the plot, so it’s in the same ballpark as the SYCL port. However LEONARDO nodes have 4 A100 GPUs. I think the measurements for the SYCL version are from single GPUs, which would mean a factor of 4 difference, but I will have to verify this. I’m also not familiar enough to say if this was the same test case and precise same method.

The nice thing about the new work is that it was achieved rather easily by modifying the original version of the ECHO code:

… we have demonstrated how a state-of-the-art numerical code for relativistic magnetohydrodynamics, ECHO has been very easily ported to GPU-based systems (in particular, on NVIDIA Volta and Ampere GPUs).
[…]
The version presented here is thus basically the original one, and now the very same code can run indifferently on a laptop, on multiple CPU-based cores, or on GPU accelerated devices. [emphasis added]

Since the Intel ifx compiler also supports off-loading do concurrent, it would be interesting to see how the SYCL vs do concurrent compare. In the new article the authors used OpenACC pragmas for data movement. At least from the Intel article, The Case for OpenMP* Target Offloading, I wasn’t able to tell if one could combine OpenMP data movement directives with do concurrent. I later found a discussion on LinkedIn about this, where Henry Gabb from Intel stated:

On Intel platforms, I still need the OpenMP directives to control host-device data transfers.

@sumseq also left a comment about his experiences: Early Results Using Fortran’s Do Concurrent Standard Parallelism on Intel GPUs with the IFX Compiler

4 Likes

Here is an excerpt from the paper on parallel programming paradigms.

Luckily, we are currently on the verge of the so-called exascale era of High Performing
Computing (HPC), and numerical codes previously written to work on standard (CPUbased) architectures are being ported on the much faster (and more energetically efficient) Graphics Processing Units (GPUs). This is true for all kinds of research fields where fluid and plasma turbulence play a relevant role, from engineering applications of standard hydrodynamic flows [35–40], to relativistic MHD and GRMHD [41–44]. Unfortunately, the process of porting an original code previously engineered to work on CPU-based systems can be very long and it usually requires a complete rewriting of the code in the CUDA language (created by NVIDIA, available for both C and Fortran HPC languages) or using meta-programming libraries like KOKKOS [e.g. 45] or SYCL / Data Parallel C++ (recently successfully applied to a reduced and independent version of our code, see https://www.intel.com/content/www/us/en/developer/articles/news/ parallel-universe-magazine-issue-51-january-2023.html, accessed on 27/11/2023).

Possibly slightly less computationally efficient, but offering a huge increase in portability,
longevity, and ease of programming, is to use a directive-based programming paradigm
like OpenACC. Similarly to the OpenMP API (Application Programming Interface) for exploiting parallelism through multicore or multithreading programming on CPUs, the OpenACC API (which may also work for multicore platforms) is designed especially for GPU accelerated devices. Specific directives, treated as simple comments by the compiler when the API is not invoked, must be placed on top of the main computationally intensive loops, in the routines called inside them (if these can be executed concurrently), in the definition of arrays that will be stored in the device’s memory, and in general in all kernels that are expected to be offloaded to the device, avoiding as much as possible copying data from and to CPUs. In any case, the original code is not affected when these interfaces are not activated (via compiling flags). An example of a Fortran code successfully accelerated on GPUs with OpenACC directives is [40] (a solver for Navier-Stokes equations).

A last possibility is the use of standard language parallelism for accelerated computing
and it would be the dream of all scientists involved in software development: to be able to
use a single version of their proprietary code, written in a standard programming language
for HPC applications (C, C++, Fortran), portable to any computing platform. The task of
code parallelization and/or acceleration on multicore CPUs or GPUs (even both, in the
case of heterogeneous architectures) should be simply left to the compiler. An effort in
this direction has been made with Standard Parallel Algorithms for C++, but it seems that
modern Fortran can really make it in a very natural way. The NVIDIA nvfortran compiler
fully exploits the parallelization potential of ISO Fortran standards, for instance the do
concurrent (DC, since 2008) construct for loops, that in addition to the Unified Memory
paradigm for straightforward data management, allows for really easy scientific coding
[46,47]. The last Intel Fortran compiler ifx, with the support of the OpenMP backend,
should also allow for the offload of DC loops to Intel GPU devices.

3 Likes

You can check out all papers using YALES2 (YALES2 public page - www.coria-cfd.fr) - stands for yet another large eddy simulation -. It’s a modern Fortran CFD solver.
Here are some exemples
YALES2BIO: A Computational Fluid Dynamics Software Dedicated to the Prediction of Blood Flows in Biomedical Devices | SpringerLink

or in the bio medical field

1 Like

Avoiding breakdown in incomplete factorizations in low precision arithmetic
by Jennifer Scott and Miroslav Tůma
arXiv February 1, 2024

Abstract
The emergence of low precision floating-point arithmetic in computer
hardware has led to a resurgence of interest in the use of mixed
precision numerical linear algebra. For linear systems of equations,
there has been renewed enthusiasm for mixed precision variants of
iterative refinement. We consider the iterative solution of large
sparse systems using incomplete factorization preconditioners. The
focus is on the robust computation of such preconditioners in half
precision arithmetic and employing them to solve symmetric positive
definite systems to higher precision accuracy; however, the proposed
ideas can be applied more generally. Even for well-conditioned
problems, incomplete factorizations can break down when small entries
occur on the diagonal during the factorization. When using half
precision arithmetic, overflows are an additional possible source of
breakdown. We examine how breakdowns can be avoided and we implement
our strategies within new half precision Fortran sparse incomplete
Cholesky factorization software. Results are reported for a range of
problems from practical applications. These demonstrate that, even for
highly ill-conditioned problems, half precision preconditioners can
potentially replace double precision preconditioners, although
unsurprisingly this may be at the cost of additional iterations of a
Krylov solver.

1 Like

Our software is written in Fortran and compiled using the NAG compiler (Version
7.1, Build 7118). This is currently the only Fortran compiler that supports the use of fp16

I was quite surprised with the title as so far I hadn’t found any formal reference to half precision except for its introduction in the latest 2023 standard. Now it makes sense.

Flang has it too: Implementation of Intrinsic types in f18 — The Flang Compiler

REAL(KIND=2) 16-bit IEEE 754 binary16 (5e11m)
REAL(KIND=3) 16-bit upper half of 32-bit IEEE 754 binary32 (8e8m)
REAL(KIND=4) 32-bit IEEE 754 binary32 (8e24m)
REAL(KIND=8) 64-bit IEEE 754 binary64 (11e53m)
REAL(KIND=10) 80-bit extended precision with explicit normalization bit (15e64m)
REAL(KIND=16) 128-bit IEEE 754 binary128 (15e113m)

Also nvfortran has half precision including on GPU devices. In both cases the constant is kind=2.

3 Likes

What is

REAL(KIND=3) 16-bit upper half of 32-bit IEEE 754 binary32 (8e8m)

and when would it be useful?

Development and convergence analysis of an effective and robust implicit Euler solver for 3D unstructured grids – D.F Cavalca, et. al

DOI: doi [dot] org/10.1016/j.jcp.2018.04.005
Link: https://www.sciencedirect.com/science/article/abs/pii/S0021999118302195

This paper’s author used Fortran.

It’s not a very good paper, as their choice of using an analytical Jacobian is not something I like, as it can not be used for other schemes. Neither do I like their choice of linear solvers.

But even then, I like this paper a lot for being easy to understand, and follow along.

I have implemented my experimental CFD solver somewhat based on this paper, using C++, and now am writing the commercial “production” version of it using Fortran-2008.

Which flang was @ivanpribec using? The one I have aaccess to is AMD flang (or should I call it AOCC flang?), which does not have a 16-bit real kind. I think there are at least two other compilers alos called flang.

I was referring to the documentation of the “new” flang. But it appears like the implementation is not complete. With the instance available on Compiler Explorer:

flang-new version 19.0.0git (https://github.com/llvm/llvm-project.git 820f244aa92f11292e59440c9bc5afbdec395b20)
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /opt/compiler-explorer/clang-llvmflang-trunk-20240203/bin
Compiler returned: 0

the following program

integer(3) :: a
print *, storage_size(a)
end

yields

error: Semantic errors in /app/example.f90
/app/example.f90:1:1: error: INTEGER(KIND=3) is not a supported type
  integer(3) :: a
  ^^^^^^^^^^^^^^^
Compiler returned: 1

Machine learning and reducing bandwidth requirements in CFD/LBM, are the only two applications I know of.

What is interesting to me is how this paper ever got published in JCP. There is nothing in this paper that wasn’t already mature and established science 25 years ago. I guess the JCP editors in 2018 needed something to fill space. Almost everything in it is covered in Juri Blazek’s book.

Just to add, I don’t know if there is any definitive comparison that has been made between the benefits of FP16 (IEEE754) and BF16 (Google Brain Format) for machine learning anyway. BF16 has more dynamic range and less precision, and is possibly easier to convert between 16-bit ↔ 32-bit because it’s just a matter of bit shifting. FP16 has intrinsics for conversion and math on many devices though, and has better precision. In machine learning, it seems to be a matter of preference which one is used for training and model representation, if indeed 16-bit floats are used. I have personally only ever worked with FP16, so at least in my bubble it’s more common.

As an aside, for machine learning, there are many additional quantized formats coming out, FP8 but also 4-bit, 5-bit etc to fit bigger models in memory. It would be great to eventually have some of these formats supported.

FYI, the flang in LLVM has REAL(3) but your program was using INTEGER(3). The program works if you change INTEGER to REAL but I don’t know if REAL(3) is fully supported yet.

1 Like

Highlights

  • Integration of Intel Coarray Fortran with Nvidia CUDA Fortran and OpenMP allows parallel computing without extensive code redesign.
  • Coarray Fortran offers comparable performance to the Message Passing Interface (MPI) for distributed memory parallelism.
  • This hybrid configuration shows near-linear scaling across different hardware setups.
  • CPU-GPU affinity can be achieved using this hybrid method and affects performance.
4 Likes

Scheduling and Performance of Asynchronous Tasks in Fortran 2018 with FEATS
Published: 28 March 2024
SN Computer Science
by Brad Richardson, Damian Rouson, Harris Snyder & Robert Singleterry
prepint here, GitHub FEATS repo here

Abstract
Most parallel scientific programs contain compiler directives
(pragmas) such as those from OpenMP (Hermanns in Parallel programming
in Fortran 95 using openMP, 2002. School of Aeronautical Engineering,
Universidad Politécnica de Madrid, España, 2011), explicit calls to
runtime library procedures such as those implementing the Message
Passing Interface (MPI) (in A message-passing interface standard
version 4.0, 2021.
https://www.mpi-forum.org/docs/mpi-4.0/mpi40-report.pdf), or
compiler-specific language extensions such as those provided by CUDA
(Ruetsch and Fatica in CUDA Fortran for scientists and engineers: best
practices for efficient CUDA Fortran programming, Elsevier, 2013). By
contrast, the recent Fortran standards empower developers to express
parallel algorithms without directly referencing lower-level parallel
programming models (Numrich in Parallel programming with co-arrays,
CRC Press, 2018, and Curcic in Modern Fortran: building efficient
parallel applications, Manning Publications, 2020). Fortran’s parallel
features place the language within the Partitioned Global Address
Space (PGAS) class of programming models. When writing programs that
exploit data parallelism, application developers often find it
straightforward to develop custom parallel algorithms. Problems
involving complex, heterogeneous, staged calculations, however, pose
much greater challenges. Such applications require careful
coordination of tasks in a manner that respects dependencies
prescribed by a directed acyclic graph. When rolling one’s own
solution proves difficult, extending a customizable framework becomes
attractive. The paper presents the design, implementation, and use of
the Framework for Extensible Asynchronous Task Scheduling (FEATS),
which we believe to be the first task scheduling tool written in
modern Fortran. We describe the benefits and compromises associated
with choosing Fortran as the implementation language, and we propose
ways in which future Fortran standards can best support the use case
in this paper.

5 Likes
1 Like