An evaluation of risks associated with relying on Fortran for mission critical codes for the next 15 years

This should probably be split into a dedicated thread.

Why cannot you just write matmul(tranpose(A), B) and let the compiler generate an equivalent of cblas_sgemm(CblasColMajor, CblasTrans, CblasNoTrans, m, n, k, 1.0, A, k, B, k, 0.0, C, m) ? I know that compilers often can’t do that yet, but I think they should. Is there some reason why this is not a good design?

4 Likes

My feeling is that the Fortran community has run after other langages with a “me too” attitude because of this very reason, instead of focusing on the core aspects of Fortran.

I do not use 10% of the OOP capabilities of Fortran, and each time I tried investigated more I could not see what would be the real benefit. In contrast, I am highly bothered by the lack of any genericity.

Fortran is/was a language targeted at scientists (in a broad sense). These people are mostly not software engineers/architects. These are different jobs, and writing good OOP code is a job for the latter. I had recently to dig into a C++ code written by the former (but who were no longer in the company), with some OOP poured a bit everywhere. It was painful, first because as a matter of fact I’m not skilled in C++, but also because many things in this code were close to non-sense. As a result, this code was extremely rigid, and making a change that I first thought it would be relatively easy according to my experience with more classical codes, proved to be quite complex.

2 Likes

This is where the Fortran Standard Library (https://stdlib.fortran-lang.org/) could be hugely useful.

For example, matrix multiplication could be called as:

call stdlib_matmul(A=A, B=B, C=C)
call stdlib_matmul(OpA='T', A=A, B=B, C=C)
call stdlib_matmul(A=A, B=B, C=A)   ! for in-place multiplication
call stdlib_matmul(alpha=alpha, beta=beta, A=A, B=B, C=C)

It would even handle the case given by @RonShepard with

call stdlib_matmul(A=A(i,j,:,:), B=B(k,l,:,:), C=C)

A matrix type could be defined which has all the information about the arrays such as type, dimension, sparcity etc. For example

A%n=100
A%m=100
A%field='complex'
A%precision='double'
A%structure='hermitian'
A%packed=.true.
A%upper='true'

The types and interfaces would be easily extendible without breaking backward compatibility.

Most of the work done by the Fortran Standard Library would be to provide a smart interface to BLAS. This seems to be ideally suited to Fortran with its optional arguments and meta-data passed when calling subroutines with explicit interfaces.

But why stop there? How about a smart interface to LAPACK, FFTW or the GSL?

For example again:

call stdlib_fft(a_in=z, direction='forward')

For an in-place, complex FFT. All the relevant information such as dimensions is passed by the interface. Or how about

call stdlib_fft(a_in=z, a_out=r, direction='backward', scale=.false.)

… for an out-of-place, complex to real FFT without scaling.

Eigenvalues and eigenvectors are another good example:

call stdlib_eigen(a=a, v=v, w=w, range=[0.d0,100.d0]).
call stdlib_eigen(a=a, w=w, num_threads=32)
call stdlib_eigen(a=a, w=w, target_device='GPU')

I think this is a much better way of adding truly useful functionality to Fortran without overburdening the compiler writers and ending up with ambiguous and tangled syntax.

(Just to be clear, the examples above are pure speculation as to how an stdlib interface might look)

10 Likes

@jkd2022 I think you are onto something here, very good ideas. I think the compiler has to cooperate with this. For example, when I see:

call stdlib_matmul(alpha=alpha, beta=beta, A=A, B=B, C=C)

I have no idea what exactly it does. I am guessing it is the same as this:

C = alpha*matmul(A, B) + beta

So the compiler should take the second and create an equivalent of the first. The second is a lot more readable.

For eigenvalues it would be helpful to be able to return a tuple, something like:

w = eig(A, num_threads=32)
w, v = eig(A, target_device='GPU')

Until then we have to return them as arguments, as you showed.

We already agreed (in 1990) that matmul is so fundamental that it is part of the language. So in that case let’s go all the way in, and ensure the compilers truly provide the maximum performance, no compromises. While keeping the nice readable high level mathematical syntax.

2 Likes

@certik This is so well said that I cannot agree more.

It is strange that a facility as fundamental as ‘matmul’ is still allowed to segfault by default with some reputable compilers, let alone the performance. I don’t think many would continue to learn Fortran if they get informed about this on the very first day of learning.

I hope it will not take another thirty years for what you said about ‘matlul’ to come true. For eigenvalue calculations, I hesitate to wish anything because it sounds too good given the history.

1 Like

Do concurrent, even if off-loaded very efficiently, is way too little. One needs to be able to off-load much more types of algorithms and also to be able to control memory transfers. If one had to get the data to and from the GPU for every do loop, it would often be very slow.

1 Like

Actually, I had not find DO CONCURENT too useful at all. I see it is talked about quite often, but I do not get the hype. Normal DO + OpenMP is more general.

1 Like

If they get informed why at the very same time, it can be understood.

One misunderstanding is that Fortran (starting with F90) is array-oriented, not especially matrix-oriented, which is not exactly the same thing.

That is not an accurate description of the NVIDIA implementation, at least. When GPU-enabled DO CONCURRENT is used, allocatable arrays become CUDA managed memory, which migrates between the GPU and CPU. It starts on the GPU by default, so if you only access an allocatable array with DO CONCURRENT, then it will never leave the GPU. The locality specifiers behave the same as their OpenMP equivalents, when used.

If you want more explicitly control over memory locality, you can use OpenACC directives, as @sumseq does in [2110.10151] Can Fortran's 'do concurrent' replace directives for accelerated computing?. This allows code to be completely Fortran compliant while also supporting lower level GPU-specific control. The Intel and Fujitsu compilers, for example, will just ignore the OpenACC directives as comments.

2 Likes

If you want to use DO+OpenMP, you have to write one version of the code for CPU and one version of the code for GPU, unless you are planning to treat your CPU as a target offload device.

But in general, you are right that standards lack low-level controls of HPC-specific models. MPI has a lot of features that aren’t found in coarrays, such as nonblocking collectives. That doesn’t mean it’s not useful to have a fully standard language-based parallel model.

2 Likes

I have a general question about this memory transfer. Newer GPUs are now advertising “unified” memory models. I think this means that the CPU and GPU share the same memory address space. Does that also mean that they share the same physical memory? If they do share the same physical memory, does that mean that all such transfers are eliminated (e.g. they are just shallow copies)?

Where can I learn more about how nvfortran does this? I found the following links:

But they don’t seem to immediately answer the first question in my mind: if I use “do concurrent” inside a function, I assume the arrays it uses will use CUDA managed memory. What about the code that calls my function? Does this property of “CUDA managed memory” propagate in the call chain?

When NVIDIA unified managed memory is enabled through the nvfortran compiler flag, all allocatables are allocated with cuda managed allocations under the hood. They are then paged back and forth to the GPU and CPU as used throughout the code. Only when the code uses regular "do " loops (with no OpenMP offload or OpenACC or concurrent) or passes the arrays into an external non-GPU pointer compatible library (such as I/O), does it page back to the CPU.

For a recent paper showing an example with a production code, see our upcoming paper on ArXiv:

3 Likes

Thanks @sumseq. Good article, thanks for pushing the boundary for a regular Fortran code to run well on all GPUs without directives. This is one thing that I don’t agree with the report, when they wrote “It is likely that codes that rely on Fortran will have poor performance for GPU technologies.” as well as “It is very likely that Fortran will preclude effective use of important advances in computing technology.” In my opinion it is true today that standard Fortran doesn’t run well on GPUs (and your article seems to confirm it), but better compilers will fix that and work is ongoing on multiple fronts to get such compilers into production (and your article also seems to suggest that).

There is one virtual address space and only one copy of the data at any time. If there are two physical memories, data migrates between them, either at page granularity (in the case of PCIe-based systems) or finer granularity if possible. Or, if the bus connection is fast enough, data will load directly from the “other” memory, e.g. GPU loads straight from CPU memory, or vice versa.

There are implementations of migration based on the GPU driver and on Linux HMM. When the driver is in charge, special allocators must be used. With Linux HMM, one can get migration of malloc’d memory.

The behavior of x86+NVIDIA, Power9+NVIDIA, NVIDIA Grace Hopper, and AMD platforms are different, and some platforms support more than one mode simultaneously, so I can’t be more specific on the implementation details without picking a specific platform.

2 Likes

Code that executes on the GPU needs to have data that it can access. As long as it was allocated with something that calls CUDA device or managed allocators, it’s fine. You can get that with Fortran ALLOCATE and -stdpar=gpu, or non-standard features from CUDA Fortran, OpenMP or OpenACC, or even from something that does this in C or C++.

The above is necessary for portability to PCIe-based GPU systems where the GPU driver needs to be involved in memory allocation. On platforms like Power9+V100 and Grace Hopper, the GPU can access OS-managed memory and it’s not necessary to use special allocators, although there are in some cases performance benefits to doing so.

One way to get an idea of what’s happening is to profile the code with Nsight and see what memory is doing. You’ll see the allocator calls. You won’t see managed memory migration directly but you can infer it from timings.

2 Likes

@JeffH Could you elaborate more on the different GPU types and the implications on how the data moved (or not moved) to/from a dedicated physical memory or memery area? It’s still a bit unclear to me… There are

  • The dedicated graphic cards with their own memory, that require the data to be moved back and forth for CPU or GPU processing
  • The Integrated GPUs, such as the Intel ones in all Core i* processors. It’s generally said that there is a reserved amount of RAM for the GPU: is it just a quota, or a partition of the address space? Have the data to be moved to a special area in order to be accessed by the GPU?
  • The unified memory, like in Apple M* chips: I understand here that the GPU can directly access any data in RAM, just like the CPU does, without the need to move them: is that correct?

The standard platforms are:

  1. CPU + memory <== PCIe ==> GPU + memory (high-end PC and server)
  2. CPU + GPU + memory (low-end PC)

You can look up the speeds for PCIe generations to understand the capability of 1 (most GPUs use a x16 connection). PCIe gen 5 is more featureful than prior ones but how that plays out in software depends a lot of things, including software. If and how well a GPU supports page faults has a huge impact on use cases, independent of the PCIe bus speed.

Even though 2 is integrated, not all hardware and software support direct load store, either because the GPU can’t fault pages and thus needs explicitly driver mapping, or because the software builds a wall to force the programmer to comply with a specification like OpenCL 2. I also know cases where the GPU doesn’t support CPU cache coherence, so direct access requires the memory be marked as uncacheable by the OS, and that hurts performance in the workloads where one would be trying to take advantage of a physically unified memory.

NVIDIA Grace Hopper is this
3. CPU + memory <== NVLink ==> GPU + memory
NVLink has significantly higher bandwidth than PCIe and has better support for load-store, etc. https://developer.nvidia.com/blog/nvidia-grace-hopper-superchip-architecture-in-depth/ has details. It’s hardware coherent so you can think of it like heterogeneous NUMA.

I won’t discuss Apple GPUs here because Apple doesn’t document their GPU platform very well (at least from an HPC user perspective) and provides no Fortran compiler, so their platform is not usable by this community except as a fantastic AArch64 CPU with gfortran.

I chose not to comment on AMD platforms because I work for NVIDIA so any erroneous comments could be interpreted as malicious, and frankly, I don’t have any expertise in Infinity Fabric. @bcornille might be able to help.

3 Likes

This is very useful information @JeffH – thanks.

I have a related question. Currently our electronic structure code is CPU only and typically runs on thousands of cores via OpenMP + MPI, usually with over 90% parallelism.

Over the past year or so we’ve been converting some parts to single precision in anticipation of adding GPU-specific optimizations.

On our cluster we have some nodes with 72 IceLake cores and 4 Nvidia A100-SXM4 cards.

Here’s the question: I have to use the CPUs for parallelism because the task each thread performs is very complex, i.e. it’s very much task parallelism. Also each task is equivalent to the others, it’s just performed with different data.

Suppose each thread reaches a point where GPU parallelism might be useful (for example matrix multiplication or many dot-products). There are 72 CPU cores but only 4 GPUs; what happens if each CPU thread requests GPU resources nearly simultaneously, say with OpenMP+target or OpenACC? Would the GPUs get shared and used concurrently, or would the CPU threads be waiting their turn?

I am excited to reply to this on a new thread :slight_smile:

1 Like