Fortran for P3HPC

Hi all,

In the times in which we are witnessing the emergence of new high-performance computational technologies (like GPUs, many-cores processors like Knight Landing) and, along with them, programming tools and libraries (CUDA, OpenACC, OpenMP, OpenCL, …) how ready is Fortran to cope with that?

Of course, we can always do it the hard way, combine our codes with codes or directives for any of the above mentioned libraries, plus traditional MPI, compile them with proper compiler and flags and run, but that clearly brings concerns on portability. You may spend many months learning CUDA, tuning your code for Nvidia GPUs, but that will all be in vein should the next super-computer available to you is not Nvidia based.

The question of Performance, Portability, and Productivity in HPC (P3HPC) naturally arises, and has been addressed for some time now with libraries such as KOKKOS or RAJA introducing portable performance through abstraction - but these are geared towards C++ codes. Even new computer languages such as Julia with native support for parallelism on a plethora of hardware architectures. But Julia is not Fortran.

I wonder, how is Fortran going to cope with these emerging technologies? I am not sure if there is a Fortran model for performance and portability across various and emerging computer architectures. Any thoughts on this? (I might be missing something obvious. Do coarrays (which I never used, I am all into MPI), for example, hold a potential to embed abstraction of hybrid HPC platforms?)

Cheers

7 Likes

IMO, coarrays, do concurrent, elemental procedures, and array statements are a very natural way to express parallel computations without specifying a particular implementation. I would hope compilers and run-times would start to take advantage of these more enthusiastically, and we can stop tying ourselves to openmp directives or CUDA specifically.

8 Likes

Excellent questions. I started a new compiler to address precisely these issues: https://lfortran.org/, it is not production ready yet, but we are making excellent progress.

My answer is that with a good compiler, a lot could be done already with the current langauge. For example do concurrent maps nicely to Kokkos. The C++ translation backend in LFortran already generates Kokkos parallel constructs for do concurrent.

Finally, perhaps some extensions to Fortran as a language are needed. My plan is to investigate and prototype such extensions in LFortran. I think we’ll be ready for this in probably a year. Our immediate goal is to get LFortran to compile existing codes.

I think Fortran as a language is fine and can map nicely to heterogeneous hardware. But compilers must do a better job.

6 Likes

I’ll second what @everythingfunctional and @certik have said, and add my own thoughts:

There are two issues: the language, and compilers. On the language front, Fortran is actually doing pretty well, as pointed out above. do concurrent, pure elemental procedures, array statements, and coarrays are most of what is needed to support future HPC scenarios. I would argue, though, that one more language feature may be needed: some way to specify data locality. When doing complex GPU programming involving several different kernels working on the same data, managing data movement to and from the GPU tends to be a big part of the battle. OpenMP gives you a way to tell the compiler that a given set of data needs to be moved to or from the device at a given point in the code. Fortran doesn’t have any such feature right now. Technically a very smart compiler could probably figure that all out, but as someone who does GPU programming professionally, I typically find that it’s beneficial or even necessary to manually control data movement.

On the compiler side, things aren’t quite as nice, as the posts above have mentioned (specifically when it comes to GPUs). Coarray support is pretty good, meaning that writing programs that scale to many nodes in a cluster is fairly easy, but in practice, do concurrent implementations fall far short of what they could achieve. I think LFortran has a bright future in this regard - it has a very flexible design that allows for the easy creation of multiple backends.

This is a bit of a tangent, but I think that HPC in general may be suffering from an excess of GPU-hype right now. Since hardware is cheap compared to the time of a skilled programmer, and since abstractions tend to be leaky, the hardware I’m most excited by is the A64FX from Fujitsu, or the upcoming Sapphire Rapids Xeon processors from Intel (specifically the ones that have HBM memory on-chip). I’d change my tune on this if there was a culture among accelerator vendors of supporting a common programming model and contributing implementations to open source compilers, but right now it looks like every vendor wants you to use their particular SDK.

7 Likes

What CUDA/GPUs and “regular” CPU architectures share in common are the existence of separated memory layers. @hsnyder is right; a 10x or more speedup can easily occur with the programmer being more explicit with the memory layer management of variables/arrays.

The CUDA shared memory framework is essentially allowing the programmer to control part of the L1 cache, specifically performing the memory management on that level. I think a language-level instruction for that capability would be welcome, if possible, on both the CPU and GPU fronts. The ability to give flags for (parts of) arrays to be loaded into the L1 cache would be helpful for both CPU and GPU programming, if it were possible for CPUs.

To be fair on the GPU hype, the FLOPS/watt and even FLOPS/dollar for the device are both much higher for GPUs. A 2-Fujitsu node server is something like $39,000 and achieves like 10 TFLOPS from what I remember for double precision. An A6000 gets 40 TFLOPS single-precision for around a $9000 price tag for a workstation with one. If you demand double precision equivalence, 10 TFLOPS double precision from the A100 for an around $15,000 price tag works, with the added benefit that you could get up to 156 TFLOPS single precision using the tensor cores.

It may be possible to map the usual GPU model into teams and images. For example, NVIDIA GPUs execute SIMT instructions 32-threads at a time since a warp executes all at once. A warp could correspond to a team (or perhaps we’d use a thread block to map to a team), and the images comprise all the GPU cores. Maybe it would actually be possible to take inspiration from the CUDA programming model for an abstraction for distributed+shared memory computing. A kernel is something that is executed N warps/threadBlocks (teams?) at a time (N = number of nodes or streaming multiprocessors) and the team executing the kernel can share memory between the members of the team. Different teams would have to make (slower) memory accesses to communicate information across teams, and this could be done either synchronously or asynchronously.

2 Likes

Welcome to the forum, @adenchfi! Thanks for your comments. Interesting idea regarding using the coarray model to emulate the GPU model. My instict so far has been that do concurrent was the way to go for that sort of thing, with coarrays being reserved for node-to-node communication, but I hadn’t ever thought to use teams in that way.

Re: GPU hype - what you’re saying is largely true, but in almost all the software that I’ve worked on, memory bandwidth is a bottleneck to such an extent that the theoretical peak FLOP numbers are meaningless. If you do have one of the use cases where you can actually have that many mathematical operations per load or store, then GPUs are going to take the crown, no question. However, like I said, in most of the applications I’ve seen, memory bandwidth is a much better metric. In that context, an A64FX node performs about as well as a Tesla V100 (which was the incumbent GPU at the time of the A64FX release, IIRC). That shrinks the price gap considerably. If you have to account for developer time to port an application to CUDA, for example, the price gap is gone (likely reversed).

EDIT: That said, take my comments with a grain of salt - I’ve never actually run on an A64FX, I’ve just read papers from people who have. Maybe I’m guilty of anti-GPU hype, haha! Ultimately I just want people to be aware of these alternatives so they can do their own investigation, and make informed decisions about the balance of hardware cost, development cost, and likely performance.

3 Likes

Thanks @hsnyder for the welcome! The CUDA model at least is very reminiscent of a cluster which inspire my comments on teams; N streaming multiprocessors <=> N nodes in a cluster.

It’s very true that GPUs still have memory bandwidth issues. A good CUDA programmer at least can mitigate those issues a lot; I don’t know about the AMD (or other GPU vendor) side. Anything GPU accelerated by OpenACC, OpenMP, or say possibly do concurrent, where the programmer is not doing the memory management, is usually largely memory bound, because these simple directives often don’t optimally re-use data. Proper usage of shared memory (the GPU L1 cache) mitigates the bandwidth issue a fair amount because it massively aids in data re-use, but the programming involved has to be redone on a kernel-by-kernel basis. The GPU hype isn’t enough for me to encourage adoption of CUDA Fortran-like subroutines/definitions into the Fortran standard alone, but if we could also finely control the L1 cache of CPUs on the language level, these constructs would also extend to CPU programming.

2 Likes

GPU stuff I do not know.
But my experience of doing Monte Carlo is that, Xeon Phi is very slow (at least 20 times slower than a typical core in a CPU) and Intel does not seem to plan any new products of Xeon Phi family. I use Xeon Phi just for fun, like Wow, I am using 5000+ cores that’s cool! But actually 200 regulars cores performs better than 5000+ Xeon Phi cores.

Perhaps these Xeon Phi or GPU are better for some matrix operations, like put a big matrix into smaller pieces and then calculate each of them and then sum them over or something like that.

The original FORTRAN compiler had a FREQUENCY statement and performed a Monte Carlo simulation based on the numbers provided before code generation.

The supreme goal of this decade must be to massively reduce current consumption and at the same time massively increase computational power for executing full programs. This may require to execute full programs on new types of highly energy efficient accelerators and also to use parallel programming as a standard way for software development.

Some predict that Explicit Data Graph Execution (EDGE) based accelerators will become a real game changer in the forthcoming future. The Intel CSA is such an accelerator: https://en.wikichip.org/wiki/intel/configurable_spatial_accelerator.

Somebody (presumable from Intel) did describe the CSA as an evolution of an FPGA and did call it a FPPA (Field programmable PE array), and did also describe Intel’s investment into the buy of Altera as a foundation for this. (Presumably to get access to the reconfigurable wiring technology).

It is certainly not too early to focus on that accelerator and to point to an upcoming new era of Reconfigurable Computing. The CSA has (or will have) very fast reconfiguration times for the busses between PEs. The dataflow graphs of full programs (e.g. the actual workload of a program) will determine the configuration.

As far as I understand it, this could also be the workloads of serial programs, but to unleash the full potential of a CSA parallel programming is a requirement. Then, by developing and applying new kinds of parallel programming models, one could affect the configuration of a CSA at a more abstract level.

I expect Intel’s evolving LLVM-based ifx Fortran compiler to become an EDGE compiler to run full Fortran programs on the CSA.

Fortran 2018, eventually and thanks to coarray teams, made the shift complete from a serial-based-programming-language into a parallel-programming-only-language (of course, Fortran 2018 comprises both programming approaches): Coarray Fortran appears not only as a kind of parallel programming language, it also provides enough low-level features for the programmer to extend the base language through customization, and - that’s the really important point – it allows to implement the required new types of parallel programming models with it; Parallel programming without a plan will not lead to anything if you want to use it for general purpose computing on upcoming reconfigurables.

Cheers

2 Likes

My above comment was obviously the wrong side of the coin to adequately answer the OP’s question, so here’s another try:

The more adequate answer lies in the underlying PGAS model itself. While Fortran 2008 was SPMD, Fortran 2018 is fully qualified APGAS. (And only as an aside here, F18 can easily be adopted by the Fortran programmer to be used for implicit parallelism ).

F18 does support spawning threads (in fork-join fashion) through coarray teams. Spawning threads is only locally in Fortran, but remote spawning (as in X10) isn’t a requirement to qualify for APGAS anyway.

With my above comment, I did focus on the consequences for general purpose parallel programming but the nature of the APGAS model is also to answer the programming of complex heterogeneous hardware architectures.

As an introduction into PGAS theory (explaining and defining it) and to better understand today’s Fortran, I would point you to two papers:

The both papers are outdated with regard to Fortran, since even the most advanced (A)PGAS topics do eventually directly apply to Fortran 2018, or can (easily) be adopted by the Fortran programmer.

Regards

1 Like