The AI era is pulling FP64 hardware away from scientific HPC

Hmmm. Makes me want to convert some of the arbitrary precision libraries for Fortran to work on a GPU and see just what speed I can get it to. The pendulum swung from custom HPC liquid cooled machines to commodity chips perhaps largely due to gaming, and now there is a new commodity (ie. AI) driving the chips. IHPC seems to have been along for the ride for quite some time now. Of course in this case maybe the resulting AI platforms will just design some quantum machines for the HPC crowd to keep us fat dumb and happy while they evolve organic GPUs and start growing themselves improved designs until they are the ones asking the questions and not us.

1 Like

There was a quesetion to the panel on HPC at last year’s conference in my field (which requires HPC to do most things these days). The NVIDIA rep (who I did not think much of) said much the same as has been reported, i.e. that FP64 isn’t going away.

One of the other panellists suggested that we have a long history of mixed precision solvers and that was a likely path ahead. I don’t know that we know how well that works for FP4, but certainly I’ve seen FP16 work well. The downside of this approach is that you need to tune more parameters, but I have/had colleagues who have done interesting work in automating that process in a data-driven fashion.

Yes, but only because today’s CPU incorporates vector processing, which was arguably Cray’s point. The verdict is still out on those FP8 “chicken”. We’ll see how they’ll do.

I have bigger fears on the software side. As you already mentioned compiler progress is slow, and not all vendors show the same interest in supporting standards like Fortran and OpenMP.

For machine learning workloads, the top kernels nowadays are composed in “tile languages”, the most notable examples being Triton, with many companies using it (I know AMD and Microsoft do, but I guess there are others too).

In response to this development, Nvidia has launched CUDA Tile including the cuTile Python dialect and CUDA Tile C++. Nvidia provides also access to the Tile IR for users wanting to target the tile dialect themselves. I’ve seen that both Julia and Rust versions are already available:

(Tangential, but Nvidia now also provides an experimental GPU compiler for Rust named cuda-oxide.)

Fortran seems like a suitable language for a tile dialect as the array syntax and array intrinsics map well also to tiled operations. I believe dialects like this existed already in the past; the Connection Machine Fortran is one case that springs to mind. But it looks like there is no user demand for this, as most machine learning work seems to flow through PyTorch, JAX, and similar libraries, in one way or the other.

With agentic coding I fear that Fortran may sink even deeper in the “software swamp”, overtaken by faster growing languages and compilers with better sponsorship.

(The flip side of the coin is agentic coding may breathe new life also into Fortran software.)


Edit: the tile programming concept is older as publications suggest, but it has found new application in AI workloads, giving the rise of the new frameworks I mentioned.

2 Likes

I was conflating two ideas I think in a way which I think not unlikely in the future, but not yet occuring perhaps.

The work I was thinking of is this: [2505.14399] Accelerating multigrid with streaming chiral SVD for Wilson fermions in lattice QCD Which basically optimally selects the prolangation/restriction bases for the multigrid.

Mixed-precision solvers are common practice of course.

What I was thinking is that if you go to a small enough multigrid lattice (i.e. potentially many layers of multigrid), then a low precision solver would probably work. I don’t know of work in that direction exactly though.

I agree. But if the hardware starts to deliver multiple-TFLOPS performance for 8-bit floating point, I’m confident that scientists and engineers will figure out a way to use it.

Another observation (something I’ve stated here before), is that of all the languages being used now, I think fortran with its KINDs faciity is the best positioned to quickly take advantage of these new floating point formats. it seems like the vast majority of prototyping and exploratory developments should be being done in fortran now, not other languages, just because of that feature and the way that new FP formats can be incorporated seamlessly into the language.

1 Like

Just a naive question: If lower-precision like FP8 can emulate higher-precision like FP64 rather well, does it mean that FP32 could also emulate FP64 with little performance hit…? (Or, is such an emulation already utilized routinely?)

Certainly that first part is true and has been studied. Techniques like Iterative refinement and Kahan summation are examples of that approach. The last part, “with little performance hit” is a separate issue. The reason is that performance depends on the hardware, and hardware is constantly changing. For example, on many modern CPUs Kahan summation in real32 arithmetic is less efficient than accumulation into a real64 value. That happened because the hardware made the real32-to-real64 conversion steps and then the real64 addition step faster. Now consider using GPU hardware that supports only the reall32 floating point, but executes 30x faster. Now the best approach shifts back to the compensated summation approach. If the real16 and real8 floating point formats end up being useful, they will go through that same kind of process.