Global Ocean Modeling With GPU Acceleration in Python

Fast, Cheap, and Turbulent—Global Ocean Modeling With GPU Acceleration in Python

by Dion Häfner, Roman Nuterman,and Markus Jochum

Journal of Advances in Modeling Earth Systems
First published: 02 December 2021

Abstract

Even to this date, most earth system models are coded in Fortran, especially those used at the largest compute scales. Our ocean model Veros takes a different approach: it is implemented using the high-level programming language Python. Besides numerous usability advantages, this allows us to leverage modern high-performance frameworks that emerged in tandem with the machine learning boom. By interfacing with the JAX library, Veros is able to run high-performance simulations on both central processing units (CPU) and graphical processing unit (GPU) through the same code base, with full support for distributed architectures. On CPU, Veros is able to match the performance of a Fortran reference, both on a single process and on hundreds of CPU cores. On GPU, we find that each device can replace dozens to hundreds of CPU cores, at a fraction of the energy consumption. We demonstrate the viability of using GPUs for earth system modeling by integrating a global 0.1° eddy-resolving setup in single precision, where we achieve 1.3 model years per day on a single compute instance with 16 GPUs, comparable to over 2,000 Fortran processes.

Related blog post

4 Likes

Unfortunately, Fortran is deemed here as a synonym of “old-fashioned” and “to-be-replaced” (again).

What is the remedy for this widespread misconception?

The authors did not make general criticisms of Fortran but have asserted that on the hardware available to them, a Python/JAX solution was faster. Based on this paper, it needs to be easier for Fortran to exploit GPUs.

2 Likes

They use the GitHub - google/jax: Composable transformations of Python+NumPy programs: differentiate, vectorize, JIT to GPU/TPU, and more library to compile Python to very fast native code using XLA. If this can optimize array code very well, we could write an XLA backend to LFortran also.

4 Likes

This is the modernized equivalent of their Fortran code in Figure 2,

do concurrent(j=js_pe:je_pe, i=is_pe:ie_pe, tke(i,j,nz,taup1) < 0.)
    tke_surf_corr(i,j) = -tke(i,j,nz,taup1) * 0.5 * dzw(nz) / dt_tke
    tke(i,j,nz,taup1) = 0.
end do

This code does not even need extra jit decorators to run on GPUs. Compile it with NVIDIA compilers to get GPU parallelism. want CPU parallelism? compile and run instead with Intel ifort. Just compare the elegance of this code with the Python equivalent in their figure.

The authors state further “When going from Fortran to Python, explicit loops are replaced with array operations”. Why? Is it not possible in Fortran? The entire vectorization ideas of numpy are inherited from Fortran 90.

11 Likes

:wave: Hey everybody, author of the article here. Happy to discuss this with you. To address some of your points:

Yes! Unfortunately, hardware agnostic code doesn’t mix very well with explicit loops, so unless you restrict yourself to vectorized operations I don’t have much hope for Fortran here (would love to be wrong though).

That would be amazing, but see the caveat above: XLA can only accelerate array code, so you would have to stick to a subset of Fortran (or do some serious compiler magic).

That looks great and very clean! I wouldn’t say that Python code is inherently more readable than Fortran code, and the inverse can often be true. On the other hand, Python gives you all the tools you need to build your own abstractions, so once you have a working + fast implementation you can start iterating towards beauty (this is next up on our roadmap). I discuss this in a different blog post here.

This sounds good on paper but is not (yet) flexible enough for complex applications like ours (see my response to your comment on the article). Also, we’re typically not interested in thread-based parallelism since we run on multiple nodes anyway.

This is out of necessity, not virtue.

7 Likes

This actually feels cleaner to me, particularly the last line, compared to:

tke = tke.at[2:-2,2:-2,-1,taup1].multiply(mask)

in the Python + JAX version. It took me half a minute to understand that since the mask is an array of zeros and ones, the elements multipled by 1 will be left “untouched”, while the rest will be set to zero.

2 Likes

Welcome @dionhaefner, and thank you for posting here!

3 Likes

The current situation with Python libraries for multi-dimensional arrays including NumPy, Tensorflow, PyTorch, Dask, JAX, CuPy, MXNet, Xarray and others, reminds me strongly of early Fortran implementations. The inconsistencies between these Python libraries has led to the launch of a Python Array API Standard.

While Python is often applauded for the high productivity of programmers, it seems to suffer the same amount of “reinventing the wheel” like any other programming language.

I’ve read your blog post and agree with most of the takeaways. One nitpick I have is the term “modern Fortran”, which is actually fairly old and has been in use all the way back since Fortran 90 if not earlier (I suppose that’s why you added the quotation marks).

The abstraction “separate numerics from physics” is something which I’ve encountered before in the book of Damian Rouson (@rouson) - Scientific Software Design: The Object-Oriented Way. Some authors also call this “coordinate-free programming”; here are just a few papers on the topic:

6 Likes

First of all, thanks @dionhaefner for writing here!

Would you have time for a quick video call to discuss this further? I would love to learn from your experience.

What I had in mind is to combine LLVM with XLA, or whatever else is needed to get it running. I don’t see fundamentally how LFortran is different from JAX: the code from your article that uses the @jax.jit decorator could be equivalently written in Fortran and then LFortran can do whatever JAX is doing, in principle. However, the devil is in the details and I would love to hear your experience and brainstorm this more.

I completely agree with you that Fortran has ways to go in order to be as usable as Python, and that is precisely why we are trying to improve the whole Fortran ecosystem. As a language Fortran seems pretty nice and I don’t see any hard roadblock, it’s “just” that the compilers must improve and the other tooling, and that is exactly what we are working on.

2 Likes

Sure. I couldn’t figure out how to send a private message here, but if you just send me an e-mail at dion.haefner@nbi.ku.dk we can bang something out.

I agree, and I hope we’ll have those community standards for accelerator frameworks soon.

Thanks for the pointers! These will come in handy when we get deeper into this discussion. I also quite like the abstractions used by jax-cfd, but if you take it too far you’re tossing code reuse out the window, so there’s always a trade-off.

2 Likes

Thanks! I sent you an email.

Thanks for the clarifications. I have not been able to read your GitHub response yet. I wish the new Fortran features were also discussed in this article. Now, people who read the paper may (implicitly) get the (incorrect) impression that Fortran fundamentally lacks cross-platform syntax.

I would have mentioned the NVIDIA SDK if I had known about it, but that doesn’t change the core message.

Fortran does lack cross-platform syntax. Running your code through a vendor-specific compiler that can only handle the simplest cases of embarassingly parallel workloads is not the same as having access to a fully integrated GPU framework.

I don’t think denying the problems of the language / ecosystem does anyone a favor. I’m the first one to admit that Python has many problems that are not present in other languages (including Fortran), but in this case the reverse is true, and the earlier we agree on that the earlier we can change it.

2 Likes

I’m afraid that’s not right. Please be careful to speak the truth when you speak of an entire language ecosystem (that is all I am asking of the authors of this article too). Fortran does have the cross-platform cross-architecture syntax. Do you know about Fortran Coarrays? Do you know about the parallelism paradigm it supports? Fortran is the pioneer of all languages in this regard when Coarrays were the only parallelism paradigm three decades ago while no other programming language had any notion of parallelism. A separate issue is whether the compilers can utilize this complex yet simple, beautiful syntax of Fortran parallelism to implement cross-arch parallelism paradigms. Some compilers like Intel ifort and OpenCoarrays library can convert the Fortran syntax to RMA MPI communications for distributed parallelism, some others like NAG convert the same Coarray code to shared-memory parallelism, and some like NVIDIA are working to convert the same code to GPU parallelism.

2 Likes

If this bears fruit I’m happy to change my mind, but until then, the point stands.

That only works if you only care about 1 type of parallelism. What happens if you want to use multi-threaded code that also uses GPU? For an example, look at AlphaZero/Lc0. They use multi-threaded MCTS search combined with GPU evaluation of neural networks. How would you deal with this in Fortran?

I am not familiar with AlphaZero, neural networks, or MCTS, but just wanted to mention FEATS which is a Fortran 2018 framework for parallel asynchronous task-scheduling. As far as I can understand, the concept of Fortran coarray images is supposed to be agnostic to the type of underlying hardware. In principle you could have a coarray team consisting of multi-threaded CPU’s, and a second coarray team of multi-threaded GPU’s. Granted, this is not yet something we see in practice so from a pragmatic point of view Dion’s point is valid.

I’m hoping that the gfortran implementation of coarrays using OpenMP might be able to leverage both types of compute platforms in the future. Similar to the abstraction “separate numerics from the physics”, Fortran’s goal has always been to “separate numerics from the hardware”.

Following this line of thought, I am a bit critical of the title of the paper, as it implies that Python and GPU acceleration are somehow needed if you want to do state of the art ocean modelling, even though the equations solved in this paper are practically the same as in the rest of the models. I think there is a crucial distinction between ocean model (which is the set of equations, boundary conditions, and parameters), and ocean model implementation such as Veros. Conflating the two can be a source of misunderstanding. In a few decades the hardware will change again, and we’ll all have the opportunity to write a bunch of new papers “Modeling X with Y Hardware in Z Language”.

3 Likes

In publish or perish academia, that sounds like a feature, not a bug.

@certik wrote:

Blockquote
They use the GitHub - google/jax: Composable transformations of Python+NumPy programs: differentiate, vectorize, JIT to GPU/TPU, and more 2 library to compile Python to very fast native code using XLA 2. If this can optimize array code very well, we could write an XLA backend to LFortran also.

From XLA website:

Blockquote
XLA provides an alternative mode of running models: it compiles the TensorFlow graph into a sequence of computation kernels generated specifically for the given model. Because these kernels are unique to the model, they can exploit model-specific information for optimization.

Correct me if I’m wrong, but this seemed like a paper describing a compiler optimization for array algorithms. The only thing keeping Fortran from implementing this is resources.

I think that is correct. I would like LFortran to be able to use XLA as an option.