Global Ocean Modeling With GPU Acceleration in Python

The details of the algorithm don’t really matter, but the TLDR is that you have 2 parts. The first is a multi-threaded tree traversal that finds positions. The second is evaluation of an expensive, vectorizable function (the neural net) on a batch of data (positions). These results are then used to update the tree, and the process continues. Is this type of algorithm currently able to be implemented effectively using Fortran?

One thing that I think the Fortran co-array approach gets wrong is that while the numerics are not the hardware, hardware isn’t just an implementation detail. The hardware you have changes which algorithms are the most efficient, so a system that lacks the ability for the user to tell the program what hardware to use will be inherently inefficient.

Maybe I’m missing some nuance here, but it seems to me that the “system” here is a compiler or library, not the language. Python the language doesn’t let you specify hardware, or even how to run in parallel beyond shared memory.

This is the fundamental problem of computer science. Every language has this problem. Some languages restrict you from using hardware specific instructions in their aims to ensure portability. For those that don’t, when you use hardware specific instructions your software is no longer portable to systems that don’t have that hardware.

That’s always the tradeoff. If you want to take advantage of hardware that the compiler/language you’re using can’t utilize automatically, then you have to write non-portable code. Fortran has taken the approach that the language should define constructs that should be optimizable on most hardware, and it’s up to the compilers to take advantage, and that the language shouldn’t add anything hardware specific. There are extensions, but they aren’t standards conforming.

@dionhaefner Nice paper, it’s well written. I skimmed through it when it first came out, and now in more detail.

I actually didn’t find any concrete or strong suggestions in the paper that Fortran syntax is inadequate for GPU computing (and it certainly is not, IMO). The benchmarks don’t seem unfavorable to Fortran+MPI. JAX seems like a powerful tool. The “Fast, cheap, turbulent” part of the title reads a bit clickbaity. I’m surprised that it was accepted in that form in JAMES. It could be an opportunity for Veros v2 paper to be titled “Faster, cheaper, and just as turbulent”. :slight_smile:

Today, GPUs are the industry standard devices to train artificial neural networks. This trend has also impacted the design of modern compute facilities; for example, out of the 8 upcoming supercomputers in the EuroHPC Joint Undertaking, 7 are going to provide GPU resources, typically making up around 10% of the total compute power (see EuroHPC, 2021). These resources would be unusable with traditional Fortran models without considerable additional effort, such as a complete re-implementation using CUDA Fortran or by using a framework like OpenACC (Wienke et al., 2012), which requires compiler directives for every loop (see also Norman et al., 2015).

I get your point but I’m skeptical whether writing an ocean model from scratch in Python+framework is a smaller effort than GPU-ifying (whether through OpenACC, OpenMP, CUDA, or fine-tuning for nvfortran) an existing ocean model. It seems like a compromise: What you get for free with JAX is building native code for various architectures. What you get for free with an existing ocean model is decades-worth of battle-testedness of dynamical core numerics and subgrid-scale algorithms that you don’t otherwise have.

While the flexibility and rich library ecosystem of Python is a strong asset, there are also some notable obstacles when choosing Python over Fortran. Decades of real-world usage and the relative simplicity of the Fortran language have led to an established community standard of model development. As a consequence, most Fortran models read similarly to each other. This is currently not the case in Python development, where the chosen abstraction and library stack have a huge influence on the structure of the model code. This calls for a collective effort to formalize a common interface for the development of high-performance models in Python. We are confident that this can and will happen should this approach gain the required momentum.

I think this is a fair assessment. I look forward to seeing the developments on both fronts–improved tooling and multi-architecture frameworks and compilers in Fortran, as well as a more unified and stable APIs in the Python ecosystem. I think diversity is good in general, and so is here. I think there should be all kinds of scientific numerical models implemented with various languages and frameworks. Only then we can really understand the pros and cons of different approaches. And, a science Ph.D. is difficult enough on its own–a student should be able to program in the language they most enjoy, and I can appreciate and understand that many people do enjoy Python.

Anyhow, great work, and congrats @dionhaefner, I look forward to reading more.

5 Likes

That’s always the tradeoff. If you want to take advantage of hardware that the compiler/language you’re using can’t utilize automatically, then you have to write non-portable code

This is false. Julia (and to some extent python), let you write algorithms that are hardware agnostic, and then call them in ways that explicitly choose the hardware to run on.

For a simple example of this, I can write

function square(x)
    return x .^ 2
end

in Julia. This is a generic function. If I call it as square(rand(1000)), it will use the CPU to perform the operation. If I call it with square(CuArray(rand(1000))) it will use the GPU to perform the operation.

Just because Fortran doesn’t let you write generic code that works on different types of hardware doesn’t mean it’s impossible.

If you gain another order of magnitude in energy efficiency by doing so, would that be a bad thing? :slight_smile:

Yes, it’s not a matter of syntax, that was the wrong term. Python certainly doesn’t have syntax for GPUs. The missing ingredient is first-class GPU support, be it through a third-party library or the compiler. No doubt that Fortran is still the king on CPU (albeit by a increasingly slim margin), and we don’t try to hide this in the paper.

Well, we would probably like to increase resolution even more, so we would probably settle on “2 Fast 2 Turbulent”. Or you want clickbait? “We wrote this Python ocean model. What happened next is unbelievable!”

I’m skeptical, too. On the other hand, if you do 1:1 translations like ours you can do automated testing against the Fortran reference and verify that results are identical (like we do). In this case your battle-testedness should translate, but it’s still a huge effort. All I know is that a manual re-write in CUDA would be insane, so we need something better. In the end all we can contribute is one data point, so let’s wait and see what people make out of it (if anything).

Thanks for your review & kind words! Glad you liked it.

5 Likes

Python has come up with many great tools to compete with Julia. Numba and Jax are huge game-changers and make Julia’s edge mostly disappear. I remember many people in my field used to be really hyped about Julia. Since Numba matures, that hype has died down. See, eg., this blog:

Unlike Julia, Fortran is not competing with Python. Fortran will continue to integrate with Python by supplying numerics packages.

1 Like

Numba and Jax are definitely great when they apply, but they aren’t full solutions.

Numba has a partly undocumented set of differences in semantics from normal python code (Deviations from Python Semantics — Numba 0.52.0.dev0+274.g626b40e-py3.7-linux-x86_64.egg documentation), and prevents you from using almost all of the dynamic features that make python friendly.

Jax only works on quasi-static problems, which isn’t a major problem for neural nets, but is for algorithms that require iterating to tolerance (or anything else that has inherent non-static features). Also, Jax requires a lot of C++ boilerplate under the hood which is probably OK if you’re google, but is a major disincentive if you as an individual want to implement Jax-like passes for python.

2 Likes

I think we are all in agreement now: Fortran has cross-platform syntax (in a sense that it doesn’t have anything intrinsic in the language that prevents it to run on a GPU), but it does not have a cross-platform tooling to take advantage of this. (Yes, for NVIDIA GPUs you can use Cuda Fortran and their compilers, and their compilers can sometimes also parallelize do concurrent and other intrinsic Fortran features, but that is not a cross-platform solution, at least not yet.)

2 Likes

I think these differences seem really minor and it seems to me you almost always would like the Numba behavior over Python behavior, and also it seems at least in principle that the Numba compiler could warn about things properly, to ensure robustness and no surprises.

I agree with your other points.

You just moved where the hardware specific stuff is defined. That statement is still hardware specific. I’ll admit that this is a pretty neat feature and technique that can be used with Julia to be able to make generic code hardware specific. I’d be curious if there is some similar kind of technique that could be developed in Fortran.

1 Like

I don’t know if it is technically possible or not but something like offload(device='device_type')
be added to do concurrent(), where device_type can be cpu, gpu, tpu, asic, fpga, vect_eng(vector engine),
etc. My line of thinking is that, even in laptops there can be two gpus and then there are desktop and supercomputers. If we take the example of a laptop with two gpus ,I would like to utilise both of them in some way(if possible at all :sweat_smile:) . So maybe something like:

 do concurrent( ... ) offload(device='igpu')
  !do something
 end do 
 
 !
 !other parts goes here
 !
 
 !computationally expensive part goes to the discrete gpu present in 
 !the laptop
 do concurrent( ... ) offload(device='dgpu')
  !do something
 end do 

What if we have a module with some procedure define in it like this and then we use the above construct to just offload particular parts of program like this?

module calc
!
!lots of stuff here 
!
contains
 subroutine calculate( ... )
 !
 !data definition
 !
  do concurrent( ... ) 
  !calculation
  end do
 end subroutine

!other stuff
end module

and then using it like this

program test
use calc
implicit none

offload(device='gpu') call calculate( ... )
!or maybe
!offload(device='tpu') call calculate( ... )
end 

If the offloading fail then it automatically use the cpu instead .
note: I just read the problem with do concurrent here. Will it be possible to add the remedies to as prescribed there?

Moving where the hardware specific stuff is defined is actually really important. Fortran does a good job separating the math from the code, but a bad job at separating the code from the hardware. The separation Julia does here means that libraries can write generic code and let the end user choose to run it on whatever hardware they have. This also means that future hardware can work for current models as long as it can implement the same interface.

I think Fortran does a good job too, but it is a job of the compiler to ensure the code runs on any future hardware. In Julia it seems one can support new hardware with Julia libraries and it will work with old Julia numerical code.

1 Like

The key difference here is that if you need a different compiler to support new hardware, using a mix of hardware gets really complicated. Do you have to use multiple compilers to compile a project that uses both GPUs and mulit-threading? That seems like a really annoying build process to configure.

How about using compiler flags instead of different compilers?

Compiler flags are still global. If you want different parts of your program to use different types of acceleration, that would still require compiling different parts of your program separately (and the associated build process mess).

Interesting consideration. At least in theory, I think that it should be possible to insert parts to be accelerated by different hardware components in different modules and compile them separately. To relieve the pain to build correctly and automatically such a Frankenstein project, a tool like fpm (with properly designed enhanced capabilities) could help.

I would prefer much more to go down this route than bloating and polluting the language with hardware specific features. I am grateful that the Standard committee is moving in the opposite direction, abstracting away from the hardware (e.g. do concurrent).

3 Likes

Right. The Fortran approach has been to have parallel (hardware independent) features in the language. Things like across node parallelism (co-arrays), on node parallelism (do concurrent), I believe we should also add task based parallelism. The way you choose how to run, say the “do concurrent” (on a CPU or GPU) is either via compiler flags, or possibly compiler pragmas, or some external config file, but the Fortran code itself is hardware independent.

2 Likes

Fortran doesn’t have (or at least the standard doesn’t define) any way to specify hardware specific things, so Fortran perfectly separates the code from the hardware. Clearly there is some desire to be able to specify some hardware specific things, the question then is how (but first whether) such specifications should be defined by the language standard. Fortran has tried hard to remain a completely hardware agnostic language, so the goal would be to design any way of defining hardware specific things in a completely hardware agnostic way. Sounds quite challenging, but perhaps there is a way. I’m hopeful the generics facility currently being designed will give us some avenues to explore.