Scientific Fortran Lectures

Hej,

I stumbled on the fortran-lang/benchmarks repo earlier this week which seems to take the dust. There is a long and fascinating discussion here about whether we should actually benchmark Fortran against other languages, what benchmarking even means in this context, etc. While I don’t want to re-open the heated discussion, I still believe that this could be an awesome pedagogical resources. To some extent, it could serve the same purpose as the Scientific Python Lectures: carefully crafted tutorials for standard scientific computing in Fortran (with and without stdlib) where the influence of some of the key constructs (e.g. nested do loops, do concurrent, array syntax, etc) on the actual implementation of a given algorithm as well its computational performances could be discussed along with some guidelines about options for standard compilers (think gfortran and ifort). Given this point of view, instead of renaming the repo to idioms as suggested here, we could actually rename it Scientific Fortran Lectures or something along these lines. This wouldn’t prevent us from including implementations in other languages (e.g. c, Julia, Python, etc) for the sake of comparison without putting too much emphasis on the actual languages performances.

I’m currently teaching some numerical analysis class for undergrads. As many of my colleagues, at the moment we’re using numpy because Python. A Scientific Fortran Lectures repo could however become the go-to online resources for educators and serve as a really good advocate for putting Fortran back in the limelight and in the classroom. This is even more true now that a lot of us are using conda as package manager which makes installing compilers really easy on pretty much any system. Likewise, it has never been easier to create Fortran projects now that fpm has reached a certain level of maturity and similarly with the on-going initiative by @FedericoPerini and others regarding the linear algebra module in stdlib. Finally, the work by @certik and colleagues on LFortran and the associated jupyter kernel wouldn’t lead to a dramatic change in the workflow of students who, for most of them, are already familiar with jupyter notebooks.

With this in mind, I’ve started to port some of my lecture material to Fortran out of curiosity, beginning with the Hello World! of partial differential equations: solving the 2D Poisson equation on a unit-square discretized with 2nd order accurate central finite differences. I’ve started with the simplest iterative solvers, i.e. the Jacobi and Gauss-Seidel iterations, and I do have a few technical questions about compiler stuff. The examples below are adapted from what is currently available in the fortran-lang/benchmarks repo.

Jacobi solver

Here is the simplest (yet reasonably high-performing) fortran implementation of the Jacobi method to solve the 2D Poisson equation.

    ! Iterative solver.
    do while (residual > tolerance)
        ! Update iteration counter.
        iteration = iteration + 1
        ! Reset residual.
        residual = 0.0_dp
        ! Jacobi iteration.
        do j = 2, m-1
            do i = 2, m-1
                ! Jacobi update.
                phi_prime(i,j) = (phi(i+1,j)+phi(i-1,j)+phi(i,j+1)+phi(i,j-1))/4 + rhs(i, j)
                ! On-the-fly residual computation.
                residual = max(residual, abs(phi_prime(i,j) - phi(i,j)))
            end do
        end do
        ! Update solution
        phi = phi_prime
    end do

Below are some timings obtained by compiling the code with gfortran 13.2.0 and various optimization level.

  • gfortran jacobi.f90 -O0 -o program && ./program – 197 seconds
  • gfortran jacobi.f90 -O1 -o program && ./program – 73 seconds
  • gfortran jacobi.f90 -O2 -o program && ./program – 24 seconds
  • gfortran jacobi.f90 -O3 -o program && ./program – 31 seconds
  • gfortran jacobi.f90 -Ofast -o program && ./program – 31 seconds

In all cases, the Jacobi solver run for 376 000 iterations (+/- a few dozens depending on the optimization level). Interestingly, it seems like -02 leads to significantly better performances than higher optimization levels.

Gauss-Seidel solver

Given the Jacobi solver, it requires only a very limited number of modifications to transform it into a Gauss-Seidel one : phi_prime no longer is an array but a simple real(dp) to store the current value phi(i, j) before we directly update it while still being able to compute the residual on the fly.

    ! Iterative solver.
    do while (residual > tolerance)
        ! Update iteration counter.
        iteration = iteration + 1
        ! Reset residual.
        residual = 0.0_dp
        ! Gauss-Seidel iteration.
        do j = 2, m-1
            do i = 2, m-1
                ! Store old value for residual computation.
                phi_prime = phi(i, j)
                ! Gauss-Seidel update.
                phi(i,j) = (phi(i+1,j)+phi(i-1,j)+phi(i,j+1)+phi(i,j-1))/4 + rhs(i, j)
                ! On-the-fly residual computation.
                residual = max(residual, abs(phi_prime - phi(i,j)))
            end do
        end do
    end do

Below are some timings obtained by compiling the code with gfortran 13.2.0 and various optimization level.

  • gfortran gauss_seidel.f90 -O0 -o program && ./program – 134 seconds
  • gfortran gauss_seidel.f90 -O1 -o program && ./program – 91 seconds
  • gfortran gauss_seidel.f90 -O2 -o program && ./program – 74 seconds
  • gfortran gauss_seidel.f90 -O3 -o program && ./program – 75 seconds
  • gfortran gauss_seidel.f90 -Ofast -o program && ./program – 62 seconds

In all cases, the Gauss-Seidel solver runs for 197 000 iterations (+/- a few dozens depending on the optimization level). As expected from the math, the Gauss-Seidel solver needs roughly half the number of iterations to converge compared to Jacobi.

The plot twist…

It is fairly easy to show mathematically that the Gauss-Seidel solver should takes roughly half the number of steps needed by Jacobi for convergence. This is indeed the case here (iteration = 376000 for Jacobi and iteration = 197000 for Gauss-Seidel). For this particular problem, proving it is actually a classical exercise in an undergrad numerical analysis class.

When it comes to the numerical implementation, math is not however all there is. Given the Jacobi update rule

phi_prime(i, j) = (phi(i+1,j)+phi(i-1,j)+phi(i,j+1)+phi(i,j-1))/4 + rhs(i, j)

and the Gauss-Seidel one

phi(i, j) = (phi(i+1,j)+phi(i-1,j)+phi(i,j+1)+phi(i,j-1))/4 + rhs(i, j)

the number of operations per update is basically the same. Naively, I would have expected that if Gauss-Seidel needs half as many iterations to converge than Jacobi, and each iteration has roughly the same computational cost, then Gauss-Seidel would actually take half the computational time of Jacobi as well. While this is somewhat the case for -00 (albeit we don’t quite have a ratio of 0.5 for the time), it certainly is not when a higher optimization level is considered. In practice, while Jacobi performs twice as many iterations as Gauss-Seidel to converge, it can actually be more than twice as fast!

  • gfortran jacobi.f90 -O2 -o program && ./program – 24 seconds
  • gfortran gauss_seidel.f90 -Ofast -o program && ./program – 62 seconds

I’m pretty sure this has been observed many times over the years but I can hardly find this observation mentioned in the literature. I suspect moreover that the reason for this is some compiler witchcraft related to the independence of the iterations in the Jacobi solver combined with automatic loop unrolling and/or cache-blocking.

Note that the Gauss-Seidel implementation here is the simplest one there is. The 2nd order central finite-difference approximation of the Laplacian leads to a consistently ordered matrix representation for which a red/black ordering of the nodes leads to two subproblems with loop indepedence (i.e. Jacobi-like). I haven’t had time to code it up yet, but I suspect it’ll be significantly faster than this vanilla Gauss-Seidel implementation because of the same exact reasons that make the Jacobi implementation faster than the vanilla Gauss-Seidel one.

While being more than familiar with the mathematical aspects of things (I actually teach this), I’m not really familiar with these compiler technical details. Any hindsight is more than appreciated!

14 Likes

Gauss-Seidel looks similar to Jacobi at first glance in terms of stencil structure and
data transfer volumes. A difference apparent at once is that SIMD vectorization is
ruled out because of the recursive structure on the central line. The in-place update
prevents optimal pipelining and the use of non-temporal stores as well. Thus GaussSeidel performance is inferior to Jacobi despite comparable data transfer volumes and less computations (Fig. 3 and 4). Additionally there is no substantial drop between in-cache and memory performance. This is especially remarkable on the Core 2 processor and clearly indicates that pipelining problems limit the achievable performance. To overcome these problems, our optimized kernel interleaves two updates in order to break up register dependencies and partially hide the recursion.

For more details see: Treibig, J., Wellein, G., & Hager, G. (2011). Efficient multicore-aware parallelization strategies for iterative stencil computations. Journal of Computational Science , 2 (2), 130-137.

Note that with gfortran and -O2 only, you are compiling for the generic x86-64 instruction set without SIMD extensions (maybe recent versions assume at least SSE2 is available).

8 Likes

Thanks a lot! It confirms what I had intuited. Really interesting. I’m always surprised by the way by how precise, varied and relevant all your citations are. You either have a fantastic memory or an incredible system to manage all your references I’d like to know more about :slight_smile:

6 Likes

For anyone interested, I’ve just created a repo here illustrating what a Scientific Fortran Lecture for computational fluid dynamics could look like. It is mostly empty at the moment and includes only the README.md with a tentative plan of what the lesson could be and of the material that can be covered.

It is admittedly pretty ambitious but, if we can pull it off, it could be an excellent poster child of how easy it is now to use Fortran in a standard undergraduate-level course in scientific computing. Compared with Python, it comes not only with better off-the-shelf computational performances (which are often good enough for most students’ projects) but also offer the possibility for those more inclined toward technicalities to discuss the interplay between a mathematical algorithm and its actual implementation alongside a high-level discussion of the optimizations enabled by current compilers.

Everything is obviously open to discussion and anyone whose willing to participate (even just discussing the overall idea and not necessarily implementing anything) is more than welcome!

9 Likes

Good luck with your lectures. An analogous project for quantum chemistry is

(This morning I wrote a very long reply, then the browser crashed, I have lost it and I’m still mourning. So I’ll try to summarize.)

Your idea is great, and I really think that the fortran community would shine if it was capable of providing learning material for Fortran, for algorithm design and computational physics/math at once. To give you an idea why this is important, I invite you to search online how to perform a forward-backward substitution with a sparse matrix stored in CSR or whatsoever: you’ll find plenty of libraries in all languages, but not so many descriptions on how the algorithm works if you want to write it from scratch starting from a reference. This might be regarded as just a stupid example, but I think it is indicative of the current tendency of scientific computing for which there is a frontend user easily calling routines and a backend code which is extremely complicated and unaccessible to most users.

So I started doing my homework and found that you can use “mini books”, and I think that would be a good format for your Scientific Fortran Lecture for computational fluid dynamics and I will try writing something such as Scientific Fortran Lecture for sparse linear algebra to see if it works nicely. What I don’t like about it is that you cannot write easily LaTeX pseudocodes in markdown, which is quite annoying. However, behind this there should also be a page (that in my mind should be similar to a rosetta code or wikipedia/algopedia page) to reference as a go to for all the details in one place, as long as pages where all the Fortran intrinsics (but I prefer the one-page-per-intrisic format) and features are well explained. An environment with all this would be the best possible way to teach.

PS: I have a HelloWorld-Poisson solver with Finite Elements (PCG with IC0) which is fairly well documented even if it has a naïve implementation, if you want to give it a look for your lectures.