Using reserved words as variables

F predated g95, contrary to what Wikipedia says (now fixed). A 1998 article by Michael Metcalf gives an overview of F and says

Their F compilers have been built using technology from NAG, Fujitsu, Salford Software and Absoft, and are available on a wide variety of unix, PC and Mac platforms … A version for Linux is even free!

1 Like

@jacobwilliams , @certik, et al., can you all comment on any specific challenges with getting other communities such as SciPy and others to quickly switch to the modernized version of codebases such as slsqp by @jacobwilliams? Is it mainly time and resources such as with code changes and testing and validation and acceptance? Or are there any fundamental issues with APIs or the modern Fortran language or performance with the modernized versions that might hinder adoption?

2 Likes

I often say that one of the major strengths of Fortran is its compatibility with the vast amount of existing code out there, going back to the 1960s. Yes, the Fortran standard has deleted features, but compilers continue to support them (and this is why I am not in favor of deleting features.)

Quite a few of the newer languages make incompatible changes from version to version, often requiring rewriting existing code. Fortran tries very hard to not do this.

2 Likes

There are obvious upsides to the approach taken by Fortran, at least from a business point of view. “We can pay to have this code written once and it’ll just keep working? Great!” Except, that’s not really how it works for any project that is under continued (often perpetual) development.

I would be surprised to find any code written before 2010 that has survived bug-free since then, and is still able to extract maximum performance on the platforms of today (giving the benefit of the doubt that it did so when originally written).

Performant code requires regular maintenance as the environment around it changes, even if the algorithm to be executed remains the same. If we do not care about performance, then why bother with a statically typed, compiled language? Developer velocity is much higher for dynamically languages, look at the spread of Python and JavaScript to every nook and cranny of everything on earth.

2 Likes

You keep repeating this but it doesn’t work like that. You’re posting in a public topic so you’re addressing everyone participating. If you want to address only a specific user please DM them.

All, going forward please stay strictly on topic and technical discussions only.

3 Likes

Actually, the original FFTPACK from 1983 still runs at a very high speed on any modern hardware with a good compiler, such as Intel. There are more such examples.

I think we want all of the following:

  • performance
  • backwards compatibility (not break things)
  • modern tooling and features
  • use modern Fortran in libraries like SciPy
4 Likes

Do you have a link to this? I’d love to take a look at see. I see this FFTPACK , but netlib also hosts LAPACK — Linear Algebra PACKage which is not exactly used… (https://www.openblas.net/ , MKL, AMD’s version, surely Nvidia ships something, etc.) Looking, I think rfftf1.f is the main thing… But I’m unsure. Sadly this is the exact kind of code people are talking about struggling with…

      SUBROUTINE RFFTF1 (N,C,CH,WA,IFAC)
      DIMENSION       CH(1)      ,C(1)       ,WA(1)      ,IFAC(1)
      NF = IFAC(2)
...

I guess it’s all implicit typing. Strangely, ifac is dimension(1), yet the first line to execute says nf = ifac(2). This interests me now lol I will look deeper into it and try to reason about what’s going on.

The SciPy maintainer group has a real struggle on their hands: trying to provide a library to many platforms using code that mixes multiple (3? maybe more?) languages all needing their own build tools… That’s really nightmare and I understand wanting to simplify as much as possible.

Fortran tooling absolutely needs to make integration easier. Discussions about fpm emitting CMakeLists.txt or whatever files Meson wants are on the right track, that would make integrating fpm-ized Fortran libraries much easier to build for external consumers.

I agree on performance. Maximum performance is kind of the bare minimum when a language wants to give up so much to be so focused on numerics.

1 Like

FFTW and MKL are both faster, but last time I tried it about 8 years ago, FFTPACK sometimes ran even faster than the default installed FFTW on a cluster. It turns out the FFTW was not properly compiled (they forgot to enable vectorization). I also rewrote the routines to modern Fortran here:

https://github.com/certik/hfsolver/blob/b4c50c1979fb7e468b1852b144ba756f5a51788d/src/fourier.f90#L167

and there is also a modern version that we maintain here:

The original code is not bad – it might use implicit typing and the “C(1)” way of declaring arrays, but it does not use common blocks or any other such features. It’s very cleanly written, each function is pure.

In a way I guess that was your point — you have to maintain the code anyway. But we only did the modern fftpack version about a year ago. It was very nice that Fortran compilers could compile the original code without problems and maintain high performance.

2 Likes

Why we switched from fftpack to pocketfft

1 Like

I have some codes that I wrote 20 years ago or more, and performance wise for the most part I would rewrite them the same way as of today.

The remaining tiny part is because I am (hopefully) more skilled as a developer (although it’s not my primary job) than I was 20 years ago, and maybe there are some parts that I would write differently. But it has nothing to do with language features or whatever, a more skilled developer at that time would have written a code that would be still a very good one today.

Computers are much more powerful than they used to be, but the principles did not change that much, and performance-wise a good code from 40 years ago is generally still a good code today: compilers are here to exploit hardware improvements (such as vector intructions, etc…)

There are some exceptions to that: codes written for vector machines of the 80-90 era may not perform well on scalar machines (and vice versa). It was possible to write codes in a way they could perform well of both architectures, but sometimes it was better to have two versions. And of course, GPUs often require rewritting codes from scratch (but in this case even maintenance is pointless).

Netlib’s LAPACK is a so-called “reference implementation”. As such, it does not pretend (and has never pretended) to give the highest possible performances on every possible machine. MKL has some critical parts written in assembly, and nothing can beat that.

In fact Lapack is another great example — in the past I often just used reference Lapack (from GitHub - Reference-LAPACK/lapack: LAPACK development repository) compiled with all optimizations on, and the performance was “good enough”. Slower than MKL or OpenBLAS, but easy to install and often even within a factor of 2x. I believe FFTPACK was also within a factor of 2x sometimes.

This brings me to another topic: I think it should be possible for the compiler to match OpenBLAS and MKL, at least on single core. Julia’s LoopVectorization package (Matrix Multiplication · LoopVectorization.jl) does exactly that. You write your matrix operations like you would do in high school and you get maximum performance.

So it seems to me it should be possible to write such Fortran compilers so that one can maintain Lapack in pure Fortran and it will run as fast as OpenBLAS (which is implemented in assembly at the lowest level).

I will be looking into this later.

1 Like

That may be true in isolated examples, but in general there are many useful features in fortran that were not available prior to f2003. Those important features include such simple things as allocatable arrays that are dummy arguments, allocatable scalars, allocatable components of a derived type, and all of the C interop features. Those are all language features that would affect how code is written and also the efficiency of the resulting code. And if you were concerned about writing portable code that ran on multiple compilers using those features, then you still had to wait until f2005 or f2006 to freely use them. So making this kind of statement about 20+ year old code is less credible than making it about 10 year old code.

I wrote a lot of code in the 1980s, and while this was often the case, there were also many cases where, after a rewrite for a vector machine (a Cray, a Cyber 205, and ETA-10, a Convex, an FPS processor, etc.) that code would then also perform better on a scalar machine than it did originally. In hindsight, one could usually explain why it performed better, but in some cases it was simply a mystery.

However, there were, and still are, situations where machine-dependent code is necessary for optimal performance. For just one example, consider a matrix-matrix product written explicitly with do loops. On a Cray (and several other vector machines), the optimal loop order has a SAXPY type innermost do loop, while on most scalar and even RISC machines, the optimal loop order has an SDOT type innermost loop. A sophisticated compiler might recognize that loop structure and rearrange it if necessary, but that was not really typical. That is why programmers of that era relied so much on conditional compilation and preprocessors, and why it was so frustrarting that the standard committee failed to incorporate that capability into fortran all through the 1980s and also failed even in the f90 revision. Even today, as you mentioned, with the various types and generations of vector hardware and GPUs available, there is still the need to write machine-dependent code.

LAPACK is designed so that its efficiency derives from the underlying BLAS operations, not so much on the optimization of the high-level code itself. For example, if you compile the fortran reference LAPACK codes, and then link to an efficient BLAS library (hand coded assembler, or tuned fortran/C), you will get optimal performance, far beyond that available with the fortran reference BLAS. When it applies, that is a good general model for other programming tasks, but it doesn’t always apply to every application. At one time, LAPACK did not even include high-level optimizations such as OpenMP directives, although I think that is no longer true for the latest LAPACK versions.

[edit: link added in reference to multithreaded LAPACK:
Does LAPACK/BLAS automatically use multi cores or threads? - #27 by JeffH]

I agree with this in general, but I thought OpenBLAS was written in C, not assembler, and I thought the tuning/optimization was done automatically, not manually. Whatever C does, I think that code could be written in fortran in principle, but fortran lacks a standard preprocessor, so such tuning tasks are easier in C for those superficial reasons.

1 Like

OpenBLAS is C with assembly kernels. I think in principle a compiler should be able to generate optimized assembly, but it will always be architecture specific. Perhaps the architecture could be described parametrically, at least for now until there is some major shift in microprocessor design. Major platforms like x86-64 and ARM should be targetable with some minimal parametric description for different micro-architectures.

You might be thinking of Atlas (https://math-atlas.sourceforge.net/), which was auto-tuned and I think it is all in C. OpenBLAS has handwritten kernels in assembly for each platform, and above that it uses Lapack in C and Fortran (both are included in the sources, not sure which one is used usually) and I think it is not auto-tuned.

In OpenBLAS, a subset of the most important LAPACK functions are reimplemented for better performance/threading. This is why they have /lapack-netlib (reference version) and /lapack folders. It looks like they re-implement the LU and Cholesky factorizations, but not the other algorithms.

There is also ReLAPACK which uses recursive (or cache-oblivious) instead of blocked algorithms. At the time of publishing in 2016, it improved on the results of MKL and OpenBLAS.

There is also a Julia package RecursiveFactorization.jl that improves the LU factorization performance compared to OpenBLAS.

libFLAME is another library that improves upon the performance of LAPACK. They also offer excellent education material on their linear algebra methodology: http://www.ulaff.net/

I’m not aware how much of these algorithmic advances trickle-down into Reference LAPACK. The problem is that the architectures keep changing, so the optimal performance is a moving target. Also it appears to be hard to optimize for all sizes. At some point linear algebra could be part of the compiler, but it doesn’t appear we are there yet.

We’ve read complaints here on Discourse on how BLAS and LAPACK don’t handle small sizes efficiently, hence the Julia/Python/Mojo communities are motivated to write their own libraries. On the other side you have the DOE and industry using MAGMA / cusolver / rocSOLVER / oneMKL (DPC++) to tackle large problems on GPUs.

Tim Mattson from Intel once showed a nice slide why we have so many parallel “standards” for the same thing, but the point he makes extend to scientific computing languages, linear algebra libraries, or machine learning frameworks alike:

(Source: Tim Mattson, An Introduction to Parallel programming with OpenMP, PDF, 13.9 MB)


The Julia community is very proud of their benchmarks: SciMLBenchmarks.jl: Benchmarks for Scientific Machine Learning (SciML) and Equation Solvers · The SciML Benchmarks, and hats off, the improvements they have made in some (if not most) areas are enormous. But sometimes they get over-excited. Let’s take as an example the root-finding example: Simple Interval Rootfinding (NonlinearSolve.jl vs Roots.jl vs MATLAB) · The SciML Benchmarks (archived version on Sep 4 2023 here)

The claim made is:

MATLAB 2022a reportedly achieves 1.66s. Try this code yourself: we receive ~0.05 seconds, or a 33x speedup.

What is not emphasized is that the MATLAB version was run on an Intel Xeon E5-1650 v4 @ 3.60 GHz 6-Core CPU (2014, Haswell architecture), whereas the Julia code ran on a AMD EPYC 7502 @ 2.5 GHz 32-Core Processor (2018, Zen 2 architecture). So we should expect some differences just because of the hardware and clock speeds.

To make things worse, the 33x comparison is between MATLAB which is using a variant of Dekker’s algorithm , whereas the Julia number comes from a Newton-Raphson solver. So the comparison is already meaningless and misleading. (I can highly recommend the video from Georg Hager, Lecture on Experiments and Data Presentation in High Performance Computing, as a source of good practices for doing performance measurements.)

I wrote my own little test in Fortran, using a wrapper of the classic zeroin routine from Forsythe, Malcolm, and Moler (1987):

allocate(levels(n), out(n))

call random_number(levels)
levels = 1.5_dp * levels

niter = 1
do
    s = timestamp()
    do k = 1, niter
        do i = 1, n
            out(i) = pzeroin(0.0_dp,2.0_dp,myfun,tol,levels(i))
        end do
    end do
    runtime = timestamp() - s
    if (runtime > 0.2_dp) exit
    niter = niter * 2
end do

print *, runtime/niter

Note how the number of outer iterations is increased to make sure we exceed 200 microseconds. The results I get are the following:

result

Results are shown for:

  • gcc version 13.1.0 (Homebrew GCC 13.1.0)
  • Intel(R) Core™ i7-9750H CPU @ 2.60GHz (6 cores, Coffee Lake architecture)
  • Python v3.9.6 [Clang 14.0.0 (clang-1400.0.29.202)] on darwin; Scipy v1.10.1

The programs and scripts can be found here: GitHub - ivan-pi/zerobench: Benchmark of zeroin

The naive conclusion would be that Scipy is c. 20 times slower than Fortran. In reality the Scipy wrapper does a lot more work, checking for sign correctness of the initial bracket, checking for NaN values, and other stuff that adds overhead. The core root finding algorithm is actually implemented in C, and can be found in the file brentq.c:

extern double brentq(callback_type f, double xa, double xb, double xtol,
                     double rtol, int iter, void *func_data_param,
                     scipy_zeros_info *solver_stats);

We can call this routine from Fortran, with the following results:

result2

For some more variety, I added the implementations from the PORT Mathematical Software Library (1978) and NAPACK (1988, codes associated with the book of W. W. Hager, Applied Numerical Linear Algebra). As you can see we are in the same order of magnitude, with differences of at most 2x (the C brentq routine has more elaborate error reporting compared to zeroin). Keeping in mind I’m using different hardware, the classic zeroin subroutine is about 4x-5x faster than the Julia benchmark.

Btw, the original version of Brent’s routine can be found in his book Algorithms for Minimization without Derivatives (1973). A comment in the source code says it was tested on the IBM 360/91; here is one photographed at NASA Goddard Space Flight Center:

Granted, compilers and software practices have come a long way since then. But I think it’s pretty cool that code from 50 years ago still compiles and runs.

In conclusion:

  • the SciML benchmark pertaining to root-finding is misleading; it ignores hardware and algorithmic differences.
  • the SciML benchmark chooses a poor baseline; arguably, if the 1.66 seconds that MATLAB takes were the bottleneck, a MATLAB user could drop down into a MEX-file and benefit from the speed of compiled languages.
  • The performance of SciPy’s Brentq (in C) is similar to Fortran, once the Python interpreter and safety nets are removed out of the picture.
7 Likes

Personally I think any benchmarks trying to compare runs between multiple different platforms, while trying to benchmark anything other than those platforms is extremely disingenuous.

I would also try to avoid benchmark results that are obtained in under 2 seconds. Every modern CPU has some manner of clockspeed boosting capability, and it’s all automatically managed in hardware based on power/thermal headroom, electrical load, etc. Even in an ideal scenario where all tests can be performed on the same machine, using the same set of compilers, with the same compilation flags, and any other configuration, the tests themselves need to run long enough that you are measuring “average” hardware performance.

This was an excellent post though, thank you for the references and additional reading material. As the Fortran community, we should strive for extreme performance, and we should be representing it fairly, with some non-zero amount of scientific rigor before coming to conclusions.

2 Likes

This is precisely why Hager advocates showing boxplots, because they capture the magnitude of fluctuations. Here is an example of how things can go wrong:

result_with_fluke

Each boxplot is 10 repeated measurements. The thin orange line is the median. In this particular run, 3 out of 10 measurements with the executable compiled with -O2 were around 100 ms for no obvious reason, while the rest were below < 50 ms.

I repeated my measurements with the minimum time set to 1.0 second instead of 0.2, but the graph does not change.

1 Like

Great post @ivanpribec! That’s a whole another topic how to properly benchmark. I get good results by running a longer loop and plotting clock counts per array element. For complex tasks, I usually use the fastest number, but I run it many times and see how often I get it. Plotting all the results (say in a boxplot) is another idea. I have a feeling benchmarking well is a little bit of an art.

Modern features help writing codes that are more structured, more readable, simpler, and all of these are good things. But still, codes written 20+ years ago can be as fast as today’s codes.

For instance, you are citing allocatable components instead of pointer components: but 20 years ago I was simply avoiding using pointers, thus derived types, in the parts of the code where the highest performances were wanted.

It depends on the application, of course, but 20+ years ago I had that same kind of choice (between pointers and fixed arrays of max dimension) and I used pointers. Pointers introduce the possibility of aliasing, and in algorithms with tight loops that has important performance consequences. When allocatables were allowed (as dummy arguments and derived type components), I switched, picked up as much as a 2x speed improvement, and never looked back. I even had some recursive data structures implemented with pointers, like linked lists and trees, that I switched those over to use allocatables. So for me, allocatables were a big deal in f2003. They allowed the minimal memory footprint and contiguous arrays without the aliasing problems.

I did not mention it before, but f2003 was also when the object oriented features and parameterized data types were added to the language. These could have been a significant addition to the language at that time too, with some potentially significant performance consequences, but their slow and buggy inclusion in compilers have slowed the adoption by programmers. I still avoid using some of those features in codes even now due to those problems.