Achieving OpenBLAS DGEMM performance with Fortran vs C intrinsics: why is Fortran slower?

PierU · August 25, 2025, 9:58am

I have tested the code on a Linux machine with a Xeon Gold 6348.

First with gfortran 10. Although I’ve left “OpenBLAS” in the report, the code is linked with the default BLAS installation, which is probably the straight Netlib reference implementation (so clearly not performant). The Fortran code is about 30% slower than the C+AVX code:

% gcc -c -Ofast -march=native -mtune=native -flto cfile.c \
&& gfortran -Ofast -march=native -mtune=native ffile.f90 cfile.o -flto -lblas \
&& ./a.out
==================================================
Matrix size:                1000
Fortran(s):                 0.072000
C-Intrinsics(s):            0.056000
OpenBLAS(s):                0.793000
Speedup (OpenBLAS):         0.090794
Speedup (C-Intrinsics):     1.285714
Error (Fortran):            0.000000
Error (C):                  0.000000
==================================================
Matrix size:                1500
Fortran(s):                 0.235000
C-Intrinsics(s):            0.175000
OpenBLAS(s):                4.009000
Speedup (OpenBLAS):         0.058618
Speedup (C-Intrinsics):     1.342857
Error (Fortran):            0.000000
Error (C):                  0.000000
==================================================
Matrix size:                2000
Fortran(s):                 0.563000
C-Intrinsics(s):            0.473000
OpenBLAS(s):                9.680000
Speedup (OpenBLAS):         0.058161
Speedup (C-Intrinsics):     1.190275
Error (Fortran):            0.000000
Error (C):                  0.000000

And with the Intel 21 compilers, linked with the MKL (again I’ve left OpenBLAS in the report, but it’s MKL). MKL is about 2x faster than the C+AVX code. And the Fortran is 5x slower. ifort is particularly not performant on this example. Note that the C+AVX version has about the same timing with gcc.

% icc -c -Ofast -march=native -mtune=native -ipo cfile.c \
&& ifort -Ofast -march=native -mtune=native ffile.f90 cfile.o -qmkl=sequential -L/opt/intel/21/mkl/lib/intel64 -ipo \
&& ./a.out
cfile.c(15): warning #266: function "aligned_alloc" declared implicitly
      return aligned_alloc(alignment, (size_t)size);
             ^

==================================================
Matrix size:                1000
Fortran(s):                 0.280900
C-Intrinsics(s):            0.052900
OpenBLAS(s):                0.026900
Speedup (OpenBLAS):        10.442379
Speedup (C-Intrinsics):     5.310019
Error (Fortran):            0.000000
Error (C):                  0.000000
==================================================
Matrix size:                1500
Fortran(s):                 0.927600
C-Intrinsics(s):            0.175800
OpenBLAS(s):                0.084000
Speedup (OpenBLAS):        11.042857
Speedup (C-Intrinsics):     5.276451
Error (Fortran):            0.000000
Error (C):                  0.000000
==================================================
Matrix size:                2000
Fortran(s):                 2.214400
C-Intrinsics(s):            0.446800
OpenBLAS(s):                0.204900
Speedup (OpenBLAS):        10.807223
Speedup (C-Intrinsics):     4.956132
Error (Fortran):            0.000000
Error (C):                  0.000000

Note that I had to comment out the deallocate in dgemm_c_intrinsics: this is not legal to deallocate on the Fortran side a pointer that has been allocated on the C side (it works by chance with gcc+gfortran).

Said-H · August 25, 2025, 11:23am

Nice! the Fortran version is slower, but still better than what I obtained in terms of Speedup. I wonder if the performance difference is due to the use of different compiler versions.

The performance of the MKL implementation can be attributed to its use of AVX-512 instructions, which are not supported on my computer, so I did not include them in the C kernel.

Said-H · August 25, 2025, 12:16pm

Using MKL version = 2025.2 with GCC version = 11.4.0. I get the same performance as C+AVX.

gcc -Ofast -march=native -mtune=native -flto -c dgemm_kernel.c -o dgemm_kernel.o && \
gfortran -Ofast -march=native -mtune=native -flto -o compare_dgemm compare_dgemm.f90 dgemm_kernel.o -I${MKLROOT}/include -L${MKLROOT}/lib/intel64 -Wl,--start-group -lmkl_gf_lp64 -lmkl_sequential -lmkl_core -Wl,--end-group -lpthread -lm -ldl && \
OMP_NUM_THREADS=1 ./compare_dgemm

==================================================
Matrix size:                1000
MATMUL(s):                  0.134000
Fortran(s):                 0.098000
C-Intrinsics(s):            0.044000
MKL-BLAS(s):                0.045000
Speedup (FORTRAN):          1.367347
Speedup (MKL-BLAS):         2.977778
Speedup (C-Intrinsics):     3.045455
Error (Fortran):            0.000000
Error (C):                  0.000000
Error (BLAS):               0.000000
==================================================
Matrix size:                1500
MATMUL(s):                  0.448000
Fortran(s):                 0.330000
C-Intrinsics(s):            0.150000
MKL-BLAS(s):                0.150000
Speedup (FORTRAN):          1.357576
Speedup (MKL-BLAS):         2.986667
Speedup (C-Intrinsics):     2.986667
Error (Fortran):            0.000000
Error (C):                  0.000000
Error (BLAS):               0.000000
==================================================
Matrix size:                2000
MATMUL(s):                  1.067000
Fortran(s):                 0.772000
C-Intrinsics(s):            0.357000
MKL-BLAS(s):                0.359000
Speedup (FORTRAN):          1.382124
Speedup (MKL-BLAS):         2.972145
Speedup (C-Intrinsics):     2.988796
Error (Fortran):            0.000000
Error (C):                  0.000000
Error (BLAS):               0.000000

PierU · August 25, 2025, 12:16pm

For the Fortran code, I wonder if it would not be better at the end to not try micro-optimizing it with all these 4 elements arrays/columns.

Said-H · August 25, 2025, 12:23pm

It is generally better to perform the accumulation in a temporary array and then copy the result into the global CCC array. But maybe simplifying the algorithm may allow the compiler to optimize it more effectively.

msz59 · August 25, 2025, 3:35pm

Reading through this topic, I started to wonder: is there any serious reason for the Fortran “standard library” (meaning the intrinsic functions/procedures), not the stdlib project, MATMUL not being implemented internally using C intrinsics, assembly or whatever it takes to make it super-fast? Does the Standard require all intrinsics to be written in pure Fortran? I doubt that.
Surely that would require the compiler/implementation builders to make an effort. But with open-source projects like LFortran, gfortran, maybe also flang, that should be doable.

RonShepard · August 25, 2025, 5:32pm

No, of course not. Historically intrinsics would have been written in assembler.

One might expect allocations to be a problem in the fortran timings. I’m not sure how C does its allocations, but in fortran allocatable arrays almost always use the heap, while automatic arrays use the stack. Heap alloction is much more expensive than stack allocation. It looks like that allocation/deallocation is within the timing loop – if so, then just changing those work arrays in the fortran code from allocatable to automatic might make a big difference. If you think they are small enough, then they could just be statically allocated (i.e. fixed dimensions with SAVE). I don’t know if the C code would need to be changed accordingly. Another approach often used in fortran production codes where performance is critical is to find a way to prealllocate all the work arrays outside of the timing loops; then it doesn’t matter so much whether heap or stack is used.

jwmwalrus · August 25, 2025, 7:38pm

I’m just speculating, but I assume the main reason is that -march=native and friends shouldn’t be taken into account by default —i.e., matmul should be “generically portable”, not tied to the capabilities of one machine or CPU.

The additional effort from compiler vendors could come from checking if a native-like flag was provided for the compilation step, and offering a different matmul default for those cases (but they already do that through other flags).

RonShepard · August 25, 2025, 10:15pm

Different versions of a CPU might have different numbers of registers, different sizes of cache, different memory bandwidth, different vector and gpu hardware, and maybe even some different instructions. An optimal MATMUL might be expected to take advantage of all of those things. This can be done using run time tests to determine which branches of code should be executed, how matrices should be subblocked, which SIMD instructions to use, and so on. These things can be determined the first time MATMUL is called, and that state can be saved so that all of the subsequent calls to MATMUL have minimal overhead. All of that does require a lot of effort from the compiler writers, library writers, and so on.

Apple somehow has managed to do this same kind of thing on a larger scale several times in its history, as they changed from motorola architectures to powerpc architectures in the 1990s, then from powerpc to intel in 2005, and more recently from intel to arm64 in 2020. They have “fat binaries” that work with whatever hardware is executing at the moment. So making “fat” versions of MATMUL that accommodate different cpu generations, different memory bandwidths, different cache sizes, and so on seems like a rather minor task compared to what Apple has done with its OS.

[edit] I wanted to add this comment here too. For the last 25 years, it has been popular to build small- to mid-scale parallel machines using off-the-shelf hardware components. The first version of the machine might have been, say 16 to 32 nodes, and they were each configured with whatever the fastest CPUs were the time and with the fastest and largest memory that you could afford. Then after a couple of years, new nodes were added to account for new computational demands. But now there is a new generation of CPU, with larger caches and maybe a new generation of SIMD vector instructions, maybe more cores per node, and a new generation of memory that is cheaper and faster. So now you have a mixed environment where not all the nodes are the same. Another few years pass, and some of the original nodes begin to fail and are replaced with yet newer versions of CPUs, memory, hard disks, SSDs, and so on. The cluster grows, evolves, and continues on like this over the decades until the present. In the beginning, when the nodes were all homogeneous, the programmer would use -march=native type options to squeeze out every bit of performance that he could, and he would tune his code to account for the cache sizes, memory bandwidth, memory capacity, network bandwidth, and so on for those nodes. But after time, the nodes become more heterogeneous, so he stops using -march=native so that his code can run on whichever sets of nodes are available at any moment. Then he digs into his algorithm and makes the formerly static programming choices (loop unrolling, matrix subblocking, etc.) more dynamic so that it uses whatever computer resources are available at run time more efficiently. That pretty much describes what has happened in the last 25 to 30 years with small-scale parallelism and commodity hardware, and that is where we are now with personal, group-level, and department-level machines. The high-end supercomputers have evolved during that time in a similar way, with more nodes per core, more GPU capability, new network configurations, and so on. So when matmul is invoked on such a machine, how much should the programmer expect it to do? One can argue that it should do at least as much as the programmer has invested into his own codes and his own algorithms to dynamically adjust and to take advantage of whatever hardware is available.

jwmwalrus · August 25, 2025, 11:33pm

But Apple is a hardware company (that just happens to make their own operating system(s)), so the fat binary has limited scope in terms of hardware.

The problem with a “fat matmul” is making it a default, instead of a compilation flag as I mentioned —e.g., if my code only uses matmul two or three times, then the overhead of checking for capabilities can be significant.

Otherwise, one might end up with code that avoids invoking matmul —pretty much like the do concurrent situation now, that using it should be avoided in some compilers unless the OpenMP overhead becomes negligible.

(Or maybe I think too much about redistributable code, rather than single-use ones? )

RonShepard · August 26, 2025, 12:02am

Certainly there some wrong ways to do this. If a code only does matmul two or three times with small matrices, then yes, the overhead might be relatively large. But in that case, who really cares? If you do a billion small matrix products, then that one-time overhead gets amortized over those billion calls, and it becomes negligible. If you do matmul with larger matrices, even if you only do a few of them, then the one-time configuration overhead becomes negligible compared to the large number of floating point operations. So if implemented well, it seems like this could be a useful approach.

gronki · August 27, 2025, 10:43am

And this is exactly how Fortran should be used, and what its spirit is. It has never been intended to provide the fine control of C or C++. It is a language to most easily express the mathematical algorithm while making minimum assumptions about the hardware. So you should express and operation such as matrix multiplication or a dot product, and it is a compiler’s job to provide you its with what it thinks is the most performant solution.

Said-H · August 27, 2025, 12:37pm

Then there is no point in using Fortran. Python handles that better and offers a larger and richer ecosystem.

gronki · August 27, 2025, 12:48pm

Fortran was 60 years ago what Python is now. If you can do everything you want to do using Python, very good. But Fortran still has its niche, namely scientists who want to write performant code without that much technical knowledge. It has never been about writing fast computation kernels, that is the domain of assembly and C.

I would never use python for something that has to last for years without active maintenance and be passed around people with varying degree of programming competency. Python code is unlikely to run 5 years later. I think only Fortran and C++ can do that nowadays.

themos · August 27, 2025, 2:39pm

Everybody has understood, for decades, that the way to compute matrix products in Fortran is to call the BLAS and the way to compute FFT is to call FFTW or something equivalent. Because of that understanding, the compiler vendors did not prioritize recognising a Fortran matrix multiply and replicating the BLAS instruction sequence. This is an example of profitable factorization.

septc · August 27, 2025, 3:22pm

Isn’t the comparison of the C versus pure Fortran in the first post rather similar to the general comparisons in the following benchmark site? (here, the “n-body” example but overall trend is the same for other cases)

n-body - Which programs are fastest? (Benchmarks Game)

In the above site, “fastest” codes always use explicit SIMD calls, while “slower” codes usually rely only on pure language syntax/constructs. And the former are made faster at the expense of very unreadable codes using a lot of explicit SIMD calls. (Btw, I think the former are not in pure C/C+ (with external calls), so I feel the comparisons are a bit like apple vs orange…) If the purpose is to get the highest performance (e.g. to write a very low-level library), I think one should probably go with C/C++/assembly with rather unreadable codes. But IMO many people are using high-level languages for code brevity, readability, and ease of development/maintanance, even at the expense of somewhat slower speed (e.g., x1.5 slower than the “best” C code in the n-body case). Another example is the Chapel code, which x1.7 slower than the “best” C, but it is extremely short and readable (IMO).

n-body Chapel #3 program (Benchmarks Game)

So, I think there are lots of situations where languages like Fortran (or other high-level languages) can attract developers, particularly when both execution speed and development time are important.

Apart from it…, I also wonder if there might be some potential speed-up if the small multi-dimensional arrays are treated as 1D, or possibly, all arrays are made 1D (with linear indexing)…? (I’ve tried a bit and seen ~10% improvement for some pattern, but gave up for trying more…)

RonShepard · August 27, 2025, 4:08pm

While this is true, there is still the question about expectations for performance and accuracy.

This applies to all of the intrinsic functions and subroutines in the language, along with array expressions, and everything else. There are two extremes, that the language provide as robust an implementation as possible, guaranteeing accuracy down to the last bit, or it provides the fastest possible implementation, perhaps sacrificing accuracy and robustness to achieve that performance. In the former case, if a programmer needs speed, then he would be expected to write his own code to replace the intrinsic operation provided by fortran. In the latter case, if the programmer needs all the extra error detection and extra accuracy, then he would be expected to write his own replacements for the fortran intrinsics.

There are of course an infinite number of intermediate goals between these two extremes, each with their own tradeoffs between robustness, accuracy, speed, and resource consumption. The language itself doesn’t mandate what a given compiler does, it remains a “quality of implementation” issue.

I think how this issue should be handled is still an open question.

Said-H · August 28, 2025, 10:43am

Sometimes you may want to implement them yourself because you have a more complex expression, and BLAS does not provide the appropriate subroutines to compute it.

Said-H · August 28, 2025, 11:08am

A lot of HPC code is written in Fortran primarily to get fast code. 60 years ago, computers did not support SIMD instructions or shared-memory parallelism, so I would assume you could get the best performance possible with Fortran.

I would never use python for something that has to last for years without active maintenance and be passed around people with varying degree of programming competency. Python code is unlikely to run 5 years later. I think only Fortran and C++ can do that nowadays.

That is probably true, but relying on Fortran solely for longevity and not performance will lead people to go to other languages like C/C++ or Julia. Those can provide maintainable, long-lived code while still achieving high performance.

Said-H · August 28, 2025, 11:16am

Usually, only a small portion of the code requires explicit SIMD calls, so only that part may become less readable, while the rest of the code remains perfectly clear. The reason we see all these benchmarks between languages is that people want to implement to their code in the fastest languages available.

Topic		Replies	Views
Testing the performance of `matmul` under default compiler settings Help	37	2823	August 11, 2022
Mapping matrix & vector arithmetic to BLAS calls	8	1378	July 20, 2022
C++ Standard Library dense linear algebra interface	23	2523	August 23, 2025
Writing wrappers for LAPACK and BLAS routines	54	2032	December 14, 2023
Does LAPACK/BLAS automatically use multi cores or threads?	35	6373	August 3, 2022

Achieving OpenBLAS DGEMM performance with Fortran vs C intrinsics: why is Fortran slower?

Related topics