Achieving OpenBLAS DGEMM performance with Fortran vs C intrinsics: why is Fortran slower?

I have tested the code on a Linux machine with a Xeon Gold 6348.

First with gfortran 10. Although I’ve left “OpenBLAS” in the report, the code is linked with the default BLAS installation, which is probably the straight Netlib reference implementation (so clearly not performant). The Fortran code is about 30% slower than the C+AVX code:

% gcc -c -Ofast -march=native -mtune=native -flto cfile.c \
&& gfortran -Ofast -march=native -mtune=native ffile.f90 cfile.o -flto -lblas \
&& ./a.out
==================================================
Matrix size:                1000
Fortran(s):                 0.072000
C-Intrinsics(s):            0.056000
OpenBLAS(s):                0.793000
Speedup (OpenBLAS):         0.090794
Speedup (C-Intrinsics):     1.285714
Error (Fortran):            0.000000
Error (C):                  0.000000
==================================================
Matrix size:                1500
Fortran(s):                 0.235000
C-Intrinsics(s):            0.175000
OpenBLAS(s):                4.009000
Speedup (OpenBLAS):         0.058618
Speedup (C-Intrinsics):     1.342857
Error (Fortran):            0.000000
Error (C):                  0.000000
==================================================
Matrix size:                2000
Fortran(s):                 0.563000
C-Intrinsics(s):            0.473000
OpenBLAS(s):                9.680000
Speedup (OpenBLAS):         0.058161
Speedup (C-Intrinsics):     1.190275
Error (Fortran):            0.000000
Error (C):                  0.000000

And with the Intel 21 compilers, linked with the MKL (again I’ve left OpenBLAS in the report, but it’s MKL). MKL is about 2x faster than the C+AVX code. And the Fortran is 5x slower. ifort is particularly not performant on this example. Note that the C+AVX version has about the same timing with gcc.

% icc -c -Ofast -march=native -mtune=native -ipo cfile.c \
&& ifort -Ofast -march=native -mtune=native ffile.f90 cfile.o -qmkl=sequential -L/opt/intel/21/mkl/lib/intel64 -ipo \
&& ./a.out
cfile.c(15): warning #266: function "aligned_alloc" declared implicitly
      return aligned_alloc(alignment, (size_t)size);
             ^

==================================================
Matrix size:                1000
Fortran(s):                 0.280900
C-Intrinsics(s):            0.052900
OpenBLAS(s):                0.026900
Speedup (OpenBLAS):        10.442379
Speedup (C-Intrinsics):     5.310019
Error (Fortran):            0.000000
Error (C):                  0.000000
==================================================
Matrix size:                1500
Fortran(s):                 0.927600
C-Intrinsics(s):            0.175800
OpenBLAS(s):                0.084000
Speedup (OpenBLAS):        11.042857
Speedup (C-Intrinsics):     5.276451
Error (Fortran):            0.000000
Error (C):                  0.000000
==================================================
Matrix size:                2000
Fortran(s):                 2.214400
C-Intrinsics(s):            0.446800
OpenBLAS(s):                0.204900
Speedup (OpenBLAS):        10.807223
Speedup (C-Intrinsics):     4.956132
Error (Fortran):            0.000000
Error (C):                  0.000000

Note that I had to comment out the deallocate in dgemm_c_intrinsics: this is not legal to deallocate on the Fortran side a pointer that has been allocated on the C side (it works by chance with gcc+gfortran).

1 Like

Nice! the Fortran version is slower, but still better than what I obtained in terms of Speedup. I wonder if the performance difference is due to the use of different compiler versions.

The performance of the MKL implementation can be attributed to its use of AVX-512 instructions, which are not supported on my computer, so I did not include them in the C kernel.

1 Like

Using MKL version = 2025.2 with GCC version = 11.4.0. I get the same performance as C+AVX.

gcc -Ofast -march=native -mtune=native -flto -c dgemm_kernel.c -o dgemm_kernel.o && \
gfortran -Ofast -march=native -mtune=native -flto -o compare_dgemm compare_dgemm.f90 dgemm_kernel.o -I${MKLROOT}/include -L${MKLROOT}/lib/intel64 -Wl,--start-group -lmkl_gf_lp64 -lmkl_sequential -lmkl_core -Wl,--end-group -lpthread -lm -ldl && \
OMP_NUM_THREADS=1 ./compare_dgemm

==================================================
Matrix size:                1000
MATMUL(s):                  0.134000
Fortran(s):                 0.098000
C-Intrinsics(s):            0.044000
MKL-BLAS(s):                0.045000
Speedup (FORTRAN):          1.367347
Speedup (MKL-BLAS):         2.977778
Speedup (C-Intrinsics):     3.045455
Error (Fortran):            0.000000
Error (C):                  0.000000
Error (BLAS):               0.000000
==================================================
Matrix size:                1500
MATMUL(s):                  0.448000
Fortran(s):                 0.330000
C-Intrinsics(s):            0.150000
MKL-BLAS(s):                0.150000
Speedup (FORTRAN):          1.357576
Speedup (MKL-BLAS):         2.986667
Speedup (C-Intrinsics):     2.986667
Error (Fortran):            0.000000
Error (C):                  0.000000
Error (BLAS):               0.000000
==================================================
Matrix size:                2000
MATMUL(s):                  1.067000
Fortran(s):                 0.772000
C-Intrinsics(s):            0.357000
MKL-BLAS(s):                0.359000
Speedup (FORTRAN):          1.382124
Speedup (MKL-BLAS):         2.972145
Speedup (C-Intrinsics):     2.988796
Error (Fortran):            0.000000
Error (C):                  0.000000
Error (BLAS):               0.000000

For the Fortran code, I wonder if it would not be better at the end to not try micro-optimizing it with all these 4 elements arrays/columns.

1 Like

It is generally better to perform the accumulation in a temporary array and then copy the result into the global CCC array. But maybe simplifying the algorithm may allow the compiler to optimize it more effectively.

Reading through this topic, I started to wonder: is there any serious reason for the Fortran “standard library” (meaning the intrinsic functions/procedures), not the stdlib project, MATMUL not being implemented internally using C intrinsics, assembly or whatever it takes to make it super-fast? Does the Standard require all intrinsics to be written in pure Fortran? I doubt that.
Surely that would require the compiler/implementation builders to make an effort. But with open-source projects like LFortran, gfortran, maybe also flang, that should be doable.

1 Like

No, of course not. Historically intrinsics would have been written in assembler.

One might expect allocations to be a problem in the fortran timings. I’m not sure how C does its allocations, but in fortran allocatable arrays almost always use the heap, while automatic arrays use the stack. Heap alloction is much more expensive than stack allocation. It looks like that allocation/deallocation is within the timing loop – if so, then just changing those work arrays in the fortran code from allocatable to automatic might make a big difference. If you think they are small enough, then they could just be statically allocated (i.e. fixed dimensions with SAVE). I don’t know if the C code would need to be changed accordingly. Another approach often used in fortran production codes where performance is critical is to find a way to prealllocate all the work arrays outside of the timing loops; then it doesn’t matter so much whether heap or stack is used.

I’m just speculating, but I assume the main reason is that -march=native and friends shouldn’t be taken into account by default —i.e., matmul should be “generically portable”, not tied to the capabilities of one machine or CPU.

The additional effort from compiler vendors could come from checking if a native-like flag was provided for the compilation step, and offering a different matmul default for those cases (but they already do that through other flags).

Different versions of a CPU might have different numbers of registers, different sizes of cache, different memory bandwidth, different vector and gpu hardware, and maybe even some different instructions. An optimal MATMUL might be expected to take advantage of all of those things. This can be done using run time tests to determine which branches of code should be executed, how matrices should be subblocked, which SIMD instructions to use, and so on. These things can be determined the first time MATMUL is called, and that state can be saved so that all of the subsequent calls to MATMUL have minimal overhead. All of that does require a lot of effort from the compiler writers, library writers, and so on.

Apple somehow has managed to do this same kind of thing on a larger scale several times in its history, as they changed from motorola architectures to powerpc architectures in the 1990s, then from powerpc to intel in 2005, and more recently from intel to arm64 in 2020. They have “fat binaries” that work with whatever hardware is executing at the moment. So making “fat” versions of MATMUL that accommodate different cpu generations, different memory bandwidths, different cache sizes, and so on seems like a rather minor task compared to what Apple has done with its OS.

[edit] I wanted to add this comment here too. For the last 25 years, it has been popular to build small- to mid-scale parallel machines using off-the-shelf hardware components. The first version of the machine might have been, say 16 to 32 nodes, and they were each configured with whatever the fastest CPUs were the time and with the fastest and largest memory that you could afford. Then after a couple of years, new nodes were added to account for new computational demands. But now there is a new generation of CPU, with larger caches and maybe a new generation of SIMD vector instructions, maybe more cores per node, and a new generation of memory that is cheaper and faster. So now you have a mixed environment where not all the nodes are the same. Another few years pass, and some of the original nodes begin to fail and are replaced with yet newer versions of CPUs, memory, hard disks, SSDs, and so on. The cluster grows, evolves, and continues on like this over the decades until the present. In the beginning, when the nodes were all homogeneous, the programmer would use -march=native type options to squeeze out every bit of performance that he could, and he would tune his code to account for the cache sizes, memory bandwidth, memory capacity, network bandwidth, and so on for those nodes. But after time, the nodes become more heterogeneous, so he stops using -march=native so that his code can run on whichever sets of nodes are available at any moment. Then he digs into his algorithm and makes the formerly static programming choices (loop unrolling, matrix subblocking, etc.) more dynamic so that it uses whatever computer resources are available at run time more efficiently. That pretty much describes what has happened in the last 25 to 30 years with small-scale parallelism and commodity hardware, and that is where we are now with personal, group-level, and department-level machines. The high-end supercomputers have evolved during that time in a similar way, with more nodes per core, more GPU capability, new network configurations, and so on. So when matmul is invoked on such a machine, how much should the programmer expect it to do? One can argue that it should do at least as much as the programmer has invested into his own codes and his own algorithms to dynamically adjust and to take advantage of whatever hardware is available.

But Apple is a hardware company (that just happens to make their own operating system(s)), so the fat binary has limited scope in terms of hardware.

The problem with a “fat matmul” is making it a default, instead of a compilation flag as I mentioned —e.g., if my code only uses matmul two or three times, then the overhead of checking for capabilities can be significant.

Otherwise, one might end up with code that avoids invoking matmul —pretty much like the do concurrent situation now, that using it should be avoided in some compilers unless the OpenMP overhead becomes negligible.

(Or maybe I think too much about redistributable code, rather than single-use ones? :slightly_smiling_face:)

Certainly there some wrong ways to do this. If a code only does matmul two or three times with small matrices, then yes, the overhead might be relatively large. But in that case, who really cares? If you do a billion small matrix products, then that one-time overhead gets amortized over those billion calls, and it becomes negligible. If you do matmul with larger matrices, even if you only do a few of them, then the one-time configuration overhead becomes negligible compared to the large number of floating point operations. So if implemented well, it seems like this could be a useful approach.

1 Like