I have tested the code on a Linux machine with a Xeon Gold 6348.
First with gfortran 10. Although I’ve left “OpenBLAS” in the report, the code is linked with the default BLAS installation, which is probably the straight Netlib reference implementation (so clearly not performant). The Fortran code is about 30% slower than the C+AVX code:
% gcc -c -Ofast -march=native -mtune=native -flto cfile.c \
&& gfortran -Ofast -march=native -mtune=native ffile.f90 cfile.o -flto -lblas \
&& ./a.out
==================================================
Matrix size: 1000
Fortran(s): 0.072000
C-Intrinsics(s): 0.056000
OpenBLAS(s): 0.793000
Speedup (OpenBLAS): 0.090794
Speedup (C-Intrinsics): 1.285714
Error (Fortran): 0.000000
Error (C): 0.000000
==================================================
Matrix size: 1500
Fortran(s): 0.235000
C-Intrinsics(s): 0.175000
OpenBLAS(s): 4.009000
Speedup (OpenBLAS): 0.058618
Speedup (C-Intrinsics): 1.342857
Error (Fortran): 0.000000
Error (C): 0.000000
==================================================
Matrix size: 2000
Fortran(s): 0.563000
C-Intrinsics(s): 0.473000
OpenBLAS(s): 9.680000
Speedup (OpenBLAS): 0.058161
Speedup (C-Intrinsics): 1.190275
Error (Fortran): 0.000000
Error (C): 0.000000
And with the Intel 21 compilers, linked with the MKL (again I’ve left OpenBLAS in the report, but it’s MKL). MKL is about 2x faster than the C+AVX code. And the Fortran is 5x slower. ifort is particularly not performant on this example. Note that the C+AVX version has about the same timing with gcc.
% icc -c -Ofast -march=native -mtune=native -ipo cfile.c \
&& ifort -Ofast -march=native -mtune=native ffile.f90 cfile.o -qmkl=sequential -L/opt/intel/21/mkl/lib/intel64 -ipo \
&& ./a.out
cfile.c(15): warning #266: function "aligned_alloc" declared implicitly
return aligned_alloc(alignment, (size_t)size);
^
==================================================
Matrix size: 1000
Fortran(s): 0.280900
C-Intrinsics(s): 0.052900
OpenBLAS(s): 0.026900
Speedup (OpenBLAS): 10.442379
Speedup (C-Intrinsics): 5.310019
Error (Fortran): 0.000000
Error (C): 0.000000
==================================================
Matrix size: 1500
Fortran(s): 0.927600
C-Intrinsics(s): 0.175800
OpenBLAS(s): 0.084000
Speedup (OpenBLAS): 11.042857
Speedup (C-Intrinsics): 5.276451
Error (Fortran): 0.000000
Error (C): 0.000000
==================================================
Matrix size: 2000
Fortran(s): 2.214400
C-Intrinsics(s): 0.446800
OpenBLAS(s): 0.204900
Speedup (OpenBLAS): 10.807223
Speedup (C-Intrinsics): 4.956132
Error (Fortran): 0.000000
Error (C): 0.000000
Note that I had to comment out the deallocate
in dgemm_c_intrinsics
: this is not legal to deallocate on the Fortran side a pointer that has been allocated on the C side (it works by chance with gcc+gfortran).