Achieving OpenBLAS DGEMM performance with Fortran vs C intrinsics: why is Fortran slower?

Perhaps using !$omp simd could provide some extra control? (It might just by a rabbit-hole which doesn’t end.) It depends if you count that as pure Fortran anymore; at least Intel Fortran and gfortran have the -qopenmp-simd/-fopenmp-simd flags, which don’t need linking with the OpenMP runtime. Maybe also the new loop transformation constructs !$omp tile and !$omp unroll could help, although YMMV due to implementation differences among compilers, not to mention interaction with the optimization passes.

A similar challenge was discussed in the thread: C++ Standard Library dense linear algebra interface - #22 by tyranids (see posts from @tyranids)

1 Like