Achieving OpenBLAS DGEMM performance with Fortran vs C intrinsics: why is Fortran slower?

In somewhat similar thread already mentioned by @ivanpribec in post #6 above, there was a discussion about including BLAS etc. into C++ standard library, with some guys prophesying the end of Fortran (a.k.a. EOF :grinning_face_with_smiling_eyes:). But they will struggle with the very same problem. There is no single superfast C/C++ function using assembler inlines available for a general purpose libstdc++ or any other library, which all have to support a huge variety of CPU hardware. So IMHO the only way is to provide several functions for different β€˜-march’ or equivalent options.
That is not much different from what Fortran implementation would have to provide for intrinsics like matmul to make them really fast.