This is matrix matrix multiplication? That is bound by the multiplication cost. For large enough matrices, you can get 100% of the theoretical performance peak in Fortran, I have done that about 5 years ago and measured it. I believe the same speed as OpenBLAS. For small matrices OpenBLAS is faster, because then you have to hide the latency of memory read/write and it gets complicated. However, if you have many small matrices to multiply, you can hide this cost. I have done that for matrix-vector multiply, if you have many vectors to multiply with the same matrix, you can vectorize efficiently and get very close to the theoretical peak performance, in Fortran. But if you have just one matrix and one vector to multiply, it is a very complicated assembly code that you have to write to carefully balance latency of reads and multiplies/additions, you can look into OpenBLAS, that’s not easy. You can’t do it from Fortran or C, unfortunately.
Can you post a C code that is faster? It’s the same issue there, I don’t think there is any advantage there. Typically one has to go into assembly to hide the latency cost if that is the issue. I have done that, I can show how that is done if there is interest.