I was recently timing different implementations of some linear algebra code. I pinned down that a particularly “hot” line in my subroutine was a matrix-matrix multiplication
B = eye(n) - matmul(transpose(A), A)/2
where the arrays have dimensions A(m,n)
and B(n,n)
, and eye(n)
is an n
-by-n
identity matrix. I compiled using gfortran, linking against OpenBLAS, with the options
-O3
-funroll-loops
-Wimplicit-interface
-fPIC
-fmax-errors=1
-fcoarray=single
-lopenblas
I was a little surprised to find a huge performance boost if I instead wrote this expression with an explicit BLAS call
B = eye(n)
call dgemm('t', 'n', n, n, m, -0.5d0, A, m, A, m, 1.d0, B, n) ! B <-- B - 0.5 A^T A
In the past, I’d assumed (or hoped) that the original expression with intrinsic matmul
and transpose
could be translated into the equivalent BLAS call automatically the compiler. But when I sit down to think about it, I realize I have no sense of how difficult that would be in practice.
I really don’t know what’s reasonable to expect performance-wise, and I’m curious to hear about other people’s anecdotal experience using intrinsics like transpose
, matmul
, and dot_product
versus directly calling to BLAS. I’m also interested in how well different vendors handle this mapping from high-level intrinsics to lower-level BLAS calls. Maybe vendors that ship their own BLAS have the upper hand here?