Mapping matrix & vector arithmetic to BLAS calls

I was recently timing different implementations of some linear algebra code. I pinned down that a particularly “hot” line in my subroutine was a matrix-matrix multiplication

B = eye(n) - matmul(transpose(A), A)/2

where the arrays have dimensions A(m,n) and B(n,n), and eye(n) is an n-by-n identity matrix. I compiled using gfortran, linking against OpenBLAS, with the options

 -O3
 -funroll-loops
 -Wimplicit-interface
 -fPIC 
 -fmax-errors=1 
 -fcoarray=single
 -lopenblas

I was a little surprised to find a huge performance boost if I instead wrote this expression with an explicit BLAS call

B = eye(n)
call dgemm('t', 'n', n, n, m, -0.5d0, A, m, A, m, 1.d0, B, n) ! B <-- B - 0.5 A^T A

In the past, I’d assumed (or hoped) that the original expression with intrinsic matmul and transpose could be translated into the equivalent BLAS call automatically the compiler. But when I sit down to think about it, I realize I have no sense of how difficult that would be in practice.

I really don’t know what’s reasonable to expect performance-wise, and I’m curious to hear about other people’s anecdotal experience using intrinsics like transpose, matmul, and dot_product versus directly calling to BLAS. I’m also interested in how well different vendors handle this mapping from high-level intrinsics to lower-level BLAS calls. Maybe vendors that ship their own BLAS have the upper hand here?

2 Likes

Here’s a minimal program you can try yourself:

program main
  use, intrinsic :: iso_fortran_env, only: int64
  implicit none

  interface
     subroutine dgemm(transa, transb, m, n, k, alpha, a, lda, b, ldb, beta, c, ldc)
       character, intent(in) :: transa, transb
       integer, intent(in) :: m, n, k, lda, ldb, ldc
       double precision, intent(in) :: alpha, a(lda,*), beta, b(ldb,*)
       double precision, intent(in out) :: c(ldc,*)
     end subroutine dgemm
  end interface

  integer :: n, i
  character(5) :: arg
  double precision, allocatable :: A(:,:), C(:,:)
  integer(int64) :: tick, tock, times(2)

  write (*, '(a4,2(2x,a10),2x,a10)') 'n', 'intrinsic', 'blas', 'speedup'
  do i = 1, 12
     if (allocated(A)) deallocate (A)
     if (allocated(C)) deallocate (C)

     n = 2**i
     allocate (A(n,n), C(n,n))
     call random_number(A)
     C = 0d0
     
     call system_clock(tick)
     C = matmul(transpose(A), A)
     call system_clock(tock)
     times(1) = tock - tick

     call system_clock(tick)
     call dgemm('t', 'n', n, n, n, 1d0, A, n, A, n, 0d0, C, n)
     call system_clock(tock)
     times(2) = tock - tick

     write (*, '(i4,2(2x,i10),2x,f10.6)') n, times, times(1)/dble(times(2))
  end do

end program main

Output on Windows 10, without much attention paid to optimization:

C:\Users\nsha\source\matmul>gfortran -O2 main.f90 -lopenblas
C:\Users\nsha\source\matmul>a.exe
   n   intrinsic        blas     speedup
   2           1         302    0.003311
   4           2           9    0.222222
   8           4          10    0.400000
  16          18          12    1.500000
  32         191         103    1.854369
  64         616         212    2.905660
 128        4765        3232    1.474319
 256       41461        5588    7.419649
 512      334616       13865   24.133862
1024     3009317       90192   33.365675
2048    33927631      692799   48.971824
4096   280117754     5318665   52.666929

Note that for small-ish matrices, (n < 500) the exact clock counts and speedup can vary a lot run to run, but the last few rows are fairly consistent on my system.

1 Like

I am using Intel Fortran linked with MKL, so I changed blas to ‘’‘mkl’’’.
I use MKL and I use John Burkardt’s wtime to measure the time, so I slightly modified your code a little bit,

include 'mkl_vsl.f90'
program main
  implicit none
  integer :: n, i
  character(5) :: arg
  double precision, allocatable :: A(:,:), C(:,:)
  double precision :: tick, tock, times(2)
  
  write (6, '(a10,2(2x,a15),2x,a15)') 'n', 'intrinsic', 'mkl', 'speedup'
  do i = 1, 13
     if (allocated(A)) deallocate (A)
     if (allocated(C)) deallocate (C)
     n = 2**i
     allocate (A(n,n), C(n,n))
     call random_number(A)
     C = 0d0
     
     tick = wtime()
     C = matmul(transpose(A), A)
     tock = wtime()
     times(1) = tock - tick

     tick = wtime()
     call dgemm('t', 'n', n, n, n, 1d0, A, n, A, n, 0d0, C, n)
     tock = wtime()
     times(2) = tock - tick

     write (6, '(i10,2(2x,f15.7),2x,f15.7)') n, times, times(1)/dble(times(2))
  end do
  contains
    function wtime ( )

!*****************************************************************************80
!! WTIME returns a reading of the wall clock time is from John Burkardt.
  implicit none
  integer, parameter :: rk = kind ( 1.0D+00 )
  integer clock_max
  integer clock_rate
  integer clock_reading
  real ( kind = rk ) wtime
  call system_clock ( clock_reading, clock_rate, clock_max )
  wtime = real ( clock_reading, kind = rk ) &
        / real ( clock_rate, kind = rk )
  return
  end  
end program main

My result is,

         n        intrinsic              mkl          speedup
         2        0.0000000        0.0000000              NaN
         4        0.0000000        0.0000000              NaN
         8        0.0000000        0.0000000              NaN
        16        0.0310000        0.0000000         Infinity
        32        0.0000000        0.0000000              NaN
        64        0.0000000        0.0000000              NaN
       128        0.0000000        0.0000000              NaN
       256        0.0000000        0.0000000              NaN
       512        0.0000000        0.0160000        0.0000000
      1024        0.0390000        0.0750000        0.5200000
      2048        0.2230000        0.3730000        0.5978552
      4096        2.3240000        2.3370000        0.9944373
      8192       45.5000000       18.3510000        2.4794289

My flag is basically -O3 -xHost -qparallel -qmkl-cluster,

I think @kargl’s answer hits the nail on the head! Using BLAS for small matrices will not be as performant as you would expect, there is a non-trivial overhead with each call.

1 Like

Here’s a relevant talk by Thomas König, FortranCon2020 [SP]: Front-end optimization in gfortran, starting at time 22:22 (note the preview below may be wrong) where he talks about inlining matrix multiplication:

4 Likes

I was not aware of this! What a great tip. Indeed, when I inform gfortran about the external BLAS, it is seems perfectly able to make the expected translations.

Updated results of my dgemm test
C:\Users\nsha\source\matmul>gfortran -O2 -fexternal-blas -fblas-matmul-limit=500 main.f90 -lopenblas
C:\Users\nsha\source\matmul>a.exe
   n   intrinsic        blas     speedup
   2           1         263    0.003802
   4           2           9    0.222222
   8           4          10    0.400000
  16          21          11    1.909091
  32         292         261    1.118774
  64        1929         213    9.056338
 128       14743        3006    4.904524
 256      139220        4536   30.692240
 512       15250       14578    1.046097
1024       90168       94684    0.952305
2048      708058      687204    1.030346
4096     5519320     5357747    1.030157

Thanks for remembering Thomas’s talk! I remember it for his great presentation of loop reordering and had for forgotten the matmul discussion at the end.

If anyone hasn’t watched that talk, I highly recommend it!

1 Like

For very small matrices on Intel processors/accelerators there’s also libxsmm. I’m not sure how it compares to gfortran inlining capabilities. Digging around, I also found this custom small matrix-matrix multiplication kernel from Nek5000: mxm_std.f

I found this SO reply quite relevant: