I’m using the Homebrew version of gfortran. Just add -framework accelerate
to the load step for direct dgemm() calls. For the indirect approach, just add the -fexternal-blas
flag to the gfortran compile step, and it will convert the matmul() calls to dgemm() calls for you. the timings are a little different for small matrices, but for anything larger than 100x100 or so, the two approaches seem to perform about the same.
Here is a little sample program:
program kxk
integer, parameter :: wp=selected_real_kind(14), n=1000
real(wp) :: a(n,n), b(n,n), c(n,n)
real(wp) :: cpu1, cpu0
call random_number( a ); a = a - 0.5_wp
call random_number( b ); b = b - 0.5_wp
c = 0.0_wp
call cpu_time(cpu0)
c = matmul( a, b )
call cpu_time(cpu1)
write(*,*) 'c11=', c(1,1), 'cpu_time=', (cpu1-cpu0), ' GFLOPS=', 2*real(n,kind=wp)**3/(cpu1-cpu0)/1.e9_wp
call cpu_time(cpu0)
call dgemm( 'N', 'N', n, n, n, 1.0_wp, a, n, b, n, 0.0_wp, c, n )
call cpu_time(cpu1)
write(*,*) 'c11=', c(1,1), 'cpu_time=', (cpu1-cpu0), ' GFLOPS=', 2*real(n,kind=wp)**3/(cpu1-cpu0)/1.e9_wp
end program kxk
You can compile it with:
gfortran -O -fexternal-blas -framework accelerate -o kxk.x kxk.F90
When I run the code, I see
c11= 0.39803384021638449 cpu_time= 1.2389999999999998E-002 GFLOPS= 161.42050040355127
c11= 0.39803384021638449 cpu_time= 9.0930000000000004E-003 GFLOPS= 219.94941163532386
There is 10-15% or so variations in the times, I’m guessing because of timer resolution (which seems to be 1e-6 second) and maybe cache effects. If the operations were put in a loop, then the timer noise would be smaller.