I found one problem, it seems

```
call cpu_time()
```

does not measure the time correctly when multiple threads are involved.

When using Intel OneAPI with MKL, I use John Burkardt’s `wtime()`

instead, so I run the code below,

```
program kxk
integer, parameter :: wp=selected_real_kind(14), n=5000
real(wp) :: a(n,n), b(n,n), c(n,n)
real(wp) :: cpu1, cpu0
call random_number( a ); a = a - 0.5_wp
call random_number( b ); b = b - 0.5_wp
c = 0.0_wp
cpu0 = wtime()
c = matmul( a, b )
cpu1 = wtime()
write(*,*) 'c11=', c(1,1), 'cpu_time=', (cpu1-cpu0), ' GFLOPS=', 2*real(n,kind=wp)**3/(cpu1-cpu0)/1.e9_wp
cpu0 = wtime()
call dgemm( 'N', 'N', n, n, n, 1.0_wp, a, n, b, n, 0.0_wp, c, n )
cpu1 = wtime()
write(*,*) 'c11=', c(1,1), 'cpu_time=', (cpu1-cpu0), ' GFLOPS=', 2*real(n,kind=wp)**3/(cpu1-cpu0)/1.e9_wp
contains
function wtime ( )
!*****************************************************************************80
!
!! WTIME returns a reading of the wall clock time.
!
! Discussion:
!
! To get the elapsed wall clock time, call WTIME before and after a given
! operation, and subtract the first reading from the second.
!
! This function is meant to suggest the similar routines:
!
! "omp_get_wtime ( )" in OpenMP,
! "MPI_Wtime ( )" in MPI,
! and "tic" and "toc" in MATLAB.
!
! Licensing:
!
! This code is distributed under the GNU LGPL license.
!
! Modified:
!
! 27 April 2009
!
! Author:
!
! John Burkardt
!
! Parameters:
!
! Output, real ( kind = rk ) WTIME, the wall clock reading, in seconds.
!
implicit none
integer, parameter :: rk = kind ( 1.0D+00 )
integer clock_max
integer clock_rate
integer clock_reading
real ( kind = rk ) wtime
call system_clock ( clock_reading, clock_rate, clock_max )
wtime = real ( clock_reading, kind = rk ) &
/ real ( clock_rate, kind = rk )
return
end function wtime
end program kxk
```

With Intel MKL’s matmul,

when n=5000, what I got is,

```
c11= 2.90899748439952 cpu_time= 1.50899999999092 GFLOPS=
165.672630882375
c11= 2.90899748439952 cpu_time= 4.14900000000489 GFLOPS=
60.2554832489046
```

Looks like MKL automatically parallelize `matmul`

, but not for `dgemm`

.