Does LAPACK/BLAS automatically use multi cores or threads?

I found one problem, it seems

call cpu_time()

does not measure the time correctly when multiple threads are involved.
When using Intel OneAPI with MKL, I use John Burkardt’s wtime() instead, so I run the code below,

program kxk
   integer, parameter :: wp=selected_real_kind(14), n=5000
   real(wp) :: a(n,n), b(n,n), c(n,n)
   real(wp) :: cpu1, cpu0

   call random_number( a ); a = a - 0.5_wp
   call random_number( b ); b = b - 0.5_wp
   c = 0.0_wp

   cpu0 = wtime()
   c = matmul( a, b )
   cpu1 = wtime()
   write(*,*) 'c11=', c(1,1), 'cpu_time=', (cpu1-cpu0), ' GFLOPS=', 2*real(n,kind=wp)**3/(cpu1-cpu0)/1.e9_wp

   cpu0 = wtime()
   call dgemm( 'N', 'N', n, n, n, 1.0_wp, a, n, b, n, 0.0_wp, c, n )
   cpu1 = wtime()
   write(*,*) 'c11=', c(1,1), 'cpu_time=', (cpu1-cpu0), ' GFLOPS=', 2*real(n,kind=wp)**3/(cpu1-cpu0)/1.e9_wp

contains
    function wtime ( )

!*****************************************************************************80
!
!! WTIME returns a reading of the wall clock time.
!
!  Discussion:
!
!    To get the elapsed wall clock time, call WTIME before and after a given
!    operation, and subtract the first reading from the second.
!
!    This function is meant to suggest the similar routines:
!
!      "omp_get_wtime ( )" in OpenMP,
!      "MPI_Wtime ( )" in MPI,
!      and "tic" and "toc" in MATLAB.
!
!  Licensing:
!
!    This code is distributed under the GNU LGPL license. 
!
!  Modified:
!
!    27 April 2009
!
!  Author:
!
!    John Burkardt
!
!  Parameters:
!
!    Output, real ( kind = rk ) WTIME, the wall clock reading, in seconds.
!
  implicit none

  integer, parameter :: rk = kind ( 1.0D+00 )

  integer clock_max
  integer clock_rate
  integer clock_reading
  real ( kind = rk ) wtime

  call system_clock ( clock_reading, clock_rate, clock_max )

  wtime = real ( clock_reading, kind = rk ) &
        / real ( clock_rate, kind = rk )

  return
  end function wtime  
end program kxk

With Intel MKL’s matmul,

when n=5000, what I got is,

 c11=   2.90899748439952      cpu_time=   1.50899999999092       GFLOPS=
   165.672630882375
 c11=   2.90899748439952      cpu_time=   4.14900000000489       GFLOPS=
   60.2554832489046

Looks like MKL automatically parallelize matmul, but not for dgemm.