Does LAPACK/BLAS automatically use multi cores or threads?

CRquantum · July 28, 2022, 10:37pm

I found one problem, it seems

call cpu_time()

does not measure the time correctly when multiple threads are involved.
When using Intel OneAPI with MKL, I use John Burkardt’s wtime() instead, so I run the code below,

program kxk
   integer, parameter :: wp=selected_real_kind(14), n=5000
   real(wp) :: a(n,n), b(n,n), c(n,n)
   real(wp) :: cpu1, cpu0

   call random_number( a ); a = a - 0.5_wp
   call random_number( b ); b = b - 0.5_wp
   c = 0.0_wp

   cpu0 = wtime()
   c = matmul( a, b )
   cpu1 = wtime()
   write(*,*) 'c11=', c(1,1), 'cpu_time=', (cpu1-cpu0), ' GFLOPS=', 2*real(n,kind=wp)**3/(cpu1-cpu0)/1.e9_wp

   cpu0 = wtime()
   call dgemm( 'N', 'N', n, n, n, 1.0_wp, a, n, b, n, 0.0_wp, c, n )
   cpu1 = wtime()
   write(*,*) 'c11=', c(1,1), 'cpu_time=', (cpu1-cpu0), ' GFLOPS=', 2*real(n,kind=wp)**3/(cpu1-cpu0)/1.e9_wp

contains
    function wtime ( )

!*****************************************************************************80
!
!! WTIME returns a reading of the wall clock time.
!
!  Discussion:
!
!    To get the elapsed wall clock time, call WTIME before and after a given
!    operation, and subtract the first reading from the second.
!
!    This function is meant to suggest the similar routines:
!
!      "omp_get_wtime ( )" in OpenMP,
!      "MPI_Wtime ( )" in MPI,
!      and "tic" and "toc" in MATLAB.
!
!  Licensing:
!
!    This code is distributed under the GNU LGPL license. 
!
!  Modified:
!
!    27 April 2009
!
!  Author:
!
!    John Burkardt
!
!  Parameters:
!
!    Output, real ( kind = rk ) WTIME, the wall clock reading, in seconds.
!
  implicit none

  integer, parameter :: rk = kind ( 1.0D+00 )

  integer clock_max
  integer clock_rate
  integer clock_reading
  real ( kind = rk ) wtime

  call system_clock ( clock_reading, clock_rate, clock_max )

  wtime = real ( clock_reading, kind = rk ) &
        / real ( clock_rate, kind = rk )

  return
  end function wtime  
end program kxk

With Intel MKL’s matmul,

when n=5000, what I got is,

 c11=   2.90899748439952      cpu_time=   1.50899999999092       GFLOPS=
   165.672630882375
 c11=   2.90899748439952      cpu_time=   4.14900000000489       GFLOPS=
   60.2554832489046

Looks like MKL automatically parallelize matmul, but not for dgemm.

Topic		Replies	Views
C++ Standard Library dense linear algebra interface	21	2330	August 11, 2023
Writing wrappers for LAPACK and BLAS routines	54	1978	December 14, 2023
Why is the Intel compiled executable that much faster than gnu?	5	331	July 25, 2025
Parallelization on GPU with Intel compiler Intel	55	2616	September 20, 2024
GPU utilization in a multithreaded code Help	6	963	June 6, 2023

Does LAPACK/BLAS automatically use multi cores or threads?

Related topics