Does LAPACK/BLAS automatically use multi cores or threads?

When I run this on my previous 2013 MacBook Pro with an intel i7 CPU (with framework accelerate), I typically see numbers in the 15-20 GFLOPS range. Gfortran has the same -fexternal-blas switch there, and it behaves the same way, lower numbers with the default matmul() and higher numbers with the compiler switch. I also have the intel compiler on that machine, so I will rerun it there to see what MKL does. I expect newer intel CPUs should perform better.

1 Like

Thanks @RonShepard .
In the kxk code, changed the

n=1000

to

n=5000

Since it seems n=1000 is too small to give accurate results on my laptop.

For n=5000, with Intel OneAPI, below is what I got,

c11=   2.90899748439952      cpu_time=   8.60937500000000       GFLOPS=
   29.0381125226860
 c11=   2.90899748439952      cpu_time=   5.17187500000000       GFLOPS=
   48.3383685800604

The default matmul() times look pretty good there. The intel compiler must share the same code as dgemm() without requiring any compiler flags. Was this the “classic” intel fortran compiler or the new llvm-based compiler?

1 Like

Below is the Intel Fortran version, 2021.6.0, Classic ifort,

It is interesting that if I choose to use Intel MKL’s matmul as matmul as show below, the option is
“Enable Matrix Multiply Library Call”, /Qopt-matmul,
I got slower results,

which is

 c11=   2.90899748439952      cpu_time=   9.15625000000000       GFLOPS=
   27.3037542662116
 c11=   2.90899748439952      cpu_time=   5.09375000000000       GFLOPS=
   49.0797546012270

However, if I use the default matmul as below, I seem to got better results,

which is

 c11=   2.90899748439952      cpu_time=   5.68750000000000       GFLOPS=
   43.9560439560440
 c11=   2.90899748439952      cpu_time=   4.09375000000000       GFLOPS=
   61.0687022900763

In both cases, intel MKL is enabled as below,

I have seen strange things before when tuning and benchmarking codes. I’ve seen computers that were faster when the leading matrix dimension was odd rather than even. That was related to memory interleaving features.

If you notice in the little benchmark code, I print out c(1,1). That is to avoid the subprogram calls getting entirely optimized away. I’ve seen things like that happen with benchmark codes. A “smart” compiler might figure out that c(1,1) could be computed just by itself as a dot product with 2*N operations rather than the full 2*N**3 operations. That apparently isn’t happening here.

1 Like

In C++ Eigen, matrix multiplication will use all threads when the OpenMP compilation flag is used.

I found one problem, it seems

call cpu_time()

does not measure the time correctly when multiple threads are involved.
When using Intel OneAPI with MKL, I use John Burkardt’s wtime() instead, so I run the code below,

program kxk
   integer, parameter :: wp=selected_real_kind(14), n=5000
   real(wp) :: a(n,n), b(n,n), c(n,n)
   real(wp) :: cpu1, cpu0

   call random_number( a ); a = a - 0.5_wp
   call random_number( b ); b = b - 0.5_wp
   c = 0.0_wp

   cpu0 = wtime()
   c = matmul( a, b )
   cpu1 = wtime()
   write(*,*) 'c11=', c(1,1), 'cpu_time=', (cpu1-cpu0), ' GFLOPS=', 2*real(n,kind=wp)**3/(cpu1-cpu0)/1.e9_wp

   cpu0 = wtime()
   call dgemm( 'N', 'N', n, n, n, 1.0_wp, a, n, b, n, 0.0_wp, c, n )
   cpu1 = wtime()
   write(*,*) 'c11=', c(1,1), 'cpu_time=', (cpu1-cpu0), ' GFLOPS=', 2*real(n,kind=wp)**3/(cpu1-cpu0)/1.e9_wp

contains
    function wtime ( )

!*****************************************************************************80
!
!! WTIME returns a reading of the wall clock time.
!
!  Discussion:
!
!    To get the elapsed wall clock time, call WTIME before and after a given
!    operation, and subtract the first reading from the second.
!
!    This function is meant to suggest the similar routines:
!
!      "omp_get_wtime ( )" in OpenMP,
!      "MPI_Wtime ( )" in MPI,
!      and "tic" and "toc" in MATLAB.
!
!  Licensing:
!
!    This code is distributed under the GNU LGPL license. 
!
!  Modified:
!
!    27 April 2009
!
!  Author:
!
!    John Burkardt
!
!  Parameters:
!
!    Output, real ( kind = rk ) WTIME, the wall clock reading, in seconds.
!
  implicit none

  integer, parameter :: rk = kind ( 1.0D+00 )

  integer clock_max
  integer clock_rate
  integer clock_reading
  real ( kind = rk ) wtime

  call system_clock ( clock_reading, clock_rate, clock_max )

  wtime = real ( clock_reading, kind = rk ) &
        / real ( clock_rate, kind = rk )

  return
  end function wtime  
end program kxk

With Intel MKL’s matmul,

when n=5000, what I got is,

 c11=   2.90899748439952      cpu_time=   1.50899999999092       GFLOPS=
   165.672630882375
 c11=   2.90899748439952      cpu_time=   4.14900000000489       GFLOPS=
   60.2554832489046

Looks like MKL automatically parallelize matmul, but not for dgemm.

The final numbers look good, but beware of the values of the clock_reading, clock_rate, and clock_max integers. Those are 32-bit values. The clock_rate value, if it is really the CPU clock rate, likely overflows, and when timings exceed a couple of seconds, the clock_reading value will wrap one or more times. Of course, the returned values might not be the CPU clock rate, in which case everything is alright, but it is something to watch for.

A possible workaround, if default integer size is a problem and if it is supported, is to use 64-bit integers.

1 Like

Here are the Apple M1 numbers without the blas substitution flag with the (1000,1000) matrices, and with the products put inside of a loop that does 10 passes. Thus the cache load effects should be minimized. I think this cpu has 24 MB of shared cache, so all three matrices should fit once they are loaded.

$ gfortran -O -framework accelerate  kxk.F90 && a.out
 c11=   1.4732006660396533      cpu_time=  0.48454799999999998       GFLOPS=   41.275580541040313     
 c11=   1.4732006660396513      cpu_time=   7.0429000000000019E-002  GFLOPS=   283.97393119311641

I think that first number is a 2-thread single-core timing, but as others have explained previously, it does multiple operation dispatches per clock cycle in each thread.

1 Like

I wanted to add one more comment about parallel linear algebra codes. LAPACK does not typically do anything explicitly parallel. That is, there are no OpenMP or MPI calls in LAPACK. However, as shown in this thread, the underlying BLAS routines used by LAPACK do sometimes employ multiple threads and multiple dispatch of functional units. Also, many LAPACK routines use a WORK(*) argument, and internally they divide tasks into smaller blocks based on the size of that workspace array, so there is some indirect tuning that can occur through the LAPACK subroutine arguments that affect eventually how the underlying parallel BLAS routines perform.

For a parallel linear algebra library that does explicitly use OpenMP and MPI, look for example at Scalapack: ScaLAPACK — Scalable Linear Algebra PACKage

1 Like

LAPACK is explicitly threaded now, at least in the better implementations. MKL threads at the LAPACK level, not just with BLAS calls, since that’s not efficient. Even reference LAPACK has some threading now, e.g. lapack/zhetrd_hb2st.F at 2614d23900915235f9652a5a2ad3e1873e7ac9f0 · Reference-LAPACK/lapack · GitHub, and even using OpenMP tasking (lapack/dsytrd_sb2st.F at 2614d23900915235f9652a5a2ad3e1873e7ac9f0 · Reference-LAPACK/lapack · GitHub).

3 Likes

Those interesting in Apple M1 DGEMM performance should read Benchmarking the Apple M1 Max. Accelerate uses AMX. Other BLAS libraries using Neon will be limited to the Neon peak, which is ~200 GF/s FP64 (~3 GHz * 8 cores * 8 flop/clock). The CPU floating-point pipeline is described as 4 FADD and 4 FMUL per cycle (Apple's Humongous CPU Microarchitecture - Apple Announces The Apple Silicon M1: Ditching x86 - What to Expect, Based on A14). OpenBLAS hits ~160 GF/s in DGEMM for m=n=k=8000, which is consistent with this.

3 Likes

Hi Jeff. That is good to know. I have never seen those OpenMP directives before. Do you know when they were added? Also, do you know about any BLAS libraries that use GPUs, either on intel or apple CPUs?

1 Like

Just curious, why not using -O3?

For that little program, the optimization does not affect the times. It shows the same results with -O0, -O3, or anything else. Apparently that is not the case with the MKL library, although I don’t think I had ever noticed that. The -O optimization level is the default in my makefile for gfortran, so that is what I cut and pasted into the earlier reply.

1 Like

@RonShepard I can’t tell from the git history, but it was some time before 2019 (Make "OMP task depend" sections conditional on OpenMP4, not just OpenMP · Reference-LAPACK/lapack@3e5c803 · GitHub), since that is when they updated the preprocessor guards on that code.

As for BLAS libraries that use GPUs, there are a few. CUBLAS is a GPU BLAS library that uses different symbols, and has asynchronous calls that operate on device (or managed) memory. The NVBLAS wrapper (NVBLAS :: CUDA Toolkit Documentation) intercepts the standard BLAS symbols, which makes it easy to use, but it’s possible that some use cases won’t run optimally, because the GPU executed BLAS calls will do CPU<->GPU data movement internally. It’s probably a good idea to use managed memory in such an application, e.g. by use F90 allocatable arrays and the NVFortran flag -gpu=managed.

I think Intel MKL is trying to do some GPU BLAS calls using the standard symbols, but with OpenMP target directives to control execution and data movement.

AMD has a GPU BLAS library but I don’t know if it supports anything other than classic asynchronous usage on device memory with C/C++ symbols (HIP/ROC BLAS).

I don’t know if Apple BLAS uses the GPU, but since Apple doesn’t support Fortran, and their GPU doesn’t do FP64, I’m not sure it matters much.

I wonder what it would take for Apple to more actively support fortran? In the past, they have relied on, for example, IBM to support fortran on PowerPC hardware and on intel to support fortran on intel hardware, but now that they are making their own CPU chips, it seems like they should now step up and do it themselves. There is an active llvm fortran project, and Apple does support other llvm-based compilers. How could fortran programmers, perhaps through this discussion group, encourage Apple to do that?

Apple does not care about Fortran because the market associated with it is irrelevant to the Mac business unit. In 2021, Apple made more money on iPads alone than the entire Intel Data Center Group business. There is absolutely nothing anyone can do to make Apple do anything with Fortran. The best thing for the Fortran community can do for MacOS users is to continue to support the GCC Fortran effort that supports MacOS today, and to increase support for LLVM Fortran, since that would leverage Apple’s existing (substantial) investment into LLVM.

I’ll note that non-zero Apple employees care about Fortran, and one of my friends there has been helpful regarding GCC Fortran, but that’s a long, long way from Apple having a Fortran compiler. They don’t even enable OpenMP in Clang.

1 Like

Why should apple care about Fortran? Should they also have their own compilers for Java, Go, and PHP (all of which are more common)? If the Fortran community isn’t large enough to support the second most common desktop OS, why should Apple invest in it? Similarly, if Fortran requires a special compiler from every hardware manufacturer to be performance competitive, is that a vote of confidence in it’s ability to endure?

Sounds like good advice. I will admit that so far I’m pretty happy with the way gfortran works on the Apple arm64 hardware. I know there are other fortran compilers available too, but gfortran is the only one I’ve tried so far.