Does LAPACK/BLAS automatically use multi cores or threads?

Dear all,

Just a quick question, I know numpy in Python uses lapack/blas, so the speed is fast.

According to a reply, anaconda version of python seems use parallelized precompiled blas/lapack,

I wonder, in Fortran, if we use subroutines/functions in LAPACK/BLAS, do we need to use addtional compile and/or link flags so that the LAPACK/BLAS can use multiple cores/threads?

Or does things like -O3 -march=native or intel’s -O3 -xHost already made LAPACK/BLAS parallelized?

I know Intel Fortran has flags like /libs:static /threads /Qmkl:cluster (on Windows, on Linux the flags are similar) in compiling, so that it uses MKL’s parallel version of LAPACK/BLAS.

Does gfortran has similar things?

Thanks much in advance!

You do know that LAPACK/BLAS are libraries, so are dependent on how those libraries are coded. There is not a single source for those. You’re aware that Intel MKL does this automatically, but it doesn’t care how you compile your Fortran source.

1 Like

These flags aren’t required, you can link in MKL in the normal way using the libraries themselves (which is what Link Line Advisor for Intel® oneAPI Math Kernel Library does, and I personally prefer,) and if you look at the options at that link, there’s options for gfortran and a certain green compiler.

As for what other implementations of BLAS/LAPACK do? You’d have to consult their documentation.

1 Like

Here is an odd thing I just noticed over the weekend. I just got a new MacBook Pro computer. It has an Apple M1 CPU. I’m just now installing compilers and stuff. For the last 25 years, I have done all of my code development on my laptop computers, so I have a lot of legacy stuff to port over.

One step in that process is to tune some library routines that do matrix products. I have about 20 different algorithms implemented, including explict loops and mixtures of loops and dot_product() and daxpy(), and so on. And there also options like gfortran’s external-blas that comes into play in this process.

So this benchmark dates back to the early 1980s. When the level-2 and level-3 BLAS routines were developed later, I addedd those too. In the early 1980s, computers were typically less than a MFLOPS up to maybe 10 MFLOPS. By the late 1980s, typical timings were in the 100 to 500 MFLOPS range. By 2000, they were up to 1 GFLOP or so. My last laptop was a 2013 MacBook Pro that was about 20 GFLOPS peak. This new computer is turning in numbers like 270 GFLOPS.

So how exactly is that happening? I don’t really know. The clock speed is variable on this machine, in the 2 to 3 GHz range depending on the load. So even with one floating point operation per cycle, that would only be, say, 2 GFLOPS per core. But there are 10 cores, so if the compiler or the library routine (DGEMM, DGEMV, etc.) uses multiple cores, that would get me up to around 20 GFLOPS. And the hardware also has GPUs. In order to get to 270 GFLOPS, the compiler and/or the library routines must be using the GPUs. The theoretical peak performance for this laptop is supposed to be someting like 4 TFLOPS, so I’m only getting a small fraction of that right now, but these timings are remarkable nonetheless.

At first I did not believe those numbers. I had assumed that my timer routine was off, or the compiler was optimizing away a do loop or something (those things have happened in the past). But I’ve checked now, and those numbers look legit. It is remarkable. So my conclusion is that the mathematical library routines on this machine must be both running in parallel over the cores and also using the GPU hardware.

There is one other notable thing. My first laptops in the mid 1990s were Apple PowerBooks using PowerPC CPUs. Then in 2005 Apple switched to intel CPUs. I had some old unformatted files sitting around during that switch, and I tried to use them and to my surprise they worked. Both CPUs used 32-bit twos-complement integers and IEEE floating point, and I guess the compilers I was using chose the same unformatted file formats, so everything worked. Well, I tried those same unformatted files with this new machine, and they still work. So I have some unformatted files from about 25 years ago that have been used by multiple compilers across multiple CPU architectures that still work.

2 Likes

The thing you’re missing is that 1 flop per cycle is really bad performance. Modern cpus have fma which is 2 flops in 1, and the M1 has a 128 bit vector width (so 2 fmas at a time), and it can dispatch 4 of these fmas per clock cycle. This gets you to a theoretical performance of 32-48 GFLOPS per core.

1 Like

In these benchmark timings, I see some numbers in the 3-5 GFLOPS range. These are presumably single-core timings of codes that use a single fma per clock cycle. This is what gfortran does with three nested do loops. Then there are other numbers in the 8-10 GFLOPS range. These might be single-core results with multiple fmas at a time. Then the other results are up above 100 GFLOPS, so these must be the multiple-core multiple-dispatch results. Gfortran doesn’t produce those directly, it is only through the external-blas option or a direct reference to dgemm() that I see those numbers. So it looks like the M1 chip is capable of 300 GFLOPS or more without using the GPUs.

Do you have any idea how much dgemm is memory bandwidth limited? The daxpy() inner loop timings are usually better than the dot_product() inner loop timings. This is the way the Cray used to be, and was unlike most other CPUs. That suggests that memory bandwidth is not the bottleneck that limits the single-core results. But it could still be the limit for the multicore results.

BTW, the dgemm() I’m using here is the one Apple supplies with -framework accelerate.

1 Like

Generally single threaded matmul aren’t memory bottle-necked (unless the other cores are doing things to eat memory bandwidth). On almost all modern CPUs, multithreaded matmul is memory bottlenecked, but the M1 has ridiculous memory bandwidth, so it’s possible that on an M1, it’s CPU bottlenecked even for multicore.

1 Like

I believe the accelerate framework on the M1 uses the AMX2 coprocessor. Each cluster of 4 performance cores has one. You mention 10 cores, which is the M1 max with 8 performance cores and thus two AMX2 blocks. The accelerate framework will use two threads in that case.

2 Likes

Wow, this is amazing. Can I ask you what compiler are you using? I have a homebrew gfortran build on my M1 and would like to give it a try, but am not sure it can link against Apple’s frameworks directly.

1 Like

I’m using the Homebrew version of gfortran. Just add -framework accelerate to the load step for direct dgemm() calls. For the indirect approach, just add the -fexternal-blas flag to the gfortran compile step, and it will convert the matmul() calls to dgemm() calls for you. the timings are a little different for small matrices, but for anything larger than 100x100 or so, the two approaches seem to perform about the same.

Here is a little sample program:

program kxk
   integer, parameter :: wp=selected_real_kind(14), n=1000
   real(wp) :: a(n,n), b(n,n), c(n,n)
   real(wp) :: cpu1, cpu0

   call random_number( a ); a = a - 0.5_wp
   call random_number( b ); b = b - 0.5_wp
   c = 0.0_wp

   call cpu_time(cpu0)
   c = matmul( a, b )
   call cpu_time(cpu1)
   write(*,*) 'c11=', c(1,1), 'cpu_time=', (cpu1-cpu0), ' GFLOPS=', 2*real(n,kind=wp)**3/(cpu1-cpu0)/1.e9_wp

   call cpu_time(cpu0)
   call dgemm( 'N', 'N', n, n, n, 1.0_wp, a, n, b, n, 0.0_wp, c, n )
   call cpu_time(cpu1)
   write(*,*) 'c11=', c(1,1), 'cpu_time=', (cpu1-cpu0), ' GFLOPS=', 2*real(n,kind=wp)**3/(cpu1-cpu0)/1.e9_wp

end program kxk

You can compile it with:

gfortran -O -fexternal-blas -framework accelerate -o kxk.x kxk.F90

When I run the code, I see

 c11=  0.39803384021638449      cpu_time=   1.2389999999999998E-002  GFLOPS=   161.42050040355127     
 c11=  0.39803384021638449      cpu_time=   9.0930000000000004E-003  GFLOPS=   219.94941163532386

There is 10-15% or so variations in the times, I’m guessing because of timer resolution (which seems to be 1e-6 second) and maybe cache effects. If the operations were put in a loop, then the timer noise would be smaller.

2 Likes

So the CPU has AMX2 coprocessors in addition to GPUs? Do you know if there is any way to access the GPUs for things like dgemm()?

Yes, I do see two threads with top when these benchmarks are running. But I don’t see 800% type numbers for cpu usage, I only see 100% type numbers.

1 Like

Yes, the AMX blocks are separate from the GPUs. I have no idea how to access the GPUs, let alone whether that would help in speeding up dgemm.

1 Like

The answer may vary depending on configuration. Here is a URL that provides useful information.
Parallel linear algebra for multicore system - Stack Overflow.

2 Answers

Sorted by:

10

As mentioned by @larsmans (with, say, MKL), you still use LAPACK + BLAS interfaces, but you just find a tuned, multithreaded version for your platform. MKL is great, but expensive. Other, open-source, options include:

  • [OpenBLAS](/ GotoBLAS, the Nehalem support should work ok but no tuned support yet for westmere. Does multithreading very well.
  • [Atlas] : automatically tunes to your architecture at installation time. probably slower for “typical” matricies (eg, square SGEMM) but can be faster for odd cases, and for westmere may even beat out OpenBLAS/GotoBLAS, haven’t tested this myself. Mostly optimized for serial case, but does include parallel multithreading routines.
  • [Plasma] - LAPACK implementation designed specificially for multicore.

I’d also agree with Mark’s comment; depending on what LAPACK routines you’re using, the distributed memory stuff with MPI might actually be faster than the multithreaded. That’s unlikely to be the case with BLAS routines, but for something more complicated (say the eigenvalue/vector routines in LAPACK) it’s worth testing. While it’s true that MPI function calls are an overhead, doing things in a distributed-memory mode means you don’t have to worry so much about false sharing, synchronizing access to shared variables, etc.
"

1 Like

Wow, that runs almost as fast on my M1 Pro with -framework Accelerate:

federico@Federicos-MacBook-Pro ~ % ./kxk.x
 c11=   2.6598316725956797      cpu_time=   1.3627999999999998E-002  GFLOPS=   146.75667742882302     
 c11=   2.6598316725956797      cpu_time=   1.1785999999999998E-002  GFLOPS=   169.69285593076535     

and there’s a speed-up factor >4x w.r.t. the intrinsic matmul, unbelieveable.

1 Like

Thanks all!

@RonShepard @FedericoPerini .
I run the kxk code using intel fortran with flag -O3 -xHost on my laptop, xeon 2186M, windows 10. I believe I used MKL already. I am currently running other programs so the result is not accurate.

However anyway, it seems the speed I got is way way way much slower than the ones you listed,

 c11=  -2.31481358685203      cpu_time=  0.328125000000000       GFLOPS=  6.09523809523809
 c11=  -2.31481358685203      cpu_time=  0.390625000000000       GFLOPS=  5.12000000000000

I am a little puzzled.
Is there anyone get similar results on a non-Mac M1 machine?
Is there ways to improve the result on a non-Mac M1 machine?

I think that 4x is the multiple dispatch that others have already commented on. I’m just now learning about the features of this chip, so it is nice to see some of that explained here.

If you switch the order of the two operations, you put dgemm first and matmul last, then the second one seems to perform better than the first. This is probably a cache effect. The first operation fills in the various cache levels with the data, and the second operation can reuse some of that effort. Also, if you put loops around the operations (within the timer steps), then the data lives in cache longer and the timings are pretty consistently improved.

This is just an artificial benchmark, but this knowledge of multiple dispatches and cache behavior is how actual application codes are tuned to improve performance too.

1 Like

When I run this on my previous 2013 MacBook Pro with an intel i7 CPU (with framework accelerate), I typically see numbers in the 15-20 GFLOPS range. Gfortran has the same -fexternal-blas switch there, and it behaves the same way, lower numbers with the default matmul() and higher numbers with the compiler switch. I also have the intel compiler on that machine, so I will rerun it there to see what MKL does. I expect newer intel CPUs should perform better.

1 Like

Thanks @RonShepard .
In the kxk code, changed the

n=1000

to

n=5000

Since it seems n=1000 is too small to give accurate results on my laptop.

For n=5000, with Intel OneAPI, below is what I got,

c11=   2.90899748439952      cpu_time=   8.60937500000000       GFLOPS=
   29.0381125226860
 c11=   2.90899748439952      cpu_time=   5.17187500000000       GFLOPS=
   48.3383685800604

The default matmul() times look pretty good there. The intel compiler must share the same code as dgemm() without requiring any compiler flags. Was this the “classic” intel fortran compiler or the new llvm-based compiler?

1 Like

Below is the Intel Fortran version, 2021.6.0, Classic ifort,

It is interesting that if I choose to use Intel MKL’s matmul as matmul as show below, the option is
“Enable Matrix Multiply Library Call”, /Qopt-matmul,
I got slower results,

which is

 c11=   2.90899748439952      cpu_time=   9.15625000000000       GFLOPS=
   27.3037542662116
 c11=   2.90899748439952      cpu_time=   5.09375000000000       GFLOPS=
   49.0797546012270

However, if I use the default matmul as below, I seem to got better results,

which is

 c11=   2.90899748439952      cpu_time=   5.68750000000000       GFLOPS=
   43.9560439560440
 c11=   2.90899748439952      cpu_time=   4.09375000000000       GFLOPS=
   61.0687022900763

In both cases, intel MKL is enabled as below,