Matmul benchmark

I want to compare the elapsed time for matrix multiplication using the following methods:

  • The intrinsic matmul function
  • The dgemm function
  • A Coarray version of matmul (utilizing intrinsic matmul)
  • A Coarray version of matmul (utilizing dgemm).

As the number of images increases, the time it takes for the standard matmul and dgemm functions also increases. I believe it should remain constant for any number of images regardless of the number of images.

How can I perform a comparison of the elapsed time for these four functions?

Results for -coarray-num-images=1
 Elapsed time (matmul):  1.645 [s]
 Elapsed time (dgemm):  0.768 [s]
 Elapsed time (coarray with matmul) image= 1:  1.696 [s]
 Elapsed time (coarray with dgemm) image= 1:  0.823 [s]
Results for -coarray-num-images=2
 Elapsed time (matmul):  1.687 [s] ??
 Elapsed time (dgemm):  0.783 [s] ??
 Elapsed time (coarray with matmul) image= 1:  0.899 [s]
 Elapsed time (coarray with matmul) image= 2:  0.899 [s]
 Elapsed time (coarray with dgemm) image= 2:  0.447 [s]
 Elapsed time (coarray with dgemm) image= 1:  0.447 [s]
Results for -coarray-num-images=4
 Elapsed time (matmul):  2.173 [s] ??
 Elapsed time (dgemm):  0.927 [s] ??
 Elapsed time (coarray with matmul) image= 3:  0.553 [s]
 Elapsed time (coarray with matmul) image= 4:  0.553 [s]
 Elapsed time (coarray with matmul) image= 1:  0.553 [s]
 Elapsed time (coarray with matmul) image= 2:  0.553 [s]
 Elapsed time (coarray with dgemm) image= 3:  0.287 [s]
 Elapsed time (coarray with dgemm) image= 4:  0.287 [s]
 Elapsed time (coarray with dgemm) image= 1:  0.287 [s]
 Elapsed time (coarray with dgemm) image= 2:  0.287 [s]
Results for -coarray-num-images=8
 Elapsed time (matmul):  2.355 [s] ??
 Elapsed time (dgemm):  1.238 [s] ??
 Elapsed time (coarray with matmul) image= 3:  0.373 [s]
 Elapsed time (coarray with matmul) image= 1:  0.373 [s]
 Elapsed time (coarray with matmul) image= 6:  0.373 [s]
 Elapsed time (coarray with matmul) image= 4:  0.374 [s]
 Elapsed time (coarray with matmul) image= 7:  0.374 [s]
 Elapsed time (coarray with matmul) image= 5:  0.374 [s]
 Elapsed time (coarray with matmul) image= 8:  0.374 [s]
 Elapsed time (coarray with matmul) image= 2:  0.374 [s]
 Elapsed time (coarray with dgemm) image= 3:  0.232 [s]
 Elapsed time (coarray with dgemm) image= 1:  0.232 [s]
 Elapsed time (coarray with dgemm) image= 4:  0.232 [s]
 Elapsed time (coarray with dgemm) image= 2:  0.232 [s]
 Elapsed time (coarray with dgemm) image= 7:  0.232 [s]
 Elapsed time (coarray with dgemm) image= 6:  0.232 [s]
 Elapsed time (coarray with dgemm) image= 5:  0.232 [s]
 Elapsed time (coarray with dgemm) image= 8:  0.232 [s]
Code
program benchmark1

   use kinds,     only: rk
   use fortime,   only: timer
   use formatmul, only: matmul, matmul_blas

   implicit none

   real(rk), allocatable :: A(:,:), B(:,:)
   real(rk), allocatable :: C(:,:)
   type(timer)           :: t
   integer               :: m, n, o, i ,l
   character(2) :: im

   ! C(m,o) = A(m,n).B(n,o)
   m = 4000
   n = 3000
   o = 2000

   l = 5

   allocate(A(m,n),B(n,o))
   call random_number(A)
   call random_number(B)

   sync all

   if (this_image() == 1) call t%timer_start()
   do i = 1,l
      C = matmul(A,B) ! intrinsic matmul
   end do
   if (this_image() == 1) call t%timer_stop(message=' Elapsed time (matmul):',nloops=l)

   sync all

   if (this_image() == 1) call t%timer_start()
   do i = 1,l
      C = matmul_blas(A,B) ! dgemm
   end do
   if (this_image() == 1) call t%timer_stop(message=' Elapsed time (dgemm):',nloops=l)

   sync all

   call t%timer_start()
   do i = 1,l
      C = matmul(A,B,'coarray','matmul') ! coarray version of matmul (utilizes intrinsic matmul)
   end do
   write (im, "(I2)") this_image()
   call t%timer_stop(message=' Elapsed time (coarray with matmul) image='//trim(im)//':',nloops=l)

   sync all

   call t%timer_start()
   do i = 1,l
      C = matmul(A,B,'coarray','blas') ! coarray version of matmul (utilizes dgemm).
   end do
   call t%timer_stop(message=' Elapsed time (coarray with dgemm) image='//trim(im)//':',nloops=l)

end program benchmark1
2 Likes

On which machine do you measure the time? Has it a fixed clock rate? If not, maybe your CPU is getting slower while heating up during your benchmark.

The clock rate obtained from call system_clock(count_rate=clock_rate) is fixed at 10000. Here is the output of lscpu on my Linux system:

Architecture:            x86_64
  CPU op-mode(s):        32-bit, 64-bit
  Address sizes:         39 bits physical, 48 bits virtual
  Byte Order:            Little Endian
CPU(s):                  16
  On-line CPU(s) list:   0-15
Vendor ID:               GenuineIntel
  Model name:            Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
    CPU family:          6
    Model:               158
    Thread(s) per core:  2
    Core(s) per socket:  8
    Socket(s):           1
    Stepping:            13
    CPU(s) scaling MHz:  94%
    CPU max MHz:         5000,0000
    CPU min MHz:         800,0000
    BogoMIPS:            4800,00

I’m not 100% sure, but this clock rate seems to be another than the variable clock of the core, with turboboost at the beginning and throttling after a while. Have you tried repeating the tests in another order?

I don’t know exactly how to obtain the clock of the core.

Here are the results for 4 images, with the order changed (ifort 2021.10.0 20230609):

 Elapsed time (matmul):  2.353 [s]
 Elapsed time (coarray with dgemm) image= 1:  0.309 [s]
 Elapsed time (coarray with dgemm) image= 3:  0.309 [s]
 Elapsed time (coarray with dgemm) image= 4:  0.309 [s]
 Elapsed time (coarray with dgemm) image= 2:  0.309 [s]
 Elapsed time (coarray with matmul) image= 1:  0.541 [s]
 Elapsed time (coarray with matmul) image= 3:  0.541 [s]
 Elapsed time (coarray with matmul) image= 4:  0.541 [s]
 Elapsed time (coarray with matmul) image= 2:  0.541 [s]
 Elapsed time (dgemm):  1.009 [s]

This is interesting. When I perform intrinsic matmul at the end, it takes significantly more time (ifort 2021.10.0 20230609)::

 Elapsed time (coarray with dgemm) image= 2:  0.295 [s]
 Elapsed time (coarray with dgemm) image= 3:  0.295 [s]
 Elapsed time (coarray with dgemm) image= 4:  0.295 [s]
 Elapsed time (coarray with dgemm) image= 1:  0.296 [s]
 Elapsed time (coarray with matmul) image= 2:  0.545 [s]
 Elapsed time (coarray with matmul) image= 3:  0.545 [s]
 Elapsed time (coarray with matmul) image= 4:  0.545 [s]
 Elapsed time (coarray with matmul) image= 1:  0.546 [s]
 Elapsed time (dgemm):  1.014 [s]
 Elapsed time (matmul): 57.047 [s] ??

Here are the additional results (ifort 2021.10.0 20230609)::

 Elapsed time (dgemm):  1.041 [s]
 Elapsed time (coarray with dgemm) image= 2:  0.316 [s]
 Elapsed time (coarray with dgemm) image= 4:  0.316 [s]
 Elapsed time (coarray with dgemm) image= 3:  0.316 [s]
 Elapsed time (coarray with dgemm) image= 1:  0.316 [s]
 Elapsed time (coarray with matmul) image= 2:  0.582 [s]
 Elapsed time (coarray with matmul) image= 4:  0.582 [s]
 Elapsed time (coarray with matmul) image= 3:  0.582 [s]
 Elapsed time (coarray with matmul) image= 1:  0.582 [s]
 Elapsed time (matmul): 61.321 [s] ??

However, using ifx 2023.2.0 20230622:

 Elapsed time (dgemm):  0.863 [s]
 Elapsed time (coarray with dgemm) image= 1:  0.276 [s]
 Elapsed time (coarray with dgemm) image= 4:  0.276 [s]
 Elapsed time (coarray with dgemm) image= 2:  0.276 [s]
 Elapsed time (coarray with dgemm) image= 3:  0.276 [s]
 Elapsed time (coarray with matmul) image= 1:  0.959 [s]
 Elapsed time (coarray with matmul) image= 4:  0.959 [s]
 Elapsed time (coarray with matmul) image= 2:  0.960 [s]
 Elapsed time (coarray with matmul) image= 3:  0.960 [s]
 Elapsed time (matmul):  3.366 [s] ! ok

Something seems to be wrong with IFORT!

You can use LIKWID tools to measure the clock as well:

Provided the specific CPU, in your case Intel Core i9, is supported. Majority of Intel server CPUs are supported, but Core CPUs may or may not.

On server CPUs, running on a single core you may run on a high clock (if not manually set), however, running on all cores is less likely to have all cores at a high clock. This may be valid for Intel Core CPUs as well.

2 Likes

Here is another simple example in which the intrinsic matmul function is performed, followed by a transpose operation, and then again intrinsic matmul function. In this case, there are no coarrays involved.

program benchmark2

   use kinds,     only: rk
   use fortime,   only: timer

   implicit none

   real(rk), allocatable :: A(:,:), B(:,:)
   real(rk), allocatable :: C(:,:)
   type(timer)           :: t
   integer               :: m, n, o, i ,l
   character(2) :: im

   ! C(m,o) = A(m,n).B(n,o)
   m = 2000
   n = 2000
   o = 2000

   l = 4

   allocate(A(m,n),B(n,o))
   call random_number(A)
   call random_number(B)

   sync all

   if (this_image() == 1) call t%timer_start()
   do i = 1,l
      C = matmul(A,B)
   end do
   if (this_image() == 1) call t%timer_stop(message=' Elapsed time (matmul):',nloops=l)

   sync all

   if (this_image() == 1) call t%timer_start()
   do i = 1,l
      A = transpose(A)
   end do
   if (this_image() == 1) call t%timer_stop(message=' Elapsed time (transpose):',nloops=l)

   sync all

   if (this_image() == 1) call t%timer_start()
   do i = 1,l
      C = matmul(A,B)
   end do
   if (this_image() == 1) call t%timer_stop(message=' Elapsed time (matmul):',nloops=l)

end program benchmark2

When using ifort, the elapsed time increases:

 Elapsed time (matmul):  0.669 [s]
 Elapsed time (transpose):  0.027 [s]
 Elapsed time (matmul): 16.256 [s] ???

However, when using ifx, the elapsed time doesn’t change:

 Elapsed time (matmul):  1.388 [s]
 Elapsed time (transpose):  0.029 [s]
 Elapsed time (matmul):  1.413 [s]

It appears again that something might be wrong with ifort.

There may be a couple of points:

  1. The clock rate that was mentioned by Carltoffel

  2. matrix-matrix if written in a simple way, it is a memory-bound algorithm. However, it is possible to rewrite it such that it is compute-bound. I do not know how much effort the developers of Intel Fortran compilers have put to fully optimize the function. When it is memory-bound, it is no surprise to encounter the increase of run time as the number of image increase because it is a hardware limit. When one image is running, there may be more memory bandwidth available per core as compared to running on all cores on a socket. In contrast, it is written in a fully optimized way to be compute bound, then only clock rate may be involved and if the clock rate is fixed, then there should not be in increase as the number of images increases.

  3. ifort has been developed for years, ifx has been under heavy development in recent years. You may encounter much different in performance when using different versions of the ifx compiler.

  4. In a code like

   do i = 1,l
      C = matmul(A,B)
   end do

the compiler most likely will find out the loop result does not depend on the iterations and it will drop the loop and do it once. Usually, people fool the compiler not to be able to find it.

In the second example, as opposed to the first one, I haven’t increased the number of images, it remains unchanged at four. The primary distinction here arises from utilizing different compilers: Intel Fortran Classic and Intel Fortran. Furthermore, tests involving only a single image for any small or large matrices consistently produce identical results: the second matmul operation using ifort takes more time.

If point 4 in this case is true, it should apply to all three loops involving first matmul, transpose, and the second matmul. However, this doesn’t happen because I’m averaging the time across l loops and this can be shown easily.

For an algorithm that can be much optimized like matmul, it is fine to observe different performances between Intel Legacy compiler and the modern LLVM ifx. They are very different anyway.

Concerning the loop, you can try to generate the assembly file with “-S” or use

to find out that deleting the loop will not change it because the compiler detects it, in this case it is easy for the compiler to find out.
Also, the size of the binary file does not change when I delete the loop lines.

One more point, when the memory access pattern is not good, such arrays are not strided. Then, there will be a huge impact on performance. In the case of small arrays that fit into cache, the issue of cache line misses and in the case of large arrays memory bandwidth and latency.
I do not know what matmul does when arrays doe not have a proper stride. Does it transpose? It would be good if someone from Intel or Gfortran community can answer it.