Testing the performance of `matmul` under default compiler settings

Further to the discussion under Automatic arrays and intrinsic array operations: to use or not to use?, I did some more tests on the performance of matmul under the default settings of the compilers. Here, matmul is only an example of Fortran’s intrinsic matrix/vector operation. There are many others. They are core strengths of Fortran, and I hope they are robust and performant by default.

(Or, how to argue that we want these intrinsic functions to perform poorly by default? How to explain such a poor default performance as an advantage? For me, this is hard to do.)

Admittedly, the tests should be more systematic, with randomized data and nice plots. However, I did not do it because the intrinsic matmul simply takes too much time to run under the default settings.

Code:

module maprod_mod
implicit none
private
public :: maprod, RP
integer, parameter :: RP = kind(0.0D0)

contains

function maprod(X, Y) result(Z)
real(RP), intent(in) :: X(:, :), Y(:, :)
real(RP), allocatable :: Z(:, :)
integer :: i, j, k

if (size(X, 2) /= size(Y, 1)) then
    write (*, *) "Wrong size"
    stop
end if

allocate (Z(size(X, 1), size(Y, 2)))
Z = 0.0_RP
do j = 1, size(Y, 2)
    do i = 1, size(X, 1)
        do k = 1, size(X, 2)
            Z(i, j) = Z(i, j) + X(i, k) * Y(k, j)
        end do
    end do
end do
end function maprod

end module maprod_mod

program test_matmul
use, non_intrinsic :: maprod_mod, only : maprod, RP
implicit none
integer, parameter :: n = 2000
real(RP), allocatable :: A(:, :), B(:, :)
real :: start, finish

write (*, *) 'Initialize'
allocate (A(n, n), B(n, n))
A = 0.0_RP

write (*, *) 'MATMUL'
!call cpu_time(start)
B = matmul(A, A)
!call cpu_time(finish)
write (*, *) 'Succeed'
!write (*, *) 'Time in seconds:', finish - start  ! The use of `cpu_time` is questionable. Just ignore it. 

! Uncomment the following code if you would like to compare `matmul` and `maprod`. 
!call cpu_time(start)
!B = maprod(A, A)
!call cpu_time(finish)
!write (*, *) 'Succeed'
!write (*, *) 'Time in seconds:', finish - start 

end program test_matmul

Hardware and OS:

Lenovo Thinkpad X1 Carbon gen 8, Ubuntu 20.04, click to see the details
$ uname -a
Linux zX18 5.11.15-051115-generic (Ubuntu 20.04)
$ lscpu | grep mod
Nom de modèle :                         Intel(R) Core(TM) i7-10610U CPU @ 1.80GHz
$ lsmem 
RANGE                                  SIZE    STATE REMOVABLE  BLOCK
0x0000000000000000-0x000000006fffffff  1.8G en ligne       oui   0-13
0x0000000078000000-0x000000007fffffff  128M en ligne       oui     15
0x0000000100000000-0x000000047fffffff   14G en ligne       oui 32-143
Taille du bloc mémoire :  128M
Mémoire partagée totale : 15.9G
$ ulimit  -a
core file size          (blocks, -c) unlimited
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 62293
max locked memory       (kbytes, -l) 65536
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1024
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) unlimited
max user processes              (-u) 62293
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

Software:

  • gfortran: gcc version 9.4.0 (Ubuntu 9.4.0-1ubuntu1~20.04.1)
  • nvfortran: nvfortran 22.5-0 64-bit target on x86-64 Linux -tp haswell
  • Classic flang: clang version 7.0.1
  • AOCC flang: AMD clang version 13.0.0 (CLANG: AOCC_3.2.0-Build#128 2021_11_12) (based on LLVM Mirror.Version.13.0.0)
  • sunf95: Oracle Developer Studio 12.6
  • g95: gcc version 4.0.3 (g95 0.94!) Jan 17 2013 (this is a dinosaur, but it is interesting to compare it with ifort and ifx)
  • nagfor: NAG Fortran Compiler Release 7.0(Yurakucho) Build 7074
  • af95: Absoft 64-bit Pro Fortran 22.0.2
  • ifort and ifx: Intel OneAPI 2022.1.0

Compiler option: -g .

Results:

  1. ifort and ifx encountered a segfault when n = 2000 (I did not try to find the smallest n). For more information about the crash, see my post under Automatic arrays and intrinsic array operations: to use or not to use. It is a well-known and well-understood issue. Let’s not repeat how to circumvent it by compiler options. Instead, let’s discuss why it is not a problem that should worry the Fortran community and how to explain this to the Fortran beginners and Fortran criticizers.

  2. af95 from encountered a segfault when n = 10000 (I did not try to find the smallest n).

  3. Up to n = 20000, none of gfortran , nvfortran, AOCC flang, classic flang, Oracle sunf95, nagfor, and g95 crashed before I terminated them. Well done!

  4. With n = 20000

  • gfortran, nvfortran, AOCC flang, classic flang, and Oracle sunf95 took about 2000–3000 wall-clock seconds to finish the job;
  • nagfor took more than 10 wall-clock hours and did not finish before I terminated it;
  • g95 took hours and did not finish before I terminated it.
  1. Wtih n = 20000 and randomized initialization of A, the following MATLAB code took about 200–300 wall-clock seconds. No segfault, no crash, no tuning needed, no long waiting needed. Just run.
 A = rand(20000,20000); tic; B = A*A; toc

What is your opinion?

Sure, MATLAB would be a total failure without the high-performance Fortran code behind the job. That’s for sure. It is a fact known quite well. I totally agree. Let’s not repeat this. Instead, let us discuss why Fortran should not optimize its intrinsic matrix/vector operations in the same way (MATLAB is optimizing its intrinsics using Fortran!) and how to explain this to Fortran beginners and Fortran criticizers.

4 Likes

I haven’t followed the other thread, but is the issue basically about use of an efficient blas implementation – which needs to be prescribed in Fortran?

If I name your code fortran_matmul.f90, set n=10000, and compile like this:
gfortran -Ofast fortran_matmul.f90 -o fortran_matmul
then it executes slowly on my machine (about 115s) .

But if I make it use the system blas then its much faster (around 7.5s) :
gfortran -Ofast -fexternal-blas fortran_matmul.f90 -o fortran_matmul -lblas

Using top I can see that the slow version is only using 1 core, while the fast version is using all cores (12 on my machine).

As a user, I would not necessarily want the fortran version to use all of my cores by default – it’s something I’d like to control. Conversely I think your point is that new users won’t know how to optimize this – agree that’s an issue.

By the way, the use of CPU_time here is a bit misleading, because in parallel it does not correspond to the wall time, but rather the total time of all cores.

2 Likes

I just ran your code using Intel OneAPI, no problem at all.
It took less than 1 second.

For intel Fortran, you probably need to make sure you did,

-heap-arrays

Here maybe another relevant thread,

gfortran should run just fine without adding a flag I think. Although you can add some flags such as
-mcmodel=medium or -frecursive or other flags to heap arrays,

2 Likes

Thank you for pointing out this.

1 Like

If I set n=10000, it took 15 seconds, (note that call cpu_time is not accurate when multiple cores are involved, here it actually took 15 seconds, but call cpu_time shows 60 seconds or so)

1 Like

Thank you for pointing our the inaccuracy in cpu_time. However, when comparing 3000 with 300, the accuracy does not matter so much.

So if you have 10 cores and are running in parallel, then I think that CPU time will return about 10x the wall time.

2 Likes

Even though cpu_time was included in the code, the time reported in my post is indeed wall-clock time. It was not that precise, but the magnitude was correct.

1 Like

That is because I used Intel MKL’s matmul which automatically uses multiple cores. So call cpu_time() will not show real wall time.

About showing true wall time, you may check,

1 Like

I hope the intrinsic matmul will have a comparable performance by default without tuning needed. Before the tuning finished, half of the users would have turned to our languages, if not more.

1 Like

So we know that by linking to an optimal blas, Fortran can perform very well.

I think you really want to focus on why this doesn’t always happen by default:

Blockquote
let us discuss why Fortran should not optimize its intrinsic matrix/vector operations in the same way (MATLAB is optimizing its intrinsics using Fortran!) and how to explain this to Fortran beginners and Fortran criticizers.

Not my expertise. Presumably it isn’t trivial to do this across the full range of platforms (?anyone informed about this?). No doubt more progress would be possible if more programmers choose to contribute to open source compilers, and/or more funding was available for compilers across the board.

2 Likes

I have not even a tiny doubt about it.

This is exactly my point.

2 Likes

I used the intrinsic matmul which only uses one core, it took 54 seconds when n=10000, and call cpu_time() in this case is accurate.
So, in general, call cpu_time() does return the total time, not the wall time. I mean, for this same code, if I use 10 cores, call cpu_time() will still return like 60 seconds which is total time, but the actual wall time is smaller than that.
I mean call cpu_time() can give you the estimation of total core hours you need.

Sorry for naïve/lazy question, once I installed gfortran, can I just run the above command? I mean do I need to install some blas?

I’m running Ubuntu where blas is already installed, so I didn’t need to do anything extra.

On other platforms, I’m not sure.

1 Like

Another example of widely-used scientific computing infrastructure, that doesn’t default to an optimized blas, is the R interpreter.

They show how to do it here, and discuss the various trade-offs.

On my machine, the default R interpreter took 713s to do a matrix multiplication with n=10000 (single core), vs the above results of 115s with gfortran’s default matmul (also single core) and 8s using gfortran with the external blas (6 cores with hyperthreading).

2 Likes

The choice of algorithm for the various fortran intrinsic procedures is a general issue, not specific to matmul. There are two extreme possibilities.

  1. The fortran intrinsic should return the most accurate result that is practical using robust algorithms and memory allocation that fails as seldom as possible. Then if the user wants to use a different algorithm to improve performance, then the user should know his application enough to choose the appropriate algorithm with the correct tradeoffs between accuracy and speed.

  2. The fortran intrinsic should always return the fastest result, even if accuracy or robustness is compromised. The user is then responsible for using a different algorithm to improve the accuracy to be appropriate for his application.

Think about trig functions, where accurate results require much effort to reduce the input argument, then followed by the appropriate taylor expansion, or the appropriate pade approximate, or continued fraction, or whatever. Or think about random numbers, where the fast ones might not satisfy all of the exotic tests for randomness, while the good ones with long periods that do pass the tests can be more expensive. Or in the case of matmul, the single-core version is almost certainly fastest for small matrices, while for larger matrices there are various choices of multiple threads, multiple cores, AVX hardware, gpu hardware, and so on. The use of fast stack allocation (which can sometimes fail) vs. slower heap allocation (which is more robust) of intermediate work arrays is also an issue.

So which of these choices should fortran do? And does the answer for matmul() apply also to the other intrinsic procedures, such as random_number(), sin(), cos(), sqrt(), etc.? Should this be selectable with compiler options or with compiler directives within the code?

For the stack/heap allocation issue, the choices are stack allocation, heap allocation, or a conditional approach where stack allocation is first tested, and if that fails then heap allocation is invoked.

4 Likes

Ron,
Your conditional stack approach is so sensible, but rarely adopted.
Actually, I don’t know any operating system that has !!

Imagine a world where this conditional approach was the general case. The forum Stack Overflow would probably never have been started !! as stack overflow errors would not be so common.

I did a google search of “what is the stack size limit for Windows 64 bit”. It is very worrying to see some of the answers that google identifies, such as 1MB

The existing approach for the stack is really not suitable for a 64-bit OS. My understanding is that code + stack is limited to 2 Gbytes. I recall @slionel suggesting a limit for combined code + stack, which appears to be based on 32-bit Windows constraints. This approach was such a lost opportunity for a new 64-bit OS.

The only recent improvement to stack design, that I have noticed, is with OpenMP where each thread is given it’s own stack.
It is a shame that each thread was not also given it’s own heap address. At present 64-bit OS can address a 48-bit memory address space, which is much larger than available physical memory. Seperate heap addreses should be a possibility.
What I would like to have is for each heap private array to start on a new memory page to limit memory conflicts between threads.

2 Likes

Take matmul as an example. Thanks to the development of numerical analysis and scientific computing for so many years, there is no real need for tradeoffs between speed and accuracy, unless the matrices under consideration are huge (in a modern sense) or the requirement for speed/accuracy is extraordinarily high, in which case the user should consider specially designed algorithms rather than intrinsic functions anyway. For “daily-life” usage such as multiplying 20000-by-20000 matrices, we know quite well how to do it sufficiently fast with good accuracy.

I totally agree. Indeed, this seems to be the only reasonable approach. How could we make it happen at least for Fortran?