Further to the discussion under Automatic arrays and intrinsic array operations: to use or not to use?, I did some more tests on the performance of
matmul under the default settings of the compilers. Here,
matmul is only an example of Fortran’s intrinsic matrix/vector operation. There are many others. They are core strengths of Fortran, and I hope they are robust and performant by default.
(Or, how to argue that we want these intrinsic functions to perform poorly by default? How to explain such a poor default performance as an advantage? For me, this is hard to do.)
Admittedly, the tests should be more systematic, with randomized data and nice plots. However, I did not do it because the intrinsic
matmul simply takes too much time to run under the default settings.
module maprod_mod implicit none private public :: maprod, RP integer, parameter :: RP = kind(0.0D0) contains function maprod(X, Y) result(Z) real(RP), intent(in) :: X(:, :), Y(:, :) real(RP), allocatable :: Z(:, :) integer :: i, j, k if (size(X, 2) /= size(Y, 1)) then write (*, *) "Wrong size" stop end if allocate (Z(size(X, 1), size(Y, 2))) Z = 0.0_RP do j = 1, size(Y, 2) do i = 1, size(X, 1) do k = 1, size(X, 2) Z(i, j) = Z(i, j) + X(i, k) * Y(k, j) end do end do end do end function maprod end module maprod_mod program test_matmul use, non_intrinsic :: maprod_mod, only : maprod, RP implicit none integer, parameter :: n = 2000 real(RP), allocatable :: A(:, :), B(:, :) real :: start, finish write (*, *) 'Initialize' allocate (A(n, n), B(n, n)) A = 0.0_RP write (*, *) 'MATMUL' !call cpu_time(start) B = matmul(A, A) !call cpu_time(finish) write (*, *) 'Succeed' !write (*, *) 'Time in seconds:', finish - start ! The use of `cpu_time` is questionable. Just ignore it. ! Uncomment the following code if you would like to compare `matmul` and `maprod`. !call cpu_time(start) !B = maprod(A, A) !call cpu_time(finish) !write (*, *) 'Succeed' !write (*, *) 'Time in seconds:', finish - start end program test_matmul
Hardware and OS:
Lenovo Thinkpad X1 Carbon gen 8, Ubuntu 20.04, click to see the details
$ uname -a Linux zX18 5.11.15-051115-generic (Ubuntu 20.04) $ lscpu | grep mod Nom de modèle : Intel(R) Core(TM) i7-10610U CPU @ 1.80GHz $ lsmem RANGE SIZE STATE REMOVABLE BLOCK 0x0000000000000000-0x000000006fffffff 1.8G en ligne oui 0-13 0x0000000078000000-0x000000007fffffff 128M en ligne oui 15 0x0000000100000000-0x000000047fffffff 14G en ligne oui 32-143 Taille du bloc mémoire : 128M Mémoire partagée totale : 15.9G $ ulimit -a core file size (blocks, -c) unlimited data seg size (kbytes, -d) unlimited scheduling priority (-e) 0 file size (blocks, -f) unlimited pending signals (-i) 62293 max locked memory (kbytes, -l) 65536 max memory size (kbytes, -m) unlimited open files (-n) 1024 pipe size (512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack size (kbytes, -s) 8192 cpu time (seconds, -t) unlimited max user processes (-u) 62293 virtual memory (kbytes, -v) unlimited file locks (-x) unlimited
gfortran: gcc version 9.4.0 (Ubuntu 9.4.0-1ubuntu1~20.04.1)
nvfortran: nvfortran 22.5-0 64-bit target on x86-64 Linux -tp haswell
flang: clang version 7.0.1
flang: AMD clang version 13.0.0 (CLANG: AOCC_3.2.0-Build#128 2021_11_12) (based on LLVM Mirror.Version.13.0.0)
sunf95: Oracle Developer Studio 12.6
g95: gcc version 4.0.3 (g95 0.94!) Jan 17 2013 (this is a dinosaur, but it is interesting to compare it with
nagfor: NAG Fortran Compiler Release 7.0(Yurakucho) Build 7074
af95: Absoft 64-bit Pro Fortran 22.0.2
ifx: Intel OneAPI 2022.1.0
ifxencountered a segfault when
n = 2000(I did not try to find the smallest n). For more information about the crash, see my post under Automatic arrays and intrinsic array operations: to use or not to use. It is a well-known and well-understood issue. Let’s not repeat how to circumvent it by compiler options. Instead, let’s discuss why it is not a problem that should worry the Fortran community and how to explain this to the Fortran beginners and Fortran criticizers.
af95from encountered a segfault when
n = 10000(I did not try to find the smallest n).
n = 20000, none of
g95crashed before I terminated them. Well done!
n = 20000
flang, and Oracle
sunf95took about 2000–3000 wall-clock seconds to finish the job;
nagfortook more than 10 wall-clock hours and did not finish before I terminated it;
g95took hours and did not finish before I terminated it.
n = 20000and randomized initialization of
A, the following MATLAB code took about 200–300 wall-clock seconds. No segfault, no crash, no tuning needed, no long waiting needed. Just run.
A = rand(20000,20000); tic; B = A*A; toc
What is your opinion?
Sure, MATLAB would be a total failure without the high-performance Fortran code behind the job. That’s for sure. It is a fact known quite well. I totally agree. Let’s not repeat this. Instead, let us discuss why Fortran should not optimize its intrinsic matrix/vector operations in the same way (MATLAB is optimizing its intrinsics using Fortran!) and how to explain this to Fortran beginners and Fortran criticizers.