Further to the discussion under Automatic arrays and intrinsic array operations: to use or not to use?, I did some more tests on the performance of matmul
under the default settings of the compilers. Here, matmul
is only an example of Fortran’s intrinsic matrix/vector operation. There are many others. They are core strengths of Fortran, and I hope they are robust and performant by default.
(Or, how to argue that we want these intrinsic functions to perform poorly by default? How to explain such a poor default performance as an advantage? For me, this is hard to do.)
Admittedly, the tests should be more systematic, with randomized data and nice plots. However, I did not do it because the intrinsic matmul
simply takes too much time to run under the default settings.
Code:
module maprod_mod
implicit none
private
public :: maprod, RP
integer, parameter :: RP = kind(0.0D0)
contains
function maprod(X, Y) result(Z)
real(RP), intent(in) :: X(:, :), Y(:, :)
real(RP), allocatable :: Z(:, :)
integer :: i, j, k
if (size(X, 2) /= size(Y, 1)) then
write (*, *) "Wrong size"
stop
end if
allocate (Z(size(X, 1), size(Y, 2)))
Z = 0.0_RP
do j = 1, size(Y, 2)
do i = 1, size(X, 1)
do k = 1, size(X, 2)
Z(i, j) = Z(i, j) + X(i, k) * Y(k, j)
end do
end do
end do
end function maprod
end module maprod_mod
program test_matmul
use, non_intrinsic :: maprod_mod, only : maprod, RP
implicit none
integer, parameter :: n = 2000
real(RP), allocatable :: A(:, :), B(:, :)
real :: start, finish
write (*, *) 'Initialize'
allocate (A(n, n), B(n, n))
A = 0.0_RP
write (*, *) 'MATMUL'
!call cpu_time(start)
B = matmul(A, A)
!call cpu_time(finish)
write (*, *) 'Succeed'
!write (*, *) 'Time in seconds:', finish - start ! The use of `cpu_time` is questionable. Just ignore it.
! Uncomment the following code if you would like to compare `matmul` and `maprod`.
!call cpu_time(start)
!B = maprod(A, A)
!call cpu_time(finish)
!write (*, *) 'Succeed'
!write (*, *) 'Time in seconds:', finish - start
end program test_matmul
Hardware and OS:
Lenovo Thinkpad X1 Carbon gen 8, Ubuntu 20.04, click to see the details
$ uname -a
Linux zX18 5.11.15-051115-generic (Ubuntu 20.04)
$ lscpu | grep mod
Nom de modèle : Intel(R) Core(TM) i7-10610U CPU @ 1.80GHz
$ lsmem
RANGE SIZE STATE REMOVABLE BLOCK
0x0000000000000000-0x000000006fffffff 1.8G en ligne oui 0-13
0x0000000078000000-0x000000007fffffff 128M en ligne oui 15
0x0000000100000000-0x000000047fffffff 14G en ligne oui 32-143
Taille du bloc mémoire : 128M
Mémoire partagée totale : 15.9G
$ ulimit -a
core file size (blocks, -c) unlimited
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 62293
max locked memory (kbytes, -l) 65536
max memory size (kbytes, -m) unlimited
open files (-n) 1024
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 8192
cpu time (seconds, -t) unlimited
max user processes (-u) 62293
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited
Software:
-
gfortran
: gcc version 9.4.0 (Ubuntu 9.4.0-1ubuntu1~20.04.1) -
nvfortran
: nvfortran 22.5-0 64-bit target on x86-64 Linux -tp haswell - Classic
flang
: clang version 7.0.1 - AOCC
flang
: AMD clang version 13.0.0 (CLANG: AOCC_3.2.0-Build#128 2021_11_12) (based on LLVM Mirror.Version.13.0.0) -
sunf95
: Oracle Developer Studio 12.6 -
g95
: gcc version 4.0.3 (g95 0.94!) Jan 17 2013 (this is a dinosaur, but it is interesting to compare it withifort
andifx
) -
nagfor
: NAG Fortran Compiler Release 7.0(Yurakucho) Build 7074 -
af95
: Absoft 64-bit Pro Fortran 22.0.2 -
ifort
andifx
: Intel OneAPI 2022.1.0
Compiler option: -g
.
Results:
-
ifort
andifx
encountered a segfault whenn = 2000
(I did not try to find the smallest n). For more information about the crash, see my post under Automatic arrays and intrinsic array operations: to use or not to use. It is a well-known and well-understood issue. Let’s not repeat how to circumvent it by compiler options. Instead, let’s discuss why it is not a problem that should worry the Fortran community and how to explain this to the Fortran beginners and Fortran criticizers. -
af95
from encountered a segfault whenn = 10000
(I did not try to find the smallest n). -
Up to
n = 20000
, none ofgfortran
,nvfortran
, AOCCflang
, classicflang
, Oraclesunf95
,nagfor
, andg95
crashed before I terminated them. Well done! -
With
n = 20000
-
gfortran
,nvfortran
, AOCCflang
, classicflang
, and Oraclesunf95
took about 2000–3000 wall-clock seconds to finish the job; -
nagfor
took more than 10 wall-clock hours and did not finish before I terminated it; -
g95
took hours and did not finish before I terminated it.
- Wtih
n = 20000
and randomized initialization ofA
, the following MATLAB code took about 200–300 wall-clock seconds. No segfault, no crash, no tuning needed, no long waiting needed. Just run.
A = rand(20000,20000); tic; B = A*A; toc
What is your opinion?
Sure, MATLAB would be a total failure without the high-performance Fortran code behind the job. That’s for sure. It is a fact known quite well. I totally agree. Let’s not repeat this. Instead, let us discuss why Fortran should not optimize its intrinsic matrix/vector operations in the same way (MATLAB is optimizing its intrinsics using Fortran!) and how to explain this to Fortran beginners and Fortran criticizers.