I cleaned up the program a bit further and merged the OpenMP and DO CONCURRENT variants in one program to make it simpler to test different variants. This forum’s inline code is not too god so I put my version on Github Gits
Then I executed this program with all the compilers I had access to on my workstation. My workstation is an Intel i9-13900k. This CPU has 8 P-cores (performance) and 16 E-cores (efficiency). The Operating system is linux Mint 21.2 (built on top of Ubuntu 22.04). The compilers I tested was:
- NAG
nagfor
7.2 build 7214 - GNU
gfortran
12.3 - Intel
ifx
2024.2.1 - Intel
ifort
2021.13.1 - Nvidia
nvfortran
24.7 - LLVM
flang-new
, Github main branch commit 0c1500ef (yesterday)
The results became a quite big document, and I also uploaded this as a second Gist:
The results have three columns: the first is the total time spent in the time loop, the second is the time spent in the first nested loop (seca
) and the third is the time in the second nested loop (secb
).
I will not do too much interpretations of the results here, but rather make some remarks:
- Only
nvfortran
can parallelize the DO CONCURRENT loops - nether of the other compilers make use of more than one thread for this variant. Therefore I would not use DO CONCURRENT in any program where I want to use threads. gfortran
will not compile the DO CONCURRENT since it does not understand the locality specification (bugzilla).nvfortran
,ifx
andifort
is creating incredibly fast executables when not using threads/OpenMP. None of the other compilers are even close in single-thread performance.ifort
creates the fastest non-threaded executable- Most compilers create a slower running program with OpenMP (or DO CONCURRENT) when using that program for a single thread (
OMP_NUM_THREADS=1
orACC_NUM_CORES=1
) - Nealy all compilers “converge” at the same runtime (except
ifort
), between 20 and 25 seconds for the entire time-loop, with enough cores. I think this means that I have reached a state where the CPU is no longer the bottleneck, but the memory transfer is.