Nvfortran comparison of do concurrent vs OpenMP code

I cleaned up the program a bit further and merged the OpenMP and DO CONCURRENT variants in one program to make it simpler to test different variants. This forum’s inline code is not too god so I put my version on Github Gits

Then I executed this program with all the compilers I had access to on my workstation. My workstation is an Intel i9-13900k. This CPU has 8 P-cores (performance) and 16 E-cores (efficiency). The Operating system is linux Mint 21.2 (built on top of Ubuntu 22.04). The compilers I tested was:

  • NAG nagfor 7.2 build 7214
  • GNU gfortran 12.3
  • Intel ifx 2024.2.1
  • Intel ifort 2021.13.1
  • Nvidia nvfortran 24.7
  • LLVM flang-new, Github main branch commit 0c1500ef (yesterday)

The results became a quite big document, and I also uploaded this as a second Gist:

The results have three columns: the first is the total time spent in the time loop, the second is the time spent in the first nested loop (seca) and the third is the time in the second nested loop (secb).

I will not do too much interpretations of the results here, but rather make some remarks:

  • Only nvfortran can parallelize the DO CONCURRENT loops - nether of the other compilers make use of more than one thread for this variant. Therefore I would not use DO CONCURRENT in any program where I want to use threads.
  • gfortran will not compile the DO CONCURRENT since it does not understand the locality specification (bugzilla).
  • nvfortran, ifx and ifort is creating incredibly fast executables when not using threads/OpenMP. None of the other compilers are even close in single-thread performance.
  • ifort creates the fastest non-threaded executable
  • Most compilers create a slower running program with OpenMP (or DO CONCURRENT) when using that program for a single thread (OMP_NUM_THREADS=1 or ACC_NUM_CORES=1)
  • Nealy all compilers “converge” at the same runtime (except ifort), between 20 and 25 seconds for the entire time-loop, with enough cores. I think this means that I have reached a state where the CPU is no longer the bottleneck, but the memory transfer is.
3 Likes

ifx did not produce correct results when i tested.

I checked the results of ifx, gfortran and nvfortran. The final value of the phi array is plotted below, for these three compilers with and without using OpenMP/DO CONCURRENT.

The right column, with OpenMP/DO CONCURRENT, are all results produced running with 8 threads. To me these results seems to be identical. If there are differences I haven’t spotted, feel free to shout out.

I just used Windows 10 with ifx. The code does not run in parallel. Maybe, running on Linux shows parallel performance. I did not check it there!

A few months back, using ifx did not produce correct results. Now it just does not run in parallel on Windows 10.