Nvfortran comparison of do concurrent vs OpenMP code

hakostra · September 6, 2024, 11:38am

I cleaned up the program a bit further and merged the OpenMP and DO CONCURRENT variants in one program to make it simpler to test different variants. This forum’s inline code is not too god so I put my version on Github Gits

gist.github.com

https://gist.github.com/hakostra/ac5ed9279136fe1e7f6f217f2561f08e

omp.F90

! Code from:
! https://fortran-lang.discourse.group/t/nvfortran-comparison-of-do-concurrent-vs-openmp-code/8552/18
!
! Adapted version with OpenMP and DO CONCURRENT in the same program file
! and removed disk write in the end.
!
MODULE all_data
    USE iso_fortran_env, ONLY: int64, i4 => int32, r8 => real64
    IMPLICIT NONE

This file has been truncated. show original

Then I executed this program with all the compilers I had access to on my workstation. My workstation is an Intel i9-13900k. This CPU has 8 P-cores (performance) and 16 E-cores (efficiency). The Operating system is linux Mint 21.2 (built on top of Ubuntu 22.04). The compilers I tested was:

NAG nagfor 7.2 build 7214
GNU gfortran 12.3
Intel ifx 2024.2.1
Intel ifort 2021.13.1
Nvidia nvfortran 24.7
LLVM flang-new, Github main branch commit 0c1500ef (yesterday)

The results became a quite big document, and I also uploaded this as a second Gist:

gist.github.com

https://gist.github.com/hakostra/87a043e8436efb898b63b192638392d2

omp-dc-test-results.txt

NAG nagfor 7.2 build 7214
=========================
Without threads:
  -O4 -target=broadwell -not_openmp

                          163.0    94.5    68.6

With OpenMP threads:
  -O4 -target=broadwell -openmp

This file has been truncated. show original

The results have three columns: the first is the total time spent in the time loop, the second is the time spent in the first nested loop (seca) and the third is the time in the second nested loop (secb).

I will not do too much interpretations of the results here, but rather make some remarks:

Only nvfortran can parallelize the DO CONCURRENT loops - nether of the other compilers make use of more than one thread for this variant. Therefore I would not use DO CONCURRENT in any program where I want to use threads.
gfortran will not compile the DO CONCURRENT since it does not understand the locality specification (bugzilla).
nvfortran, ifx and ifort is creating incredibly fast executables when not using threads/OpenMP. None of the other compilers are even close in single-thread performance.
ifort creates the fastest non-threaded executable
Most compilers create a slower running program with OpenMP (or DO CONCURRENT) when using that program for a single thread (OMP_NUM_THREADS=1 or ACC_NUM_CORES=1)
Nealy all compilers “converge” at the same runtime (except ifort), between 20 and 25 seconds for the entire time-loop, with enough cores. I think this means that I have reached a state where the CPU is no longer the bottleneck, but the memory transfer is.

Shahid · September 6, 2024, 4:15pm

ifx did not produce correct results when i tested.

hakostra · September 9, 2024, 6:46am

I checked the results of ifx, gfortran and nvfortran. The final value of the phi array is plotted below, for these three compilers with and without using OpenMP/DO CONCURRENT.

The right column, with OpenMP/DO CONCURRENT, are all results produced running with 8 threads. To me these results seems to be identical. If there are differences I haven’t spotted, feel free to shout out.

Shahid · September 9, 2024, 9:25am

I just used Windows 10 with ifx. The code does not run in parallel. Maybe, running on Linux shows parallel performance. I did not check it there!

Shahid · September 9, 2024, 9:26am

A few months back, using ifx did not produce correct results. Now it just does not run in parallel on Windows 10.

Topic		Replies	Views
Do concurrent: not seeing any speedup	39	851	June 2, 2025
OpenMP and FORTRAN Help	32	1030	December 9, 2024
Bug or programmer error? NVIDIA	9	362	July 23, 2025
Fortran applications using Fortran 2008+ features	29	2442	June 21, 2022
Parallelization on GPU with Intel compiler Intel	55	2649	September 20, 2024

Nvfortran comparison of do concurrent vs OpenMP code

Related topics