Nvfortran comparison of do concurrent vs OpenMP code

JohnCampbell · September 6, 2024, 6:43am

@ivanpribec
This is an interesting approach, but it ignores a very important consideration that the unit of memory access is based on memory pages (4Kbytes). Unfortunately this can make your analysis more complex.
The problem with the analysis of looking only at bytes is that it under estimates the total memory transfers.
If accessing contiguous memory, you have the same calculation result, but if randomly accessing 8-byte values, you are randomly accessing 4 kbytes of memory for each access, ie 512 x as much memory.
( this is why the inner loop index should be i ! which as been emphasised since I first tried to write optimised fortran calculations.)

Examples of this issue include:

I changed the following loop order, although it is outside the main calculation loops

    do i = 1, Nx
        do j = 1, Ny
            if ( (i - Nx/two)*(i - Nx/two) + (j - Ny/two)*(j - Ny/two) < seed ) then
                phi(i,j) = one
            end if
        end do
    end do

in this set of statements the first 2 require 3 pages of memory for jm,j and jp
the 3rd only 1 page and the 4th 2 pages. each page is then processed by i, so sequentially, but requires a 3x memory footprint in L1 and L2 cache ( which is another complexity) as well as 3x in L3 cache.

                lap_phi(i,j)   = ( phi(ip,j) + phi(im,j) + phi(i,jm) + phi(i,jp) - four*phi(i,j)) / ( dx*dy )
                lap_tempr(i,j) = ( tempr(ip,j) + tempr(im,j) + tempr(i,jm) + tempr(i,jp) - four*tempr(i,j)) / ( dx*dy )
                
                !======
                
                phidx(i,j) = ( phi(ip,j) - phi(im,j) ) / dx
                phidy(i,j) = ( phi(i,jp) - phi(i,jm) ) / dy

Unfortunately, trying to manage L1,L2 and L3 cache efficiency when there is a memory access bottleneck is very difficult, especially where the only controls we have are fortran statements.

Another complexity to understand for OpenMP is the use of -ffast-math. This has been shown to be very effective for single thread (well behaved) calculations, but when memory bandwidth problems emerge, the advantage erodes.

Finally I am not an expert on Nvfortran or the two Intel Fortrans, but these do appear to utilise multi-thread for DO Concurrent, while the versions of Gfortran I use do not. (All would require appropriate compiler options?)
I would observe that the extra detail required to make DO Concurrent multi-threaded is about the same as using !$OMP in Gfortran (which more clearly documents the intent).

So the important result is to make sure the statements document the programmer’s intent.
I achieve that with Gfortran.

Although there are many disadvantages of using OpenMP, the advantages dominate.

Topic		Replies	Views
Do concurrent: not seeing any speedup	39	896	June 2, 2025
DO CONCURRENT: compiler flags to enable parallelization Help	40	4587	November 20, 2025
Fortran applications using Fortran 2008+ features	29	2469	June 21, 2022
Why the performance is poorer after using OpenMP? Help	20	5787	June 2, 2022
Poor openmp scaling with ifort but not gfortran Help	12	1563	December 23, 2021

Nvfortran comparison of do concurrent vs OpenMP code

Related topics