Nvfortran comparison of do concurrent vs OpenMP code

@ivanpribec
This is an interesting approach, but it ignores a very important consideration that the unit of memory access is based on memory pages (4Kbytes). Unfortunately this can make your analysis more complex.
The problem with the analysis of looking only at bytes is that it under estimates the total memory transfers.
If accessing contiguous memory, you have the same calculation result, but if randomly accessing 8-byte values, you are randomly accessing 4 kbytes of memory for each access, ie 512 x as much memory.
( this is why the inner loop index should be i ! which as been emphasised since I first tried to write optimised fortran calculations.)

Examples of this issue include:

  1. I changed the following loop order, although it is outside the main calculation loops
    do i = 1, Nx
        do j = 1, Ny
            if ( (i - Nx/two)*(i - Nx/two) + (j - Ny/two)*(j - Ny/two) < seed ) then
                phi(i,j) = one
            end if
        end do
    end do
  1. in this set of statements the first 2 require 3 pages of memory for jm,j and jp
    the 3rd only 1 page and the 4th 2 pages. each page is then processed by i, so sequentially, but requires a 3x memory footprint in L1 and L2 cache ( which is another complexity) as well as 3x in L3 cache.
                lap_phi(i,j)   = ( phi(ip,j) + phi(im,j) + phi(i,jm) + phi(i,jp) - four*phi(i,j)) / ( dx*dy )
                lap_tempr(i,j) = ( tempr(ip,j) + tempr(im,j) + tempr(i,jm) + tempr(i,jp) - four*tempr(i,j)) / ( dx*dy )
                
                !======
                
                phidx(i,j) = ( phi(ip,j) - phi(im,j) ) / dx
                phidy(i,j) = ( phi(i,jp) - phi(i,jm) ) / dy

Unfortunately, trying to manage L1,L2 and L3 cache efficiency when there is a memory access bottleneck is very difficult, especially where the only controls we have are fortran statements.

Another complexity to understand for OpenMP is the use of -ffast-math. This has been shown to be very effective for single thread (well behaved) calculations, but when memory bandwidth problems emerge, the advantage erodes.

Finally I am not an expert on Nvfortran or the two Intel Fortrans, but these do appear to utilise multi-thread for DO Concurrent, while the versions of Gfortran I use do not. (All would require appropriate compiler options?)
I would observe that the extra detail required to make DO Concurrent multi-threaded is about the same as using !$OMP in Gfortran (which more clearly documents the intent).

So the important result is to make sure the statements document the programmer’s intent.
I achieve that with Gfortran.

Although there are many disadvantages of using OpenMP, the advantages dominate.