I run my codes using do concurrent and OpenMP. On Windows 10 with /Qopenmp
compiler flag with the intel classic compiler, I do not see any performance difference. But with Ubuntu 20.04 using nvfortran there is a great performance difference. I use the compiler option -stdpar=multicore
for do concurrent following the link : Accelerating Fortran DO CONCURRENT with GPUs and the NVIDIA HPC SDK | NVIDIA Technical Blog. I use -mp
compiler flag for OpenMP code. The performance difference is about 1/3. Do concurrent seems faster. So my question is which other compiler flags I can use to compare the performance for CPU using nvfortran? Should not the performance be the same for both versions?
OpenMP has the disadvantage of a significant startup overhead of about 20,000 processor cycles (Gfortran on windows although this may vary with OS), so there can be little benefit for multi-thread code examples like saxpy in the Nvidia link.
Perhaps a multi-thread DO CONCURRENT may achieve better results, with less startup overhead for their do concurrent implementations.
I did not see any reference to AVX / simd instructions, which typically give a performance uplift for these types of calculations, where there is only a small calculation overhead per itteration.
Perhaps your nvfortran test obtained an improvement from avx, rather than multi-thread ?
With calculations like saxpy, I do not get any direct benefit from Openmp, but do from AVX.
Using saxpy with large vectors also has problems with memory bandwidth limitations (cache efficiency), so testing can be difficult to control when comparing to a reference case.
Interesting the Nvidia link references saxpy as a multi-threading calculation candidate.
I used the following two versions of the code. It is the same OpenMP code implementation in gfortran and intel - Help - Fortran Discourse (fortran-lang.discourse.group)
1) OpenMP
! OpenMP version of the code
program openmp_test
use iso_fortran_env
implicit none
!=============
integer ( kind = 4 ), parameter :: Nx = 2000
integer ( kind = 4 ), parameter :: Ny = 2000
real ( kind = 8 ) :: dx = 0.03d0
real ( kind = 8 ) :: dy = 0.03d0
real ( kind = 8 ), parameter :: one = 1.0
real ( kind = 8 ), parameter :: two = 2.0
real ( kind = 8 ), parameter :: four = 4.0
real ( kind = 8 ), parameter :: half = 0.5
!=============
integer (kind = 4 ) :: nsteps = 2000
integer (kind = 4 ) :: nprint = 100
integer (kind = 4 ) :: tsteps
real ( kind = 8 ) :: dtime = 1.0d-4
!===============
real ( kind = 8 ) :: tau = 0.0003d0
real ( kind = 8 ) :: epsilonb = 0.01d0
real ( kind = 8 ) :: kappa = 1.8d0
real ( kind = 8 ) :: delta = 0.02d0
real ( kind = 8 ) :: aniso = 6.0d0
real ( kind = 8 ) :: alpha = 0.9d0
real ( kind = 8 ) :: gama = 10.0d0
real ( kind = 8 ) :: teq = 1.0d0
real ( kind = 8 ) :: theta0= 0.2d0
real ( kind = 8 ) :: seed = 5.0d0
real ( kind = 8 ) :: pix = four*atan(one)
!===============
real ( kind = 8 ) , dimension( Nx, Ny ) :: phi, tempr
real ( kind = 8 ) , dimension( Nx, Ny ) :: lap_phi, lap_tempr
real ( kind = 8 ) , dimension( Nx, Ny ) :: phidx, phidy
real ( kind = 8 ) , dimension( Nx, Ny ) :: epsil, epsilon_deriv
real ( kind = 8 ) :: phi_old, term1, term2
real ( kind = 8 ) :: theta, m
integer ( kind = 4 ) :: i, j, ip, im, jp, jm
!============================================================
write (*,*) 'Vern : ',compiler_version ()
write (*,*) 'Opts : ',compiler_options ()
call timer ( 'start' )
phi = 0.0
tempr = 0.0
do i = 1, Nx
do j = 1, Ny
if ( (i - Nx/two)*(i - Nx/two) + (j - Ny/two)*(j - Ny/two) < seed ) then
phi(i,j) = one
end if
end do
end do
call timer ( 'initialise phi' )
!============================================================
time_loop: do tsteps = 1, nsteps
!$omp parallel do default ( none ) &
!$omp& private ( i,j, ip,im, jp,jm, theta ) &
!$omp& shared ( dx, dy, lap_phi, lap_tempr, phi, tempr, phidx, phidy, epsil, epsilon_deriv, &
!$omp& theta0, aniso, delta, epsilonb )
do j = 1, Ny
do i =1, Nx
jp = j + 1
jm = j - 1
ip = i + 1
im = i - 1
if ( im == 0 ) im = Nx
if ( ip == ( Nx + 1) ) ip = 1
if ( jm == 0 ) jm = Ny
if ( jp == ( Ny + 1) ) jp = 1
!=====
lap_phi(i,j) = ( phi(ip,j) + phi(im,j) + phi(i,jm) + phi(i,jp) - four*phi(i,j)) / ( dx*dy )
lap_tempr(i,j) = ( tempr(ip,j) + tempr(im,j) + tempr(i,jm) + tempr(i,jp) - four*tempr(i,j)) / ( dx*dy )
!======
phidx(i,j) = ( phi(ip,j) - phi(im,j) ) / dx
phidy(i,j) = ( phi(i,jp) - phi(i,jm) ) / dy
!======
theta = atan2( phidy(i,j),phidx(i,j) )
!======
epsil(i,j) = epsilonb *( one + delta * cos ( aniso*( theta - theta0 ) ) )
epsilon_deriv(i,j) = -epsilonb * aniso * delta * sin ( aniso*( theta - theta0 ) )
end do
end do
!$omp end parallel do
!$omp parallel do default ( none ) &
!$omp& private ( i,j,ip,im,jp,jm, phi_old, term1, term2, m ) &
!$omp& shared ( dx, dy, phi, tempr, epsil, epsilon_deriv, phidx, phidy, lap_phi, lap_tempr, &
!$omp& alpha, pix, teq, gama, dtime, tau, kappa )
do j = 1, Ny
do i =1, Nx
jp = j + 1
jm = j - 1
ip = i + 1
im = i - 1
if ( im == 0 ) im = Nx
if ( ip == ( Nx + 1) ) ip = 1
if ( jm == 0 ) jm = Ny
if ( jp == ( Ny + 1) ) jp = 1
phi_old = phi(i,j)
!========
term1 = ( epsil(i,jp)*epsilon_deriv(i,jp)*phidx(i,jp) - epsil(i,jm)*epsilon_deriv(i,jm)*phidx(i,jm) ) / dy
term2 = -( epsil(ip,j)*epsilon_deriv(ip,j)*phidy(ip,j) - epsil(im,j)*epsilon_deriv(im,j)*phidy(im,j) ) / dx
!========
m = alpha/pix * atan ( gama*( teq - tempr(i,j) ) )
!========
phi(i,j) = phi(i,j) + ( dtime/tau ) * ( term1 + term2 + epsil(i,j)**2 * lap_phi(i,j) ) &
+ phi_old * ( one - phi_old )*( phi_old - half + m )
tempr(i,j) = tempr(i,j) + dtime*lap_tempr(i,j) + kappa * ( phi(i,j) - phi_old )
end do
end do
!$omp end parallel do
! print steps
if ( mod ( tsteps, nprint ) .eq. 0 ) print *, 'Done steps = ', tsteps
end do time_loop
call timer ( 'time_loop' )
!============================================================================
! output file
open ( 11, file = "phi.dat" )
do i = 1, Nx
write( 11, 10 ) ( phi(i,j), j = 1, Ny )
end do
10 FORMAT (1000000F10.6)
close( 11 )
write (*,*) 'max phi =', maxval (phi)
call timer ( 'report phi' )
end program openmp_test
subroutine timer ( desc )
character desc*(*)
integer*8 :: clock, rate, last_clock = -1
real*8 :: sec
if ( last_clock < 0 ) then
call system_clock ( last_clock, rate )
sec = 0
write (*,11) sec, desc
else
call system_clock ( clock, rate )
sec = dble ( clock - last_clock ) / dble (rate)
write (*,11) sec, desc
11 format ( f10.3,2x,a )
last_clock = clock
end if
end subroutine timer
2) Do concurrent
! dc version of the main code
program dc_test
use iso_fortran_env
implicit none
!=============
integer ( kind = 4 ), parameter :: Nx = 2000
integer ( kind = 4 ), parameter :: Ny = 2000
real ( kind = 8 ) :: dx = 0.03d0
real ( kind = 8 ) :: dy = 0.03d0
real ( kind = 8 ), parameter :: one = 1.0
real ( kind = 8 ), parameter :: two = 2.0
real ( kind = 8 ), parameter :: four = 4.0
real ( kind = 8 ), parameter :: half = 0.5
!=============
integer (kind = 4 ) :: nsteps = 2000
integer (kind = 4 ) :: nprint = 100
integer (kind = 4 ) :: tsteps
real ( kind = 8 ) :: dtime = 1.0d-4
!===============
real ( kind = 8 ) :: tau = 0.0003d0
real ( kind = 8 ) :: epsilonb = 0.01d0
real ( kind = 8 ) :: kappa = 1.8d0
real ( kind = 8 ) :: delta = 0.02d0
real ( kind = 8 ) :: aniso = 6.0d0
real ( kind = 8 ) :: alpha = 0.9d0
real ( kind = 8 ) :: gama = 10.0d0
real ( kind = 8 ) :: teq = 1.0d0
real ( kind = 8 ) :: theta0= 0.2d0
real ( kind = 8 ) :: seed = 5.0d0
real ( kind = 8 ) :: pix = four*atan(one)
!===============
real ( kind = 8 ) , dimension( Nx, Ny ) :: phi, tempr
real ( kind = 8 ) , dimension( Nx, Ny ) :: lap_phi, lap_tempr
real ( kind = 8 ) , dimension( Nx, Ny ) :: phidx, phidy
real ( kind = 8 ) , dimension( Nx, Ny ) :: epsil, epsilon_deriv
real ( kind = 8 ) :: phi_old, term1, term2
real ( kind = 8 ) :: theta, m
integer ( kind = 4 ) :: i, j, ip, im, jp, jm
!============================================================
write (*,*) 'Vern : ',compiler_version ()
write (*,*) 'Opts : ',compiler_options ()
call timer ( 'start' )
phi = 0.0
tempr = 0.0
do i = 1, Nx
do j = 1, Ny
if ( (i - Nx/two)*(i - Nx/two) + (j - Ny/two)*(j - Ny/two) < seed ) then
phi(i,j) = one
end if
end do
end do
call timer ( 'initialise phi' )
!============================================================
time_loop: do tsteps = 1, nsteps
do concurrent (j=1:Ny,i=1:Nx) default (none) &
local ( ip,im, jp,jm, theta ) &
shared ( dx, dy, lap_phi, lap_tempr, phi, tempr, phidx, phidy, epsil, epsilon_deriv, &
theta0, aniso, delta, epsilonb )
jp = j + 1
jm = j - 1
ip = i + 1
im = i - 1
if ( im == 0 ) im = Nx
if ( ip == ( Nx + 1) ) ip = 1
if ( jm == 0 ) jm = Ny
if ( jp == ( Ny + 1) ) jp = 1
!=====
lap_phi(i,j) = ( phi(ip,j) + phi(im,j) + phi(i,jm) + phi(i,jp) - four*phi(i,j)) / ( dx*dy )
lap_tempr(i,j) = ( tempr(ip,j) + tempr(im,j) + tempr(i,jm) + tempr(i,jp) - four*tempr(i,j)) / ( dx*dy )
!======
phidx(i,j) = ( phi(ip,j) - phi(im,j) ) / dx
phidy(i,j) = ( phi(i,jp) - phi(i,jm) ) / dy
!======
theta = atan2( phidy(i,j),phidx(i,j) )
!======
epsil(i,j) = epsilonb *( one + delta * cos ( aniso*( theta - theta0 ) ) )
epsilon_deriv(i,j) = -epsilonb * aniso * delta * sin ( aniso*( theta - theta0 ) )
end do
do concurrent (j=1:Ny,i=1:Nx) default (none) &
local (ip,im,jp,jm, phi_old, term1, term2, m ) &
shared ( dx, dy, phi, tempr, epsil, epsilon_deriv, phidx, phidy, lap_phi, lap_tempr, &
alpha, pix, teq, gama, dtime, tau, kappa )
jp = j + 1
jm = j - 1
ip = i + 1
im = i - 1
if ( im == 0 ) im = Nx
if ( ip == ( Nx + 1) ) ip = 1
if ( jm == 0 ) jm = Ny
if ( jp == ( Ny + 1) ) jp = 1
phi_old = phi(i,j)
!========
term1 = ( epsil(i,jp)*epsilon_deriv(i,jp)*phidx(i,jp) - epsil(i,jm)*epsilon_deriv(i,jm)*phidx(i,jm) ) / dy
term2 = -( epsil(ip,j)*epsilon_deriv(ip,j)*phidy(ip,j) - epsil(im,j)*epsilon_deriv(im,j)*phidy(im,j) ) / dx
!========
m = alpha/pix * atan ( gama*( teq - tempr(i,j) ) )
!========
phi(i,j) = phi(i,j) + ( dtime/tau ) * ( term1 + term2 + epsil(i,j)**2 * lap_phi(i,j) ) &
+ phi_old * ( one - phi_old )*( phi_old - half + m )
tempr(i,j) = tempr(i,j) + dtime*lap_tempr(i,j) + kappa * ( phi(i,j) - phi_old )
end do
! print steps
if ( mod ( tsteps, nprint ) .eq. 0 ) print *, 'Done steps = ', tsteps
end do time_loop
call timer ( 'time_loop' )
!============================================================================
! output file
open ( 11, file = "phi.dat" )
do i = 1, Nx
write( 11, 10 ) ( phi(i,j), j = 1, Ny )
end do
10 FORMAT (1000000F10.6)
close( 11 )
write (*,*) 'max phi =', maxval (phi)
call timer ( 'report phi' )
end program dc_test
subroutine timer ( desc )
character desc*(*)
integer*8 :: clock, rate, last_clock = -1
real*8 :: sec
if ( last_clock < 0 ) then
call system_clock ( last_clock, rate )
sec = 0
write (*,11) sec, desc
else
call system_clock ( clock, rate )
sec = dble ( clock - last_clock ) / dble (rate)
write (*,11) sec, desc
11 format ( f10.3,2x,a )
last_clock = clock
end if
end subroutine timer
I run on Ubuntu 20.04 with following commands
1) OpenMP test
shahid@shahid-G3-3579:~$ nvfortran omp.f90 -o omp -mp -Minfo
openmp_test:
66, FMA (fused multiply-add) instruction(s) generated
83, !$omp parallel
104, FMA (fused multiply-add) instruction(s) generated
105, FMA (fused multiply-add) instruction(s) generated
118, FMA (fused multiply-add) instruction(s) generated
127, !$omp parallel
188, maxval reduction inlined
shahid@shahid-G3-3579:~$ ./omp
Vern : nvfortran 24.7-0
Opts : omp.f90 -o omp -mp -Minfo
0.000 start
0.038 initialise phi
Done steps = 100
Done steps = 200
Done steps = 300
Done steps = 400
Done steps = 500
Done steps = 600
Done steps = 700
Done steps = 800
Done steps = 900
Done steps = 1000
Done steps = 1100
Done steps = 1200
Done steps = 1300
Done steps = 1400
Done steps = 1500
Done steps = 1600
Done steps = 1700
Done steps = 1800
Done steps = 1900
Done steps = 2000
87.198 time_loop
max phi = 1.002807743736036
0.266 report phi
- Do concurrent test
shahid@shahid-G3-3579:~$ nvfortran dc.f90 -o dc -stdpar=multicore -Minfo
dc_test:
62, Memory zero idiom, array assignment replaced by call to pgf90_mzero8
63, Memory zero idiom, array assignment replaced by call to pgf90_mzero8
67, FMA (fused multiply-add) instruction(s) generated
83, Generating Multicore code
83, Loop parallelized across CPU threads
101, FMA (fused multiply-add) instruction(s) generated
102, FMA (fused multiply-add) instruction(s) generated
115, FMA (fused multiply-add) instruction(s) generated
120, Generating Multicore code
120, Loop parallelized across CPU threads
140, FMA (fused multiply-add) instruction(s) generated
141, FMA (fused multiply-add) instruction(s) generated
149, FMA (fused multiply-add) instruction(s) generated
151, FMA (fused multiply-add) instruction(s) generated
176, maxval reduction inlined
shahid@shahid-G3-3579:~$ ./dc
Vern : nvfortran 24.7-0
Opts : dc.f90 -o dc -stdpar=multicore -Minfo
0.000 start
0.030 initialise phi
Done steps = 100
Done steps = 200
Done steps = 300
Done steps = 400
Done steps = 500
Done steps = 600
Done steps = 700
Done steps = 800
Done steps = 900
Done steps = 1000
Done steps = 1100
Done steps = 1200
Done steps = 1300
Done steps = 1400
Done steps = 1500
Done steps = 1600
Done steps = 1700
Done steps = 1800
Done steps = 1900
Done steps = 2000
71.349 time_loop
max phi = 1.002807743736036
0.257 report phi
It seems like do concurrent is faster but with higher domain size i.e.Nx=2000,Ny=2000. I test with Nx=100 and Ny=100 and the computed time was the same.
FMA (fused multiply-add) instruction(s) are generated by the nvidia compiler here. Are they same as AVX instructions?
I think I have seen this program before. I am having trouble running it on Win 10 with Gfortran 12.3, as it does not terminate (corrupt stack ?)
I made some changes to the Omp version by,
- Moving all array declarations to a module ( I think there is a problem with the stack size)
- change atan2 to zatan2 which checks when both arguments are zero
- introduced time accumulators to check the 2 OMP regions
- introduced call omp_set_num_threads (nth) to test variable number of threads
- compiled with varying options, although I still can’t get the program to stop properly?
My batch file is
set prog=%1
set options=-fimplicit-none -fallow-argument-mismatch -O2 -march=native -o %prog%.exe
set options=-fimplicit-none -fallow-argument-mismatch -O2 -march=native -fopenmp -o %prog%.exe
now >> %prog%.log
del %prog%.exe
gfortran %prog%.f90 %options%
%prog% >> %prog%.log
type %prog%.log
The elapsed times for different thread counts are:
8 threads 56.3 sec
4 threads 87.0 sec
2 threads 143.0 sec
no Openmp 287 sec
So definately Openmp has some effect.
However, there could still be errors in the program, that I have not identified, which makes timing tests a problem, such as if there are floating point underflow or invalid atan2 values.
On 64 bit Windows, the default stack size is ridiculously small. (I will modify the bat file and check again)
My latest test version is:
! OpenMP version of the code
!
! the original program does not run with OpenMP on Ver 12.2
! possible changes
! 1) patch atan2 error
! 2) use module
! 3) change stack size
! 4) try Gfortran ver 11
module All_data!
implicit none
integer, parameter :: i4 = selected_int_kind (12)
integer, parameter :: r8 = selected_real_kind (12)
!=============
! integer ( kind = i4 ), parameter :: Nx = 2000
! integer ( kind = i4 ), parameter :: Ny = 2000
integer ( kind = i4 ), parameter :: Nx = 2000
integer ( kind = i4 ), parameter :: Ny = 2000
real ( kind = r8 ) :: dx = 0.03d0
real ( kind = r8 ) :: dy = 0.03d0
real ( kind = r8 ), parameter :: one = 1.0
real ( kind = r8 ), parameter :: two = 2.0
real ( kind = r8 ), parameter :: four = 4.0
real ( kind = r8 ), parameter :: half = 0.5
!=============
integer (kind = i4 ) :: nsteps = 2000
integer (kind = i4 ) :: nprint = 100
integer (kind = i4 ) :: tsteps
real ( kind = r8 ) :: dtime = 1.0d-4
!===============
real ( kind = r8 ) :: tau = 0.0003d0
real ( kind = r8 ) :: epsilonb = 0.01d0
real ( kind = r8 ) :: kappa = 1.8d0
real ( kind = r8 ) :: delta = 0.02d0
real ( kind = r8 ) :: aniso = 6.0d0
real ( kind = r8 ) :: alpha = 0.9d0
real ( kind = r8 ) :: gama = 10.0d0
real ( kind = r8 ) :: teq = 1.0d0
real ( kind = r8 ) :: theta0= 0.2d0
real ( kind = r8 ) :: seed = 5.0d0
real ( kind = r8 ) :: pix = four*atan(one)
!===============
real ( kind = r8 ) , dimension( Nx, Ny ) :: phi, tempr
real ( kind = r8 ) , dimension( Nx, Ny ) :: lap_phi, lap_tempr
real ( kind = r8 ) , dimension( Nx, Ny ) :: phidx, phidy
real ( kind = r8 ) , dimension( Nx, Ny ) :: epsil, epsilon_deriv
real ( kind = r8 ) :: phi_old, term1, term2
real ( kind = r8 ) :: theta, m
integer ( kind = i4 ) :: i, j, ip, im, jp, jm
contains
real*8 function delta_sec ()
integer*8 :: clock, rate = 1, last_clock = -1
real*8 :: sec
if ( last_clock < 0 ) then
call system_clock ( last_clock, rate )
sec = 0
else
call system_clock ( clock )
sec = dble ( clock - last_clock ) / dble (rate)
last_clock = clock
end if
delta_sec = sec
end function delta_sec
end module all_data
program openmp_test
use iso_fortran_env
use all_data
real*8 :: secs, seca, secb
integer :: nth = 1
logical :: do_loops = .true.
write (*,*) 'start of program', i4,r8
!============================================================
write (*,*) 'Vern : ', compiler_version ()
write (*,*) 'Opts : ', compiler_options ()
!$ nth = 2
!$ call omp_set_num_threads (nth)
write (*,*) 'threads set to', nth
call timer ( 'start' )
phi = 0.0
tempr = 0.0
secs = 0
seca = 0
secb = 0
do j = 1, Ny
do i = 1, Nx
if ( (i - Nx/two)*(i - Nx/two) + (j - Ny/two)*(j - Ny/two) < seed ) then
phi(i,j) = one
end if
end do
end do
call timer ( 'initialise phi' )
!============================================================
time_loop: do tsteps = 1, nsteps
secs = secs + delta_sec ()
if ( do_loops ) then
!$omp parallel do default ( none ) &
!$omp& private ( i,j, ip,im, jp,jm, theta ) &
!$omp& shared ( dx, dy, lap_phi, lap_tempr, phi, tempr, phidx, phidy, epsil, epsilon_deriv, &
!$omp& theta0, aniso, delta, epsilonb )
do j = 1, Ny
do i =1, Nx
jp = j + 1
jm = j - 1
ip = i + 1
im = i - 1
if ( im == 0 ) im = Nx
if ( ip == ( Nx + 1) ) ip = 1
if ( jm == 0 ) jm = Ny
if ( jp == ( Ny + 1) ) jp = 1
!=====
lap_phi(i,j) = ( phi(ip,j) + phi(im,j) + phi(i,jm) + phi(i,jp) - four*phi(i,j)) / ( dx*dy )
lap_tempr(i,j) = ( tempr(ip,j) + tempr(im,j) + tempr(i,jm) + tempr(i,jp) - four*tempr(i,j)) / ( dx*dy )
!======
phidx(i,j) = ( phi(ip,j) - phi(im,j) ) / dx
phidy(i,j) = ( phi(i,jp) - phi(i,jm) ) / dy
!======
theta = zatan2 ( phidy(i,j), phidx(i,j) )
!======
epsil(i,j) = epsilonb *( one + delta * cos ( aniso*( theta - theta0 ) ) )
epsilon_deriv(i,j) = -epsilonb * aniso * delta * sin ( aniso*( theta - theta0 ) )
end do
end do
!$omp end parallel do
end if
seca = seca + delta_sec ()
if ( do_loops ) then
!$omp parallel do default ( none ) &
!$omp& private ( i,j,ip,im,jp,jm, phi_old, term1, term2, m ) &
!$omp& shared ( dx, dy, phi, tempr, epsil, epsilon_deriv, phidx, phidy, lap_phi, lap_tempr, &
!$omp& alpha, pix, teq, gama, dtime, tau, kappa )
do j = 1, Ny
do i =1, Nx
jp = j + 1
jm = j - 1
ip = i + 1
im = i - 1
if ( im == 0 ) im = Nx
if ( ip == ( Nx + 1) ) ip = 1
if ( jm == 0 ) jm = Ny
if ( jp == ( Ny + 1) ) jp = 1
phi_old = phi(i,j)
!========
term1 = ( epsil(i,jp)*epsilon_deriv(i,jp)*phidx(i,jp) - epsil(i,jm)*epsilon_deriv(i,jm)*phidx(i,jm) ) / dy
term2 = -( epsil(ip,j)*epsilon_deriv(ip,j)*phidy(ip,j) - epsil(im,j)*epsilon_deriv(im,j)*phidy(im,j) ) / dx
!========
m = alpha/pix * atan ( gama*( teq - tempr(i,j) ) )
!========
phi(i,j) = phi(i,j) + ( dtime/tau ) * ( term1 + term2 + epsil(i,j)**2 * lap_phi(i,j) ) &
+ phi_old * ( one - phi_old )*( phi_old - half + m )
tempr(i,j) = tempr(i,j) + dtime*lap_tempr(i,j) + kappa * ( phi(i,j) - phi_old )
end do
end do
!$omp end parallel do
end if
secb = secb + delta_sec ()
! print steps
if ( mod ( tsteps, nprint ) .eq. 0 ) print *, 'Done steps = ', tsteps
end do time_loop
call timer ( 'time_loop' )
write (*,*) 'stages', secs,seca,secb
!============================================================================
! output file
open ( 11, file = "phi.dat" )
do i = 1, Nx
write( 11, 10 ) ( phi(i,j), j = 1, Ny )
end do
10 FORMAT (1000000F10.6)
close( 11 )
write (*,*) 'max phi =', maxval (phi)
call timer ( 'report phi' )
!z call exit (0)
contains
real*8 function zatan2 ( y,x )
real*8 :: y,x, res
if ( abs (x) > 0 .or. abs (y) > 0 ) then
res = atan2 ( y, x )
else
res = 0
end if
zatan2 = res
end function zatan2
end program openmp_test
subroutine timer ( desc )
character desc*(*)
integer*8 :: clock, rate, last_clock = -1
real*8 :: sec
if ( last_clock < 0 ) then
call system_clock ( last_clock, rate )
sec = 0
write (*,11) sec, desc
else
call system_clock ( clock, rate )
sec = dble ( clock - last_clock ) / dble (rate)
write (*,11) sec, desc
11 format ( f10.3,2x,a )
last_clock = clock
end if
end subroutine timer
Yeah. It is the same code you modified last time. However, my concern is the performance comparison for nvidia using ubuntu 20.04. I found that the performance of do concurrent
and OpenMP
is same if the domain size is small i.e. Nx=100 and Ny=100. But if it is increased to Nx=2000 and Ny=2000, there is a big time difference. The OpenMP performance is bad in that case.
Going from 100x100 to 2000x2000 is 400 times bigger so the increase in run time should not be unexpected.
The use of Nvidia off-loading with DO CONCURRENT looks to be a very interesting opportunity.
It appears to provide the basics of shared/private array nomination, which looks good.
It appears to be an “easy access” to large thread counts on GPU’s.
I would expect that the same issues that apply to OpenMP would apply here, especially memory transfer bandwidth limitations.
Please update us on how you go with this approach and the pc configuration you are using.
Also, you should check your test example, as floating point underflow and use of sin/cos/atan with unusual values can distort the run times, which confuses the test interpretation.
Yeah. But at this moment I am interested only for CPU. Especially how do concurrent and OpenMP performance can be compared for nvidia. For intel ifort it appears to have no effect and both (OpenMP and do concurrent) codes have same performance. But seems like nvidia has different performance. Maybe it has to do with OpenACC implementation for do concurrent in nvidia compiler. I am not familiar with acc so I can not say much about it!. I read that nvidia uses OpenACC under the hood for do concurrent and uses OpenMP for OpenMP. Perhaps that gives some performance differences.
I am running on my laptop. It is a bit old. Dell G3 with Ubuntu 20.04, Intel Core i5 and 8th generation processor. The memory is 16 GB.
I checked the output with graphics library. It looks fine.
The reason for the performance difference is that you are collapsing the loops in the DC case but not in the OpenMP case. You must add collapse(2)
to achieve an equivalent semantic in the OpenMP DO
statement.
Most of the other theories here are wrong. NVIDIA Fortran implements DC using the OpenACC back-end, which will generate multithreaded plus SIMD code in a similar way to OpenMP. The overheads of launching a thread pool should be similar. One should, of course, do at least one warmup iteration before starting the timer in order to get clean results.
For the two loops:
do j = 1, Ny
do i =1, Nx
Where Ny is much greater than the number of available threads, I have not seen any improvement with the use of “collapse (2)”
In this case, where Ny = 2000, collapse will have little change.
I have now included a review of the array phi, using:
subroutine print_phi
use all_data
integer ( kind = i4 ) :: i, j, j1, k, nz, nt, nv, ns
character :: values(20)*11
open ( unit = 11, file = "phi.dat" )
nz = 0
nt = 0
ns = 0
nv = 0
do i = 1, Nx
write (11, 10 ) i, Ny
do j = 1,ny,20
do j1 = j,min(j+19,ny)
k = j1-j+1
if ( phi(i,j1) == 0 ) then
values(k) = ' 0.'
nz = nz+1
else
write ( values(k), 11) phi(i,j1)
if ( abs(phi(i,j1)) < 1.d-200 ) then
nt = nt+1
else if ( abs(phi(i,j1)) < 1.d-30 ) then
ns = ns+1
else
nv = nv+1
end if
end if
end do
write ( 11, 12 ) values(1:k)
end do
end do
write (11, *) 'max phi = ', maxval (phi)
write (11,14) 'zero values ', nz
write (11,14) 'values < 1.d-200', nt
write (11,14) 'values < 1.d-30 ', ns
write (11,14) 'values > 1.d-30 ', nv
write (*, *) 'max phi = ', maxval (phi)
write (*,14) 'zero values ', nz
write (*,14) 'values < 1.d-200', nt
write (*,14) 'values < 1.d-30 ', ns
write (*,14) 'values > 1.d-30 ', nv
close( unit = 11 )
10 format ( 2i6 )
11 FORMAT ( es11.3 )
12 format ( 20a )
14 format ( a, i8)
end subroutine print_phi
The results summary is now:
start of program 4 8
Vern : GCC version 12.3.0
Opts : ... -O3 -fimplicit-none -fopenmp
threads set to 10
0.000 start
0.016 initialise phi
Done steps = 100
Done steps = 200
...
Done steps = 1900
Done steps = 2000
49.309 time_loop
stages 0.0004 32.7309 16.5781
max phi = 1.0028077443407950
zero values 3319603 83.0%
values < 1.d-307 48158 1.2%
values < 1.d-200 297038 7.4%
values < 1.d-30 291174 7.3%
values > 1.d-30 44027 1.1%
1.273 report phi
test completed 10
This distribution of magnitude shows a significant number of very small values, which would suggest a large number of floating point underflows.
The program termination does report IEEE_UNDERFLOW_FLAG and IEEE_DENORMAL occuring.
Depending on how underflow is managed, this can produce significant variation in run times between different compilers.
The initial conditions for this numerical calculation should be revised to achieve results that are easier to compare between compilers.
Some tests I performed based on the above discussion.The system specs are:
Processor | 11th Gen Intel(R) Core™ i5-11500 @ 2.70GHz 2.71 GHz |
---|---|
Installed RAM | 8.00 GB (7.83 GB usable) |
System type | 64-bit operating system, x64-based processor |
The results are:
Test 1 | |||||
---|---|---|---|---|---|
Nx | Ny | Intel (2021.13.0) | gfortran (13.2) | No.of threads | |
No collapse clause | 200 | 200 | 0.588 | 5.681 | 2 |
with collapse clause | 200 | 200 | 1.099 | 5.7 | 2 |
Test 2 | |||||
---|---|---|---|---|---|
Nx | Ny | Intel (2021.13.0) | gfortran (13.2) | No.of threads | |
No collapse clause | 2000 | 2000 | 98.173 | 522.845 | 2 |
with collapse clause | 2000 | 2000 | 149.886 | 467.451 | 2 |
Test 3 | |||||
---|---|---|---|---|---|
Nx | Ny | Intel (2021.13.0) | gfortran (13.2) | No.of threads | |
No collapse clause | 200 | 200 | 0.3143 | 2.363 | 8 |
with collapse clause | 200 | 200 | 0.531 | 2.362 | 8 |
Test 4 | |||||
---|---|---|---|---|---|
Nx | Ny | Intel (2021.13.0) | gfortran (13.2) | No.of threads | |
No collapse clause | 2000 | 2000 | 114.347 | 186.003 | 8 |
with collapse clause | 2000 | 2000 | 179.237 | 186.576 | 8 |
My code version is
! OpenMP version of the code
!
! the original program does not run with OpenMP on Ver 12.2
! possible changes
! 1) patch atan2 error
! 2) use module
! 3) change stack size
! 4) try Gfortran ver 11
! Shahid version
! ===============
! 1) Put these declarations to the main program.
! integer ( kind = i4 ) :: i, j, ip, im, jp, jm
!
! intel error
! --------------
! error #6401: The attributes of this name conflict with those made accessible by a USE statement. [J]
! jp = j + 1
!
! 2) User where construct for phi. It limits the value of phi between 0.000001 and 0.999999.
!
! 3) print dimensions Nx and Ny
!
! 4) add subroutine print_phi in the file
!
module All_data!
implicit none
integer, parameter :: i4 = selected_int_kind (12)
integer, parameter :: r8 = selected_real_kind (12)
!=============
! integer ( kind = i4 ), parameter :: Nx = 2000
! integer ( kind = i4 ), parameter :: Ny = 2000
integer ( kind = i4 ), parameter :: Nx = 2000
integer ( kind = i4 ), parameter :: Ny = 2000
real ( kind = r8 ) :: dx = 0.03d0
real ( kind = r8 ) :: dy = 0.03d0
real ( kind = r8 ), parameter :: one = 1.0
real ( kind = r8 ), parameter :: two = 2.0
real ( kind = r8 ), parameter :: four = 4.0
real ( kind = r8 ), parameter :: half = 0.5
!=============
integer (kind = i4 ) :: nsteps = 2000
integer (kind = i4 ) :: nprint = 100
integer (kind = i4 ) :: tsteps
real ( kind = r8 ) :: dtime = 1.0d-4
!===============
real ( kind = r8 ) :: tau = 0.0003d0
real ( kind = r8 ) :: epsilonb = 0.01d0
real ( kind = r8 ) :: kappa = 1.8d0
real ( kind = r8 ) :: delta = 0.02d0
real ( kind = r8 ) :: aniso = 6.0d0
real ( kind = r8 ) :: alpha = 0.9d0
real ( kind = r8 ) :: gama = 10.0d0
real ( kind = r8 ) :: teq = 1.0d0
real ( kind = r8 ) :: theta0= 0.2d0
real ( kind = r8 ) :: seed = 5.0d0
real ( kind = r8 ) :: pix = four*atan(one)
!===============
real ( kind = r8 ) , dimension( Nx, Ny ) :: phi, tempr
real ( kind = r8 ) , dimension( Nx, Ny ) :: lap_phi, lap_tempr
real ( kind = r8 ) , dimension( Nx, Ny ) :: phidx, phidy
real ( kind = r8 ) , dimension( Nx, Ny ) :: epsil, epsilon_deriv
real ( kind = r8 ) :: phi_old, term1, term2
real ( kind = r8 ) :: theta, m
! integer ( kind = i4 ) :: i, j, ip, im, jp, jm
contains
real*8 function delta_sec ()
integer*8 :: clock, rate = 1, last_clock = -1
real*8 :: sec
if ( last_clock < 0 ) then
call system_clock ( last_clock, rate )
sec = 0
else
call system_clock ( clock )
sec = dble ( clock - last_clock ) / dble (rate)
last_clock = clock
end if
delta_sec = sec
end function delta_sec
end module all_data
program openmp_test
use iso_fortran_env
use all_data
integer ( kind = 4 ) :: i, j, ip, im, jp, jm
real*8 :: secs, seca, secb
integer :: nth = 1
logical :: do_loops = .true.
write (*,*) 'start of program', i4,r8
!============================================================
write (*,*) 'Vern : ', compiler_version ()
write (*,*) 'Opts : ', compiler_options ()
!$ nth = 8
!$ call omp_set_num_threads (nth)
write (*,*) 'threads set to', nth
call timer ( 'start' )
phi = 0.0
tempr = 0.0
secs = 0
seca = 0
secb = 0
do j = 1, Ny
do i = 1, Nx
if ( (i - Nx/two)*(i - Nx/two) + (j - Ny/two)*(j - Ny/two) < seed ) then
phi(i,j) = one
end if
end do
end do
call timer ( 'initialise phi' )
!============================================================
time_loop: do tsteps = 1, nsteps
secs = secs + delta_sec ()
if ( do_loops ) then
!collapse (2)
!$omp parallel do collapse (2) default ( none ) &
!$omp& private ( i,j, ip,im, jp,jm, theta ) &
!$omp& shared ( dx, dy, lap_phi, lap_tempr, phi, tempr, phidx, phidy, epsil, epsilon_deriv, &
!$omp& theta0, aniso, delta, epsilonb )
do j = 1, Ny
do i =1, Nx
jp = j + 1
jm = j - 1
ip = i + 1
im = i - 1
if ( im == 0 ) im = Nx
if ( ip == ( Nx + 1) ) ip = 1
if ( jm == 0 ) jm = Ny
if ( jp == ( Ny + 1) ) jp = 1
!=====
lap_phi(i,j) = ( phi(ip,j) + phi(im,j) + phi(i,jm) + phi(i,jp) - four*phi(i,j)) / ( dx*dy )
lap_tempr(i,j) = ( tempr(ip,j) + tempr(im,j) + tempr(i,jm) + tempr(i,jp) - four*tempr(i,j)) / ( dx*dy )
!======
phidx(i,j) = ( phi(ip,j) - phi(im,j) ) / dx
phidy(i,j) = ( phi(i,jp) - phi(i,jm) ) / dy
!======
theta = zatan2 ( phidy(i,j), phidx(i,j) )
!======
epsil(i,j) = epsilonb *( one + delta * cos ( aniso*( theta - theta0 ) ) )
epsilon_deriv(i,j) = -epsilonb * aniso * delta * sin ( aniso*( theta - theta0 ) )
end do
end do
!$omp end parallel do
end if
seca = seca + delta_sec ()
if ( do_loops ) then
!collapse (2)
!$omp parallel do collapse (2) default ( none ) &
!$omp& private ( i,j,ip,im,jp,jm, phi_old, term1, term2, m ) &
!$omp& shared ( dx, dy, phi, tempr, epsil, epsilon_deriv, phidx, phidy, lap_phi, lap_tempr, &
!$omp& alpha, pix, teq, gama, dtime, tau, kappa )
do j = 1, Ny
do i =1, Nx
jp = j + 1
jm = j - 1
ip = i + 1
im = i - 1
if ( im == 0 ) im = Nx
if ( ip == ( Nx + 1) ) ip = 1
if ( jm == 0 ) jm = Ny
if ( jp == ( Ny + 1) ) jp = 1
phi_old = phi(i,j)
!========
term1 = ( epsil(i,jp)*epsilon_deriv(i,jp)*phidx(i,jp) - epsil(i,jm)*epsilon_deriv(i,jm)*phidx(i,jm) ) / dy
term2 = -( epsil(ip,j)*epsilon_deriv(ip,j)*phidy(ip,j) - epsil(im,j)*epsilon_deriv(im,j)*phidy(im,j) ) / dx
!========
m = alpha/pix * atan ( gama*( teq - tempr(i,j) ) )
!========
phi(i,j) = phi(i,j) + ( dtime/tau ) * ( term1 + term2 + epsil(i,j)**2 * lap_phi(i,j) ) &
+ phi_old * ( one - phi_old )*( phi_old - half + m )
tempr(i,j) = tempr(i,j) + dtime*lap_tempr(i,j) + kappa * ( phi(i,j) - phi_old )
end do
end do
!$omp end parallel do
where (phi > 0.999999 ) phi = 0.999999
where (phi < 0.000001 ) phi = 0.000001
end if
secb = secb + delta_sec ()
! print steps
if ( mod ( tsteps, nprint ) .eq. 0 ) print *, 'Done steps = ', tsteps
end do time_loop
call timer ( 'time_loop' )
Print*, "Nx=", Nx ,"Ny=", Ny
write (*,*) 'stages', secs,seca,secb
!============================================================================
! output file
open ( 11, file = "phi.dat" )
do i = 1, Nx
write( 11, 10 ) ( phi(i,j), j = 1, Ny )
end do
10 FORMAT (1000000F10.6)
close( 11 )
write (*,*) 'max phi =', maxval (phi)
call timer ( 'report phi' )
call print_phi
!z call exit (0)
contains
real*8 function zatan2 ( y,x )
real*8 :: y,x, res
if ( abs (x) > 0 .or. abs (y) > 0 ) then
res = atan2 ( y, x )
else
res = 0
end if
zatan2 = res
end function zatan2
subroutine print_phi
use all_data
integer ( kind = i4 ) :: j1, k, nz, nt, nv, ns
character :: values(20)*11
open ( unit = 11, file = "phi.dat" )
nz = 0
nt = 0
ns = 0
nv = 0
do i = 1, Nx
write (11, 10 ) i, Ny
do j = 1,ny,20
do j1 = j,min(j+19,ny)
k = j1-j+1
if ( phi(i,j1) == 0 ) then
values(k) = ' 0.'
nz = nz+1
else
write ( values(k), 11) phi(i,j1)
if ( abs(phi(i,j1)) < 1.d-200 ) then
nt = nt+1
else if ( abs(phi(i,j1)) < 1.d-30 ) then
ns = ns+1
else
nv = nv+1
end if
end if
end do
write ( 11, 12 ) values(1:k)
end do
end do
write (11, *) 'max phi = ', maxval (phi)
write (11,14) 'zero values ', nz
write (11,14) 'values < 1.d-200', nt
write (11,14) 'values < 1.d-30 ', ns
write (11,14) 'values > 1.d-30 ', nv
write (*, *) 'max phi = ', maxval (phi)
write (*,14) 'zero values ', nz
write (*,14) 'values < 1.d-200', nt
write (*,14) 'values < 1.d-30 ', ns
write (*,14) 'values > 1.d-30 ', nv
close( unit = 11 )
10 format ( 2i6 )
11 FORMAT ( es11.3 )
12 format ( 20a )
14 format ( a, i8)
end subroutine print_phi
end program openmp_test
subroutine timer ( desc )
character desc*(*)
integer*8 :: clock, rate, last_clock = -1
real*8 :: sec
if ( last_clock < 0 ) then
call system_clock ( last_clock, rate )
sec = 0
write (*,11) sec, desc
else
call system_clock ( clock, rate )
sec = dble ( clock - last_clock ) / dble (rate)
write (*,11) sec, desc
11 format ( f10.3,2x,a )
last_clock = clock
end if
end subroutine timer
One of the output from gfortran is
C:\Users\owner\Desktop>gfortran omp1.f90 -o omp1 -fopenmp
C:\Users\owner\Desktop>omp1
start of program 8 8
Vern : GCC version 13.2.0
Opts : -mtune=generic -march=x86-64 -mthreads -fopenmp
threads set to 8
0.000 start
0.025 initialise phi
Done steps = 100
Done steps = 200
Done steps = 300
Done steps = 400
Done steps = 500
Done steps = 600
Done steps = 700
Done steps = 800
Done steps = 900
Done steps = 1000
Done steps = 1100
Done steps = 1200
Done steps = 1300
Done steps = 1400
Done steps = 1500
Done steps = 1600
Done steps = 1700
Done steps = 1800
Done steps = 1900
Done steps = 2000
186.576 time_loop
Nx= 2000 Ny= 2000
stages 2.5224999999999497E-003 117.58536029999989 68.988256199999995
max phi = 0.99999898672103882
1.393 report phi
max phi = 0.99999898672103882
zero values 0
values < 1.d-200 0
values < 1.d-30 0
values > 1.d-30 4000000
Yes. Version 13.2.0 also does not terminate!
I tested Ver 11 and Ver 12, with the same result. There is a bug in Gfortran -openmp, but this is the only program I have that exhibits this problem.
Your results are very mixed, especially for Intel ! I have seen this poor performance in other tests with IEEE floating point underflow errors.
What time are you reporting here? Is it the section between 'initialise_phi'
and 'time_loop'
?
If you do some counting of the memory accesses, you can quickly estimate how well the code is performing.
First loop nest
Variable | Write | Read |
---|---|---|
phi |
0 | 5 |
tempr |
0 | 5 |
lap_phi |
1 | 0 |
lap_tempr |
1 | 0 |
phi_dx |
1 | 0 |
phi_dy |
1 | 0 |
epsil |
1 | 0 |
epsilon_deriv |
1 | 0 |
Second loop nest
Variable | Write | Read |
---|---|---|
phi |
1 | 1 |
epsil |
0 | 5 |
epsilon_deriv |
0 | 4 |
phi_dx |
0 | 2 |
phi_dy |
0 | 2 |
tempr |
1 | 1 |
lap_phi |
0 | 1 |
If you sum these values, it gives you the amount of memory accessed to update one grid cell:
Write | Read | |
---|---|---|
Total | 8 | 26 |
So a total of 34 elements per cell update, 8 bytes each, which gives 272 bytes per cell update. I ignored the where(phi > ...) ...
clipping part, which is essentially a third and fourth loop nest, so take the following numbers with a grain of salt.
Now if you calculate the rate (cell updates per second), using the best number from Test 2:
rate = ((2000 * 2000) cells) * (2000 steps) / 98.173 s = 81.5 MUPS (mega cell updates per second)
If you multiply the rate by the memory balance, you get an effective bandwidth:
bw = (81.5 MUPS) * (272 bytes per cell update) = 22168 MB/s = 22.17 GB/s
If you look up the properties of your processor, you’ll find it has a peak memory bandwidth of 50 GB/s, so you are using about 40 % of it. (For the 2000^2 grid, 8 field variables, double precision, you need 256 MB of memory which exceeds the 12 MB L3 cache. The 200^2 case only needs 2.56 MB)
Assuming your code is bandwidth-limited (typically stencils codes are) this is not that bad. I’m guessing you could probably go a little faster (say up to 60-70 % of the theoretical bandwidth). Your kernel uses trigonometric functions (sin, cos, atan) so perhaps that gives it a slightly higher arithmetic intensity, which balances out the memory load time. You could verify this by looking at the hardware performance counters using certain tools. For your loop nests, if you would introduce a layer of halo cells for the periodic boundary condition, it would make the loop easier to vectorize and perhaps speed it up a little.
Instead of guessing, I’d recommend using tools like the Intel Application Performance Snapshot, as I’ve described before here and also here. The tool essentially helps identify what is the bottleneck (at least from the processor point of view), for instance if it is using scalar or vector instructions, if the cache is stalling, if there is thread imbalance or overhead, etc… I find it really useful. If not obvious, despite being an Intel tool, you can also use it to profile executables produced by other compilers (gfortran, nvfortran), and also on other x86-64 machines (for instance if you have a CPU from AMD).
Another thing you can try doing is plotting the rate (cell updates per second) as a function of grid size. In this plot you should be able to clearly see the effect of going from cache to main memory. Here is an example what it looks like when you hit the “memory wall”:
I’d expect you to see something similar.
This is a nice idea, I will have a look at it.
I never used these before. So they are on my list now.
I heard a lot of it before but also never tried.
There are two curves here. What do they represent? If it is total cell update per second then it should be a single curve here, i think. How to get vertical lines L1 and L2?
I tried with -O3
compiler flag together with collapse(2)
for OpenMP. This one is generating SIMD code for the loop. I find the comparable code performance with do concurrent. Without -O3
I do not see any improvement even when the collapse(2)
is used in the code.
Sorry about the rawness, I forgot to add labels. They are two different stencils, a 9-pt stencil and a 21-pt stencil. The colors are different runs. You can find the cache sizes by using the command lscpu
. You can also find them listed on the internet: Intel Core i5-11500 Specs | TechPowerUp CPU Database. Then you need to find the corresponding amount of array elements. Alternatively, you can put amount of memory on the x-axis.
@ivanpribec
This is an interesting approach, but it ignores a very important consideration that the unit of memory access is based on memory pages (4Kbytes). Unfortunately this can make your analysis more complex.
The problem with the analysis of looking only at bytes is that it under estimates the total memory transfers.
If accessing contiguous memory, you have the same calculation result, but if randomly accessing 8-byte values, you are randomly accessing 4 kbytes of memory for each access, ie 512 x as much memory.
( this is why the inner loop index should be i ! which as been emphasised since I first tried to write optimised fortran calculations.)
Examples of this issue include:
- I changed the following loop order, although it is outside the main calculation loops
do i = 1, Nx
do j = 1, Ny
if ( (i - Nx/two)*(i - Nx/two) + (j - Ny/two)*(j - Ny/two) < seed ) then
phi(i,j) = one
end if
end do
end do
- in this set of statements the first 2 require 3 pages of memory for jm,j and jp
the 3rd only 1 page and the 4th 2 pages. each page is then processed by i, so sequentially, but requires a 3x memory footprint in L1 and L2 cache ( which is another complexity) as well as 3x in L3 cache.
lap_phi(i,j) = ( phi(ip,j) + phi(im,j) + phi(i,jm) + phi(i,jp) - four*phi(i,j)) / ( dx*dy )
lap_tempr(i,j) = ( tempr(ip,j) + tempr(im,j) + tempr(i,jm) + tempr(i,jp) - four*tempr(i,j)) / ( dx*dy )
!======
phidx(i,j) = ( phi(ip,j) - phi(im,j) ) / dx
phidy(i,j) = ( phi(i,jp) - phi(i,jm) ) / dy
Unfortunately, trying to manage L1,L2 and L3 cache efficiency when there is a memory access bottleneck is very difficult, especially where the only controls we have are fortran statements.
Another complexity to understand for OpenMP is the use of -ffast-math. This has been shown to be very effective for single thread (well behaved) calculations, but when memory bandwidth problems emerge, the advantage erodes.
Finally I am not an expert on Nvfortran or the two Intel Fortrans, but these do appear to utilise multi-thread for DO Concurrent, while the versions of Gfortran I use do not. (All would require appropriate compiler options?)
I would observe that the extra detail required to make DO Concurrent multi-threaded is about the same as using !$OMP in Gfortran (which more clearly documents the intent).
So the important result is to make sure the statements document the programmer’s intent.
I achieve that with Gfortran.
Although there are many disadvantages of using OpenMP, the advantages dominate.
You are absolutely right, the analysis is not infallible. It is based on many assumptions (you may also call them prerequisites) including,
- no cold cache effects
- “steady state” has been reached in the long loops
- the overall rate is limited by a single bottleneck
More generally, this type of analysis is part of the roofline model. This is a “naive” or “optimistic” performance model, which in its basic form ignores architectural aspects, and just like you say it underestimates the true amount of memory transfers. (We are also ignoring facts like other processes have data cached, so the caches aren’t fully available, etc.)
It is still useful in my opinion, because measuring just the elapsed time, does’t really let us know how well is the processor being exploited. This tutorial video explains the roofline model very well.
Another important effect is that most CPU’s, expect the most expensive server ones, throttle the CPU frequency when they get hot. And when using all cores for heavy numerical computations - they will get hot. So you can have a situation when a single thread runs at 100% frequency, and when running all threads the frequency throttle down. The time spent at 100% load also matter here - making it more complicated to account for.