Nvfortran comparison of do concurrent vs OpenMP code

I run my codes using do concurrent and OpenMP. On Windows 10 with /Qopenmp compiler flag with the intel classic compiler, I do not see any performance difference. But with Ubuntu 20.04 using nvfortran there is a great performance difference. I use the compiler option -stdpar=multicore for do concurrent following the link : Accelerating Fortran DO CONCURRENT with GPUs and the NVIDIA HPC SDK | NVIDIA Technical Blog. I use -mp compiler flag for OpenMP code. The performance difference is about 1/3. Do concurrent seems faster. So my question is which other compiler flags I can use to compare the performance for CPU using nvfortran? Should not the performance be the same for both versions?

2 Likes

OpenMP has the disadvantage of a significant startup overhead of about 20,000 processor cycles (Gfortran on windows although this may vary with OS), so there can be little benefit for multi-thread code examples like saxpy in the Nvidia link.
Perhaps a multi-thread DO CONCURRENT may achieve better results, with less startup overhead for their do concurrent implementations.
I did not see any reference to AVX / simd instructions, which typically give a performance uplift for these types of calculations, where there is only a small calculation overhead per itteration.

Perhaps your nvfortran test obtained an improvement from avx, rather than multi-thread ?

With calculations like saxpy, I do not get any direct benefit from Openmp, but do from AVX.
Using saxpy with large vectors also has problems with memory bandwidth limitations (cache efficiency), so testing can be difficult to control when comparing to a reference case.

Interesting the Nvidia link references saxpy as a multi-threading calculation candidate.

I used the following two versions of the code. It is the same OpenMP code implementation in gfortran and intel - Help - Fortran Discourse (fortran-lang.discourse.group)

1) OpenMP

!  OpenMP version of the code

program openmp_test
    use iso_fortran_env
    implicit none

    
    !=============
    
    integer ( kind = 4 ), parameter :: Nx = 2000
    integer ( kind = 4 ), parameter :: Ny = 2000
    real ( kind = 8 )               :: dx = 0.03d0
    real ( kind = 8 )               :: dy = 0.03d0
    real ( kind = 8 ),    parameter :: one  = 1.0
    real ( kind = 8 ),    parameter :: two  = 2.0
    real ( kind = 8 ),    parameter :: four = 4.0
    real ( kind = 8 ),    parameter :: half = 0.5
    
    !=============
    
    integer (kind = 4 ) :: nsteps = 2000
    integer (kind = 4 ) :: nprint = 100
    integer (kind = 4 ) :: tsteps 
    real ( kind = 8 )   :: dtime  = 1.0d-4
    
    !===============
    
    real ( kind = 8 )   :: tau   = 0.0003d0
    real ( kind = 8 )   :: epsilonb = 0.01d0
    real ( kind = 8 )   :: kappa = 1.8d0
    real ( kind = 8 )   :: delta = 0.02d0
    real ( kind = 8 )   :: aniso = 6.0d0
    real ( kind = 8 )   :: alpha = 0.9d0
    real ( kind = 8 )   :: gama  = 10.0d0
    real ( kind = 8 )   :: teq   = 1.0d0
    real ( kind = 8 )   :: theta0= 0.2d0
    real ( kind = 8 )   :: seed  = 5.0d0
    
    real ( kind = 8 )   :: pix   = four*atan(one)
    
    !===============
    
    real ( kind = 8 ) , dimension( Nx, Ny ) :: phi, tempr
    real ( kind = 8 ) , dimension( Nx, Ny ) :: lap_phi, lap_tempr
    real ( kind = 8 ) , dimension( Nx, Ny ) :: phidx, phidy
    real ( kind = 8 ) , dimension( Nx, Ny ) :: epsil, epsilon_deriv
    real ( kind = 8 )                       :: phi_old, term1, term2
    real ( kind = 8 )                       :: theta, m
    integer ( kind = 4 )                    :: i, j, ip, im, jp, jm
    
    
    
    !============================================================
    
    write (*,*) 'Vern : ',compiler_version ()
    write (*,*) 'Opts : ',compiler_options ()

    
    call timer ( 'start' )
    
    phi = 0.0
    tempr = 0.0
    
    do i = 1, Nx
        do j = 1, Ny
            if ( (i - Nx/two)*(i - Nx/two) + (j - Ny/two)*(j - Ny/two) < seed ) then
                phi(i,j) = one
            end if
        end do
    end do
    
    call timer ( 'initialise phi' )
    
    
    
    !============================================================
    
    
    
    time_loop: do tsteps = 1, nsteps
        
        
        !$omp  parallel do default ( none )     &
        !$omp&             private ( i,j, ip,im, jp,jm, theta )   &
        !$omp&             shared  ( dx, dy, lap_phi, lap_tempr, phi, tempr, phidx, phidy, epsil, epsilon_deriv,  &
        !$omp&                       theta0, aniso, delta, epsilonb )
        
        do j = 1, Ny
            do i =1, Nx                
                
                jp = j + 1
                jm = j - 1
                
                ip = i + 1
                im = i - 1
                
                if ( im == 0 ) im = Nx
                if ( ip == ( Nx + 1) ) ip = 1
                if ( jm == 0 ) jm = Ny
                if ( jp == ( Ny + 1) ) jp = 1
                
                !=====
                
                lap_phi(i,j)   = ( phi(ip,j) + phi(im,j) + phi(i,jm) + phi(i,jp) - four*phi(i,j)) / ( dx*dy )
                lap_tempr(i,j) = ( tempr(ip,j) + tempr(im,j) + tempr(i,jm) + tempr(i,jp) - four*tempr(i,j)) / ( dx*dy )
                
                !======
                
                phidx(i,j) = ( phi(ip,j) - phi(im,j) ) / dx
                phidy(i,j) = ( phi(i,jp) - phi(i,jm) ) / dy
                
                !======
                
                theta  = atan2( phidy(i,j),phidx(i,j) )
                
                !======
                
                epsil(i,j)         =  epsilonb *( one +  delta * cos ( aniso*( theta - theta0 ) ) )
                epsilon_deriv(i,j) = -epsilonb * aniso * delta * sin ( aniso*( theta - theta0 ) )
                
            end do
        end do
        
        !$omp end parallel do
        
        
        !$omp  parallel do default ( none )     &
        !$omp&             private ( i,j,ip,im,jp,jm, phi_old, term1, term2, m )  &
        !$omp&             shared  ( dx, dy, phi, tempr, epsil, epsilon_deriv, phidx, phidy, lap_phi, lap_tempr,  &
        !$omp&                       alpha, pix, teq, gama, dtime, tau, kappa )
        
        do j = 1, Ny
            do i =1, Nx 
                
                jp = j + 1
                jm = j - 1
                
                ip = i + 1
                im = i - 1
                
                if ( im == 0 ) im = Nx
                if ( ip == ( Nx + 1) ) ip = 1
                if ( jm == 0 ) jm = Ny
                if ( jp == ( Ny + 1) ) jp = 1
                
                phi_old = phi(i,j)
                
                !========
                
                term1 =  ( epsil(i,jp)*epsilon_deriv(i,jp)*phidx(i,jp) - epsil(i,jm)*epsilon_deriv(i,jm)*phidx(i,jm) ) / dy
                term2 = -( epsil(ip,j)*epsilon_deriv(ip,j)*phidy(ip,j) - epsil(im,j)*epsilon_deriv(im,j)*phidy(im,j) ) / dx
                
                !========
                
                m = alpha/pix * atan ( gama*( teq - tempr(i,j) ) )
                
                !========
                
                phi(i,j) = phi(i,j) + ( dtime/tau ) * ( term1 + term2 + epsil(i,j)**2 * lap_phi(i,j) ) &
                                    +  phi_old * ( one - phi_old )*( phi_old - half + m )
                tempr(i,j) = tempr(i,j) + dtime*lap_tempr(i,j) + kappa * ( phi(i,j) - phi_old )
                
            end do
        end do
        !$omp end parallel do
        
        
        ! print steps
        
        if ( mod ( tsteps, nprint ) .eq. 0 ) print *, 'Done steps  =  ', tsteps
        
        
        
    end do time_loop
    
    call timer ( 'time_loop' )
       
    !============================================================================
    ! output file
    
    open ( 11, file = "phi.dat" )
    do i = 1, Nx
        write( 11, 10 ) ( phi(i,j), j = 1, Ny )
    end do
    10 FORMAT (1000000F10.6)
    close( 11 )
    
    write (*,*) 'max phi =', maxval (phi)
    call timer ( 'report phi' )
    

end program openmp_test

  subroutine timer ( desc )
   character desc*(*)
   integer*8 :: clock, rate, last_clock = -1
   real*8    :: sec

   if ( last_clock < 0 ) then
     call system_clock ( last_clock, rate )
     sec = 0
     write (*,11) sec, desc
   else
     call system_clock ( clock, rate )
     sec = dble ( clock - last_clock ) / dble (rate)
     write (*,11) sec, desc
  11 format ( f10.3,2x,a )
     last_clock = clock
   end if
  end subroutine timer

2) Do concurrent

!  dc version of the main code

program dc_test
    use iso_fortran_env
    implicit none

    
    !=============
    
    integer ( kind = 4 ), parameter :: Nx = 2000
    integer ( kind = 4 ), parameter :: Ny = 2000
    real ( kind = 8 )               :: dx = 0.03d0
    real ( kind = 8 )               :: dy = 0.03d0
    real ( kind = 8 ),    parameter :: one  = 1.0
    real ( kind = 8 ),    parameter :: two  = 2.0
    real ( kind = 8 ),    parameter :: four = 4.0
    real ( kind = 8 ),    parameter :: half = 0.5
    
    !=============
    
    integer (kind = 4 ) :: nsteps = 2000
    integer (kind = 4 ) :: nprint = 100
    integer (kind = 4 ) :: tsteps 
    real ( kind = 8 )   :: dtime  = 1.0d-4
    
    !===============
    
    real ( kind = 8 )   :: tau   = 0.0003d0
    real ( kind = 8 )   :: epsilonb = 0.01d0
    real ( kind = 8 )   :: kappa = 1.8d0
    real ( kind = 8 )   :: delta = 0.02d0
    real ( kind = 8 )   :: aniso = 6.0d0
    real ( kind = 8 )   :: alpha = 0.9d0
    real ( kind = 8 )   :: gama  = 10.0d0
    real ( kind = 8 )   :: teq   = 1.0d0
    real ( kind = 8 )   :: theta0= 0.2d0
    real ( kind = 8 )   :: seed  = 5.0d0
    
    real ( kind = 8 )   :: pix   = four*atan(one)
    
    !===============
    
    real ( kind = 8 ) , dimension( Nx, Ny ) :: phi, tempr
    real ( kind = 8 ) , dimension( Nx, Ny ) :: lap_phi, lap_tempr
    real ( kind = 8 ) , dimension( Nx, Ny ) :: phidx, phidy
    real ( kind = 8 ) , dimension( Nx, Ny ) :: epsil, epsilon_deriv
    real ( kind = 8 )                       :: phi_old, term1, term2
    real ( kind = 8 )                       :: theta, m
    integer ( kind = 4 )                    :: i, j, ip, im, jp, jm
    
    
    
    !============================================================
    
    write (*,*) 'Vern : ',compiler_version ()
    write (*,*) 'Opts : ',compiler_options ()

    
    call timer ( 'start' )
    
    phi = 0.0
    tempr = 0.0
    
    do i = 1, Nx
        do j = 1, Ny
            if ( (i - Nx/two)*(i - Nx/two) + (j - Ny/two)*(j - Ny/two) < seed ) then
                phi(i,j) = one
            end if
        end do
    end do
    
    call timer ( 'initialise phi' )
    
    
    
    !============================================================
    
    
    
    time_loop: do tsteps = 1, nsteps
        
        do concurrent (j=1:Ny,i=1:Nx) default (none)    &
           local ( ip,im, jp,jm, theta )   &
           shared  ( dx, dy, lap_phi, lap_tempr, phi, tempr, phidx, phidy, epsil, epsilon_deriv,  &
                          theta0, aniso, delta, epsilonb )             
                
                jp = j + 1
                jm = j - 1
                
                ip = i + 1
                im = i - 1
                
                if ( im == 0 ) im = Nx
                if ( ip == ( Nx + 1) ) ip = 1
                if ( jm == 0 ) jm = Ny
                if ( jp == ( Ny + 1) ) jp = 1
                
                !=====
                
                lap_phi(i,j)   = ( phi(ip,j) + phi(im,j) + phi(i,jm) + phi(i,jp) - four*phi(i,j)) / ( dx*dy )
                lap_tempr(i,j) = ( tempr(ip,j) + tempr(im,j) + tempr(i,jm) + tempr(i,jp) - four*tempr(i,j)) / ( dx*dy )
                
                !======
                
                phidx(i,j) = ( phi(ip,j) - phi(im,j) ) / dx
                phidy(i,j) = ( phi(i,jp) - phi(i,jm) ) / dy
                
                !======
                
                theta  = atan2( phidy(i,j),phidx(i,j) )
                
                !======
                
                epsil(i,j)         =  epsilonb *( one +  delta * cos ( aniso*( theta - theta0 ) ) )
                epsilon_deriv(i,j) = -epsilonb * aniso * delta * sin ( aniso*( theta - theta0 ) )
                
	end do
        
        do concurrent (j=1:Ny,i=1:Nx) default (none)    &
           	local (ip,im,jp,jm, phi_old, term1, term2, m )  &
      		shared  ( dx, dy, phi, tempr, epsil, epsilon_deriv, phidx, phidy, lap_phi, lap_tempr,  &
        	alpha, pix, teq, gama, dtime, tau, kappa )
        	        
                jp = j + 1
                jm = j - 1
                
                ip = i + 1
                im = i - 1
                
                if ( im == 0 ) im = Nx
                if ( ip == ( Nx + 1) ) ip = 1
                if ( jm == 0 ) jm = Ny
                if ( jp == ( Ny + 1) ) jp = 1
                
                phi_old = phi(i,j)
                
                !========
                
                term1 =  ( epsil(i,jp)*epsilon_deriv(i,jp)*phidx(i,jp) - epsil(i,jm)*epsilon_deriv(i,jm)*phidx(i,jm) ) / dy
                term2 = -( epsil(ip,j)*epsilon_deriv(ip,j)*phidy(ip,j) - epsil(im,j)*epsilon_deriv(im,j)*phidy(im,j) ) / dx
                
                !========
                
                m = alpha/pix * atan ( gama*( teq - tempr(i,j) ) )
                
                !========
                
                phi(i,j) = phi(i,j) + ( dtime/tau ) * ( term1 + term2 + epsil(i,j)**2 * lap_phi(i,j) ) &
                                    +  phi_old * ( one - phi_old )*( phi_old - half + m )
                tempr(i,j) = tempr(i,j) + dtime*lap_tempr(i,j) + kappa * ( phi(i,j) - phi_old )
                
        end do
        
        
        ! print steps
        
        if ( mod ( tsteps, nprint ) .eq. 0 ) print *, 'Done steps  =  ', tsteps
        
        
        
    end do time_loop
    
    call timer ( 'time_loop' )
       
    !============================================================================
    ! output file
    
    open ( 11, file = "phi.dat" )
    do i = 1, Nx
        write( 11, 10 ) ( phi(i,j), j = 1, Ny )
    end do
    10 FORMAT (1000000F10.6)
    close( 11 )
    
    write (*,*) 'max phi =', maxval (phi)
    call timer ( 'report phi' )
    

end program dc_test

  subroutine timer ( desc )
   character desc*(*)
   integer*8 :: clock, rate, last_clock = -1
   real*8    :: sec

   if ( last_clock < 0 ) then
     call system_clock ( last_clock, rate )
     sec = 0
     write (*,11) sec, desc
   else
     call system_clock ( clock, rate )
     sec = dble ( clock - last_clock ) / dble (rate)
     write (*,11) sec, desc
  11 format ( f10.3,2x,a )
     last_clock = clock
   end if
  end subroutine timer

I run on Ubuntu 20.04 with following commands

1) OpenMP test

shahid@shahid-G3-3579:~$ nvfortran omp.f90 -o omp -mp -Minfo
openmp_test:
     66, FMA (fused multiply-add) instruction(s) generated
     83, !$omp parallel
    104, FMA (fused multiply-add) instruction(s) generated
    105, FMA (fused multiply-add) instruction(s) generated
    118, FMA (fused multiply-add) instruction(s) generated
    127, !$omp parallel
    188, maxval reduction inlined
shahid@shahid-G3-3579:~$ ./omp
 Vern : nvfortran 24.7-0
 Opts : omp.f90 -o omp -mp -Minfo
     0.000  start
     0.038  initialise phi
 Done steps  =            100
 Done steps  =            200
 Done steps  =            300
 Done steps  =            400
 Done steps  =            500
 Done steps  =            600
 Done steps  =            700
 Done steps  =            800
 Done steps  =            900
 Done steps  =           1000
 Done steps  =           1100
 Done steps  =           1200
 Done steps  =           1300
 Done steps  =           1400
 Done steps  =           1500
 Done steps  =           1600
 Done steps  =           1700
 Done steps  =           1800
 Done steps  =           1900
 Done steps  =           2000
    87.198  time_loop
 max phi =    1.002807743736036     
     0.266  report phi
  1. Do concurrent test
shahid@shahid-G3-3579:~$ nvfortran dc.f90 -o dc -stdpar=multicore -Minfo
dc_test:
     62, Memory zero idiom, array assignment replaced by call to pgf90_mzero8
     63, Memory zero idiom, array assignment replaced by call to pgf90_mzero8
     67, FMA (fused multiply-add) instruction(s) generated
     83, Generating Multicore code
         83, Loop parallelized across CPU threads
    101, FMA (fused multiply-add) instruction(s) generated
    102, FMA (fused multiply-add) instruction(s) generated
    115, FMA (fused multiply-add) instruction(s) generated
    120, Generating Multicore code
        120, Loop parallelized across CPU threads
    140, FMA (fused multiply-add) instruction(s) generated
    141, FMA (fused multiply-add) instruction(s) generated
    149, FMA (fused multiply-add) instruction(s) generated
    151, FMA (fused multiply-add) instruction(s) generated
    176, maxval reduction inlined
shahid@shahid-G3-3579:~$ ./dc
 Vern : nvfortran 24.7-0
 Opts : dc.f90 -o dc -stdpar=multicore -Minfo
     0.000  start
     0.030  initialise phi
 Done steps  =            100
 Done steps  =            200
 Done steps  =            300
 Done steps  =            400
 Done steps  =            500
 Done steps  =            600
 Done steps  =            700
 Done steps  =            800
 Done steps  =            900
 Done steps  =           1000
 Done steps  =           1100
 Done steps  =           1200
 Done steps  =           1300
 Done steps  =           1400
 Done steps  =           1500
 Done steps  =           1600
 Done steps  =           1700
 Done steps  =           1800
 Done steps  =           1900
 Done steps  =           2000
    71.349  time_loop
 max phi =    1.002807743736036     
     0.257  report phi

It seems like do concurrent is faster but with higher domain size i.e.Nx=2000,Ny=2000. I test with Nx=100 and Ny=100 and the computed time was the same.

FMA (fused multiply-add) instruction(s) are generated by the nvidia compiler here. Are they same as AVX instructions?

I think I have seen this program before. I am having trouble running it on Win 10 with Gfortran 12.3, as it does not terminate (corrupt stack ?)
I made some changes to the Omp version by,

  1. Moving all array declarations to a module ( I think there is a problem with the stack size)
  2. change atan2 to zatan2 which checks when both arguments are zero
  3. introduced time accumulators to check the 2 OMP regions
  4. introduced call omp_set_num_threads (nth) to test variable number of threads
  5. compiled with varying options, although I still can’t get the program to stop properly?

My batch file is

set prog=%1

set options=-fimplicit-none -fallow-argument-mismatch -O2 -march=native -o %prog%.exe
set options=-fimplicit-none -fallow-argument-mismatch -O2 -march=native -fopenmp -o %prog%.exe

now >> %prog%.log 

del %prog%.exe

gfortran %prog%.f90 %options% 

%prog% >> %prog%.log 

type %prog%.log

The elapsed times for different thread counts are:
8 threads 56.3 sec
4 threads 87.0 sec
2 threads 143.0 sec
no Openmp 287 sec

So definately Openmp has some effect.
However, there could still be errors in the program, that I have not identified, which makes timing tests a problem, such as if there are floating point underflow or invalid atan2 values.
On 64 bit Windows, the default stack size is ridiculously small. (I will modify the bat file and check again)

My latest test version is:

!  OpenMP version of the code
!
!  the original program does not run with OpenMP on Ver 12.2
!   possible changes
!    1) patch atan2 error
!    2) use module
!    3) change stack size
!    4) try Gfortran ver 11

module All_data!
    implicit none

    integer, parameter :: i4 = selected_int_kind (12)
    integer, parameter :: r8 = selected_real_kind (12)

    !=============
    
!    integer ( kind = i4 ), parameter :: Nx = 2000
!    integer ( kind = i4 ), parameter :: Ny = 2000
    integer ( kind = i4 ), parameter :: Nx = 2000
    integer ( kind = i4 ), parameter :: Ny = 2000
    real ( kind = r8 )               :: dx = 0.03d0
    real ( kind = r8 )               :: dy = 0.03d0
    real ( kind = r8 ),    parameter :: one  = 1.0
    real ( kind = r8 ),    parameter :: two  = 2.0
    real ( kind = r8 ),    parameter :: four = 4.0
    real ( kind = r8 ),    parameter :: half = 0.5
    
    !=============
    
    integer (kind = i4 ) :: nsteps = 2000
    integer (kind = i4 ) :: nprint = 100
    integer (kind = i4 ) :: tsteps 
    real ( kind = r8 )   :: dtime  = 1.0d-4
    
    !===============
    
    real ( kind = r8 )   :: tau   = 0.0003d0
    real ( kind = r8 )   :: epsilonb = 0.01d0
    real ( kind = r8 )   :: kappa = 1.8d0
    real ( kind = r8 )   :: delta = 0.02d0
    real ( kind = r8 )   :: aniso = 6.0d0
    real ( kind = r8 )   :: alpha = 0.9d0
    real ( kind = r8 )   :: gama  = 10.0d0
    real ( kind = r8 )   :: teq   = 1.0d0
    real ( kind = r8 )   :: theta0= 0.2d0
    real ( kind = r8 )   :: seed  = 5.0d0
    
    real ( kind = r8 )   :: pix   = four*atan(one)
    
    !===============
    
    real ( kind = r8 ) , dimension( Nx, Ny ) :: phi, tempr
    real ( kind = r8 ) , dimension( Nx, Ny ) :: lap_phi, lap_tempr
    real ( kind = r8 ) , dimension( Nx, Ny ) :: phidx, phidy
    real ( kind = r8 ) , dimension( Nx, Ny ) :: epsil, epsilon_deriv
    real ( kind = r8 )                       :: phi_old, term1, term2
    real ( kind = r8 )                       :: theta, m
    integer ( kind = i4 )                    :: i, j, ip, im, jp, jm

  contains

  real*8 function delta_sec ()
   integer*8 :: clock, rate = 1, last_clock = -1
   real*8    :: sec

   if ( last_clock < 0 ) then
     call system_clock ( last_clock, rate )
     sec = 0
   else
     call system_clock ( clock )
     sec = dble ( clock - last_clock ) / dble (rate)
     last_clock = clock
   end if
   delta_sec = sec
  end function delta_sec

end module all_data

program openmp_test
    use iso_fortran_env
    use all_data
    
    real*8  :: secs, seca, secb
    integer :: nth      = 1
    logical :: do_loops = .true.
    
    write (*,*) 'start of program', i4,r8
    
    !============================================================
    
    write (*,*) 'Vern : ', compiler_version ()
    write (*,*) 'Opts : ', compiler_options ()

!$    nth = 2
!$    call omp_set_num_threads (nth)
    write (*,*) 'threads set to', nth
    
    call timer ( 'start' )
    
    phi = 0.0
    tempr = 0.0
    secs = 0
    seca = 0
    secb = 0
    
    do j = 1, Ny
        do i = 1, Nx
            if ( (i - Nx/two)*(i - Nx/two) + (j - Ny/two)*(j - Ny/two) < seed ) then
                phi(i,j) = one
            end if
        end do
    end do
    
    call timer ( 'initialise phi' )
    
    
    !============================================================
    
    
    
    time_loop: do tsteps = 1, nsteps
        
        secs = secs + delta_sec ()
     if ( do_loops ) then
           
        !$omp  parallel do default ( none )     &
        !$omp&             private ( i,j, ip,im, jp,jm, theta )   &
        !$omp&             shared  ( dx, dy, lap_phi, lap_tempr, phi, tempr, phidx, phidy, epsil, epsilon_deriv,  &
        !$omp&                       theta0, aniso, delta, epsilonb )
        
        do j = 1, Ny
            do i =1, Nx                
                
                jp = j + 1
                jm = j - 1
                
                ip = i + 1
                im = i - 1
                
                if ( im == 0 ) im = Nx
                if ( ip == ( Nx + 1) ) ip = 1
                if ( jm == 0 ) jm = Ny
                if ( jp == ( Ny + 1) ) jp = 1
                
                !=====
                
                lap_phi(i,j)   = ( phi(ip,j) + phi(im,j) + phi(i,jm) + phi(i,jp) - four*phi(i,j)) / ( dx*dy )
                lap_tempr(i,j) = ( tempr(ip,j) + tempr(im,j) + tempr(i,jm) + tempr(i,jp) - four*tempr(i,j)) / ( dx*dy )
                
                !======
                
                phidx(i,j) = ( phi(ip,j) - phi(im,j) ) / dx
                phidy(i,j) = ( phi(i,jp) - phi(i,jm) ) / dy
                
                !======
                
                theta  = zatan2 ( phidy(i,j), phidx(i,j) )
                
                !======
                
                epsil(i,j)         =  epsilonb *( one +  delta * cos ( aniso*( theta - theta0 ) ) )
                epsilon_deriv(i,j) = -epsilonb * aniso * delta * sin ( aniso*( theta - theta0 ) )
                
            end do
        end do
        
        !$omp end parallel do
     end if

        seca = seca + delta_sec ()
        
     if ( do_loops ) then
        !$omp  parallel do default ( none )     &
        !$omp&             private ( i,j,ip,im,jp,jm, phi_old, term1, term2, m )  &
        !$omp&             shared  ( dx, dy, phi, tempr, epsil, epsilon_deriv, phidx, phidy, lap_phi, lap_tempr,  &
        !$omp&                       alpha, pix, teq, gama, dtime, tau, kappa )
        
        do j = 1, Ny
            do i =1, Nx 
                
                jp = j + 1
                jm = j - 1
                
                ip = i + 1
                im = i - 1
                
                if ( im == 0 ) im = Nx
                if ( ip == ( Nx + 1) ) ip = 1
                if ( jm == 0 ) jm = Ny
                if ( jp == ( Ny + 1) ) jp = 1
                
                phi_old = phi(i,j)
                
                !========
                
                term1 =  ( epsil(i,jp)*epsilon_deriv(i,jp)*phidx(i,jp) - epsil(i,jm)*epsilon_deriv(i,jm)*phidx(i,jm) ) / dy
                term2 = -( epsil(ip,j)*epsilon_deriv(ip,j)*phidy(ip,j) - epsil(im,j)*epsilon_deriv(im,j)*phidy(im,j) ) / dx
                
                !========
                
                m = alpha/pix * atan ( gama*( teq - tempr(i,j) ) )
                
                !========
                
                phi(i,j) = phi(i,j) + ( dtime/tau ) * ( term1 + term2 + epsil(i,j)**2 * lap_phi(i,j) ) &
                                    +  phi_old * ( one - phi_old )*( phi_old - half + m )
                tempr(i,j) = tempr(i,j) + dtime*lap_tempr(i,j) + kappa * ( phi(i,j) - phi_old )
                
            end do
        end do
        !$omp end parallel do
     end if
        
        secb = secb + delta_sec ()
        
        ! print steps
        
        if ( mod ( tsteps, nprint ) .eq. 0 ) print *, 'Done steps  =  ', tsteps
        
        
        
    end do time_loop
    
    call timer ( 'time_loop' )
       
    write (*,*) 'stages', secs,seca,secb
    
    !============================================================================
    ! output file
    
    open ( 11, file = "phi.dat" )
    do i = 1, Nx
        write( 11, 10 ) ( phi(i,j), j = 1, Ny )
    end do
    10 FORMAT (1000000F10.6)
    close( 11 )
    
    write (*,*) 'max phi =', maxval (phi)
    call timer ( 'report phi' )

!z    call exit (0)

   contains

    real*8 function zatan2 ( y,x )
      real*8 :: y,x, res
      if ( abs (x) > 0 .or. abs (y) > 0 ) then
        res = atan2 ( y, x )
      else
        res = 0
      end if
      zatan2 = res
    end function zatan2

end program openmp_test

  subroutine timer ( desc )
   character desc*(*)
   integer*8 :: clock, rate, last_clock = -1
   real*8    :: sec

   if ( last_clock < 0 ) then
     call system_clock ( last_clock, rate )
     sec = 0
     write (*,11) sec, desc
   else
     call system_clock ( clock, rate )
     sec = dble ( clock - last_clock ) / dble (rate)
     write (*,11) sec, desc
  11 format ( f10.3,2x,a )
     last_clock = clock
   end if

  end subroutine timer
1 Like

Yeah. It is the same code you modified last time. However, my concern is the performance comparison for nvidia using ubuntu 20.04. I found that the performance of do concurrent and OpenMP is same if the domain size is small i.e. Nx=100 and Ny=100. But if it is increased to Nx=2000 and Ny=2000, there is a big time difference. The OpenMP performance is bad in that case.

Going from 100x100 to 2000x2000 is 400 times bigger so the increase in run time should not be unexpected.

The use of Nvidia off-loading with DO CONCURRENT looks to be a very interesting opportunity.
It appears to provide the basics of shared/private array nomination, which looks good.
It appears to be an “easy access” to large thread counts on GPU’s.

I would expect that the same issues that apply to OpenMP would apply here, especially memory transfer bandwidth limitations.
Please update us on how you go with this approach and the pc configuration you are using.

Also, you should check your test example, as floating point underflow and use of sin/cos/atan with unusual values can distort the run times, which confuses the test interpretation.

1 Like

Yeah. But at this moment I am interested only for CPU. Especially how do concurrent and OpenMP performance can be compared for nvidia. For intel ifort it appears to have no effect and both (OpenMP and do concurrent) codes have same performance. But seems like nvidia has different performance. Maybe it has to do with OpenACC implementation for do concurrent in nvidia compiler. I am not familiar with acc so I can not say much about it!. I read that nvidia uses OpenACC under the hood for do concurrent and uses OpenMP for OpenMP. Perhaps that gives some performance differences.

I am running on my laptop. It is a bit old. Dell G3 with Ubuntu 20.04, Intel Core i5 and 8th generation processor. The memory is 16 GB.

I checked the output with graphics library. It looks fine.

The reason for the performance difference is that you are collapsing the loops in the DC case but not in the OpenMP case. You must add collapse(2) to achieve an equivalent semantic in the OpenMP DO statement.

Most of the other theories here are wrong. NVIDIA Fortran implements DC using the OpenACC back-end, which will generate multithreaded plus SIMD code in a similar way to OpenMP. The overheads of launching a thread pool should be similar. One should, of course, do at least one warmup iteration before starting the timer in order to get clean results.

2 Likes

For the two loops:
do j = 1, Ny
do i =1, Nx
Where Ny is much greater than the number of available threads, I have not seen any improvement with the use of “collapse (2)”
In this case, where Ny = 2000, collapse will have little change.

I have now included a review of the array phi, using:

  subroutine print_phi
    use all_data
    
    integer ( kind = i4 ) :: i, j, j1, k, nz, nt, nv, ns
    character :: values(20)*11

    open ( unit = 11, file = "phi.dat" )
    nz = 0
    nt = 0
    ns = 0
    nv = 0
    do i = 1, Nx
       write (11, 10 ) i, Ny
       do j = 1,ny,20
          do j1 = j,min(j+19,ny)
             k = j1-j+1
             if ( phi(i,j1) == 0 ) then
                values(k) = '  0.'
                nz = nz+1
             else
                write ( values(k), 11) phi(i,j1)
                if ( abs(phi(i,j1)) < 1.d-200 ) then
                  nt = nt+1
                else if ( abs(phi(i,j1)) < 1.d-30 ) then
                  ns = ns+1
                else
                  nv = nv+1
                end if
             end if
          end do
          write ( 11, 12 ) values(1:k)
       end do
    end do
    
    write (11, *) 'max phi = ', maxval (phi)
    write (11,14) 'zero values     ', nz
    write (11,14) 'values < 1.d-200', nt
    write (11,14) 'values < 1.d-30 ', ns
    write (11,14) 'values > 1.d-30 ', nv

    write (*, *) 'max phi = ', maxval (phi)
    write (*,14) 'zero values     ', nz
    write (*,14) 'values < 1.d-200', nt
    write (*,14) 'values < 1.d-30 ', ns
    write (*,14) 'values > 1.d-30 ', nv

    close( unit = 11 )

    10 format ( 2i6 )
    11 FORMAT ( es11.3 )
    12 format ( 20a )
    14 format ( a, i8)

  end subroutine print_phi

The results summary is now:

 start of program           4           8
 Vern : GCC version 12.3.0
 Opts : ... -O3 -fimplicit-none -fopenmp
 threads set to          10
     0.000  start
     0.016  initialise phi
 Done steps  =           100
 Done steps  =           200
 ...
 Done steps  =          1900
 Done steps  =          2000
    49.309  time_loop
 stages    0.0004   32.7309   16.5781
 max phi =    1.0028077443407950    
 zero values	   3319603   83.0%
 values < 1.d-307	 48158    1.2%
 values < 1.d-200	297038    7.4%
 values < 1.d-30	291174    7.3%
 values > 1.d-30	 44027    1.1%
     1.273  report phi
 test completed          10

This distribution of magnitude shows a significant number of very small values, which would suggest a large number of floating point underflows.
The program termination does report IEEE_UNDERFLOW_FLAG and IEEE_DENORMAL occuring.

Depending on how underflow is managed, this can produce significant variation in run times between different compilers.
The initial conditions for this numerical calculation should be revised to achieve results that are easier to compare between compilers.

Some tests I performed based on the above discussion.The system specs are:

Processor 11th Gen Intel(R) Core™ i5-11500 @ 2.70GHz 2.71 GHz
Installed RAM 8.00 GB (7.83 GB usable)
System type 64-bit operating system, x64-based processor

The results are:

Test 1
Nx Ny Intel (2021.13.0) gfortran (13.2) No.of threads
No collapse clause 200 200 0.588 5.681 2
with collapse clause 200 200 1.099 5.7 2
Test 2
Nx Ny Intel (2021.13.0) gfortran (13.2) No.of threads
No collapse clause 2000 2000 98.173 522.845 2
with collapse clause 2000 2000 149.886 467.451 2
Test 3
Nx Ny Intel (2021.13.0) gfortran (13.2) No.of threads
No collapse clause 200 200 0.3143 2.363 8
with collapse clause 200 200 0.531 2.362 8
Test 4
Nx Ny Intel (2021.13.0) gfortran (13.2) No.of threads
No collapse clause 2000 2000 114.347 186.003 8
with collapse clause 2000 2000 179.237 186.576 8

My code version is

!  OpenMP version of the code
!
!  the original program does not run with OpenMP on Ver 12.2
!   possible changes
!    1) patch atan2 error
!    2) use module
!    3) change stack size
!    4) try Gfortran ver 11



!                   Shahid version
!                  ===============

!  1)  Put these declarations to the main program. 
!   integer ( kind = i4 )                    :: i, j, ip, im, jp, jm
!
!   intel error
!   --------------
!   error #6401: The attributes of this name conflict with those made accessible by a USE statement.   [J]
!          jp = j + 1
!
!  2) User where construct for phi. It limits the value of phi between 0.000001 and 0.999999. 
!
!  3) print dimensions Nx and Ny
!   
!  4) add subroutine print_phi in the file
!
module All_data!
  implicit none
  
  integer, parameter :: i4 = selected_int_kind (12)
  integer, parameter :: r8 = selected_real_kind (12)
  
  !=============
  
  !    integer ( kind = i4 ), parameter :: Nx = 2000
  !    integer ( kind = i4 ), parameter :: Ny = 2000
  integer ( kind = i4 ), parameter :: Nx = 2000
  integer ( kind = i4 ), parameter :: Ny = 2000
  real ( kind = r8 )               :: dx = 0.03d0
  real ( kind = r8 )               :: dy = 0.03d0
  real ( kind = r8 ),    parameter :: one  = 1.0
  real ( kind = r8 ),    parameter :: two  = 2.0
  real ( kind = r8 ),    parameter :: four = 4.0
  real ( kind = r8 ),    parameter :: half = 0.5
  
  !=============
  
  integer (kind = i4 ) :: nsteps = 2000
  integer (kind = i4 ) :: nprint = 100
  integer (kind = i4 ) :: tsteps 
  real ( kind = r8 )   :: dtime  = 1.0d-4
  
  !===============
  
  real ( kind = r8 )   :: tau   = 0.0003d0
  real ( kind = r8 )   :: epsilonb = 0.01d0
  real ( kind = r8 )   :: kappa = 1.8d0
  real ( kind = r8 )   :: delta = 0.02d0
  real ( kind = r8 )   :: aniso = 6.0d0
  real ( kind = r8 )   :: alpha = 0.9d0
  real ( kind = r8 )   :: gama  = 10.0d0
  real ( kind = r8 )   :: teq   = 1.0d0
  real ( kind = r8 )   :: theta0= 0.2d0
  real ( kind = r8 )   :: seed  = 5.0d0
  
  real ( kind = r8 )   :: pix   = four*atan(one)
  
  !===============
  
  real ( kind = r8 ) , dimension( Nx, Ny ) :: phi, tempr
  real ( kind = r8 ) , dimension( Nx, Ny ) :: lap_phi, lap_tempr
  real ( kind = r8 ) , dimension( Nx, Ny ) :: phidx, phidy
  real ( kind = r8 ) , dimension( Nx, Ny ) :: epsil, epsilon_deriv
  real ( kind = r8 )                       :: phi_old, term1, term2
  real ( kind = r8 )                       :: theta, m
!  integer ( kind = i4 )                    :: i, j, ip, im, jp, jm
  
  contains
  
  real*8 function delta_sec ()
    integer*8 :: clock, rate = 1, last_clock = -1
    real*8    :: sec
    
    if ( last_clock < 0 ) then
      call system_clock ( last_clock, rate )
    sec = 0
    else
      call system_clock ( clock )
      sec = dble ( clock - last_clock ) / dble (rate)
      last_clock = clock
    end if
    delta_sec = sec
  end function delta_sec
  
end module all_data

program openmp_test
  use iso_fortran_env
  use all_data
  
  integer ( kind = 4 )                    :: i, j, ip, im, jp, jm
  real*8  :: secs, seca, secb
  integer :: nth      = 1
  logical :: do_loops = .true.
  
  write (*,*) 'start of program', i4,r8
  
  !============================================================
  
  write (*,*) 'Vern : ', compiler_version ()
  write (*,*) 'Opts : ', compiler_options ()
  
  !$    nth = 8
  !$    call omp_set_num_threads (nth)
  write (*,*) 'threads set to', nth
  
  call timer ( 'start' )
  
  phi = 0.0
  tempr = 0.0
  secs = 0
  seca = 0
  secb = 0
  
  do j = 1, Ny
    do i = 1, Nx
      if ( (i - Nx/two)*(i - Nx/two) + (j - Ny/two)*(j - Ny/two) < seed ) then
        phi(i,j) = one
      end if
    end do
  end do
  
  call timer ( 'initialise phi' )
  
  
  !============================================================
  
  
  
  time_loop: do tsteps = 1, nsteps
    
    secs = secs + delta_sec ()
    if ( do_loops ) then
 
 !collapse (2)
      !$omp  parallel do collapse (2) default ( none )     &
      !$omp&             private ( i,j, ip,im, jp,jm, theta )   &
      !$omp&             shared  ( dx, dy, lap_phi, lap_tempr, phi, tempr, phidx, phidy, epsil, epsilon_deriv,  &
      !$omp&                       theta0, aniso, delta, epsilonb )
      
      do j = 1, Ny
        do i =1, Nx                
          
          jp = j + 1
          jm = j - 1
          
          ip = i + 1
          im = i - 1
          
          if ( im == 0 ) im = Nx
          if ( ip == ( Nx + 1) ) ip = 1
          if ( jm == 0 ) jm = Ny
          if ( jp == ( Ny + 1) ) jp = 1
          
          !=====
          
          lap_phi(i,j)   = ( phi(ip,j) + phi(im,j) + phi(i,jm) + phi(i,jp) - four*phi(i,j)) / ( dx*dy )
          lap_tempr(i,j) = ( tempr(ip,j) + tempr(im,j) + tempr(i,jm) + tempr(i,jp) - four*tempr(i,j)) / ( dx*dy )
          
          !======
          
          phidx(i,j) = ( phi(ip,j) - phi(im,j) ) / dx
          phidy(i,j) = ( phi(i,jp) - phi(i,jm) ) / dy
          
          !======
          
          theta  = zatan2 ( phidy(i,j), phidx(i,j) )
          
          !======
          
          epsil(i,j)         =  epsilonb *( one +  delta * cos ( aniso*( theta - theta0 ) ) )
          epsilon_deriv(i,j) = -epsilonb * aniso * delta * sin ( aniso*( theta - theta0 ) )
          
        end do
      end do
      
      !$omp end parallel do
    end if
    
    seca = seca + delta_sec ()
    
    if ( do_loops ) then
    
    !collapse (2)
      !$omp  parallel do collapse (2) default ( none )     &
      !$omp&             private ( i,j,ip,im,jp,jm, phi_old, term1, term2, m )  &
      !$omp&             shared  ( dx, dy, phi, tempr, epsil, epsilon_deriv, phidx, phidy, lap_phi, lap_tempr,  &
      !$omp&                       alpha, pix, teq, gama, dtime, tau, kappa )
      
      do j = 1, Ny
        do i =1, Nx 

          jp = j + 1
          jm = j - 1
          
          ip = i + 1
          im = i - 1
          
          if ( im == 0 ) im = Nx
          if ( ip == ( Nx + 1) ) ip = 1
          if ( jm == 0 ) jm = Ny
          if ( jp == ( Ny + 1) ) jp = 1
          
          phi_old = phi(i,j)
          
          !========
          
          term1 =  ( epsil(i,jp)*epsilon_deriv(i,jp)*phidx(i,jp) - epsil(i,jm)*epsilon_deriv(i,jm)*phidx(i,jm) ) / dy
          term2 = -( epsil(ip,j)*epsilon_deriv(ip,j)*phidy(ip,j) - epsil(im,j)*epsilon_deriv(im,j)*phidy(im,j) ) / dx
          
          !========
          
          m = alpha/pix * atan ( gama*( teq - tempr(i,j) ) )
          
          !========
          
          phi(i,j) = phi(i,j) + ( dtime/tau ) * ( term1 + term2 + epsil(i,j)**2 * lap_phi(i,j) ) &
          +  phi_old * ( one - phi_old )*( phi_old - half + m )
          tempr(i,j) = tempr(i,j) + dtime*lap_tempr(i,j) + kappa * ( phi(i,j) - phi_old )
          
        end do
      end do
      !$omp end parallel do
      
      where (phi > 0.999999 ) phi = 0.999999
      where (phi < 0.000001 ) phi = 0.000001
      
    end if
    
    secb = secb + delta_sec ()
    
    ! print steps
    
    if ( mod ( tsteps, nprint ) .eq. 0 ) print *, 'Done steps  =  ', tsteps
    
    
    
  end do time_loop
  
  call timer ( 'time_loop' )
  
  Print*, "Nx=", Nx ,"Ny=", Ny
  
  write (*,*) 'stages', secs,seca,secb
  
  !============================================================================
  ! output file
  
  open ( 11, file = "phi.dat" )
  do i = 1, Nx
    write( 11, 10 ) ( phi(i,j), j = 1, Ny )
  end do
  10 FORMAT (1000000F10.6)
  close( 11 )
  
  write (*,*) 'max phi =', maxval (phi)
  call timer ( 'report phi' )
  
  call print_phi
  
  !z    call exit (0)
  
  contains
  
  real*8 function zatan2 ( y,x )
    real*8 :: y,x, res
    if ( abs (x) > 0 .or. abs (y) > 0 ) then
    res = atan2 ( y, x )
    else
      res = 0
    end if
    zatan2 = res
  end function zatan2
  
  subroutine print_phi
    use all_data
    
    integer ( kind = i4 ) :: j1, k, nz, nt, nv, ns
    character :: values(20)*11
    
    open ( unit = 11, file = "phi.dat" )
    nz = 0
    nt = 0
    ns = 0
    nv = 0
    do i = 1, Nx
      write (11, 10 ) i, Ny
      do j = 1,ny,20
        do j1 = j,min(j+19,ny)
          k = j1-j+1
          if ( phi(i,j1) == 0 ) then
            values(k) = '  0.'
          nz = nz+1
          else
            write ( values(k), 11) phi(i,j1)
            if ( abs(phi(i,j1)) < 1.d-200 ) then
            nt = nt+1
            else if ( abs(phi(i,j1)) < 1.d-30 ) then
            ns = ns+1
            else
              nv = nv+1
            end if
          end if
        end do
        write ( 11, 12 ) values(1:k)
      end do
    end do
    
    write (11, *) 'max phi = ', maxval (phi)
    write (11,14) 'zero values     ', nz
    write (11,14) 'values < 1.d-200', nt
    write (11,14) 'values < 1.d-30 ', ns
    write (11,14) 'values > 1.d-30 ', nv
    
    write (*, *) 'max phi = ', maxval (phi)
    write (*,14) 'zero values     ', nz
    write (*,14) 'values < 1.d-200', nt
    write (*,14) 'values < 1.d-30 ', ns
    write (*,14) 'values > 1.d-30 ', nv
    
    close( unit = 11 )
    
    10 format ( 2i6 )
    11 FORMAT ( es11.3 )
    12 format ( 20a )
    14 format ( a, i8)
    
  end subroutine print_phi
  
end program openmp_test

subroutine timer ( desc )
  character desc*(*)
  integer*8 :: clock, rate, last_clock = -1
  real*8    :: sec
  
  if ( last_clock < 0 ) then
    call system_clock ( last_clock, rate )
    sec = 0
  write (*,11) sec, desc
  else
    call system_clock ( clock, rate )
    sec = dble ( clock - last_clock ) / dble (rate)
    write (*,11) sec, desc
    11 format ( f10.3,2x,a )
    last_clock = clock
  end if
  
end subroutine timer

One of the output from gfortran is

C:\Users\owner\Desktop>gfortran omp1.f90 -o omp1 -fopenmp

C:\Users\owner\Desktop>omp1
 start of program           8           8
 Vern : GCC version 13.2.0
 Opts : -mtune=generic -march=x86-64 -mthreads -fopenmp
 threads set to           8
     0.000  start
     0.025  initialise phi
 Done steps  =                    100
 Done steps  =                    200
 Done steps  =                    300
 Done steps  =                    400
 Done steps  =                    500
 Done steps  =                    600
 Done steps  =                    700
 Done steps  =                    800
 Done steps  =                    900
 Done steps  =                   1000
 Done steps  =                   1100
 Done steps  =                   1200
 Done steps  =                   1300
 Done steps  =                   1400
 Done steps  =                   1500
 Done steps  =                   1600
 Done steps  =                   1700
 Done steps  =                   1800
 Done steps  =                   1900
 Done steps  =                   2000
   186.576  time_loop
 Nx=                 2000 Ny=                 2000
 stages   2.5224999999999497E-003   117.58536029999989        68.988256199999995
 max phi =  0.99999898672103882
     1.393  report phi
 max phi =   0.99999898672103882
zero values            0
values < 1.d-200       0
values < 1.d-30        0
values > 1.d-30  4000000

Yes. Version 13.2.0 also does not terminate!

I tested Ver 11 and Ver 12, with the same result. There is a bug in Gfortran -openmp, but this is the only program I have that exhibits this problem.

Your results are very mixed, especially for Intel ! I have seen this poor performance in other tests with IEEE floating point underflow errors.

What time are you reporting here? Is it the section between 'initialise_phi' and 'time_loop'?

If you do some counting of the memory accesses, you can quickly estimate how well the code is performing.


First loop nest

Variable Write Read
phi 0 5
tempr 0 5
lap_phi 1 0
lap_tempr 1 0
phi_dx 1 0
phi_dy 1 0
epsil 1 0
epsilon_deriv 1 0

Second loop nest

Variable Write Read
phi 1 1
epsil 0 5
epsilon_deriv 0 4
phi_dx 0 2
phi_dy 0 2
tempr 1 1
lap_phi 0 1

If you sum these values, it gives you the amount of memory accessed to update one grid cell:

Write Read
Total 8 26

So a total of 34 elements per cell update, 8 bytes each, which gives 272 bytes per cell update. I ignored the where(phi > ...) ... clipping part, which is essentially a third and fourth loop nest, so take the following numbers with a grain of salt.

Now if you calculate the rate (cell updates per second), using the best number from Test 2:

rate = ((2000 * 2000) cells) * (2000 steps) / 98.173 s = 81.5 MUPS (mega cell updates per second)

If you multiply the rate by the memory balance, you get an effective bandwidth:

bw = (81.5 MUPS) * (272 bytes per cell update) = 22168 MB/s = 22.17 GB/s

If you look up the properties of your processor, you’ll find it has a peak memory bandwidth of 50 GB/s, so you are using about 40 % of it. (For the 2000^2 grid, 8 field variables, double precision, you need 256 MB of memory which exceeds the 12 MB L3 cache. The 200^2 case only needs 2.56 MB)

Assuming your code is bandwidth-limited (typically stencils codes are) this is not that bad. I’m guessing you could probably go a little faster (say up to 60-70 % of the theoretical bandwidth). Your kernel uses trigonometric functions (sin, cos, atan) so perhaps that gives it a slightly higher arithmetic intensity, which balances out the memory load time. You could verify this by looking at the hardware performance counters using certain tools. For your loop nests, if you would introduce a layer of halo cells for the periodic boundary condition, it would make the loop easier to vectorize and perhaps speed it up a little.

Instead of guessing, I’d recommend using tools like the Intel Application Performance Snapshot, as I’ve described before here and also here. The tool essentially helps identify what is the bottleneck (at least from the processor point of view), for instance if it is using scalar or vector instructions, if the cache is stalling, if there is thread imbalance or overhead, etc… I find it really useful. If not obvious, despite being an Intel tool, you can also use it to profile executables produced by other compilers (gfortran, nvfortran), and also on other x86-64 machines (for instance if you have a CPU from AMD).

Another thing you can try doing is plotting the rate (cell updates per second) as a function of grid size. In this plot you should be able to clearly see the effect of going from cache to main memory. Here is an example what it looks like when you hit the “memory wall”:

I’d expect you to see something similar.

1 Like

This is a nice idea, I will have a look at it.

I never used these before. So they are on my list now.

I heard a lot of it before but also never tried.

There are two curves here. What do they represent? If it is total cell update per second then it should be a single curve here, i think. How to get vertical lines L1 and L2?

I tried with -O3 compiler flag together with collapse(2) for OpenMP. This one is generating SIMD code for the loop. I find the comparable code performance with do concurrent. Without -O3 I do not see any improvement even when the collapse(2) is used in the code.

Sorry about the rawness, I forgot to add labels. They are two different stencils, a 9-pt stencil and a 21-pt stencil. The colors are different runs. You can find the cache sizes by using the command lscpu. You can also find them listed on the internet: Intel Core i5-11500 Specs | TechPowerUp CPU Database. Then you need to find the corresponding amount of array elements. Alternatively, you can put amount of memory on the x-axis.

1 Like

@ivanpribec
This is an interesting approach, but it ignores a very important consideration that the unit of memory access is based on memory pages (4Kbytes). Unfortunately this can make your analysis more complex.
The problem with the analysis of looking only at bytes is that it under estimates the total memory transfers.
If accessing contiguous memory, you have the same calculation result, but if randomly accessing 8-byte values, you are randomly accessing 4 kbytes of memory for each access, ie 512 x as much memory.
( this is why the inner loop index should be i ! which as been emphasised since I first tried to write optimised fortran calculations.)

Examples of this issue include:

  1. I changed the following loop order, although it is outside the main calculation loops
    do i = 1, Nx
        do j = 1, Ny
            if ( (i - Nx/two)*(i - Nx/two) + (j - Ny/two)*(j - Ny/two) < seed ) then
                phi(i,j) = one
            end if
        end do
    end do
  1. in this set of statements the first 2 require 3 pages of memory for jm,j and jp
    the 3rd only 1 page and the 4th 2 pages. each page is then processed by i, so sequentially, but requires a 3x memory footprint in L1 and L2 cache ( which is another complexity) as well as 3x in L3 cache.
                lap_phi(i,j)   = ( phi(ip,j) + phi(im,j) + phi(i,jm) + phi(i,jp) - four*phi(i,j)) / ( dx*dy )
                lap_tempr(i,j) = ( tempr(ip,j) + tempr(im,j) + tempr(i,jm) + tempr(i,jp) - four*tempr(i,j)) / ( dx*dy )
                
                !======
                
                phidx(i,j) = ( phi(ip,j) - phi(im,j) ) / dx
                phidy(i,j) = ( phi(i,jp) - phi(i,jm) ) / dy

Unfortunately, trying to manage L1,L2 and L3 cache efficiency when there is a memory access bottleneck is very difficult, especially where the only controls we have are fortran statements.

Another complexity to understand for OpenMP is the use of -ffast-math. This has been shown to be very effective for single thread (well behaved) calculations, but when memory bandwidth problems emerge, the advantage erodes.

Finally I am not an expert on Nvfortran or the two Intel Fortrans, but these do appear to utilise multi-thread for DO Concurrent, while the versions of Gfortran I use do not. (All would require appropriate compiler options?)
I would observe that the extra detail required to make DO Concurrent multi-threaded is about the same as using !$OMP in Gfortran (which more clearly documents the intent).

So the important result is to make sure the statements document the programmer’s intent.
I achieve that with Gfortran.

Although there are many disadvantages of using OpenMP, the advantages dominate.

You are absolutely right, the analysis is not infallible. It is based on many assumptions (you may also call them prerequisites) including,

  • no cold cache effects
  • “steady state” has been reached in the long loops
  • the overall rate is limited by a single bottleneck

More generally, this type of analysis is part of the roofline model. This is a “naive” or “optimistic” performance model, which in its basic form ignores architectural aspects, and just like you say it underestimates the true amount of memory transfers. (We are also ignoring facts like other processes have data cached, so the caches aren’t fully available, etc.)

It is still useful in my opinion, because measuring just the elapsed time, does’t really let us know how well is the processor being exploited. This tutorial video explains the roofline model very well.

Another important effect is that most CPU’s, expect the most expensive server ones, throttle the CPU frequency when they get hot. And when using all cores for heavy numerical computations - they will get hot. So you can have a situation when a single thread runs at 100% frequency, and when running all threads the frequency throttle down. The time spent at 100% load also matter here - making it more complicated to account for.