OpenMP: efficiency in reduction in large array

Dear all:

Is it possible to update element of large array without using reduction in openmp? Since reduction will cause temporarily copying the large array itself, I tend to avoid it, and the fact is that serial computation is faster than reduction with openmp.

I do notice OpenMP question: private vs shared work arrays for reduction exists, but it still cannot solve my problem without using reduction.

Here is my code example that reproduces the race condition that I am having, and hopefully there’s a solution for it!

program main

    use iso_Fortran_env, only: rk => real64, ik => int32
    use omp_lib
    implicit none

    integer(ik) :: i, j, t, ipr
    real(rk), dimension(10, 10, 3) :: a
    real(rk), dimension(10) :: jgrid, tgrid
    real(rk), dimension(3, 3) :: pii
    real(rk), dimension(3) :: igrid
    time0 = omp_get_wtime()
    a = 0.0_rk
    jgrid = 1.0_rk
    tgrid = 2.0_rk
    igrid = 3.0_rk
    pii = transpose(reshape([0.1_rk, 0.5_rk, 0.4_rk, &
                    0.3_rk, 0.6_rk, 0.1_rk, &
                    0.4_rk, 0.1_rk, 0.5_rk], [3, 3]))

    ! -------------------------------------------- !
    ! !$omp parallel do collapse(4) reduction(+:a) !
    ! -------------------------------------------- !
    do i = 1, 3, 1
        do j = 1, 10, 1
            do t = 1, 10, 1
                do ipr = 1, 3, 1
                    a(t, j, ipr) =  a(t, j, ipr) + igrid(i)**pii(i, ipr) + jgrid(j)*tgrid(t)
                enddo
            enddo
        enddo
    enddo

    write(*, *) sum(a)

    time1 = omp_get_wtime()
    write(*, '(a, F12.6, a)') 'elapsed time: ', time1 - time0, ' seconds.'

  
end program main

If the !$omp session remain commented, the serial execution time is:

   3123.9733500317811     
elapsed time:     0.000041 seconds.

if uncomment the box comment allow the usage of reduction and omp, here’s the result:

   3123.9733500317811     
elapsed time:     0.001185 seconds.

and if I delete the reduction(+:a) part, then the result will be different everytime. I think this is the race condition.

Thank you so much!

The DO order you proposed is very unsuitable.
Collapse is probably ineffective.
With better DO order, reduction is not necessary.

As a first pass, I would:
improve the DO order
declare public/shared arrays ( at least for clarity )
remove collapse, or use only collapse(2)

At the least, “i” then “t” should be taken out of collapse to remove any race condition.

The following appears to work for me

program main

    use iso_Fortran_env, only: rk => real64, ik => int32
    use omp_lib
    implicit none

    integer(ik) :: i, j, t, ipr
    real(rk), dimension(10, 10, 3) :: a
    real(rk), dimension(10) :: jgrid, tgrid
    real(rk), dimension(3, 3) :: pii
    real(rk), dimension(3) :: igrid
    real(rk) :: time0,time1
    
    time0 = omp_get_wtime()
    a = 0.0_rk
    jgrid = 1.0_rk
    tgrid = 2.0_rk
    igrid = 3.0_rk
    pii = transpose(reshape([0.1_rk, 0.5_rk, 0.4_rk, &
                    0.3_rk, 0.6_rk, 0.1_rk, &
                    0.4_rk, 0.1_rk, 0.5_rk], [3, 3]))

    ! -------------------------------------------- !
    ! !$omp parallel do collapse(4) reduction(+:a) !
    ! -------------------------------------------- !
  !$omp parallel do PRIVATE (ipr,j,t,i) SHARED (a,igrid,pii,jgrid,tgrid)
                do ipr = 1, 3, 1
        do j = 1, 10, 1
            do t = 1, 10, 1
    do i = 1, 3, 1
                    a(t, j, ipr) =  a(t, j, ipr) + igrid(i)**pii(i, ipr) + jgrid(j)*tgrid(t)
                enddo
            enddo
        enddo
    enddo
  !$omp end parallel do

    write(*, *) sum(a)

    time1 = omp_get_wtime()
    write(*, '(a, F12.6, a)') 'elapsed time: ', time1 - time0, ' seconds.'

  
end program main

Then see what can be improved.

2 Likes

Is the following equivalent to your calculation ?

program main

    use iso_Fortran_env, only: rk => real64, ik => int32
    use omp_lib
    implicit none

    integer(ik) :: i, j, t, ipr
    real(rk), dimension(10, 10, 3) :: a
    real(rk), dimension(10) :: jgrid, tgrid, x
    real(rk), dimension(3, 3) :: pii
    real(rk), dimension(3) :: igrid
    real(rk) :: time0,time1, sec
    
    time0 = omp_get_wtime()
    sec = elapse_sec ()
    a     = 0.0_rk
    jgrid = 1.0_rk
    tgrid = 2.0_rk
    igrid = 3.0_rk
    pii = transpose(reshape([0.1_rk, 0.5_rk, 0.4_rk, &
                             0.3_rk, 0.6_rk, 0.1_rk, &
                             0.4_rk, 0.1_rk, 0.5_rk], [3, 3]))

    ! -------------------------------------------- !
    ! !$omp parallel do collapse(4) reduction(+:a) !
    ! -------------------------------------------- !
  !$omp parallel do PRIVATE (ipr,j,t,i,x) SHARED (a,igrid,pii,jgrid,tgrid)
     do ipr = 1, 3, 1
        do j = 1, 10, 1
            do t = 1, 10, 1
                x(t) = jgrid(j)*tgrid(t) * 3.
                do i = 1, 3, 1
!z                    a(t, j, ipr) =  a(t, j, ipr) + igrid(i)**pii(i, ipr) + jgrid(j)*tgrid(t)
                    x(t) = x(t) + igrid(i)**pii(i, ipr)
                enddo
            enddo
            a(:, j, ipr) =  x
        enddo
    enddo
  !$omp end parallel do

    write(*, *) sum(a)

    time1 = omp_get_wtime()
    sec = elapse_sec () - sec
    write(*, '(a, F12.6, a)') 'elapsed time: ', time1 - time0, ' seconds.'
    write (*,*) 'elapse', sec

  
   contains
     real*8 function elapse_sec ()
     integer*8 clock, rate
     call SYSTEM_CLOCK ( clock, rate )
       elapse_sec = dble(clock)/dble(rate)
     end function elapse_sec
end program main

omp_get_wtime is implemented poorly on the gfortran I use

1 Like

The workload in your example is way to low to get any improvement with OpenMP. In your timing you are just observing the OpenMP overheads (create the threads, synchronise at the end…). General rule: OpenMP (or more generally any kind of parallelism) is useless for low workloads.

Other general comments:

  • Collapsing all nested loops is not always necessarily a good strategy when the inner loop can be vectorised by the compiler (which is the case here)

  • As far as possible it’s better to have the inner loops that correspond to the first dimensions of the arrays. So here ipr would better be iterated by the outer loop, then j, t and i (as proposed by @JohnCampbell ). t and i loops can also be swapped, and array syntax can be used. No reduction is needed as there’s no race condition here.

     !$OMP PARALLEL DO PRIVATE (ipr,j,i) COLLAPSE(2)
     do ipr = 1, 3, 1
        do j = 1, 10, 1
            a(:, j, ipr) =  a(:, j, ipr) + 3*jgrid(j)*tgrid(:)
            do i = 1, 3, 1
                a(:, j, ipr) =  a(:, j, ipr) + igrid(i)**pii(i, ipr)
            enddo
        enddo
    enddo

But again, with such small dimensions there’s no chance observing any speed-up with OpenMP

1 Like

I looked at the test loops and found it does not take muchof a size increase to get !$OMP efficiency.
The workload quickly overcomes the !$OMP overheads.

By changing to jgrid(20) and tgrid(20), this provides sufficient calculation to achieve OMP improvement of 40% without collapse and 100% with collapse(2).
By changing to jgrid(50), tgrid(50) and collapse(2), this provides significant OMP improvement of 500%.
Other changes are listed below.
I assumed dimensions for jgrid and tgrid would be problem dependent.

The most useful timing for OMP tests is elapsed time.

For this test, I placed the test code I previously posted into a subroutine and called with differing size for jgrid and tgrid.
I also tested the OMP loops both with and without collapse(2), which also showed an improvement with the outer two DO being “ipr” and “j”, mainly because size(pii,2) has been fixed at 3.

The comparison of loop performance is:

  It is now Saturday, 10 December 2022 at 22:31:55.800
Driving: gfortran omp2_test.f90 -v -fallow-argument-mismatch -O2 -fopenmp -o omp2_test.exe -l gfortran
Built by Equation Solution <http://www.Equation.com>.
Using built-in specs.
COLLECT_GCC=gfortran
COLLECT_LTO_WRAPPER=c:/program\ files\ (x86)/gcc_eq/gcc_11.1.0/bin/../libexec/gcc/x86_64-w64-mingw32/11.1.0/lto-wrapper.exe
Target: x86_64-w64-mingw32
Configured with: ../gcc-11.1.0-mingw/configure --host=x86_64-w64-mingw32 ...
Thread model: win32
COLLECT_GCC_OPTIONS='-v' '-fallow-argument-mismatch' '-O2' '-fopenmp' '-o' 'omp2_test.exe' '-mtune=generic' '-march=x86-64' '-mthreads' '-dumpdir' 'omp2_test.'

OMP test  ni=   3 nj=  10 nt=  10 npi=   3
1 OMP Thread Test     1 Sum A = 3123.973
2 OMP CLP Thread Test 2 Sum A = 3123.973
3 Single Thread Test  3 Sum A = 3123.973
elapse   0.00037   0.00007   0.00004     0.116     0.582

OMP test  ni=   3 nj=  20 nt=  20 npi=   3
1 OMP Thread Test     1 Sum A = 12495.893
2 OMP CLP Thread Test 2 Sum A = 12495.893
3 Single Thread Test  3 Sum A = 12495.893
elapse   0.00012   0.00008   0.00016     1.398     2.007

OMP test  ni=   3 nj=  50 nt=  50 npi=   3
1 OMP Thread Test     1 Sum A = 78099.334
2 OMP CLP Thread Test 2 Sum A = 78099.334
3 Single Thread Test  3 Sum A = 78099.334
elapse   0.00047   0.00017   0.00100     2.102     5.981

OMP test  ni=   3 nj= 100 nt= 100 npi=   3
1 OMP Thread Test     1 Sum A = 312397.335
2 OMP CLP Thread Test 2 Sum A = 312397.335
3 Single Thread Test  3 Sum A = 312397.335
elapse   0.00173   0.00057   0.00390     2.252     6.857

OMP test  ni=   3 nj= 100 nt= 200 npi=   3
1 OMP Thread Test     1 Sum A = 624794.670
2 OMP CLP Thread Test 2 Sum A = 624794.670
3 Single Thread Test  3 Sum A = 624794.670
elapse   0.00343   0.00110   0.00818     2.385     7.472

OMP test  ni=   3 nj= 200 nt= 500 npi=   3
1 OMP Thread Test     1 Sum A = 3123973.350
2 OMP CLP Thread Test 2 Sum A = 3123973.350
3 Single Thread Test  3 Sum A = 3123973.350
elapse   0.01580   0.00439   0.03919     2.480     8.929

OMP test  ni=   3 nj= 500 nt= 800 npi=   3
1 OMP Thread Test     1 Sum A = 12495893.400
2 OMP CLP Thread Test 2 Sum A = 12495893.400
3 Single Thread Test  3 Sum A = 12495893.400
elapse   0.06302   0.01688   0.15156     2.405     8.977

I have linked the test program below, and the compile options are listed above.
omp2_test.f90 (3.4 KB)

1 Like

Sorry, @JohnCampbell and @PierU for the late reply due to my flight back to my home country. I will look into all of your suggestions and reply all of you tomorrow!

Thank you so much!

I don’t think this is equivalent to my calculation. My actual code is to compute the steady state distribution of the firm (in economics), and the update of the distribution array, which is supposed to be the a array in the original example, only happens whenever there’s mass at that (i, j, ipr).

That’s probably true. Somehow my senior told me that updating the stationary distribution is a serial task to do. Maybe I just shouldn’t try to parallelize this work.

A little bit of detour, I am very unsure about the difference of call CPU_TIME(), call system_clock(), and omp_get_wtime(). It seems to me that if I am using openmp, cpu_time seems to exaggerate the time spend, since I guess I am using more CPU means more CPU time. What about the other two?

I don’t think the approach of changing the ipr loop to the outside loop is applicable to my actual code, which is in this repo. The highlighted loop is the loop that I try to parallelize. Note that I haven’t updated it, and this version of the code is still using derived type, so might not be able to use OpenMP properly.

But still, I think your answer is amazing! I personally think the comparison of using collapse to only the second level, default multi-thread and single thread is very good. This gives me some insight on how to actually optimize my own code and this structure is also a good benchmark for me to try out different setting.

In your OP calculation below, the calculation a(t, j, ipr) = a(t, j, ipr) + formula(i,ipr,j,t) will be calculated 3 x 10 x 10 x 3 = 900 times, Changing the order of the DO loops only changes the order of the computation, but the resulting values assembled in A(1:10,1:10,1:3) will still be the same. The results I presented for “OMP test ni= 3 nj= 10 nt= 10 npi= 3” are the same as your original post.
(This is important to understand as it assists to eliminate race condition by swaping DO ipr with DO i )

You state " only happens whenever there’s mass at that (i, j, ipr)" but the code you posted does not have any logic for this restriction ?
Did you mean to have restrictions based on some (i,j,ipr) case ?

    do i = 1, 3, 1
        do j = 1, 10, 1
            do t = 1, 10, 1
                do ipr = 1, 3, 1
                    a(t, j, ipr) =  a(t, j, ipr) + igrid(i)**pii(i, ipr) + jgrid(j)*tgrid(t)
                enddo
            enddo
        enddo
    enddo

Sorry for not being clear. Yes, in my example the i, j, t are index for the current distribution mass, and I am trying to update to next period distribution by moving some of the mass along the future i index ipr, and in next recursion the future i become today’s i, until I find the fixed point.
Therefore, I am not sure whether it is possible to change the order of i and ipr, since what I am suppose to do is that, given today’s mass is at (i, j, t), what will this mass goes at tomorrow (ipr, jpr, tpr), which jpr and tpr follows some kind of internal calculation.
So to be clear, my question would be, is it possible to use omp to improve the efficiency given this reduction structure on large array?

As the computation time is .001 seconds, I would ignore !$OMP and focus on " trying to update to next period distribution by moving some of the mass along the future i index ipr"

I don’t see any logic that identifies this “mass” shifting, related to"i" or “ipr”.
This appears to be a sequential process where period “i” and “ipr” are possibly related ?
You need to better clarify these relationships and ignore computational efficiency.

omp_get_wtime() measures the elapse time, that is it simply reading a clock and returning the value. cpu_time() measures the time that each core spents executing a thread, and sums all of them. The cpu_time() of an OpenMP code that scales perfectly should be the same as the cpu_time() of the serial code, while the omp_get_wtime() should be inversely proportional to the number of threads.

Regarding your code, before thinking about parallelizing it, you should make sure that the serial version really does what it is supposed to do. It’s not that clear at the moment.

Thanks @PierU and @JohnCampbell ! I will see into my code and smooth them!

[quote="fish830911, post:9, ttheir recquote]
To add some further clarification_system_clock() and omp_get_wtime() are estimates of the elapsed time or “wall clock time”. The problem with these is measures is when their recorded value is updated and what accuracy is given to “rate”. Hence my use of INTEGER8 for SYSTEM_CLOCK, which provides a different “rate” than using INTEGER4.
Some operating system implementations of omp_get_wtime(), SYSTEM_CLOCK and other elapsed time reports are updated only 64 times per second. Most operating systems have different implementations.
The obtained the wall time can also differ between threads, but this difference is usually much smaller than the frequency that their recorded value is updated, although this update can occur at different times for each thread. It is a “claimed problem” to be aware of for comparing wall clock times between threads but not significant.

As discussed, CPU_TIME is the processor time accumulated for all “CPU’s” associated with the process.
In general, CPU_TIME / Clock_time = number of threads used, but a lesser value can often indicate inefficiency.

If the cpu is running multiple processes, as almost all cpus these days do, then the cpu time for a single process will be less than the wall time. This has nothing to do with inefficiency, it just means that some other user got a time slice or some background process had to wake up and do something (service a print queue, or file an incoming mail, or do something with a network interrupt, or mount or unmount a disk, etc.). On this machine I’m typing on right now, I’m running a browser, so the cpu needs to process my keystrokes, and when I click the reply button, it sends off the info to the remote web site.

@RonShepard

I run my compute intensive multi-thread tasks defined with OpenMP on windows with a multi-core Intel i7 or AMD Ryzen processor. They are very different to the multi-thread tasks that the operating system appears to run (as shown in task manager).
If I am running multiple threads, CPU_TIME clock accumulates the “CPU” time for all associated threads.
Typically they are running at 100% usage (doubled for hyper-threading ?), but if CPU usage is not the bottleneck, thay can run at a reduced (per core) operation. The most common cause of reduction for these processes is memory access bottleneck, which is an indication of an unsuitable demand for memory information between memory and cache; the inefficiency I referred to in my previous post.

This is a very different resource allocation problem to the days of time-sharing when there were very few cores available ( typically one! ) and many processes running. (I shut down our last Pr1me in 1992 !)
A Windows PC actually has this environment, (see Task Manager > Performance for the Process and Threads count), but these are very different to the compute intensive threads generated with OpenMP. It is a good idea to reduce your process thread count and leave 1 or 2 cores for those other mostly suspended other processes.

My comments were limited to the OpenMP style tasks and how CPU_TIME reports their usage. How complex a description do you want ?