Question on OpenMP reduction

fish830911 · April 15, 2024, 4:36am

Dear all:

I am wondering if it is possible to efficiently parallelize the variable phi_new if the phi_new array is huge, like 4 dimensions. To add some more content, this code is to calculate the stationary distribution, so we have to follow this loop structure, i.e., ia and then ie and then is_p.

github.com

fabiankindermann/ce-fortran/blob/main/code-book/prog09/prog09_09/prog09_09m.f90#L221-L231


      
          do ia = 0, NA
              do is = 1, NS
          
                  do is_p = 1, NS
                      phi_new(ial(ia,is), is_p) = phi_new(ial(ia,is), is_p) + &
                          pi(is,is_p)*varphi(ia,is)*phi(ia,is)
                      phi_new(iar(ia,is), is_p) = phi_new(iar(ia,is), is_p) + &
                          pi(is,is_p)*(1d0-varphi(ia,is))*phi(ia,is)
                  enddo
              enddo
          enddo

JohnCampbell · April 15, 2024, 5:01am

Your code snippet appears to have a number of arrays, some you suggest are large ?

phi_new(??, is_p)
pi(is,is_p)
varphi(ia,is)
phi(ia,is)
ial(ia,is)
iar(ia,is)

A significant problem you appear to have is “do is_p = 1, NS” as an inner loop, which implies significant memory strides for addressing pi and phi_new.
Perhaps transpose phi_new and pi ?

You could use OMP on lines 221 and 221 using collapse(2)

ial and iar could pose potential efficiency problems for cache clashing between threads ?

The problem may remain that you have large arrays but minimal calculation per itteration, which can be a problem for OMP efficiency.

lots of unknowns with your brief description !

JohnCampbell · April 15, 2024, 8:26am

Assuming ial and iar are arrays and NA and NS are comparable, you could try the following.

!$OMP PARALLEL DO SHARED (ns,na, phi_new, pi, varphi, phi, ial, iar)  &
!$OMP&            PRIVATE (is_p,ia,is,la,ra)
  do is_p = 1, NS
    do is = 1, NS
      do ia = 0, NA
        la = ial(ia,is)
        ra = iar(ia,is)
        phi_new(la, is_p) = phi_new(la, is_p) + pi(is,is_p)*     varphi(ia,is) *phi(ia,is)
        phi_new(ra, is_p) = phi_new(ra, is_p) + pi(is,is_p)*(1d0-varphi(ia,is))*phi(ia,is)
      end do
    end do
  end do
!$OMP END PARALLEL DO

PierU · April 15, 2024, 8:53am

They are not. I can see in the code that is linked from the original post that NA=1000 and NS=7

This also says that the total number of iterations for the 3 nested loops is 49000, which is quite a low count, and the inner operations are just 2 FMAs: the total workload is low, and the computations are probably memory bound. All set together, there is little chance to get a important speed-up with OpenMP here.

In addition, nothing says that ial(ia,is) and iar(ia,is) have unique values for different ia,is couples. If the values are not unique then the loops on ia and is can simply not be parallelized… Unless a reduction on phi_new is declared, but this will make the code even more memory bound.

JohnCampbell · April 15, 2024, 9:12am

@PierU
Given what you say, it would be unlikely to consider collapse(2), but the approach I outlined above may achieve a reasonable speed-up for num_threads <= NS, as using is_p will protect from any thread clashes.

manuel.arenaz · April 24, 2024, 3:27am

I suggest you take a look at the checker PWR052: Consider applying multithreading parallelism to sparse reduction loop related to the parallelization of loops that compute a “sparse reduction” pattern (aka. “histogram reductions”). This has been extensively studied in papers about parallelization with OpenMP/OpenACC for multithreading and offloading. Note: the example in the catalog is in C, we are working to improve the documentation for Fortran (see Fortran example here).

The first thing to consider is protecting the sparse reduction array in order to avoid race conditions during the parallel execution and ensure the correctness of the code.

Next you can consider other aspects related to performance, which includes topics like the ones discussed in this thread related to cache efficiency, loop collapsing, etc.

We have conducted several training courses over the year using the Codee tools addressing this important sparse reduction pattern. In the materials of this Codee course at NERSC take a look at the examples ATMUX and LULESHmk which also compute sparse reduction patterns.

I hope this helps as a starting point as efficiently parallelizing sparse reductions is not straigthforward in general!

JohnCampbell · April 24, 2024, 1:43pm

In the original code example, the use of is_p as the DO loop clearly achieves this outcome.
Has this approach been tested in the complete code ?

@manuel.arenaz, regarding “PWR052” I did not see any guidance for how to use “nodes(nel)” to improve OpenMP performance, as this approach comes with problems related to cache use efficiency. Has this issue been addressed, as this is a major problem when using OpenMP for these types of problems ?

aledinola · April 26, 2024, 4:02pm

You can parallelize the two outer loops (over assets a and the shock z today) and declare phi_new as a reduction variable.

PierU · April 26, 2024, 7:51pm

Reducing on a large array is often not very efficient…

aledinola · April 27, 2024, 9:08pm

You can parallelize the code as follows.

! iterate until the distribution function converges
phi = 1d0/((NA+1)*NS) ! initial guess

do iter = 1, itermax

    phi_new = 0d0

    !$omp parallel if (par==1) default(shared) private(ia,is,is_p) reduction(+: phi_new)
    !$omp do collapse(2)
    do ia = 0, NA
        do is = 1, NS
            do is_p = 1, NS
                phi_new(ial(ia,is), is_p) = phi_new(ial(ia,is), is_p) + &
                    pi(is,is_p)*varphi(ia,is)*phi(ia,is)
                phi_new(iar(ia,is), is_p) = phi_new(iar(ia,is), is_p) + &
                    pi(is,is_p)*(1d0-varphi(ia,is))*phi(ia,is)
            enddo
        enddo
    enddo
    !$omp enddo
    !$omp end parallel
    
    con_lev = maxval(abs(phi_new(:, :) - phi(:, :))/max(abs(phi(:, :)), 1d-10))

    ! update distribution
    phi = phi_new
    
    write(*,*) "iter = ", iter, "con_lev = ", con_lev

    ! check for convergence
    if(con_lev < sig_in)then
        return
    endif
enddo !end iteration

It turns out however that parallelization slightly slows down the code (at least on my laptop). As mentioned by @PierU, reducing on a large array is often not efficient. It would be great if someone else could come up with better suggestions.

A few comments on the code:

It would be better to post a minimum working example. The link that you posted refers to a full implementation of a general equilibrium, heterogeneous agent model (the Aiyagari model) but your question only refers to the computation of the stationary distribution.
There are too many global variables. The subroutine get_distribution does not have inputs or outputs, so it’s not clear what arrays is going to modify, etc.
Before starting the fixed point iteration, you have to set an initial guess for the array phi. Since it is a distribution, all elements must be non-negative and add up to one.
(I admit this is a matter of personal taste). Why using zero-based array indexing?
Note that I added a flag, par, to control whether to run the code in parallel or not. You can define par as either a command line argument or inside the code as

integer, parameter :: par = 1 ! 1: parallel, 0: serial

Hope this helps!

fish830911 · April 28, 2024, 2:30am

Dear @aledinola

Yes, that is also what I realized. It seems like reduction on a large array like the distribution seems to be a difficult task for openmp to parallelize.

I think what I am wondering is applicable to any heterogeneous agent model, and thus I thought it is sufficient to post the distribution code that I found in the Econ textbook as a good example to illustrate. Therefore, I don’t know why they chose zero-based array.

Thank you for the par example! I don’t know that openmp is able to do that. That would make the testing become much easier!

fish830911 · April 28, 2024, 3:45am

Thanks! I will take a look of the materials that you attached!

JohnCampbell · April 28, 2024, 4:42am

@aledinola
As you have the code working, could you please try my suggestion of modifying the loop order and use

!$OMP PARALLEL DO SHARED (ns,na, phi_new, pi, varphi, phi, ial, iar)  &
!$OMP&            PRIVATE (is_p,ia,is,la,ra)
  do is_p = 1, NS

I have not seen the definition of ial and iar arrays (or functions?) but I expect this revised approach has a better chance of improving performance, while still getting the same calculated result.
If the arrays are “large”, the loop order you tested would struggle with efficient cache usage.

As you are struggling to improve performance, your use of collapse(2) looks to not be helpful.

PierU · April 28, 2024, 7:43am

A reduction implies a private copy of the variable for each thread, and at the end of the parallel region an update of the original variable, and this update cannot fully use all the threads. All of this has a cost.

JohnCampbell · April 28, 2024, 9:45am

@aledinola

I downloaded the code from the OP link and reproduced the inefficiency of using !$OMP
I did the following tests using Gfortran 12.3.0 on Win 10.

A single thread original code elapsed time is 4.082 s using -O3, but 5.312 s using -O2
With some optimising tweeks ( mainly noting iaR(ia,is) = iaL(ia,is)+1 ), elapsed time is 3.981 s (minimal!)
Re-ordering the loops for using !$OMP, elapsed time is 4.490 s
So the multi-thread is not effective.

There are a couple of contributers to this outcome:

The !$OMP region is entered too often as !$OMP is inside the DO loop “do iter = 1, itermax” , which is executed 23,636 times. Allowing for 20 micro seconds per OMP region entry, this gives 0.47 seconds OMP startup overhead and little available OMP efficiency, as there is so little calculation per OMP loop cycle.
it is interesting that with -O3 and changing the loop order for OMP, this removed the efficiency of not calculating “varphi(ia,is) *phi(ia,is)” and “(1d0-varphi(ia,is))*phi(ia,is)” inside the “do is_p = 1, NS” loop, which Gfortran optimising identifies.

Basically -O3 is more effective that -fopenmp for this calculation.

The elapsed times for this calculation are also relatively small to require OMP support.

aledinola · April 28, 2024, 11:20am

@JohnCampbell Thanks for testing the openMP code! I agree, there is too much overhead and openMP is not worth it. I think the best way is to rewrite a bit the code like I did here (see comments in the code):

! calculates the invariant distribution of households
subroutine get_distribution2()

    use toolbox
    use omp_lib

    implicit none
    integer :: ia, is, iter, is_p
    real*8 :: aplus, con_lev
    integer :: ial(0:NA, NS), iar(0:NA, NS)
    real*8 :: varphi(0:NA, NS)

    ! get the interpolation shares and points
    do ia = 0, NA
        do is = 1, NS

            ! calculate where this guy would go
            aplus = (1d0+r)*a(ia) + w*eta(is) - c(ia,is)
            aplus = min(max(aplus, a_l), a_u)

            ! determine the gridpoints in between this decision lies
            call linint_Grow(aplus, a_l, a_u, a_grow, NA, &
                ial(ia, is), iar(ia, is), varphi(ia, is))
        enddo
    enddo

    ! iterate until the distribution function converges
    phi =  1d0/((NA+1)*NS) ! set initial guess
    write(*,*) "sum(phi) = ", sum(phi)
    
    do iter = 1, itermax

        phi_new = 0d0

        ! First, calculate phi_new(a',s) using the law of motion for the 
        ! endogenous state variable a. 
        do ia = 0, NA
        do is = 1, NS
            phi_new(ial(ia,is),is)=phi_new(ial(ia,is),is)+varphi(ia,is)*phi(ia,is)
            phi_new(iar(ia,is),is)=phi_new(iar(ia,is),is)+(1d0-varphi(ia,is))*phi(ia,is)
        enddo
        enddo
        ! Second, calculate phi_new(a',s') using the transition matrix pi(s,s')
        ! phi_new(a',s') = phi_new(a',s)*pi(s,s'), where * is a matrix product
        phi_new = matmul(phi_new, pi)
        
        con_lev = maxval(abs(phi_new(:, :) - phi(:, :))/max(abs(phi(:, :)), 1d-10))

        ! update distribution
        phi = phi_new
        
        write(*,*) "iter = ", iter, "con_lev = ", con_lev

        ! check for convergence
        if(con_lev < sig_in)then
            return
        endif
    enddo !end iteration
    
    
    write(*,*)'Distribution Function Iteration did not converge'

end subroutine get_distribution2

The trick is the following:

first the update of the endogenous state a, so that you loop over a and s only (but not over s'). So in this step you have two nested loops instead of three.
In the second step, you update s, which you can do with a matrix multiplication
I guess this can be further sped up replacing matmul with a call to DGEMM

Also, note that to measure the timing it would be better to time the subroutine get_distribution only, because most of the time is spent doing the value function iteration and the general equilibrium loop over the interest rate. I created a repo on Github which contains a simplified version of OP’s code.

In the code on Github, I have two different versions of the distribution:
(1) get_distribution contains the original code, with the OpenMP reduction
(2) get_distribution2 contains my version where I split the update in two parts.

fish830911 · April 28, 2024, 3:47pm

Hi @aledinola :

Thank you so much for your reply! I appreciate your innovation in replacing the multiplication of pi(is, is_p) with matrix multiplication. It does seem like the large array reduction on phi_new is unavoidable, and thus openmp doesn’t worth it.

Thanks a lot!

manuel.arenaz · April 28, 2024, 9:28pm

In the original code (shown below), you could potentially parallelize any of the 3 loops using OpenMP multithreading, and even collapse them in order to increase the number of iterations of the parallelized loop (as properly pointed out in this discussion). In any case, the key point is that the reduction array phi_new is read and written in each loop iteration, and that the array entry updated is only known at run-time, when the value of the array index ial(ia,is) is known as it depends on the values of array ial (same stands for array iar).

do ia = 0, NA
    do is = 1, NS
        do is_p = 1, NS
            phi_new(ial(ia,is), is_p) = phi_new(ial(ia,is), is_p) + &
                pi(is,is_p)*varphi(ia,is)*phi(ia,is)
            phi_new(iar(ia,is), is_p) = phi_new(iar(ia,is), is_p) + &
                pi(is,is_p)*(1d0-varphi(ia,is))*phi(ia,is)
        enddo
    enddo
enddo

The consequence is that, in order to preserve correctness (before any other performance optimization), the OpenMP multithreading version must also contain atomic directives before the statements updating the array phi_new. The code should be something like this (sorry, I did not have time to test this implementation myself before sending this reply):

!$OMP PARALLEL DO SHARED (ns,na, phi_new, pi, varphi, phi, ial, iar)  &
!$OMP&            PRIVATE (is_p,ia,is,la,ra)
  do is_p = 1, NS
    do is = 1, NS
      do ia = 0, NA
        la = ial(ia,is)
        ra = iar(ia,is)
        !$OMP atomic update
        phi_new(la, is_p) = phi_new(la, is_p) + pi(is,is_p)*     varphi(ia,is) *phi(ia,is)
        !$OMP atomic update
        phi_new(ra, is_p) = phi_new(ra, is_p) + pi(is,is_p)*(1d0-varphi(ia,is))*phi(ia,is)
      end do
    end do
  end do
!$OMP END PARALLEL DO

Regarding performance, several considerations about this OpenMP multithreaded version:

It uses only one array phi_new, in contrast to several thread-private copies of the array in the OpenMP reduction clause. Thus its memory overhead does not grow with the number of threads.
It introduces parallelization overhead due to the atomic operations, and it grows with the number of iterations of the loops (so it grows with the problem size of the code).
The performance gain depends on the number of floating point operations computed in the the loop, which in this case is low (e.g. multiplications, additions). Benchmarking is needed in order to measure the actual performance achivable on the target processor.

From our experience, these “sparse reductions” (aka “histogram reductions”, “irregular reductions”) can be parallelized using OpenMP multithreading directives (see the codes ATMUX and LULESHmk in the Codee course at NERSC). The final performance gain obtained greatly depends on syncronization overhead (e.g. atomic), computational workload (e.g. FLOPs) and memory efficiency (e.g. cache memory). I think further experimentation and benchmarking is needed using OpenMP.

I hope this helps!

JohnCampbell · April 29, 2024, 2:52am

This code example is not a good candidate for applying !$OMP, as this set of loops is a small contributer to the overall elapsed time.
The following is a modified trace of the code on my Ryzen 5900x

  It is now Monday, 29 April 2024 at 11:24:32.526
       K/Y         r      diff_K
                                    Ni  6993 it    0.2250 sec
  300.0000    4.0000 -0.33753719  6993
                                    Ni  3137 it    0.1013 sec
  299.9999    4.0000 -0.33742600  3137
                                    Ni  5880 it    0.1893 sec
  299.7143    4.0114  0.00440964  5880
                                    Ni  4567 it    0.1472 sec
  299.7178    4.0113 -0.00003247  4567
                                    Ni  3059 it    0.0988 sec
  299.7177    4.0113 -0.00000087  3059

Time elapsed:   4.017 s

In an elapsed time of 4.017 seconds, the execution of “get_distribution (iter)” contributes only 0.762 seconds or 19% of the elapsed time
Basically there are 23,636 itterations of “do iter = 1, itermax”, which equates to 32 micro-seconds per entering the OMP region. The expected initialisation of the OMP region is 15 to 20 micro-seconds, so there is little potential for OMP gains.

A code block that executes in 32 microsec is too small to be a candidate for OpenMP.

@aledinola, your use of !$OMP atomic is unnecessary as using “do is_p = 1, NS” as a thread seperator provides sufficient thread seperation for phi_new.
( In other cases I have attempted to adjust the size of phi_new(:,is_p) to be a multiple of memory pages for better cache seperation, but any improvement is difficult to identify)

The other unnecessary consideration is “large” arrays as the total array storage is only 80 kbytes, which is much smaller that L3 or L2 cache. Although they are larger than most L1 cache, this is far from being a “large” memory problem.

In the original loop order form, oprimising compilers identify the main improvement by effectively converting to:

            do is = 1, NS
              do ia = 0, NA
                 la = ial(ia,is)
                 fl =      varphi(ia,is) *phi(ia,is)
                 fr = (1d0-varphi(ia,is))*phi(ia,is)
                 do is_p = 1, NS
                   phi_new(la,   is_p) = phi_new(la,   is_p) +  pi(is,is_p)* fl   !      varphi(ia,is) *phi(ia,is)
                   phi_new(la+1, is_p) = phi_new(la+1, is_p) +  pi(is,is_p)* fr   ! (1d0-varphi(ia,is))*phi(ia,is)
                 end do
              end do
           end do

I also checked that for this data set, in all cases “iar(ia,is) = ial(ia,is) + 1”, but this appears to achieve minimal simd advantage.

Not a good candidate for !$OMP !

aledinola · April 29, 2024, 8:00am

Very interesting, thanks! Let me add that iar=ial+1 is by construction, because iar is the interpolating point on the right, while ial is the point on the left. The logic is that you need to find the location of a’ on the grid for a, but a’ is not necessarily a grid point. So you find the two points on the a grid that bracket a’. Likewise varphi is the interpolation weight.

Btw, I was not suggesting to use omp atomic… or are you referring to the reduction clause? In my example, without reduction, you get wrong results

Topic		Replies	Views
OpenMP: efficiency in reduction in large array	16	1389	December 13, 2022
OpenMP question: private vs shared work arrays for reduction Help	8	2366	February 23, 2021
OpenMP and FORTRAN Help	32	1000	December 9, 2024
Fortran openMP - make code faster	4	477	August 31, 2022
Learning coarrays, collective subroutines and other parallel features of Modern Fortran Help	48	2388	May 11, 2021

Question on OpenMP reduction

Related topics