# Question on OpenMP reduction

Dear all:

I am wondering if it is possible to efficiently parallelize the variable phi_new if the phi_new array is huge, like 4 dimensions. To add some more content, this code is to calculate the stationary distribution, so we have to follow this loop structure, i.e., ia and then ie and then is_p.

1 Like

Your code snippet appears to have a number of arrays, some you suggest are large ?

phi_new(??, is_p)
pi(is,is_p)
varphi(ia,is)
phi(ia,is)
ial(ia,is)
iar(ia,is)

A significant problem you appear to have is â€śdo is_p = 1, NSâ€ť as an inner loop, which implies significant memory strides for addressing pi and phi_new.
Perhaps transpose phi_new and pi ?

You could use OMP on lines 221 and 221 using collapse(2)

ial and iar could pose potential efficiency problems for cache clashing between threads ?

The problem may remain that you have large arrays but minimal calculation per itteration, which can be a problem for OMP efficiency.

lots of unknowns with your brief description !

1 Like

Assuming ial and iar are arrays and NA and NS are comparable, you could try the following.

!\$OMP PARALLEL DO SHARED (ns,na, phi_new, pi, varphi, phi, ial, iar)  &
!\$OMP&            PRIVATE (is_p,ia,is,la,ra)
do is_p = 1, NS
do is = 1, NS
do ia = 0, NA
la = ial(ia,is)
ra = iar(ia,is)
phi_new(la, is_p) = phi_new(la, is_p) + pi(is,is_p)*     varphi(ia,is) *phi(ia,is)
phi_new(ra, is_p) = phi_new(ra, is_p) + pi(is,is_p)*(1d0-varphi(ia,is))*phi(ia,is)
end do
end do
end do
!\$OMP END PARALLEL DO

They are not. I can see in the code that is linked from the original post that NA=1000 and NS=7

This also says that the total number of iterations for the 3 nested loops is 49000, which is quite a low count, and the inner operations are just 2 FMAs: the total workload is low, and the computations are probably memory bound. All set together, there is little chance to get a important speed-up with OpenMP here.

In addition, nothing says that ial(ia,is) and iar(ia,is) have unique values for different ia,is couples. If the values are not unique then the loops on ia and is can simply not be parallelizedâ€¦ Unless a reduction on phi_new is declared, but this will make the code even more memory bound.

1 Like

@PierU
Given what you say, it would be unlikely to consider collapse(2), but the approach I outlined above may achieve a reasonable speed-up for num_threads <= NS, as using is_p will protect from any thread clashes.

1 Like

I suggest you take a look at the checker PWR052: Consider applying multithreading parallelism to sparse reduction loop related to the parallelization of loops that compute a â€śsparse reductionâ€ť pattern (aka. â€śhistogram reductionsâ€ť). This has been extensively studied in papers about parallelization with OpenMP/OpenACC for multithreading and offloading. Note: the example in the catalog is in C, we are working to improve the documentation for Fortran (see Fortran example here).

The first thing to consider is protecting the sparse reduction array in order to avoid race conditions during the parallel execution and ensure the correctness of the code.

Next you can consider other aspects related to performance, which includes topics like the ones discussed in this thread related to cache efficiency, loop collapsing, etc.

We have conducted several training courses over the year using the Codee tools addressing this important sparse reduction pattern. In the materials of this Codee course at NERSC take a look at the examples ATMUX and LULESHmk which also compute sparse reduction patterns.

I hope this helps as a starting point as efficiently parallelizing sparse reductions is not straigthforward in general!

In the original code example, the use of is_p as the DO loop clearly achieves this outcome.
Has this approach been tested in the complete code ?

@manuel.arenaz, regarding â€śPWR052â€ť I did not see any guidance for how to use â€śnodes(nel)â€ť to improve OpenMP performance, as this approach comes with problems related to cache use efficiency. Has this issue been addressed, as this is a major problem when using OpenMP for these types of problems ?

You can parallelize the two outer loops (over assets a and the shock z today) and declare phi_new as a reduction variable.

Reducing on a large array is often not very efficientâ€¦

1 Like

You can parallelize the code as follows.

! iterate until the distribution function converges
phi = 1d0/((NA+1)*NS) ! initial guess

do iter = 1, itermax

phi_new = 0d0

!\$omp parallel if (par==1) default(shared) private(ia,is,is_p) reduction(+: phi_new)
!\$omp do collapse(2)
do ia = 0, NA
do is = 1, NS
do is_p = 1, NS
phi_new(ial(ia,is), is_p) = phi_new(ial(ia,is), is_p) + &
pi(is,is_p)*varphi(ia,is)*phi(ia,is)
phi_new(iar(ia,is), is_p) = phi_new(iar(ia,is), is_p) + &
pi(is,is_p)*(1d0-varphi(ia,is))*phi(ia,is)
enddo
enddo
enddo
!\$omp enddo
!\$omp end parallel

con_lev = maxval(abs(phi_new(:, :) - phi(:, :))/max(abs(phi(:, :)), 1d-10))

! update distribution
phi = phi_new

write(*,*) "iter = ", iter, "con_lev = ", con_lev

! check for convergence
if(con_lev < sig_in)then
return
endif
enddo !end iteration

It turns out however that parallelization slightly slows down the code (at least on my laptop). As mentioned by @PierU, reducing on a large array is often not efficient. It would be great if someone else could come up with better suggestions.

A few comments on the code:

• It would be better to post a minimum working example. The link that you posted refers to a full implementation of a general equilibrium, heterogeneous agent model (the Aiyagari model) but your question only refers to the computation of the stationary distribution.
• There are too many global variables. The subroutine get_distribution does not have inputs or outputs, so itâ€™s not clear what arrays is going to modify, etc.
• Before starting the fixed point iteration, you have to set an initial guess for the array phi. Since it is a distribution, all elements must be non-negative and add up to one.
• (I admit this is a matter of personal taste). Why using zero-based array indexing?
• Note that I added a flag, par, to control whether to run the code in parallel or not. You can define par as either a command line argument or inside the code as
integer, parameter :: par = 1 ! 1: parallel, 0: serial

Hope this helps!

Dear @aledinola

Yes, that is also what I realized. It seems like reduction on a large array like the distribution seems to be a difficult task for openmp to parallelize.

I think what I am wondering is applicable to any heterogeneous agent model, and thus I thought it is sufficient to post the distribution code that I found in the Econ textbook as a good example to illustrate. Therefore, I donâ€™t know why they chose zero-based array.

Thank you for the par example! I donâ€™t know that openmp is able to do that. That would make the testing become much easier!

1 Like

Thanks! I will take a look of the materials that you attached!

1 Like

@aledinola
As you have the code working, could you please try my suggestion of modifying the loop order and use

!\$OMP PARALLEL DO SHARED (ns,na, phi_new, pi, varphi, phi, ial, iar)  &
!\$OMP&            PRIVATE (is_p,ia,is,la,ra)
do is_p = 1, NS

I have not seen the definition of ial and iar arrays (or functions?) but I expect this revised approach has a better chance of improving performance, while still getting the same calculated result.
If the arrays are â€ślargeâ€ť, the loop order you tested would struggle with efficient cache usage.

As you are struggling to improve performance, your use of collapse(2) looks to not be helpful.

1 Like

A reduction implies a private copy of the variable for each thread, and at the end of the parallel region an update of the original variable, and this update cannot fully use all the threads. All of this has a cost.

1 Like

@aledinola

I did the following tests using Gfortran 12.3.0 on Win 10.

A single thread original code elapsed time is 4.082 s using -O3, but 5.312 s using -O2
With some optimising tweeks ( mainly noting iaR(ia,is) = iaL(ia,is)+1 ), elapsed time is 3.981 s (minimal!)
Re-ordering the loops for using !\$OMP, elapsed time is 4.490 s
So the multi-thread is not effective.

There are a couple of contributers to this outcome:

1. The !\$OMP region is entered too often as !\$OMP is inside the DO loop â€śdo iter = 1, itermaxâ€ť , which is executed 23,636 times. Allowing for 20 micro seconds per OMP region entry, this gives 0.47 seconds OMP startup overhead and little available OMP efficiency, as there is so little calculation per OMP loop cycle.

2. it is interesting that with -O3 and changing the loop order for OMP, this removed the efficiency of not calculating â€śvarphi(ia,is) *phi(ia,is)â€ť and â€ś(1d0-varphi(ia,is))*phi(ia,is)â€ť inside the â€śdo is_p = 1, NSâ€ť loop, which Gfortran optimising identifies.

Basically -O3 is more effective that -fopenmp for this calculation.

The elapsed times for this calculation are also relatively small to require OMP support.

1 Like

@JohnCampbell Thanks for testing the openMP code! I agree, there is too much overhead and openMP is not worth it. I think the best way is to rewrite a bit the code like I did here (see comments in the code):

! calculates the invariant distribution of households
subroutine get_distribution2()

use toolbox
use omp_lib

implicit none
integer :: ia, is, iter, is_p
real*8 :: aplus, con_lev
integer :: ial(0:NA, NS), iar(0:NA, NS)
real*8 :: varphi(0:NA, NS)

! get the interpolation shares and points
do ia = 0, NA
do is = 1, NS

! calculate where this guy would go
aplus = (1d0+r)*a(ia) + w*eta(is) - c(ia,is)
aplus = min(max(aplus, a_l), a_u)

! determine the gridpoints in between this decision lies
call linint_Grow(aplus, a_l, a_u, a_grow, NA, &
ial(ia, is), iar(ia, is), varphi(ia, is))
enddo
enddo

! iterate until the distribution function converges
phi =  1d0/((NA+1)*NS) ! set initial guess
write(*,*) "sum(phi) = ", sum(phi)

do iter = 1, itermax

phi_new = 0d0

! First, calculate phi_new(a',s) using the law of motion for the
! endogenous state variable a.
do ia = 0, NA
do is = 1, NS
phi_new(ial(ia,is),is)=phi_new(ial(ia,is),is)+varphi(ia,is)*phi(ia,is)
phi_new(iar(ia,is),is)=phi_new(iar(ia,is),is)+(1d0-varphi(ia,is))*phi(ia,is)
enddo
enddo
! Second, calculate phi_new(a',s') using the transition matrix pi(s,s')
! phi_new(a',s') = phi_new(a',s)*pi(s,s'), where * is a matrix product
phi_new = matmul(phi_new, pi)

con_lev = maxval(abs(phi_new(:, :) - phi(:, :))/max(abs(phi(:, :)), 1d-10))

! update distribution
phi = phi_new

write(*,*) "iter = ", iter, "con_lev = ", con_lev

! check for convergence
if(con_lev < sig_in)then
return
endif
enddo !end iteration

write(*,*)'Distribution Function Iteration did not converge'

end subroutine get_distribution2

The trick is the following:

• first the update of the endogenous state a, so that you loop over a and s only (but not over s'). So in this step you have two nested loops instead of three.
• In the second step, you update s, which you can do with a matrix multiplication
• I guess this can be further sped up replacing matmul with a call to DGEMM

Also, note that to measure the timing it would be better to time the subroutine get_distribution only, because most of the time is spent doing the value function iteration and the general equilibrium loop over the interest rate. I created a repo on Github which contains a simplified version of OPâ€™s code.

In the code on Github, I have two different versions of the distribution:
(1) get_distribution contains the original code, with the OpenMP reduction
(2) get_distribution2 contains my version where I split the update in two parts.

Hi @aledinola :

Thank you so much for your reply! I appreciate your innovation in replacing the multiplication of pi(is, is_p) with matrix multiplication. It does seem like the large array reduction on phi_new is unavoidable, and thus openmp doesnâ€™t worth it.

Thanks a lot!

1 Like

In the original code (shown below), you could potentially parallelize any of the 3 loops using OpenMP multithreading, and even collapse them in order to increase the number of iterations of the parallelized loop (as properly pointed out in this discussion). In any case, the key point is that the reduction array phi_new is read and written in each loop iteration, and that the array entry updated is only known at run-time, when the value of the array index ial(ia,is) is known as it depends on the values of array ial (same stands for array iar).

do ia = 0, NA
do is = 1, NS
do is_p = 1, NS
phi_new(ial(ia,is), is_p) = phi_new(ial(ia,is), is_p) + &
pi(is,is_p)*varphi(ia,is)*phi(ia,is)
phi_new(iar(ia,is), is_p) = phi_new(iar(ia,is), is_p) + &
pi(is,is_p)*(1d0-varphi(ia,is))*phi(ia,is)
enddo
enddo
enddo

The consequence is that, in order to preserve correctness (before any other performance optimization), the OpenMP multithreading version must also contain atomic directives before the statements updating the array phi_new. The code should be something like this (sorry, I did not have time to test this implementation myself before sending this reply):

!\$OMP PARALLEL DO SHARED (ns,na, phi_new, pi, varphi, phi, ial, iar)  &
!\$OMP&            PRIVATE (is_p,ia,is,la,ra)
do is_p = 1, NS
do is = 1, NS
do ia = 0, NA
la = ial(ia,is)
ra = iar(ia,is)
!\$OMP atomic update
phi_new(la, is_p) = phi_new(la, is_p) + pi(is,is_p)*     varphi(ia,is) *phi(ia,is)
!\$OMP atomic update
phi_new(ra, is_p) = phi_new(ra, is_p) + pi(is,is_p)*(1d0-varphi(ia,is))*phi(ia,is)
end do
end do
end do
!\$OMP END PARALLEL DO

1. It uses only one array phi_new, in contrast to several thread-private copies of the array in the OpenMP reduction clause. Thus its memory overhead does not grow with the number of threads.

2. It introduces parallelization overhead due to the atomic operations, and it grows with the number of iterations of the loops (so it grows with the problem size of the code).

3. The performance gain depends on the number of floating point operations computed in the the loop, which in this case is low (e.g. multiplications, additions). Benchmarking is needed in order to measure the actual performance achivable on the target processor.

From our experience, these â€śsparse reductionsâ€ť (aka â€śhistogram reductionsâ€ť, â€śirregular reductionsâ€ť) can be parallelized using OpenMP multithreading directives (see the codes ATMUX and LULESHmk in the Codee course at NERSC). The final performance gain obtained greatly depends on syncronization overhead (e.g. atomic), computational workload (e.g. FLOPs) and memory efficiency (e.g. cache memory). I think further experimentation and benchmarking is needed using OpenMP.

I hope this helps!

1 Like

This code example is not a good candidate for applying !\$OMP, as this set of loops is a small contributer to the overall elapsed time.
The following is a modified trace of the code on my Ryzen 5900x

It is now Monday, 29 April 2024 at 11:24:32.526
K/Y         r      diff_K
Ni  6993 it    0.2250 sec
300.0000    4.0000 -0.33753719  6993
Ni  3137 it    0.1013 sec
299.9999    4.0000 -0.33742600  3137
Ni  5880 it    0.1893 sec
299.7143    4.0114  0.00440964  5880
Ni  4567 it    0.1472 sec
299.7178    4.0113 -0.00003247  4567
Ni  3059 it    0.0988 sec
299.7177    4.0113 -0.00000087  3059

Time elapsed:   4.017 s

In an elapsed time of 4.017 seconds, the execution of â€śget_distribution (iter)â€ť contributes only 0.762 seconds or 19% of the elapsed time
Basically there are 23,636 itterations of â€śdo iter = 1, itermaxâ€ť, which equates to 32 micro-seconds per entering the OMP region. The expected initialisation of the OMP region is 15 to 20 micro-seconds, so there is little potential for OMP gains.

A code block that executes in 32 microsec is too small to be a candidate for OpenMP.

@aledinola, your use of !\$OMP atomic is unnecessary as using â€śdo is_p = 1, NSâ€ť as a thread seperator provides sufficient thread seperation for phi_new.
( In other cases I have attempted to adjust the size of phi_new(:,is_p) to be a multiple of memory pages for better cache seperation, but any improvement is difficult to identify)

The other unnecessary consideration is â€ślargeâ€ť arrays as the total array storage is only 80 kbytes, which is much smaller that L3 or L2 cache. Although they are larger than most L1 cache, this is far from being a â€ślargeâ€ť memory problem.

In the original loop order form, oprimising compilers identify the main improvement by effectively converting to:

do is = 1, NS
do ia = 0, NA
la = ial(ia,is)
fl =      varphi(ia,is) *phi(ia,is)
fr = (1d0-varphi(ia,is))*phi(ia,is)
do is_p = 1, NS
phi_new(la,   is_p) = phi_new(la,   is_p) +  pi(is,is_p)* fl   !      varphi(ia,is) *phi(ia,is)
phi_new(la+1, is_p) = phi_new(la+1, is_p) +  pi(is,is_p)* fr   ! (1d0-varphi(ia,is))*phi(ia,is)
end do
end do
end do

I also checked that for this data set, in all cases â€śiar(ia,is) = ial(ia,is) + 1â€ť, but this appears to achieve minimal simd advantage.

Not a good candidate for !\$OMP !

1 Like

Very interesting, thanks! Let me add that iar=ial+1 is by construction, because iar is the interpolating point on the right, while ial is the point on the left. The logic is that you need to find the location of aâ€™ on the grid for a, but aâ€™ is not necessarily a grid point. So you find the two points on the a grid that bracket aâ€™. Likewise varphi is the interpolation weight.

Btw, I was not suggesting to use omp atomicâ€¦ or are you referring to the reduction clause? In my example, without reduction, you get wrong results