OpenMP question: private vs shared work arrays for reduction

Hello,

I have been trying to parallelize some code using OpenMP, and now comparing different codes to get better performance. The threaded code needs to update some array of size N (say, arr(1:N)). For example, the serial code may look like this:

arr(:) = 0
do i = 1, N
   arr( i ) = arr( i ) + (... some calculation ...)
enddo

(Edit: In actual code, there is some dependency between the left- and right-hand side, which seems to require the use of temporary arrays (unless redundant computations are performed). A more practical example may be like this:

arr(:) = 0
do iterm = 1, Nterms
   i1 = some_list( 1, iterm )  !! the contents updated periodically during the run
   i2 = some_list( 2, iterm )
   ...
   arr( i1 ) = arr( i1 ) + (... some calculation using arr etc...)
   arr( i2 ) = arr( i2 ) + (... another calculation using arr etc...)
   ...
enddo

To parallelize this code, I could include arr to the reduction clause, such that

!! Code(1) : Use the reduction clause.

arr(:) = 0
!$omp parallel do default(none) &
!$omp private(i,...) shared(N,...) reduction(+: arr)
do i = 1, N
   arr( i ) = arr( i ) + (... some calculation ...)
enddo
!$omp end parallel do

My current understanding is that a local temporary variable for arr is created for each thread and their contents are summed at the end of the loop (automatically added to the original arr).

On the other hand, I could also prepare some work array (say arr_para(:,:)) beforehand and use it as a shared variable, for example,

!! Code(2):
!! Allocate arr_para(N,nthreads) beforehand,
!! e.g. as a locally saved array or a type component.

ithre = 0
arr_para(:,:) = 0

!$omp parallel do default(none) &
!$omp private(i,...) shared(N, arr_para, ...)
do i = 1, N
   !$ ithre = omp_get_thread_num() + 1
   arr_para( i, ithre ) = arr_para( i, ithre ) + (... some calculation ...)
enddo
!$omp end parallel do

!! Reduce arr_para later in serial.
do ithre = 1, nthreads
  arr(:) = arr(:) + arr_para( :, ithre )
enddo

I initially thought that Code (2) would be faster for not-too-small N because Code (1) needs to create temporary arrays every time the parallel region is visited. But if I compare Code (1) and (2) for a test system with N ~1500, the speed was essentially the same… Because Code (1) is simpler, I would like to use it if possible, but my concern is that it may become slower with increasing N and nthreads (say >= 32 threads). I guess comparison may also depend on compilers/options, machines, and also the amount of calculation vs cost of temporary array creation(?). My another concern is that temporary arrays created by OpenMP (in Code (1)) might cause some trouble with stack memory for large N (not sure at all…). In my case, N is at most <~ 100000 (so only megabytes per array). The cost of “(… some calculation …)” in the above code varies depending on the loop to be parallelized (and also system size).

So I would really appreciate it if you give any comments or suggestions as to possible pros/cons of the above two approaches, so that I can determine which pattern to use for my codes (anyway, at the moment…)

As always, thanks very much!

Your concern for stack overflows in the first case is a valid one, if you scale up your systemsize you will run into issues with both the system stack and the OMP stack. The user of your application has to set ulimit -s unlimited and provide a reasonable large value in the OMP_STACKSIZE variable to avoid stack overflows due to the temporary arrays created.

I have used both strategies in applications already, and have transitioned most of my OMP usage from (2) to (1). Using the code (2) strategy has led to usually hard to debug OMP bugs for me, which was the reason why I dropped it, but I’m certainly not happy with the stack overflow issue cases by the code (1) strategy and the support request it is generating on a regular basis.

I’m still searching for a third, hopefully better, answer to this problem.

2 Likes

Isn’t reduction clause meant for scalar shared variables that would otherwise get overwritten by multiple threads at the same time? As sum in

do i=1,N
  sum = sum+a(i)
enddo

I do not see the reason for making arr reduced in the original code where each thread would deal with its own set of a(i) values.

2 Likes

As suggested by msz59, you don’t need the “reduction” clause.
Dependent of the expression in “(… some calculation …)” and if in "some calculation " no arr elements is present, the calculation of arr(i) is in fact independent for the other arr elements. Therefore, you can use a simple shared arr(N) table and not arr_para(N,nthreads). Furthermore, you don’t need a protection of the arr table (with atomic, critical), since for each tread, the indexes i are different and therefore it will always use a different arr element.

arr(:) = 0
!$omp parallel do default(none) &
!$omp private(i,...) shared(N, arr ...)
do i = 1, N
   arr( i ) = arr( i ) + (... some calculation ...)
enddo
!$omp end parallel do

I’m using such structure in some of my code and the OpenMP parallelisation is very efficient.

2 Likes

I was also thrown off by the reduction. According to the serial code, I don’t see the need for it.

Note that instead of loops, there is also the workshare construct:

The workshare construct divides the execution of the enclosed structured block into separate units of work, and causes the threads of the team to share the work such that each unit is executed only once by one thread, in the context of its implicit task.

For array expressions within each statement, including transformational array intrinsic functions that compute scalar values from arrays:

  • Evaluation of each element of the array expression, including any references to ELEMENTAL functions, is a unit of work.
  • Evaluation of transformational array intrinsic functions may be freely subdivided into any number of units of work.
  • For an array assignment statement, the assignment of each element is a unit of work.

On your example it would look as follows

!$omp workshare
arr(:) = arr(:) + (... some array expression with independent elements ...)
!$omp end workshare
3 Likes

Yes, $OMP REDUCTION does not apply in that case.
I used it in a ray tracing algorithm: each thread was computing the trajectories of many rays and computing some scalar values like the number of simulated rays or the reflected optical power by all rays in that thread. And when the parallel loop is ended, the reduction is automatically summing the scalar values from each thread, in order to obtain the total values.

Thanks, I have to try that as well.

Hi, thanks very much for various comments! And I am sorry that my first post is
not clear enough… (and I agree that simple use of shared should be sufficient for the first code above). In my actual codes, I am trying to parallelize loops where data race occurs if all threads write data to arr simultaneously (*1). When I posted the question, I tried to make it as simple/concise as possible and so cut down all unnecessary details, but now noticed that I also cut the essential part also… (so I will edit the original post later.)

(*1) One typical case is the calculation of energy gradient vector, where energy terms/contributions are divided into threads and each thread writes the evaluated contribution to (some elements of ) arr. Because the latter elements can overlap among different threads, I am using temporary arrays at the moment. One article that I recently read is the following one (by the original author of LAMMPS), and for example, “atom (or force) decomposition algorithms” (A2 or F2 in the paper) is one typical case I tried to parallelize. (In the paper, the author uses MPI for distributed computing.)

(this site seems to provide a free draft)
– Fast parallel algorithms for short-range molecular dynamics (Technical Report) | OSTI.GOV

Thanks very much for the info, and I did not know even OMP_STACKSIZE… There seem to be various books and tutorial slides, so I am now gathering them more and playing with different patterns.

One thing I am still not sure is whether OpenMP automatically takes care of the size of arr when making private copies. My current assumption is that it always uses stack memory for private copies, but not very sure in practice… (it might depend on compilers etc).