I have been trying to parallelize some code using OpenMP, and now comparing different codes to get better performance. The threaded code needs to update some array of size N (say,
arr(1:N)). For example, the serial code may look like this:
arr(:) = 0 do i = 1, N arr( i ) = arr( i ) + (... some calculation ...) enddo
(Edit: In actual code, there is some dependency between the left- and right-hand side, which seems to require the use of temporary arrays (unless redundant computations are performed). A more practical example may be like this:
arr(:) = 0 do iterm = 1, Nterms i1 = some_list( 1, iterm ) !! the contents updated periodically during the run i2 = some_list( 2, iterm ) ... arr( i1 ) = arr( i1 ) + (... some calculation using arr etc...) arr( i2 ) = arr( i2 ) + (... another calculation using arr etc...) ... enddo
To parallelize this code, I could include
arr to the reduction clause, such that
!! Code(1) : Use the reduction clause. arr(:) = 0 !$omp parallel do default(none) & !$omp private(i,...) shared(N,...) reduction(+: arr) do i = 1, N arr( i ) = arr( i ) + (... some calculation ...) enddo !$omp end parallel do
My current understanding is that a local temporary variable for
arr is created for each thread and their contents are summed at the end of the loop (automatically added to the original
On the other hand, I could also prepare some work array (say
arr_para(:,:)) beforehand and use it as a shared variable, for example,
!! Code(2): !! Allocate arr_para(N,nthreads) beforehand, !! e.g. as a locally saved array or a type component. ithre = 0 arr_para(:,:) = 0 !$omp parallel do default(none) & !$omp private(i,...) shared(N, arr_para, ...) do i = 1, N !$ ithre = omp_get_thread_num() + 1 arr_para( i, ithre ) = arr_para( i, ithre ) + (... some calculation ...) enddo !$omp end parallel do !! Reduce arr_para later in serial. do ithre = 1, nthreads arr(:) = arr(:) + arr_para( :, ithre ) enddo
I initially thought that Code (2) would be faster for not-too-small N because Code (1) needs to create temporary arrays every time the parallel region is visited. But if I compare Code (1) and (2) for a test system with N ~1500, the speed was essentially the same… Because Code (1) is simpler, I would like to use it if possible, but my concern is that it may become slower with increasing N and nthreads (say >= 32 threads). I guess comparison may also depend on compilers/options, machines, and also the amount of calculation vs cost of temporary array creation(?). My another concern is that temporary arrays created by OpenMP (in Code (1)) might cause some trouble with stack memory for large N (not sure at all…). In my case, N is at most <~ 100000 (so only megabytes per array). The cost of “(… some calculation …)” in the above code varies depending on the loop to be parallelized (and also system size).
So I would really appreciate it if you give any comments or suggestions as to possible pros/cons of the above two approaches, so that I can determine which pattern to use for my codes (anyway, at the moment…)
As always, thanks very much!