In the original code (shown below), you could potentially parallelize any of the 3 loops using OpenMP multithreading, and even collapse them in order to increase the number of iterations of the parallelized loop (as properly pointed out in this discussion). In any case, the key point is that the reduction array phi_new is read and written in each loop iteration, and that the array entry updated is only known at run-time, when the value of the array index ial(ia,is) is known as it depends on the values of array ial (same stands for array iar).
do ia = 0, NA
do is = 1, NS
do is_p = 1, NS
phi_new(ial(ia,is), is_p) = phi_new(ial(ia,is), is_p) + &
pi(is,is_p)*varphi(ia,is)*phi(ia,is)
phi_new(iar(ia,is), is_p) = phi_new(iar(ia,is), is_p) + &
pi(is,is_p)*(1d0-varphi(ia,is))*phi(ia,is)
enddo
enddo
enddo
The consequence is that, in order to preserve correctness (before any other performance optimization), the OpenMP multithreading version must also contain atomic directives before the statements updating the array phi_new. The code should be something like this (sorry, I did not have time to test this implementation myself before sending this reply):
!$OMP PARALLEL DO SHARED (ns,na, phi_new, pi, varphi, phi, ial, iar) &
!$OMP& PRIVATE (is_p,ia,is,la,ra)
do is_p = 1, NS
do is = 1, NS
do ia = 0, NA
la = ial(ia,is)
ra = iar(ia,is)
!$OMP atomic update
phi_new(la, is_p) = phi_new(la, is_p) + pi(is,is_p)* varphi(ia,is) *phi(ia,is)
!$OMP atomic update
phi_new(ra, is_p) = phi_new(ra, is_p) + pi(is,is_p)*(1d0-varphi(ia,is))*phi(ia,is)
end do
end do
end do
!$OMP END PARALLEL DO
Regarding performance, several considerations about this OpenMP multithreaded version:
-
It uses only one array phi_new, in contrast to several thread-private copies of the array in the OpenMP reduction clause. Thus its memory overhead does not grow with the number of threads.
-
It introduces parallelization overhead due to the atomic operations, and it grows with the number of iterations of the loops (so it grows with the problem size of the code).
-
The performance gain depends on the number of floating point operations computed in the the loop, which in this case is low (e.g. multiplications, additions). Benchmarking is needed in order to measure the actual performance achivable on the target processor.
From our experience, these “sparse reductions” (aka “histogram reductions”, “irregular reductions”) can be parallelized using OpenMP multithreading directives (see the codes ATMUX and LULESHmk in the Codee course at NERSC). The final performance gain obtained greatly depends on syncronization overhead (e.g. atomic), computational workload (e.g. FLOPs) and memory efficiency (e.g. cache memory). I think further experimentation and benchmarking is needed using OpenMP.
I hope this helps!