Best practice for using OpenMP with sections containing do loops

I have a subroutine where I do (very similar) calculations along 2 axes. These 2 parts are independent from each other and contain nested loops, which can also be evaluated concurrently.
Is there a best practice or heuristic to chose how to approach this kind of problem with OpenMP? When should one use sections (which are incompatible with nested parallel do) ? When should one prefer parallel loops.
In my case the times are quite similar…

! method with **sections**
fedges = 0
!$omp parallel
!$omp sections
!$omp section
do j = 1, nc(2)
   call myweno(1)%reconstruct(varray(:, j), vl1, vr1)
   do i = 1, nc(1) - 1
      fedges(i, j, 1) = godunov(flux1, vr1(i), vl1(i + 1), &
                                 [gx(1)%right(i), gx(2)%center(j)], t)
   end do
end do
!$omp section
do i = 1, nc(1)
   call myweno(2)%reconstruct(varray(i, :), vl2, vr2)
   do j = 1, nc(2) - 1
      fedges(i, j, 2) = godunov(flux2, vr2(j), vl2(j + 1), &
                                 [gx(1)%center(i), gx(2)%right(j)], t)
   end do
end do
!$omp end sections
!$omp end parallel
! method with **parallel do**
fedges = 0
!$omp parallel
!$omp do
do j = 1, nc(2)
   call myweno(1)%reconstruct(varray(:, j), vl1, vr1)
   do i = 1, nc(1) - 1
      fedges(i, j, 1) = godunov(flux1, vr1(i), vl1(i + 1), &
                                [gx(1)%right(i), gx(2)%center(j)], t)
   end do
end do
!$omp end do nowait
!$omp do
do i = 1, nc(1)
   call myweno(2)%reconstruct(varray(i, :), vl2, vr2)
   do j = 1, nc(2) - 1
      fedges(i, j, 2) = godunov(flux2, vr2(j), vl2(j + 1), &
                                [gx(1)%center(i), gx(2)%right(j)], t)
   end do
end do
!$omp end do
!$omp end parallel

It’s difficult to give a good answer without knowing more about your problem and hardware you plan to run this on. From the looks of it is a finite-volume on a 2D mesh, where you would typically use work-sharings construct to map the domain to different threads along either one or two spatial axes.

From the definition of !$omp section,

Summary The sections construct is a non-iterative worksharing construct that contains a set of structured blocks that are to be distributed among and executed by the threads in a team. Each structured block is executed once by one of the threads [emphasis added] in the team in the context of its implicit task.

I would say that sections are sub-optimal for this usage case, as you are limited to two threads, although you probably have several more of them that could contribute to the update.

If your times are quite similar it could mean the problem size is too small, or the parallelization didn’t work as expected. Using the compiler flags for optimization reports, e.g. -qopt-report with Intel compilers, could be useful to check if parallelization occurs, and if not, what is obstructing it.

I concur with @ivanpribec that in such case using parallel loop looks more natural.

Then the usual questions:

  • what are the values of nc(:)?
  • is godunov() a function? If yes does it perform a lot of computations?
  • have you checked that your program really runs very several threads?
  • how do you time the execution? Have you timed the serial version?

And one observation: the memory access pattern fedges(i,j) is (potentially) bad, because j is the inner loop index. It’s bad even for the serial version, but it’s even worse for the “parallel do” version, as it results in “false sharing” (several threads that write to the same cache line, forcing a lot of cache synchronisation between the cores).

My problem is exactly as @ivanpribec described it. I work on a PC (and that is the intended target device) and currently with a 250x250 mesh, i.e. nc=[250, 250]. The “heavier” part is the WENO reconstruction; the evaluation of the numerical flux is less of a problem.

In serial mode, the total compute time is roughly 30 s with 1000 time steps, which boils down to 6 ms per du/dt evaluation and, thus, 3 ms per axis or ~0.01 ms per outer-loop iteration. Interestingly, although the index order of varray and fedges is not optimal for the second axis, there is no observable time penalty. (One can comment out one of the axis and see that the compute times are similar.)

When the max number of threads is 2, the compute time is reduced to about half, as one could expect for such an embarrassingly parallel problem. There is no difference between sections and parallel do.
As the max number of threads is increased to 4 or 6 (max CPU cores in my case), there is no time reduction using sections (obvious, because there can be only one thread per section), and only a marginal time reduction using parallel do. I suppose the latter is because each outer loop (3 ms) does not have enough work to pay off the split over several cores.

Originally, I thought sections could contain nested parallel dos… some like depicted in Figure 1.2 of F95_OpenMPv1_v2.dvi. But that seems not to be the case.

Just out of curiosity, is there some OpenMP construct that would allow me to split N cores in two teams of N/2 cores (one for each axis) and then split the outer-loop over these N/2 cores?

It depends on the workload in godunov(), that’s also why I was asking about that.

250x250 is not a lot, and in the case where godunov performs only a few operations, two things can happen:

  • the openmp overheads hide any gain past 2 threads
  • the code quickly gets memory bound, and more threads can’t help in this case

This is indeed possible, but you have to enable the nested parallelism, and start a fully new parallel region for the do loop. In my exprience, the nested parallelism is often not efficient.

See above. Using tasks could also be a solution.

But IMO you are overcomplicating things. if your code doesn’t scale well with simple parallel do’s, there’s almost no chance that it scale better with more complex openmp structures. At the end you will have to create N threads anyway with their overheads, and if the code is memory bound with N threads, it will be memory bound as well with 2*(N/2) threads

Looking at your method with parallel do,

  1. You do not indicate which are shared or private. This could cause problems, depending what happens in myweno(:)%reconstruct. It looks probable that vl1, vr1, vl2 & vr2 should be private.

  2. Also why “nowait” ?

  3. Would fedgesT(j, i, 2) help ?

For timing, I hope you are considering elapsed time.

Good point. I have assumed that the // code was correct and there was only a performance issue, but indeed if these variables are output from reconstruct, then there is a race condition.

Because the two loops blocks are independent from each other. So once some threads have finished with the first block, they can start with the second block without having to wait for the other ones.

That said, to really take advantage of this, a scheduling different from the default static one is probably needed.

Thank you all. Based on the advice give here, I ended up implementing it like so:

 ...
! Fluxes along x1 and x2 at interior cell boundaries
!$omp parallel
!$omp do private(vl1, vr1)
do j = 1, nc(2)
   call myweno(1)%reconstruct(v(1+(j-1)*nc(1):j*nc(1)), vl1, vr1)
   do i = 1, nc(1) - 1
      fedges1(i, j) = godunov(flux1, vr1(i), vl1(i + 1), &
                              [gx(1)%right(i), gx(2)%center(j)], t)
   end do
end do
!$omp end do nowait
!$omp do private(vl2, vr2)
do i = 1, nc(1)
   call myweno(2)%reconstruct(v(i:i+(nc(2)-1)*nc(1):nc(1)), vl2, vr2)
   do j = 1, nc(2) - 1
      fedges2(j, i) = godunov(flux2, vr2(j), vl2(j + 1), &
                              [gx(1)%center(i), gx(2)%right(j)], t)
   end do
end do
!$omp end do
!$omp end parallel
...

As a conclusion, before starting optimizing the // performances, always check the correctness of the // version compared to the serial version :slight_smile:

It would be interesting for us to know if after the private fix you get better performances with >2 threads? Not only a race condition gives wrong results, but it can also hit performances.

I did check the code with the sections construct, which was the one I (incorrectly) believed to be the best approach. With parallel do, I was (regrettably) only playing with the number of threads without really paying attention to the results. This may sound upside sown :upside_down_face:, but solving the numerical problem was actually not the goal, but rather just an excuse to take some first steps with OpenMP.

And I confess I had to learn more than I was expecting… For instance, it took me a while to realize that cpu_time can’t be used to measure elapsed time with OpenMP. And I also found a very weird problem with do concurrent and OpenMP, which I will post on a separate thread.

OpenMP is a very nice specification and quite easy to use. Still, there are some things to know and learn :wink:

The elapse time can be measured with omp_get_wtime() (but the Fortran intrisic system_clock() can also do, although the granularity is sometimes not small enough).

The performance remains essentially the same with and without the private clause. On the other hand, the (wrong) result is kind of cool!
example2d