Best practice for using OpenMP with sections containing do loops

HugoMVale · April 30, 2023, 2:49pm

I have a subroutine where I do (very similar) calculations along 2 axes. These 2 parts are independent from each other and contain nested loops, which can also be evaluated concurrently.
Is there a best practice or heuristic to chose how to approach this kind of problem with OpenMP? When should one use sections (which are incompatible with nested parallel do) ? When should one prefer parallel loops.
In my case the times are quite similar…

! method with **sections**
fedges = 0
!$omp parallel
!$omp sections
!$omp section
do j = 1, nc(2)
   call myweno(1)%reconstruct(varray(:, j), vl1, vr1)
   do i = 1, nc(1) - 1
      fedges(i, j, 1) = godunov(flux1, vr1(i), vl1(i + 1), &
                                 [gx(1)%right(i), gx(2)%center(j)], t)
   end do
end do
!$omp section
do i = 1, nc(1)
   call myweno(2)%reconstruct(varray(i, :), vl2, vr2)
   do j = 1, nc(2) - 1
      fedges(i, j, 2) = godunov(flux2, vr2(j), vl2(j + 1), &
                                 [gx(1)%center(i), gx(2)%right(j)], t)
   end do
end do
!$omp end sections
!$omp end parallel

! method with **parallel do**
fedges = 0
!$omp parallel
!$omp do
do j = 1, nc(2)
   call myweno(1)%reconstruct(varray(:, j), vl1, vr1)
   do i = 1, nc(1) - 1
      fedges(i, j, 1) = godunov(flux1, vr1(i), vl1(i + 1), &
                                [gx(1)%right(i), gx(2)%center(j)], t)
   end do
end do
!$omp end do nowait
!$omp do
do i = 1, nc(1)
   call myweno(2)%reconstruct(varray(i, :), vl2, vr2)
   do j = 1, nc(2) - 1
      fedges(i, j, 2) = godunov(flux2, vr2(j), vl2(j + 1), &
                                [gx(1)%center(i), gx(2)%right(j)], t)
   end do
end do
!$omp end do
!$omp end parallel

ivanpribec · April 30, 2023, 3:38pm

It’s difficult to give a good answer without knowing more about your problem and hardware you plan to run this on. From the looks of it is a finite-volume on a 2D mesh, where you would typically use work-sharings construct to map the domain to different threads along either one or two spatial axes.

From the definition of !$omp section,

Summary The sections construct is a non-iterative worksharing construct that contains a set of structured blocks that are to be distributed among and executed by the threads in a team. Each structured block is executed once by one of the threads [emphasis added] in the team in the context of its implicit task.

I would say that sections are sub-optimal for this usage case, as you are limited to two threads, although you probably have several more of them that could contribute to the update.

If your times are quite similar it could mean the problem size is too small, or the parallelization didn’t work as expected. Using the compiler flags for optimization reports, e.g. -qopt-report with Intel compilers, could be useful to check if parallelization occurs, and if not, what is obstructing it.

PierU · April 30, 2023, 10:39pm

I concur with @ivanpribec that in such case using parallel loop looks more natural.

Then the usual questions:

what are the values of nc(:)?
is godunov() a function? If yes does it perform a lot of computations?
have you checked that your program really runs very several threads?
how do you time the execution? Have you timed the serial version?

And one observation: the memory access pattern fedges(i,j) is (potentially) bad, because j is the inner loop index. It’s bad even for the serial version, but it’s even worse for the “parallel do” version, as it results in “false sharing” (several threads that write to the same cache line, forcing a lot of cache synchronisation between the cores).

HugoMVale · May 1, 2023, 10:11am

My problem is exactly as @ivanpribec described it. I work on a PC (and that is the intended target device) and currently with a 250x250 mesh, i.e. nc=[250, 250]. The “heavier” part is the WENO reconstruction; the evaluation of the numerical flux is less of a problem.

In serial mode, the total compute time is roughly 30 s with 1000 time steps, which boils down to 6 ms per du/dt evaluation and, thus, 3 ms per axis or ~0.01 ms per outer-loop iteration. Interestingly, although the index order of varray and fedges is not optimal for the second axis, there is no observable time penalty. (One can comment out one of the axis and see that the compute times are similar.)

When the max number of threads is 2, the compute time is reduced to about half, as one could expect for such an embarrassingly parallel problem. There is no difference between sections and parallel do.
As the max number of threads is increased to 4 or 6 (max CPU cores in my case), there is no time reduction using sections (obvious, because there can be only one thread per section), and only a marginal time reduction using parallel do. I suppose the latter is because each outer loop (3 ms) does not have enough work to pay off the split over several cores.

Originally, I thought sections could contain nested parallel dos… some like depicted in Figure 1.2 of F95_OpenMPv1_v2.dvi. But that seems not to be the case.

Just out of curiosity, is there some OpenMP construct that would allow me to split N cores in two teams of N/2 cores (one for each axis) and then split the outer-loop over these N/2 cores?

PierU · May 1, 2023, 11:59am

It depends on the workload in godunov(), that’s also why I was asking about that.

250x250 is not a lot, and in the case where godunov performs only a few operations, two things can happen:

the openmp overheads hide any gain past 2 threads
the code quickly gets memory bound, and more threads can’t help in this case

This is indeed possible, but you have to enable the nested parallelism, and start a fully new parallel region for the do loop. In my exprience, the nested parallelism is often not efficient.

See above. Using tasks could also be a solution.

But IMO you are overcomplicating things. if your code doesn’t scale well with simple parallel do’s, there’s almost no chance that it scale better with more complex openmp structures. At the end you will have to create N threads anyway with their overheads, and if the code is memory bound with N threads, it will be memory bound as well with 2*(N/2) threads

JohnCampbell · May 1, 2023, 1:03pm

Looking at your method with parallel do,

You do not indicate which are shared or private. This could cause problems, depending what happens in myweno(:)%reconstruct. It looks probable that vl1, vr1, vl2 & vr2 should be private.
Also why “nowait” ?
Would fedgesT(j, i, 2) help ?

For timing, I hope you are considering elapsed time.

PierU · May 1, 2023, 2:08pm

Good point. I have assumed that the // code was correct and there was only a performance issue, but indeed if these variables are output from reconstruct, then there is a race condition.

Because the two loops blocks are independent from each other. So once some threads have finished with the first block, they can start with the second block without having to wait for the other ones.

PierU · May 1, 2023, 6:41pm

That said, to really take advantage of this, a scheduling different from the default static one is probably needed.

HugoMVale · May 1, 2023, 9:10pm

Thank you all. Based on the advice give here, I ended up implementing it like so:

 ...
! Fluxes along x1 and x2 at interior cell boundaries
!$omp parallel
!$omp do private(vl1, vr1)
do j = 1, nc(2)
   call myweno(1)%reconstruct(v(1+(j-1)*nc(1):j*nc(1)), vl1, vr1)
   do i = 1, nc(1) - 1
      fedges1(i, j) = godunov(flux1, vr1(i), vl1(i + 1), &
                              [gx(1)%right(i), gx(2)%center(j)], t)
   end do
end do
!$omp end do nowait
!$omp do private(vl2, vr2)
do i = 1, nc(1)
   call myweno(2)%reconstruct(v(i:i+(nc(2)-1)*nc(1):nc(1)), vl2, vr2)
   do j = 1, nc(2) - 1
      fedges2(j, i) = godunov(flux2, vr2(j), vl2(j + 1), &
                              [gx(1)%center(i), gx(2)%right(j)], t)
   end do
end do
!$omp end do
!$omp end parallel
...

PierU · May 2, 2023, 6:37am

As a conclusion, before starting optimizing the // performances, always check the correctness of the // version compared to the serial version

It would be interesting for us to know if after the private fix you get better performances with >2 threads? Not only a race condition gives wrong results, but it can also hit performances.

HugoMVale · May 2, 2023, 5:39pm

I did check the code with the sections construct, which was the one I (incorrectly) believed to be the best approach. With parallel do, I was (regrettably) only playing with the number of threads without really paying attention to the results. This may sound upside sown , but solving the numerical problem was actually not the goal, but rather just an excuse to take some first steps with OpenMP.

And I confess I had to learn more than I was expecting… For instance, it took me a while to realize that cpu_time can’t be used to measure elapsed time with OpenMP. And I also found a very weird problem with do concurrent and OpenMP, which I will post on a separate thread.

PierU · May 2, 2023, 8:02pm

OpenMP is a very nice specification and quite easy to use. Still, there are some things to know and learn

The elapse time can be measured with omp_get_wtime() (but the Fortran intrisic system_clock() can also do, although the granularity is sometimes not small enough).

HugoMVale · May 4, 2023, 10:04pm

The performance remains essentially the same with and without the private clause. On the other hand, the (wrong) result is kind of cool!
example2d

Topic		Replies	Views
OpenMP parallelization Help	1	434	March 5, 2022
Are 4 nested do loops suboptimal? Help	22	965	April 1, 2022
Using libraries in parallel code with OpenMp Help	2	405	February 28, 2021
Why the performance is poorer after using OpenMP? Help	20	5457	June 2, 2022
Fortran openMP - make code faster	4	477	August 31, 2022

Best practice for using OpenMP with sections containing do loops

Related topics