Why the performance is poorer after using OpenMP?

I’m writing subroutine to calculate Laplacian operator:

subroutine laplacian(nx, ny, var, dx, dy, lap)
    use omp_lib
    implicit none
    integer, intent(in) :: nx, ny
    real*8, intent(in) :: var(nx, ny)
    real*8, intent(in) :: dx, dy
    real*8, intent(out) :: lap(nx,ny)
    integer :: i, j, ip1, im1, jp1, jm1
    real*8 :: start_time, end_time
    start_time = omp_get_wtime()
    !$omp parallel do default(none) shared(lap, var, dx, dy, nx, ny) private(j, i, ip1, im1, jp1, jm1)
    do j = 1, ny
        do i = 1, nx
            ip1 = mod(i+1, nx)
            im1 = mod(i+nx-2, nx) + 1
            jp1 = mod(j+1, ny)
            jm1 = mod(j+ny-2, ny) + 1
            lap(i,j) = (var(ip1,j) - 2*var(i,j) + var(im1,j)) / (dx*dx)  &
                       + (var(i,jp1) - 2*var(i,j) + var(i,jm1)) / (dy*dy)
        end do
    end do
    !$omp end parallel do
    end_time = omp_get_wtime()
    write(*,*) "finish computation in ", end_time-start_time, " seconds"
end subroutine laplacian

I try to reach higher performance so I also use OpenMP. But I found that the performance is reduced after using OpenMP.
More precisely, the OpenMP version of subroutine will only faster when (nx, ny) is greater than (2000, 2000):

With / Without OpenMP  ->  0.0139 / 0.00303 sec
With / Without OpenMP  ->  0.0180 / 0.0719 sec

I only have came up with two possible reasons,
(1) the problem is too simple so that the overhead of creating and managing threads is relatively expensive,
(2) false sharing.

I’m not sure if there will occur false sharing in my subroutine, but I think that it will not. Also I don’t think that the problem is too simple to accelerate by OpenMP. Is there any suggestion that why OpenMP fails? Thanks!

Not surprising that OpenMP does not improve with small nx,ny.
If each lap(:,j_thread) are on seperate memory pages, then there would be less memory consistency problems while small nx and same page introduces inefficiency. To assist this, SCHEDULE (STATIC) is preferred, which should be default, but is better to specify.
Also check ip1,jp1 as for i=nx-1, ip1=0 looks wrong.
Try ip1 = i+1 ; if I(i==nx) i = 1

Would the following code placed in appropriate places help, or are compilers too smart for this ?

            fx = 1 / (dx*dx)
            fy = 1 / (dy*dy)
            f0 = -2*(fx+fy)

            jm1 = j-1 ; if ( j==1  ) jm1 = ny
            jp1 = j+1 ; if ( j==ny ) jp1 = 1
            im1 = i-1 ; if ( i==1  ) im1 = nx
            ip1 = i+1 ; if ( i==nx ) ip1 = 1

            lap(i,j) = f0*var(i,j) + fy*var(i,jm1) + fx*var(im1,j) + fx*(var(ip1,j) + fy*(var(i,jp1)

Thank you for your suggestions!
I agree that OpenMP will not improve the performance with small nx, ny, but I was wondering that if nx=ny=200 is still too small? There are about 10 cores in my machine, OpenMP will create 10 threads by default, each thread will be assigned to 200/10=20 iterations. Is it still too small to improve the performance? If it is true, is there any other methods to accelerate with small nx, ny?

BTW, I originally wrote the functions in C, so ip1 were int ip1 = (i+1) % nx and im1 were int im1 = (i+nx-1) % nx. After suffering the inefficiency of OpenMP, I rewrote the function in Fortran in order to ask question here but I only rewrote im1 and forgot to modify ip1. Thanks for the correction!

OpenMP, as every other parallelization framework, does add some overhead in order to assign tasks to threads and do the memory management. However in most cases inefficient performance comes from mistakes in the code (like what should be shared and what shouldn’t).
Also it is not uncommon for non-OpenMP code being faster when compiled with, say, the -O2 flag, especially when the amount of work needed is not too big. With a 200x200 loop, as in your case, I am not surprised OpenMP doesn’t do a good job. It should shine when you have to deal with billions of iterations. With 10 threads as in your case, I would expect a ~7.5 times boost in performance, at best - but again, it depends on the code at hand.

Sometime ago, I did some experiments with testing programs implemented in various parallelizations frameworks. I even ported part of pthreads to Fortran, even though I suspect at least gfortran’s implementation of OpenMP is based on (“native” in GNU/Linux) pthreads. At least for the test codes I tried, I concluded switching to another parallelization scheme wouldn’t improve performance, and it is all about how exactly parallelization is done in the code, and how much work each specific problem requires.

1 Like

I tried running your code (with a bit of modification, see below). On my old mac, it gave some speed-up even for n=200, but the result may be different from your machine + compiler. So I wonder which compiler + options + machine (CPU spec) did you use? (For me, I used gfortran-10.2 -O3 -fopenmp and echo 200 | OMP_NUM_THREADS=4 ./a.out etc. I calculated laplacian 100 times and took the mean value.)

minimum working example code
module test_m
    use iso_fortran_env, only: dp => real64
    implicit none
subroutine laplacian(nx, ny, var, dx, dy, lap)
    integer, intent(in) :: nx, ny
    real(dp), intent(in) :: var(nx, ny)
    real(dp), intent(in) :: dx, dy
    real(dp), intent(out) :: lap(nx,ny)
    integer :: i, j
    !$omp parallel do default(none) private(i,j) &
    !$omp shared(lap, var, dx, dy, nx, ny)
    do j = 2, ny - 1
    do i = 2, nx - 1
        lap(i,j) =   (var(i+1, j) - 2*var(i,j) + var(i-1, j)) / (dx*dx)  &
                   + (var(i, j+1) - 2*var(i,j) + var(i, j-1)) / (dy*dy)
    end do
    end do
    !$omp end parallel do
end subroutine
end module

program main
    use test_m
    use omp_lib
    implicit none
    integer :: n, ix, iy, loop
    real(dp) :: dx, start_time, end_time
    real(dp), allocatable :: var(:,:), lap(:,:)

    print *, "n = ?"; read *, n
    allocate( var(n,n), lap(n,n) )
    dx = 0.01_dp
    var = reshape( [( [(sin( ix * dx + iy * dx ), ix=1,n)], iy=1,n)], [n,n] )
    lap = 0

    start_time = omp_get_wtime()
    do loop = 1, 100
        call laplacian( n, n, var, dx, dx, lap )
    end_time = omp_get_wtime()

    print *, "var        = ", var(n/2, n/2)
    print *, "lap(num)   = ", lap(n/2, n/2)
    print *, "lap(exact) = ", -2 * var(n/2, n/2)
    print *, "time       = ", ( end_time - start_time ) / loop

If you want to parallelise nested loops, there is the collapse(n) clause which collapses n loops to a single one. In some situations this might help, for example if the outer loop has not enough iterations. However in this case it didn’t get better.

1 Like

As correctly written in other answers, performance is not portable, i.e. certain patterns are fast on one architecture but slow on others. Answering this question is therefore not possible without discussing hardware.
In a very simple model, every calculation requires the following steps

  1. Load the data from RAM into the CPU
  2. Perform the calculation
  3. Write the result into RAM

The performance of CPUs has dramatically increased over time, the speed of memory access did not increase equivalently. That means that the speed of many operations does not depend on the actual speed of the CPU but on the time for memory operations. This situation is called memory-bound. The several levels of cache on modern CPUs exist exactly to mitigate this problem. If your code is memory-bound, any shared memory parallelization (openMP, pthreads) will not give a performance increase. It might even increase the runtime, either due to overhead or because the memory access is not uniform among processors. When discussing the benefits of parallelization, one therefore need to know the arithmetic intensity of an algorithm. Algorithms who do many operations of the same data once it is loaded are generally better suited for parallel computing. The algorithm discussed here is not intensive, so it is not surprising that the speed up is low or even non-existing.