Why the performance is poorer after using OpenMP?

CYehLu · February 26, 2022, 7:37pm

I’m writing subroutine to calculate Laplacian operator:

subroutine laplacian(nx, ny, var, dx, dy, lap)
    use omp_lib
    implicit none
    integer, intent(in) :: nx, ny
    real*8, intent(in) :: var(nx, ny)
    real*8, intent(in) :: dx, dy
    real*8, intent(out) :: lap(nx,ny)
    integer :: i, j, ip1, im1, jp1, jm1
    real*8 :: start_time, end_time
    
    start_time = omp_get_wtime()
    
    !$omp parallel do default(none) shared(lap, var, dx, dy, nx, ny) private(j, i, ip1, im1, jp1, jm1)
    do j = 1, ny
        do i = 1, nx
            ip1 = mod(i+1, nx)
            im1 = mod(i+nx-2, nx) + 1
            jp1 = mod(j+1, ny)
            jm1 = mod(j+ny-2, ny) + 1
            lap(i,j) = (var(ip1,j) - 2*var(i,j) + var(im1,j)) / (dx*dx)  &
                       + (var(i,jp1) - 2*var(i,j) + var(i,jm1)) / (dy*dy)
        end do
    end do
    !$omp end parallel do
    
    end_time = omp_get_wtime()
    write(*,*) "finish computation in ", end_time-start_time, " seconds"
    
    return
end subroutine laplacian

I try to reach higher performance so I also use OpenMP. But I found that the performance is reduced after using OpenMP.
More precisely, the OpenMP version of subroutine will only faster when (nx, ny) is greater than (2000, 2000):

nx=ny=200:
With / Without OpenMP  ->  0.0139 / 0.00303 sec
nx=ny=2000:
With / Without OpenMP  ->  0.0180 / 0.0719 sec

I only have came up with two possible reasons,
(1) the problem is too simple so that the overhead of creating and managing threads is relatively expensive,
(2) false sharing.

I’m not sure if there will occur false sharing in my subroutine, but I think that it will not. Also I don’t think that the problem is too simple to accelerate by OpenMP. Is there any suggestion that why OpenMP fails? Thanks!

JohnCampbell · February 27, 2022, 12:21am

Not surprising that OpenMP does not improve with small nx,ny.
If each lap(:,j_thread) are on seperate memory pages, then there would be less memory consistency problems while small nx and same page introduces inefficiency. To assist this, SCHEDULE (STATIC) is preferred, which should be default, but is better to specify.
Also check ip1,jp1 as for i=nx-1, ip1=0 looks wrong.
Try ip1 = i+1 ; if I(i==nx) i = 1

Would the following code placed in appropriate places help, or are compilers too smart for this ?

            fx = 1 / (dx*dx)
            fy = 1 / (dy*dy)
            f0 = -2*(fx+fy)

            jm1 = j-1 ; if ( j==1  ) jm1 = ny
            jp1 = j+1 ; if ( j==ny ) jp1 = 1
            im1 = i-1 ; if ( i==1  ) im1 = nx
            ip1 = i+1 ; if ( i==nx ) ip1 = 1

            lap(i,j) = f0*var(i,j) + fy*var(i,jm1) + fx*var(im1,j) + fx*(var(ip1,j) + fy*(var(i,jp1)

CYehLu · February 27, 2022, 6:53am

Thank you for your suggestions!
I agree that OpenMP will not improve the performance with small nx, ny, but I was wondering that if nx=ny=200 is still too small? There are about 10 cores in my machine, OpenMP will create 10 threads by default, each thread will be assigned to 200/10=20 iterations. Is it still too small to improve the performance? If it is true, is there any other methods to accelerate with small nx, ny?

BTW, I originally wrote the functions in C, so ip1 were int ip1 = (i+1) % nx and im1 were int im1 = (i+nx-1) % nx. After suffering the inefficiency of OpenMP, I rewrote the function in Fortran in order to ask question here but I only rewrote im1 and forgot to modify ip1. Thanks for the correction!

Pap · February 27, 2022, 9:12am

OpenMP, as every other parallelization framework, does add some overhead in order to assign tasks to threads and do the memory management. However in most cases inefficient performance comes from mistakes in the code (like what should be shared and what shouldn’t).
Also it is not uncommon for non-OpenMP code being faster when compiled with, say, the -O2 flag, especially when the amount of work needed is not too big. With a 200x200 loop, as in your case, I am not surprised OpenMP doesn’t do a good job. It should shine when you have to deal with billions of iterations. With 10 threads as in your case, I would expect a ~7.5 times boost in performance, at best - but again, it depends on the code at hand.

Sometime ago, I did some experiments with testing programs implemented in various parallelizations frameworks. I even ported part of pthreads to Fortran, even though I suspect at least gfortran’s implementation of OpenMP is based on (“native” in GNU/Linux) pthreads. At least for the test codes I tried, I concluded switching to another parallelization scheme wouldn’t improve performance, and it is all about how exactly parallelization is done in the code, and how much work each specific problem requires.

Carltoffel · February 28, 2022, 9:38am

If you want to parallelise nested loops, there is the collapse(n) clause which collapses n loops to a single one. In some situations this might help, for example if the outer loop has not enough iterations. However in this case it didn’t get better.

MarDie · March 3, 2022, 6:59am

As correctly written in other answers, performance is not portable, i.e. certain patterns are fast on one architecture but slow on others. Answering this question is therefore not possible without discussing hardware.
In a very simple model, every calculation requires the following steps

Load the data from RAM into the CPU
Perform the calculation
Write the result into RAM

The performance of CPUs has dramatically increased over time, the speed of memory access did not increase equivalently. That means that the speed of many operations does not depend on the actual speed of the CPU but on the time for memory operations. This situation is called memory-bound. The several levels of cache on modern CPUs exist exactly to mitigate this problem. If your code is memory-bound, any shared memory parallelization (openMP, pthreads) will not give a performance increase. It might even increase the runtime, either due to overhead or because the memory access is not uniform among processors. When discussing the benefits of parallelization, one therefore need to know the arithmetic intensity of an algorithm. Algorithms who do many operations of the same data once it is loaded are generally better suited for parallel computing. The algorithm discussed here is not intensive, so it is not surprising that the speed up is low or even non-existing.

Shahid · May 22, 2022, 12:31pm

Hey. I am running the code with temporal and spatial discretization. Does loading of data apply here? Is it not supposed to run faster here with OpenMP? I am getting six times slower speed with using OpenMP.

program OpenMP_Test
use omp_lib
implicit none

!
!Parameters
!

integer(kind = 4)                  :: steps 
real(kind = 8)                     :: t1,t2
integer(kind = 4)                  :: i,j
real(kind = 8), dimension(256,256) :: r,c

!
!OpenMP parameters checking
!

write (*,'(a,i3)') ' The number of threads available  = ', omp_get_max_threads ( )

!
!Random numbers
!

call random_number(r)

c = 0.45*(0.5-r)

t1 = omp_get_wtime()

!
!Iteration
!

do steps = 1, 40000
    
    !$OMP PARALLEL DO REDUCTION(+:C)
    
    do j = 1,256
        do i = 1,256
            !
            !update value
            !
            c(i,j) =  c(i,j) + 0.2*i  
            
        end do
    end do
    
    !$OMP END PARALLEL DO
    
end do

t2 = omp_get_wtime()

print*, 'Omp Time is', t2-t1

end program

JohnCampbell · May 22, 2022, 1:33pm

Your code is:
entering an OpenMP region 40,000 times.
initialising a copy of c(256,256), omp_get_max_threads ( ) x 40,000 times
doing the calculation that the single thread performs, spread over multiple threads.
combining a copy into the master c(256,256) omp_get_max_threads ( ) x 40,000 times.

Unfortunately all the OpenMP overheads swamp the gains of the threaded parts.
I would estimate each !$OMP PARALLEL DO to take 20.e-6 seconds, so the overhead of reduction ( allocate, initilise, then combine the multiple copies of c(256,256) ) would be significant.
Changing 256 to max_threads should indicate 20.e-6 seconds.

Shahid · May 22, 2022, 1:53pm

Does this mean i can not use OpenMP for this code?

Where to apply this implementation? or how to modify it? I am confused about it.

JohnCampbell · May 22, 2022, 2:00pm

Basically, there is insufficient computation for each thread to justify the OpenMP overhead, both the initiation of the !$OMP region and, in this particular case, the REDUCTION (+:C) where C is a 0.5 MByte array.
For each element of the C array, all you are doing is Cij = Cij + 0.2*i. Too trivial a calculation.
Try:

integer(kind = 4), parameter       :: num_steps = 40000
integer(kind = 4), parameter       :: nc        = 256 !  or 4
...
  !$OMP PARALLEL DO PRIVATE(i,j) REDUCTION(+:C)
    
  do steps = 1, num_steps
    do j = 1,nc
        do i = 1,nc
            !
            !update value
            !
            c(i,j) =  c(i,j) + 0.2*i  
            
        end do
    end do
  end do
    
  !$OMP END PARALLEL DO

Shahid · May 22, 2022, 8:44pm

So if i put it graphically, does it mean the main do worksharing construct is for the steps do loop? It divides the loop into 4 parts (in case 4 threads are used here); the inner do loops do the computation in each thread and update the c variable there with a reduction clause.

JohnCampbell · May 23, 2022, 2:37am

Yes, basically what you have drawn is correct, but can you start step 10,001 at the same time as step 1 ? This is not what you typically want.
Typically you do want the order as you initially posted, with load sharing between threads for each itteration. The penalty will be about 40,000 x 20.e-6 seconds, (0.8 seconds), but the work in each itteration needs to be much more substantial than c(i,j) = c(i,j) + 0.2*i to justify the use of !$OMP.
Your 40,000 x “REDUCTION(+:C)” for a 0.5 MB array is taking most of the time, so an approach where this is changed to a shared array could help, say C(i,j,thread) and manage their combination, hopefully without a race condition.

What your figure shows, that is useful to appreciate, is that between “!$OMP PARALLEL” and “!$OMP DO” there are 4 threads (depending on the environment) that all do the same tasks, with a mixture of shared and private variables. This can be good for initial conditions of private variables, or removed with !$OMP PARALLEL DO.

Shahid · May 24, 2022, 2:02pm

I am trying to implement this shared array concept (have not read it before or implemented it either). Can you please share some material or links where this has been implemented or discussed with examples? I could not find it anywhere. Thanks.

JohnCampbell · May 24, 2022, 2:35pm

This is a simple example of managing REDUCTION manually

program OpenMP_Test
  use omp_lib
  implicit none

!
!Parameters
!

  integer(kind = 4), parameter :: max_steps = 40000
  integer(kind = 4), parameter :: nc = 256
  integer(kind = 4)            :: steps 
  integer(kind = 4)            :: threads
  integer(kind = 4)            :: i,j, thread, new
  integer(kind = 4), allocatable, dimension(:)  :: thread_step
  real(kind = 8), allocatable, dimension(:,:)   :: r
  real(kind = 8), allocatable, dimension(:,:,:) :: c
  real(kind = 8)                     :: t0,t1,t2,t3

!
!OpenMP parameters checking
!
  t0 = delta_sec ()

  threads = omp_get_max_threads ( )
  write (*,'(a,i3)') ' The number of threads available  = ', threads

!  initialise arrays
  allocate ( r(nc,nc) )
  allocate ( c(nc,nc,threads) )
  allocate ( thread_step(0:max_steps+1) )
!
! Random numbers
  call random_number (r)

  c = 0
  c(:,:,1) = 0.45*( 0.5-r(:,:) )
  thread_step = 0

  t1 = delta_sec ()

!
! Iteration
!

 !$OMP PARALLEL DO PRIVATE(steps,thread,i,j) SHARED(C, thread_step) SCHEDULE(STATIC)
    do steps = 1, max_steps
      thread = omp_get_thread_num () + 1
      thread_step(steps) = thread

      do j = 1,nc
        do i = 1,nc

          !update value
            c(i,j,thread) =  c(i,j,thread) + 0.2*i  

        end do
      end do
    end do
    
  !$OMP END PARALLEL DO

  t2 = delta_sec ()

!  now combine results for each thread
    do thread = 2,threads
      do j = 1,nc
        do i = 1,nc
            c(i,j,1) =  c(i,j,1) + c(i,j,thread)
        end do
      end do
    end do
    write (*,*) 'maxval = ', maxval ( c(:,:,1) )

    new = 1
    do steps = 1, max_steps+1
      if ( thread_step(steps) == thread_step(new) ) cycle
      write ( *,11) 'step', new, ' :', steps-1,' : thread',thread_step(new)
      new = steps
    end do 
  11 format (a,i6,a,i6,a,i3)

  t3 = delta_sec ()

  write (*,12) ' Initialise', t1
  write (*,12) ' OMP step  ', t2
  write (*,12) ' Combine   ', t3
  12 format (a,f10.5)

  contains

    function delta_sec ()
      real(kind=8)    :: delta_sec
      integer(kind=8) :: tick, rate, last = 0
      call SYSTEM_CLOCK (tick, rate)
      delta_sec = dble(tick-last) / dble(rate)
      last = tick
    end function delta_sec

end program

Edit : to include thread usage info

JohnCampbell · May 25, 2022, 1:15am

The following is a simple change on the previopus example, to show that the !$OMP region can be moved to inside each step, with a consequent increase in overhead for 40,000 !$OMP regions of about 1 second, but without the REDUCTION overhead, which was the main purpose of the manual REDUCTION. (hopefully no errors in this edit !!)

program OpenMP_Test
  use omp_lib
  implicit none

!
!Parameters
!

  integer(kind = 4), parameter :: max_steps = 40000
  integer(kind = 4), parameter :: nc = 256
  integer(kind = 4)            :: steps 
  integer(kind = 4)            :: threads
  integer(kind = 4)            :: i,j, thread, new
  integer(kind = 4), allocatable, dimension(:)  :: thread_step
  real(kind = 8), allocatable, dimension(:,:)   :: r
  real(kind = 8), allocatable, dimension(:,:,:) :: c
  real(kind = 8)                     :: t0,t1,t2,t3

!
!OpenMP parameters checking
!
  t0 = delta_sec ()

  threads = omp_get_max_threads ( )
  write (*,'(a,i3)') ' The number of threads available  = ', threads

!  initialise arrays
  allocate ( r(nc,nc) )
  allocate ( c(nc,nc,threads) )
  allocate ( thread_step(0:max_steps+1) )
!
! Random numbers
  call random_number (r)

  c = 0
  c(:,:,1) = 0.45*( 0.5-r(:,:) )
  thread_step = 0

  t1 = delta_sec ()

!
! Iteration
!

   do steps = 1, max_steps

    !$OMP PARALLEL DO PRIVATE(thread,i,j) SHARED(steps, C, thread_step) SCHEDULE(STATIC)
      do j = 1,nc
        thread = omp_get_thread_num () + 1
        thread_step(j) = thread

        do i = 1,nc

          !update value
            c(i,j,thread) =  c(i,j,thread) + 0.2*i  

        end do
      end do

    !$OMP END PARALLEL DO

    end do

  t2 = delta_sec ()

!  now combine results for each thread
    do thread = 2,threads
      do j = 1,nc
        do i = 1,nc
            c(i,j,1) =  c(i,j,1) + c(i,j,thread)
        end do
      end do
    end do
    write (*,*) 'maxval = ', maxval ( c(:,:,1) )

    new = 1
    do steps = 1, max_steps+1   ! > nc+1
      if ( thread_step(steps) == thread_step(new) ) cycle
      write ( *,11) 'step', new, ' :', steps-1,' : thread',thread_step(new)
      new = steps
    end do 
  11 format (a,i6,a,i6,a,i3)

  t3 = delta_sec ()

  write (*,12) ' Initialise', t1
  write (*,12) ' OMP step  ', t2
  write (*,12) ' Combine   ', t3
  12 format (a,f10.5)

  contains

    function delta_sec ()
      real(kind=8)    :: delta_sec
      integer(kind=8) :: tick, rate, last = 0
      call SYSTEM_CLOCK (tick, rate)
      delta_sec = dble(tick-last) / dble(rate)
      last = tick
    end function delta_sec

end program

Shahid · June 2, 2022, 5:48am

I revised the material once more and come to some conclusions:

!$OMP PARALLEL DO

will first create a parallel region

and then will distribute the load. The variable i in counter do loop is distributed among threads as in example. It is by default private!

so if we apply the same logic for the code you wrote then

steps, i and j are distributed among threads. They appear to me looks like

Here the 2D mesh seems to not have been evaluated at each point. It works like (in case of up to 8 iterations for i and j, for instance)

This type of do work-sharing loop will not be able to evaluate the whole mesh. The empty cells are not covered in this distribution.

I just can not get my head around how this whole work-sharing construct works here? Is there something missing in the graphical representation? Is there any link or material available to go deeper in this type of problem? Thanks.

JohnCampbell · June 2, 2022, 6:43am

@Shahid ,
Your last two diagrams do not appear consistent with the sample code I wrote.
For the code as:

  !$OMP PARALLEL DO PRIVATE(steps,thread,i,j) SHARED(C) SCHEDULE(STATIC)
   do steps = 1, max_steps
      thread = omp_get_thread_num () + 1
      do j = 1,nc
         do i = 1,nc
            c(i,j,thread) =  c(i,j,thread) + 0.2*i  
         end do
      end do
   end do
  !$OMP END PARALLEL DO

In this case, “DO steps” is shared between threads as you have indicated.
thread_0 for steps 1 to 10000, thread_1 for 10001 to 20000, thread_2 for 20001 to 30000 and thread_3 for 30001 to 40000.
( this arrangement is ensured by SCHEDULE(STATIC), which improves efficiency by seperating the memory pages of array C used by threads to minimise “memory page conflicts”. There is no race condition in this sharing. )
Importantly, for each thread, they all perform complete loops do j=1,nc ; do i = 1,nc. This is NOT what your last diagram shows and should be corrected.
It is helpful to define (step, i,j and thread) as private variables that are different for each thread, while array C as shared between all threads.

Often “do steps” must be done sequentially, so this loop must be outside the parallel region, as for the following case.

   do steps = 1, max_steps
    !$OMP PARALLEL DO PRIVATE(thread,i,j) SHARED(steps,C) SCHEDULE(STATIC)
      do j = 1,nc
         thread = omp_get_thread_num () + 1
         do i = 1,nc
            c(i,j,thread) =  c(i,j,thread) + 0.2*i  
         end do
      end do
    !$OMP END PARALLEL DO
   end do

In this case, we enter an OMP region for each step.
Now, “j” is shared between the threads
thread_0 for j 1 to 64, thread_1 for 65 to 128, thread_2 for 129 to 192 and thread_3 for 193 to 256.
In all threads, i would cycle for 1 to 256

In both cases, we need to consider what is happening to array C. The results are spread between the thread “copies”.
It is important that the initial condition for C before entry is considered when accumulating the final value at the end of DO steps.
This accumulation can either be done manually (preferred for the second option) or via REDUCTION (+: C) (preferred for the first option).
C needs to be initialised (as has been done) then either accumulated via REDUCTION (+: C) or manually on exit.

Then there is the question as to if the accumulated/reduced value of C needs to be referenced at each step, making the example more complex, by potentially introducing a race condition that would need to be managed. I think it has been shown that including REDUCTION in the second example does slow the computation significantly, which may make OMP usage problematic.

!$OMP is used to improve performance, but in some cases this is not possible.

You may be confusing your assumption about do j; do i using a subset of values, if the COLLAPSE clause is used. This might only be useful if the number of threads available is more than NC. However (typically) where NC >> num_threads, I would recommend that collapse should be avoided.

Shahid · June 2, 2022, 8:29am

It sounds very tricky!
Although I have just put some part of the code here, the OpenMP implementation appears to be challenging here.
My main code does involve Laplace evaluation at each grid point. Would it be evaluated at each thread as c?

do steps = 1, max_steps
     !$OMP PARALLEL DO PRIVATE(thread,i,j,im,ip,jm,jp) SHARED(steps,lap_c,C) SCHEDULE(STATIC)
        thread = omp_get_thread_num () + 1
        do j = 1,nc
            do i = 1,nc
            
                jp = j+1
                jm = j-1
                
                ip = i+1
                im = i-1
                
                if (im == 0) im = Nc
                if (ip == (Nc+1)) ip = 1
                if (jm == 0) jm = Nc                
                if (jp == (Nc+1)) jp = 1
                
                lap_c(i,j,thread) = c(ip,j,thread) + c(im,j,thread) + c(i,jm,thread) + c(i,jp,thread) - 4.0*c(i,j,thread)
                c(i,j,thread) =  c(i,j,thread) + 0.052*lap_c(i,j,thread) 
            end do
        end do
    end do

JohnCampbell · June 2, 2022, 8:37am

This code approach is just wrong, even as a single thread code. c(im,j) is being overwritten, which is a “race condition”

Shahid · June 2, 2022, 10:30am

This code is working fine in serial programming. I do not know how this c(im,j) is creating a race condition.

Topic		Replies	Views
OpenMP: efficiency in reduction in large array	16	1429	December 13, 2022
OpenMP and FORTRAN Help	32	1025	December 9, 2024
A simple example to compare coarrays and openmp	10	2575	February 20, 2022
Nvfortran comparison of do concurrent vs OpenMP code Help	24	811	September 9, 2024
Fortran openMP - make code faster	4	480	August 31, 2022

Why the performance is poorer after using OpenMP?

Related topics