OpenMP Parallel Loop

Would appreciate how to use OpenMP to parallel the following code

  maxvab=100000 ; thispos=0  
!  jwork(random)  contains 1 or 0 in random
  do ivab=1,maxvab             
     if(jwork(ivab) == 0)cycle                 
     jWork(ivab)=thisPos
     thisPos=thisPos+1                     
  end do

This pair of lines mean that you cannot run the loop in parallel without getting different answers. If you do the iteration where ivab == maxvab first, jwork(maxvab) will get a different value than if you did it last. This is called a race condition.

you are right.
What would you do to solve this problem which appears all the time in sparse solver?

This looks like it needs a parallel scan operation. I.e.

pure function scan(arr, op, init) result(scanned)
  type(T), intent(in) :: arr(:)
  type(U), intent(in) :: init
  interface
    pure function op(lhs, rhs)
      type(T), intent(in) :: lhs
      type(U), intent(in) :: rhs
      type(U) :: op
    end function
  end interface
  type(U) :: scanned(size(arr))
end function

Then you can do

positions = scan(jWork, increment_if_positive, 0)
where (jWork > 0) jWork = positions

Writing that scan to work in parallel is hard™. The conditional assignment in parallel is easy. The compiler can probably do it for you.

i think a sort routine for jwork after the loop will work but would it perform better than the original serial code?

There are some workarounds, but with a much more complex code. And given the pretty low workload here (100000 iterations with that few instruction is a low workload), it will very likely be even slower than the simple serial version (at least on usual hardware).

That said, OpenMP 5 has introduced the scan directive. I’m not sure how it can be used here, but it could be worth investigating.

Typically this DO loop would be part of a larger OMP computation. This DO should be allocated to a single thread, while other “jwork” vectors are allocated to other threads.
Typically there are many vectors, which can be allocated dynamically to the next available thread.
You need to monitor the status of each jwork vector, as to if they are:

  • not yet processed,
  • being processed, or
  • have completed processing.

Often they need to be allocated sequentially, but during their processing, the other vectors need to be monitored.
Processing can not be completed until earlier vectors are completed, which can require waiting. This can be as simple as monitoring “last_vector_processed”, which can be updated as an atomic construct. This can still be very efficient, as the vector size is typically much greater than the number of active threads.

Dear John
Would be a very neat approach to solve this problem.
Can you share an example of the implementation of processed/being processed/completed ?

regards

The practical example I have is with an !$OMP skyline direct solver.
As equations must be reduced sequentially, I use "last_vector_processed”, and monitor this when reducing other equation columns in other threads. Again, as the bandwidth (vector size) is much larger than the number of threads, substantial computation can be done before the threads interact, waiting for "last_vector_processed” approaches this equation number.
I do use a function “wait_a_while” to suspend processing, but monitor how often and for how long this is used. It does not provide a significant delay to the calculation.
The following code appears to achieve what is wanted. it must be compiled with -O0 or -O1 to avoid being optimised away !

!   These routines cause a thread to wait if the equation "Jeq" is not complete
!   The delay must be short so as not to delay the restart of this column
!    and also not get into a tight loop in this thread
!   The call to QueryPerformance_tick > QueryPerformanceCounter is used to
!    hopefully prevent this occurring.
!
!   the performance of this routine varies with different processors
!
  subroutine wait_on_jeq ( Jeq, Last_eqn, op_wait )
!
      integer*4 Jeq                                   ! equation that is being waited for
      integer*4 Last_eqn                              ! last equation to be completed
      integer*8 op_wait                               ! count of equations that are waiting
!
!z      logical   NA_wait(*)  ! shared wait status for each equation : .false. when complete
!
      integer*4 iw                                    ! loop count
      integer*8 tick                                  ! dummy argument to overcome /O1
!
      iw = 0
      DO 
         if ( Last_eqn >= Jeq ) return
         call take_a_small_delay (iw, op_wait, tick)
       !$OMP FLUSH ( Last_eqn )
      END DO
!
  end subroutine wait_on_jeq

  subroutine take_a_small_delay (iw, op_wait, tick)
      integer*4 :: iw                                 ! count of delay calls for this equation
      integer*8 :: op_wait                            ! count of equations that are waiting
      integer*8 :: tick                               ! dummy argument to overcome /O1
!
      integer*8, external :: QueryPerformance_tick
!
!   The following code order does not work well on i7-4790K
!    I shall retest to see if I can reproduce this result
!    retest with different compile options -O1 helped
      if ( iw == 0 ) op_wait = op_wait + 1            ! first call
      if ( mod(iw,4) == 0 ) tick = QueryPerformance_tick ()
      iw = iw + 1
      if ( iw < 0 ) iw = 0
!
  end subroutine take_a_small_delay

I’m not sure about what you are doing, but it seems to me that the OpenMP task directive with the depend clause can handle this kind of problem.

I am solving large scale eigenvalue problems with Arpack Sparse solver on a 20 cores PC. Need to form three matrices in CSR format. Typical size of each matrix is 5millions x 5 millions. Typical outer loop is over a million of points and inner loop is over a million of cells.

The inner loop is as listed below which i have tried to parallelize

             do jelem=1,nelz ! ( size(nelz) >1 million))
                kount=0       
                do nod=1,lnomax ! (size(8))   ! hot spot
                   if(lnods(jelem,nod) == inode)kount=kount+1                        
                end do    
                if(kount == 0)cycle     
                do nod=1,lnomax
                   jnode=lnods(jelem,nod)
                   nf2=(jnode-1)*ndfmax    
                   do idof=1,ndfmax ! (size(ndfmax) = 6)
                      jWork(nf2+idof)=jWork(nf2+idof)+1
                   end do         
                end do                   
             end do

Thank you in advance for any suggestion.

            !$OMP PARALLEL DO PRIVATE(kount,nod,jnode,nf2) REDUCTION(+:jWork)
            do jelem=1,nelz ! ( size(nelz) >1 million))
                kount = count(lnods(jelem,1:lnomax) == inode)                            
                if(kount == 0)cycle     
                do nod=1,lnomax
                   jnode=lnods(jelem,nod)
                   nf2=(jnode-1)*ndfmax 
                   jWork(nf2+1:nf2+ndfmax) = jWork(nf2+1:nf2+ndfmax) + 1   
                end do                   
             end do

In the case jnode is guaranteed to be different for different jelem values, then the reduction clause can be dropped

1 Like

The use of reduction clause will create a copy of JWORK for each thread. The size of JWORK is the same as the number of equation and will cause stack overflow with larger problem. Would it be possible to parallel this section with critical, atomic and/or lock? If so, how and where would be the best location to give the best performance?

You can use the Intel Advisor to model threading designs based on the measured serial execution properties. See Model Threading Design | Intel® Advisor User Guide.

The idea is you instrument your code with special annotations:


      use advisor_annotate
      ...

             call annotate_site_begin("assembly")
             do jelem=1,nelz ! ( size(nelz) >1 million))
                call annotate_task_begin("count_and_reduce")
                kount=0       
                do nod=1,lnomax ! (size(8))   ! hot spot
                   if(lnods(jelem,nod) == inode)kount=kount+1                        
                end do    
                if(kount == 0)cycle     
                do nod=1,lnomax
                   jnode=lnods(jelem,nod)
                   nf2=(jnode-1)*ndfmax    
                   do idof=1,ndfmax ! (size(ndfmax) = 6)
                      call annotate_lock_acquire(0)
                      jWork(nf2+idof)=jWork(nf2+idof)+1
                      call annotate_lock_release(0)
                   end do         
                end do
                call annotate_task_end                   
             end do
             call annotate_site_end

The tool can then help you explore different threading models (for Fortran you’ll typically only look at OpenMP).

By the way, I’d recommend reading the following resources from the European Performance Optimization and Productivity (PoP) Center of Excellence:

The following picture that I borrowed from the Barcelona Supercomputing Centre facebook page illustrates the different parallelization concepts:

Addendum: here are a few more resources from the experts,

Ruud van der Pas and Christian Terboven are also authors of the book Using OpenMP - The Next Step: Affinity, Accelerators, Tasking, and SIMD, The MIT Press, 2017. (Caveat - the book focuses on OpenMP 4.5, but the multidependencies were only introduced in OpenMP 5.0.)

This could also be modified as:

if (any(lnods(jelem,1:lnomax) == inode)) then
  do nod = 1, lnomax 
  ! ...
  end do
end if

The discontigous access on the second dimension, since the first dimension of lnods is of size nelzs (over a million), is not ideal IMO. Hopefully the compiler recognizes it’s a constant offset at least, and then uses the cached result (the row of lnods(jelem,:)) in the second loop.

even with the use of any, i still can’t get my parallel region
to work correctly.

!$omp parallel &
!$omp private (jelem, inode, nod, jnode, nf2, idof) &
!$omp shared (nelz, lnods, lnomax, ndfmax, jwork)
!$omp do

             do jelem=1,nelz 
                if (any(lnods(jelem,1:lnomax) == inode)) then        
                     do nod=1,lnomax
                       jnode=lnods(jelem,nod)
                       nf2=(jnode-1)*ndfmax    
                       do idof=1,ndfmax
                          jWork(nf2+idof)=jWork(nf2+idof)+1
                       end do         
                    end do                    
                endif
             end do

!$omp end do
!$omp end parallel

thankyou for the reading recommendation

Try making the jwork update atomic.

already try and not luck.

the issue is several threads executing the if statement in parallel

               if (any(lnods(jelem,1:lnomax) == inode)) then 
                             ....
              endif