Would appreciate how to use OpenMP to parallel the following code
maxvab=100000 ; thispos=0
! jwork(random) contains 1 or 0 in random
do ivab=1,maxvab
if(jwork(ivab) == 0)cycle
jWork(ivab)=thisPos
thisPos=thisPos+1
end do
This pair of lines mean that you cannot run the loop in parallel without getting different answers. If you do the iteration where ivab == maxvab first, jwork(maxvab) will get a different value than if you did it last. This is called a race condition.
This looks like it needs a parallel scan operation. I.e.
pure function scan(arr, op, init) result(scanned)
type(T), intent(in) :: arr(:)
type(U), intent(in) :: init
interface
pure function op(lhs, rhs)
type(T), intent(in) :: lhs
type(U), intent(in) :: rhs
type(U) :: op
end function
end interface
type(U) :: scanned(size(arr))
end function
There are some workarounds, but with a much more complex code. And given the pretty low workload here (100000 iterations with that few instruction is a low workload), it will very likely be even slower than the simple serial version (at least on usual hardware).
Typically this DO loop would be part of a larger OMP computation. This DO should be allocated to a single thread, while other “jwork” vectors are allocated to other threads.
Typically there are many vectors, which can be allocated dynamically to the next available thread.
You need to monitor the status of each jwork vector, as to if they are:
not yet processed,
being processed, or
have completed processing.
Often they need to be allocated sequentially, but during their processing, the other vectors need to be monitored.
Processing can not be completed until earlier vectors are completed, which can require waiting. This can be as simple as monitoring “last_vector_processed”, which can be updated as an atomic construct. This can still be very efficient, as the vector size is typically much greater than the number of active threads.
The practical example I have is with an !$OMP skyline direct solver.
As equations must be reduced sequentially, I use "last_vector_processed”, and monitor this when reducing other equation columns in other threads. Again, as the bandwidth (vector size) is much larger than the number of threads, substantial computation can be done before the threads interact, waiting for "last_vector_processed” approaches this equation number.
I do use a function “wait_a_while” to suspend processing, but monitor how often and for how long this is used. It does not provide a significant delay to the calculation.
The following code appears to achieve what is wanted. it must be compiled with -O0 or -O1 to avoid being optimised away !
! These routines cause a thread to wait if the equation "Jeq" is not complete
! The delay must be short so as not to delay the restart of this column
! and also not get into a tight loop in this thread
! The call to QueryPerformance_tick > QueryPerformanceCounter is used to
! hopefully prevent this occurring.
!
! the performance of this routine varies with different processors
!
subroutine wait_on_jeq ( Jeq, Last_eqn, op_wait )
!
integer*4 Jeq ! equation that is being waited for
integer*4 Last_eqn ! last equation to be completed
integer*8 op_wait ! count of equations that are waiting
!
!z logical NA_wait(*) ! shared wait status for each equation : .false. when complete
!
integer*4 iw ! loop count
integer*8 tick ! dummy argument to overcome /O1
!
iw = 0
DO
if ( Last_eqn >= Jeq ) return
call take_a_small_delay (iw, op_wait, tick)
!$OMP FLUSH ( Last_eqn )
END DO
!
end subroutine wait_on_jeq
subroutine take_a_small_delay (iw, op_wait, tick)
integer*4 :: iw ! count of delay calls for this equation
integer*8 :: op_wait ! count of equations that are waiting
integer*8 :: tick ! dummy argument to overcome /O1
!
integer*8, external :: QueryPerformance_tick
!
! The following code order does not work well on i7-4790K
! I shall retest to see if I can reproduce this result
! retest with different compile options -O1 helped
if ( iw == 0 ) op_wait = op_wait + 1 ! first call
if ( mod(iw,4) == 0 ) tick = QueryPerformance_tick ()
iw = iw + 1
if ( iw < 0 ) iw = 0
!
end subroutine take_a_small_delay
I am solving large scale eigenvalue problems with Arpack Sparse solver on a 20 cores PC. Need to form three matrices in CSR format. Typical size of each matrix is 5millions x 5 millions. Typical outer loop is over a million of points and inner loop is over a million of cells.
The inner loop is as listed below which i have tried to parallelize
do jelem=1,nelz ! ( size(nelz) >1 million))
kount=0
do nod=1,lnomax ! (size(8)) ! hot spot
if(lnods(jelem,nod) == inode)kount=kount+1
end do
if(kount == 0)cycle
do nod=1,lnomax
jnode=lnods(jelem,nod)
nf2=(jnode-1)*ndfmax
do idof=1,ndfmax ! (size(ndfmax) = 6)
jWork(nf2+idof)=jWork(nf2+idof)+1
end do
end do
end do
!$OMP PARALLEL DO PRIVATE(kount,nod,jnode,nf2) REDUCTION(+:jWork)
do jelem=1,nelz ! ( size(nelz) >1 million))
kount = count(lnods(jelem,1:lnomax) == inode)
if(kount == 0)cycle
do nod=1,lnomax
jnode=lnods(jelem,nod)
nf2=(jnode-1)*ndfmax
jWork(nf2+1:nf2+ndfmax) = jWork(nf2+1:nf2+ndfmax) + 1
end do
end do
In the case jnode is guaranteed to be different for different jelem values, then the reduction clause can be dropped
The use of reduction clause will create a copy of JWORK for each thread. The size of JWORK is the same as the number of equation and will cause stack overflow with larger problem. Would it be possible to parallel this section with critical, atomic and/or lock? If so, how and where would be the best location to give the best performance?
The idea is you instrument your code with special annotations:
use advisor_annotate
...
call annotate_site_begin("assembly")
do jelem=1,nelz ! ( size(nelz) >1 million))
call annotate_task_begin("count_and_reduce")
kount=0
do nod=1,lnomax ! (size(8)) ! hot spot
if(lnods(jelem,nod) == inode)kount=kount+1
end do
if(kount == 0)cycle
do nod=1,lnomax
jnode=lnods(jelem,nod)
nf2=(jnode-1)*ndfmax
do idof=1,ndfmax ! (size(ndfmax) = 6)
call annotate_lock_acquire(0)
jWork(nf2+idof)=jWork(nf2+idof)+1
call annotate_lock_release(0)
end do
end do
call annotate_task_end
end do
call annotate_site_end
The tool can then help you explore different threading models (for Fortran you’ll typically only look at OpenMP).
By the way, I’d recommend reading the following resources from the European Performance Optimization and Productivity (PoP) Center of Excellence:
if (any(lnods(jelem,1:lnomax) == inode)) then
do nod = 1, lnomax
! ...
end do
end if
The discontigous access on the second dimension, since the first dimension of lnods is of size nelzs (over a million), is not ideal IMO. Hopefully the compiler recognizes it’s a constant offset at least, and then uses the cached result (the row of lnods(jelem,:)) in the second loop.
do jelem=1,nelz
if (any(lnods(jelem,1:lnomax) == inode)) then
do nod=1,lnomax
jnode=lnods(jelem,nod)
nf2=(jnode-1)*ndfmax
do idof=1,ndfmax
jWork(nf2+idof)=jWork(nf2+idof)+1
end do
end do
endif
end do