OpenMP Parallel Loop

inode MUST NOT be private

gfortran will handle that correctly (generally uses the heap instead of the stack for arrays above a given size). Otherwise you can indeed protect the jWork update with an atomic directive, or allocate your own private copies of jWork and operate a manual reduction at the end.

 !$OMP PARALLEL DO PRIVATE(kount,nod,jnode,nf2)
            do jelem=1,nelz ! ( size(nelz) >1 million))
                kount = count(lnods(jelem,1:lnomax) == inode)                            
                if(kount == 0)cycle     
                do nod=1,lnomax
                   jnode=lnods(jelem,nod)
                   nf2=(jnode-1)*ndfmax 
                   do idof = 1, ndfmax
                       !$OMP ATOMIC UPDATE
                       jWork(nf2+idof) = jWork(nf2+idof) + 1   
                   end do
                end do                   
             end do
            !$OMP PARALLEL PRIVATE(kount,nod,jnode,nf2,jWork_local)
            allocate( jWork_local(size(jWork)), source=0)
            !$OMP DO 
            do jelem=1,nelz ! ( size(nelz) >1 million))
                kount = count(lnods(jelem,1:lnomax) == inode)                            
                if(kount == 0)cycle     
                do nod=1,lnomax
                   jnode=lnods(jelem,nod)
                   nf2=(jnode-1)*ndfmax 
                   do idof = 1, ndfmax
                       jWork_local(nf2+idof) = jWork_local(nf2+idof) + 1   
                   end do
                end do                   
             end do
             !$OMP END DO
             !$OMP CRITICAL
             jWork(:) = jWork(:) + jWork_local(:)
             !$OMP END CRITICAL
             deallocate( jWork_local )
             !$OMP END PARALLEL
1 Like

For long vectors, there might be a significant difference in these two tests. The count() intrinsic must scan the entire vector, while the any() intrinsic can branch out when it finds the first value.

2 Likes

You need to investigate how to change the stack size for each thread. I use gfortran on windows and set the stack size to 500 MBytes, which then sets each thread stack size to 500 MBytes. (thread stack default can vary with different compilers or OS)

My build is:
set tce=load_gf.tce
set load=gfortran

set stack_options=-Wl,-stack,536870912,-Map=program.map

del program.map
del program.exe

now >>%tce%

%load% @load_gf.txt -fopenmp -fstack-arrays %stack_options% >>%tce%  2>&1

dir program.* /od >>%tce%

notepad %tce%

You can very easily confirm by using “write ( * , * ) loc(private_variable), loc(shared_variable)” in each thread.
A large stack only affects the virtual memory space, not physical memory space, so on 64 bit os, this approach should be the default !

I also expect that reducing such a large “jwork” array would be better avoided.

Reduction of jwork is not a good option for me.

You may be right, but please state why it isn’t, given the clues that have been given to you to avoid the stack overflow issues.

By the way, we still don’t know if jnode can have the same value or not between different iterations of the jelem loop. If the answer is “no”, then no reduction or atomic directive is needed at all.

I am using Intel OneAPI (2021) fortran and Visual studio 2017.
Haven’t figure out on the stack setting part yet.

As said before, you can avoid the stack issues with a manual allocation and reduction. I do not say that the reduction is the best option (probably the atomic option is better), just that it is a valid option.

And still :

Jnode is having the same value for each iteration.