I looked at the code free_vortices_2d.f90, as it appears to be stand-alone, and tried to clean up the OMP implimentation by :
#moving arrays into a module, to avoid any possible stack problems ( your arrays could be allocatable and smaller?)
module vor_arrays
integer, parameter :: max_m = 100000
REAL, DIMENSION(max_m,3) :: vortices !-- array holds 2D vortices
REAL, DIMENSION(max_m,3) :: vor !-- temporary array holds 2D vortices
end module vor_arrays
#including the following code to control thread numbers and test alternatives.
call omp_set_num_threads ( num_threads )
#moving the following code into subroutine rvelocity to clean up OMP usage.
Especially explicitly defining SHARED and PRIVATE attributes of variables.
I think this may have been a problem with your tests, especially array vor.
!$OMP PARALLEL DO SHARED ( vortices, vor, M, DT, sigma ) &
!$OMP& PRIVATE (i, x,y,w, dx,dy, j, qx,qy,qw, pqx,pqy,rpq, gexp_value )
#Included functions elapse_seconds () and delta_seconds () to provide timing
#I did not like your file unit numbers, so used lu_NIT = 11 for file access.
#At the end of the run I got the following message
“Note : The following floating-point exceptions are signalling: IEEE_UNDERFLOW_FLAG”
This can occur with random generated data and these exceptions can extend run times.
Changing to 8-byte reals may help this, due to extended range.
The code I generated is:
free_vortices2d.f90 (5.2 KB)
My Windows build is:
set prog=free_vortices2d
set basic=-fimplicit-none -fallow-argument-mismatch -O2 -march=native -ffast-math
set vec=-fimplicit-none -fallow-argument-mismatch -O3 -march=native -ffast-math -funroll-loops --param max-unroll-times=2
set omps=-fopenmp -fstack-arrays
set omp=-fopenmp
del %prog%.exe
gfortran %prog%.f90 %basic% %omp% -v -o %prog%.exe
%prog%
The net result is performance is closely related to number of threads.
Some points may be helpful, while others may just be an arbitary style .