Hello,
I recently ran into an issue with very slow thread creation for nested loops in code compiled with GFortran.
Here is a simplified form of the code:
program test
implicit none
integer l
!$OMP PARALLEL DO &
!$OMP NUM_THREADS(1)
do l=1,1000
call foo
end do
!$OMP END PARALLEL DO
end program
subroutine foo
implicit none
integer, parameter :: l=200,m=100,n=10
! number of threads
integer, parameter :: nthd=10
integer i,j
! automatic arrays
real(8) a(n,l),b(n,m),x(m)
a(:,:)=2.d0
b(:,:)=3.d0
do i=1,l
!$OMP PARALLEL DO DEFAULT(SHARED) &
!$OMP NUM_THREADS(nthd)
do j=1,m
x(j)=dot_product(a(:,i),b(:,j))
end do
!$OMP END PARALLEL DO
end do
end subroutine
The wall-clock time is about 0.5 seconds when compiled with Intel or PGI Fortran. However, for GFortran compiled with
gfortran -O3 -fopenmp test.f90
and OMP_NESTED set to true, the wall-clock time is about 70 seconds, or about 140 times slower. (The ‘dot_product’ can be removed from the loop – all the time is taken with thread creation).
This only affects nested loops; if the OMP directives are removed from the loop in the program part in the code above then GFortran is as fast as the other compilers. I’ve tried several different versions of GFortran (from 7.5.0 to 12.1.0) on different machines and it’s slow on all of them.
It may problem with libgomp. If I substitute the libgomp library for that provided with the NVIDIA compiler (formerly PGI) then it’s as fast as the others.
I’d like to submit a bug report to GCC but I wondered if anyone on Fortran Discourse could reproduce the problem first.