I recently ran into an issue with very slow thread creation for nested loops in code compiled with GFortran.
Here is a simplified form of the code:
program test implicit none integer l !$OMP PARALLEL DO & !$OMP NUM_THREADS(1) do l=1,1000 call foo end do !$OMP END PARALLEL DO end program subroutine foo implicit none integer, parameter :: l=200,m=100,n=10 ! number of threads integer, parameter :: nthd=10 integer i,j ! automatic arrays real(8) a(n,l),b(n,m),x(m) a(:,:)=2.d0 b(:,:)=3.d0 do i=1,l !$OMP PARALLEL DO DEFAULT(SHARED) & !$OMP NUM_THREADS(nthd) do j=1,m x(j)=dot_product(a(:,i),b(:,j)) end do !$OMP END PARALLEL DO end do end subroutine
The wall-clock time is about 0.5 seconds when compiled with Intel or PGI Fortran. However, for GFortran compiled with
gfortran -O3 -fopenmp test.f90
and OMP_NESTED set to true, the wall-clock time is about 70 seconds, or about 140 times slower. (The ‘dot_product’ can be removed from the loop – all the time is taken with thread creation).
This only affects nested loops; if the OMP directives are removed from the loop in the program part in the code above then GFortran is as fast as the other compilers. I’ve tried several different versions of GFortran (from 7.5.0 to 12.1.0) on different machines and it’s slow on all of them.
It may problem with libgomp. If I substitute the libgomp library for that provided with the NVIDIA compiler (formerly PGI) then it’s as fast as the others.
I’d like to submit a bug report to GCC but I wondered if anyone on Fortran Discourse could reproduce the problem first.