Slow thread creation with nested loops in GFortran

Hello,

I recently ran into an issue with very slow thread creation for nested loops in code compiled with GFortran.

Here is a simplified form of the code:

program test
implicit none
integer l
!$OMP PARALLEL DO &
!$OMP NUM_THREADS(1)
do l=1,1000
  call foo
end do
!$OMP END PARALLEL DO
end program

subroutine foo
implicit none
integer, parameter :: l=200,m=100,n=10
! number of threads
integer, parameter :: nthd=10
integer i,j
! automatic arrays
real(8) a(n,l),b(n,m),x(m)
a(:,:)=2.d0
b(:,:)=3.d0
do i=1,l
!$OMP PARALLEL DO DEFAULT(SHARED) &
!$OMP NUM_THREADS(nthd)
  do j=1,m
    x(j)=dot_product(a(:,i),b(:,j))
  end do
!$OMP END PARALLEL DO
end do
end subroutine

The wall-clock time is about 0.5 seconds when compiled with Intel or PGI Fortran. However, for GFortran compiled with

gfortran -O3 -fopenmp test.f90

and OMP_NESTED set to true, the wall-clock time is about 70 seconds, or about 140 times slower. (The ‘dot_product’ can be removed from the loop – all the time is taken with thread creation).

This only affects nested loops; if the OMP directives are removed from the loop in the program part in the code above then GFortran is as fast as the other compilers. I’ve tried several different versions of GFortran (from 7.5.0 to 12.1.0) on different machines and it’s slow on all of them.

It may problem with libgomp. If I substitute the libgomp library for that provided with the NVIDIA compiler (formerly PGI) then it’s as fast as the others.

I’d like to submit a bug report to GCC but I wondered if anyone on Fortran Discourse could reproduce the problem first.

1 Like

I copied your code, as posted and ran in Win 10 - 64 using Gfortran 11.1.0 on i7-8700k with 32 GB memory (which supports 12 threads).
My initial run omitted -fopenmp and ran in 0.05 seconds.

However when including -fopenmp in the build, it ran for 64 seconds then stoped with an errpor report:
“libgomp: Thread creation failed: Resource temporarily unavailable.”

I have reproduced your problem.

However, I am not familiar with using OMP_NESTED and wonder if your approach of limiting the thread count is working as required ?

My build .bat file is:


set program=%1
set tce=%program%.log

del %program%.exe

set options=-v -fimplicit-none -fallow-argument-mismatch -O3 -march=native -ffast-math -fstack-arrays -fopenmp
set link_options=-Wl,-stack,32000000,-Map=%program%.map -o %program%.exe

set OMP_NESTED=true

gfortran %program%.f90 %options% %link_options% >> %tce% 2>&1

dir %program%.* /od >> %tce%

audit /start  >> %tce%
%program%     >> %tce%
audit /end    >> %tce%

notepad %tce%

The truncated output ( removing most of -v ) is

Driving: gfortran jdk2022.f90 -v -fimplicit-none -fallow-argument-mismatch -O3 -march=native -ffast-math -fstack-arrays -Wl,-stack,32000000,-Map=jdk2022.map -o jdk2022.exe -l gfortran
Built by Equation Solution <http://www.Equation.com>.
Using built-in specs.
COLLECT_GCC=gfortran
COLLECT_GCC_OPTIONS='-v' '-fimplicit-none' '-fallow-argument-mismatch' '-O3' '-march=native' '-ffast-math' '-fstack-arrays' '-o' 'jdk2022.exe' '-dumpdir' 'jdk2022.'
 Volume in drive C has no label.
 Volume Serial Number is 7443-5314

 Directory of C:\forum\memory

21/01/2023  11:51 AM               540 jdk2022.f90
21/01/2023  12:11 PM            17,405 jdk2022.log
21/01/2023  12:11 PM           739,569 jdk2022.map
21/01/2023  12:11 PM         1,755,203 jdk2022.exe
               4 File(s)      2,512,717 bytes
               0 Dir(s)  331,918,966,784 bytes free
[AUDIT Ver 1.21] Saturday, 21 January 2023 at 12:11:17.994
[AUDIT Ver 1.21] elapse        0.049 seconds: Saturday, 21 January 2023 at 12:11:18.041        0.953

Driving: gfortran jdk2022.f90 -v -fimplicit-none -fallow-argument-mismatch -O3 -march=native -ffast-math -fstack-arrays -fopenmp -Wl,-stack,32000000,-Map=jdk2022.map -o jdk2022.exe -l gfortran
Built by Equation Solution <http://www.Equation.com>.
Using built-in specs.
COLLECT_GCC=gfortran
COLLECT_LTO_WRAPPER=c:/program\ files\ (x86)/gcc_eq/gcc_11.3.0/bin/../libexec/gcc/x86_64-w64-mingw32/11.3.0/lto-wrapper.exe
Target: x86_64-w64-mingw32
Configured with: ../gcc-11.3.0/configure --host=x86_64-w64-mingw32 --build=x86_64-unknown-linux-gnu --target=x86_64-w64-mingw32 --prefix=/home/gfortran/gcc-home/binary/mingw32/native/x86_64/gcc/11.3.0 --with-sysroot=/home/gfortran/gcc-home/binary/mingw32/cross/x86_64/gcc/12-20220403 --with-gcc --with-gnu-ld --with-gnu-as --with-ld64=no --with-gmp=/home/gfortran/gcc-home/binary/mingw32/native/x86_64/gmp --with-mpfr=/home/gfortran/gcc-home/binary/mingw32/native/x86_64/mpfr --with-mpc=/home/gfortran/gcc-home/binary/mingw32/native/x86_64/mpc --with-cloog=/home/gfortran/gcc-home/binary/mingw32/native/x86_64/cloog --with-libiconv-prefix=/home/gfortran/gcc-home/binary/mingw32/native/x86_64/libiconv --with-diagnostics-color=auto --enable-cloog-backend=isl --enable-targets=i686-w64-mingw32,x86_64-w64-mingw32 --enable-lto --enable-languages=c,c++,fortran --enable-threads=win32 --enable-static --enable-shared=lto-plugin --enable-plugins --enable-ld=yes --enable-libquadmath --enable-libquadmath-support --enable-libgomp --disable-checking --disable-nls --disable-tls --disable-win32-registry
Thread model: win32
Supported LTO compression algorithms: zlib
gcc version 11.3.0 (GCC) 
COLLECT_GCC_OPTIONS='-v' '-fimplicit-none' '-fallow-argument-mismatch' '-O3' '-march=native' '-ffast-math' '-fstack-arrays' '-fopenmp' '-o' 'jdk2022.exe' '-mthreads' '-pthread'
 c:/program files (x86)/gcc_eq/gcc_11.3.0/bin/../libexec/gcc/x86_64-w64-mingw32/11.3.0/f951.exe jdk2022.f90 -march=skylake -mmmx -mpopcnt -msse -msse2 -msse3 -mssse3 -msse4.1 -msse4.2 -mavx -mavx2 -mno-sse4a -mno-fma4 -mno-xop -mfma -mno-avx512f -mbmi -mbmi2 -maes -mpclmul -mno-avx512vl -mno-avx512bw -mno-avx512dq -mno-avx512cd -mno-avx512er -mno-avx512pf -mno-avx512vbmi -mno-avx512ifma -mno-avx5124vnniw -mno-avx5124fmaps -mno-avx512vpopcntdq -mno-avx512vbmi2 -mno-gfni -mno-vpclmulqdq -mno-avx512vnni -mno-avx512bitalg -mno-avx512bf16 -mno-avx512vp2intersect -mno-3dnow -madx -mabm -mno-cldemote -mclflushopt -mno-clwb -mno-clzero -mcx16 -mno-enqcmd -mf16c -mfsgsbase -mfxsr -mhle -msahf -mno-lwp -mlzcnt -mmovbe -mno-movdir64b -mno-movdiri -mno-mwaitx -mno-pconfig -mno-pku -mno-prefetchwt1 -mprfchw -mno-ptwrite -mno-rdpid -mrdrnd -mrdseed -mrtm -mno-serialize -msgx -mno-sha -mno-shstk -mno-tbm -mno-tsxldtrk -mno-vaes -mno-waitpkg -mno-wbnoinvd -mxsave -mxsavec -mxsaveopt -mxsaves -mno-amx-tile -mno-amx-int8 -mno-amx-bf16 -mno-uintr -mno-hreset -mno-kl -mno-widekl -mno-avxvnni --param l1-cache-size=32 --param l1-cache-line-size=64 --param l2-cache-size=12288 -mtune=skylake -quiet -dumpbase jdk2022.f90 -dumpbase-ext .f90 -mthreads -O3 -version -fimplicit-none -fallow-argument-mismatch -ffast-math -fstack-arrays -fopenmp -fintrinsic-modules-path c:/program files (x86)/gcc_eq/gcc_11.3.0/bin/../lib/gcc/x86_64-w64-mingw32/11.3.0/finclude -o C:\Users\John\AppData\Local\Temp\ccTq8oEh.s
GNU Fortran (GCC) version 11.3.0 (x86_64-w64-mingw32)
	compiled by GNU C version 12.0.1 20220401 (experimental), GMP version 6.2.1, MPFR version 4.1.0, MPC version 1.2.1, isl version none
GGC heuristics: --param ggc-min-expand=100 --param ggc-min-heapsize=131072
GNU Fortran2008 (GCC) version 11.3.0 (x86_64-w64-mingw32)
	compiled by GNU C version 12.0.1 20220401 (experimental), GMP version 6.2.1, MPFR version 4.1.0, MPC version 1.2.1, isl version none
GGC heuristics: --param ggc-min-expand=100 --param ggc-min-heapsize=131072
COLLECT_GCC_OPTIONS='-v' '-fimplicit-none' '-fallow-argument-mismatch' '-O3' '-march=native' '-ffast-math' '-fstack-arrays' '-fopenmp' '-o' 'jdk2022.exe' '-mthreads' '-pthread'
 c:/program files (x86)/gcc_eq/gcc_11.3.0/bin/../lib/gcc/x86_64-w64-mingw32/11.3.0/../../../../x86_64-w64-mingw32/bin/as.exe -v -o C:\Users\John\AppData\Local\Temp\ccwyhTot.o C:\Users\John\AppData\Local\Temp\ccTq8oEh.s
GNU assembler version 2.37 (x86_64-w64-mingw32) using BFD version (GNU Binutils) 2.37
Reading specs from c:/program files (x86)/gcc_eq/gcc_11.3.0/bin/../lib/gcc/x86_64-w64-mingw32/11.3.0/../../../../lib/libgfortran.spec
rename spec lib to liborig
COLLECT_GCC_OPTIONS='-v' '-fimplicit-none' '-fallow-argument-mismatch' '-O3' '-march=native' '-ffast-math' '-fstack-arrays' '-fopenmp' '-o' 'jdk2022.exe' '-mthreads' '-pthread'
COMPILER_PATH=c:/program files (x86)/gcc_eq/gcc_11.3.0/bin/../libexec/gcc/x86_64-w64-mingw32/11.3.0/;c:/program files (x86)/gcc_eq/gcc_11.3.0/bin/../libexec/gcc/;c:/program files (x86)/gcc_eq/gcc_11.3.0/bin/../lib/gcc/x86_64-w64-mingw32/11.3.0/../../../../x86_64-w64-mingw32/bin/
LIBRARY_PATH=c:/program files (x86)/gcc_eq/gcc_11.3.0/bin/../lib/gcc/x86_64-w64-mingw32/11.3.0/;c:/program files (x86)/gcc_eq/gcc_11.3.0/bin/../lib/gcc/;c:/program files (x86)/gcc_eq/gcc_11.3.0/bin/../lib/gcc/x86_64-w64-mingw32/11.3.0/../../../../x86_64-w64-mingw32/lib/../lib/;c:/program files (x86)/gcc_eq/gcc_11.3.0/bin/../lib/gcc/x86_64-w64-mingw32/11.3.0/../../../../lib/;c:/program files (x86)/gcc_eq/gcc_11.3.0/bin/../lib/gcc/x86_64-w64-mingw32/11.3.0/../../../../x86_64-w64-mingw32/lib/;c:/program files (x86)/gcc_eq/gcc_11.3.0/bin/../lib/gcc/x86_64-w64-mingw32/11.3.0/../../../
Reading specs from c:/program files (x86)/gcc_eq/gcc_11.3.0/bin/../lib/gcc/x86_64-w64-mingw32/11.3.0/../../../../lib/libgomp.spec
COLLECT_GCC_OPTIONS='-v' '-fimplicit-none' '-fallow-argument-mismatch' '-O3' '-march=native' '-ffast-math' '-fstack-arrays' '-fopenmp' '-o' 'jdk2022.exe' '-mthreads' '-pthread' '-dumpdir' 'jdk2022.'
 c:/program files (x86)/gcc_eq/gcc_11.3.0/bin/../libexec/gcc/x86_64-w64-mingw32/11.3.0/collect2.exe -plugin c:/program files (x86)/gcc_eq/gcc_11.3.0/bin/../libexec/gcc/x86_64-w64-mingw32/11.3.0/liblto_plugin.dll -plugin-opt=c:/program files (x86)/gcc_eq/gcc_11.3.0/bin/../libexec/gcc/x86_64-w64-mingw32/11.3.0/lto-wrapper.exe -plugin-opt=-fresolution=C:\Users\John\AppData\Local\Temp\cc51OJxJ.res -plugin-opt=-pass-through=-lmingwthrd -plugin-opt=-pass-through=-lmingw32 -plugin-opt=-pass-through=-lgcc -plugin-opt=-pass-through=-lmoldname -plugin-opt=-pass-through=-lmingwex -plugin-opt=-pass-through=-lmsvcrt -plugin-opt=-pass-through=-lkernel32 -plugin-opt=-pass-through=-lquadmath -plugin-opt=-pass-through=-lm -plugin-opt=-pass-through=-lmingwthrd -plugin-opt=-pass-through=-lmingw32 -plugin-opt=-pass-through=-lgcc -plugin-opt=-pass-through=-lmoldname -plugin-opt=-pass-through=-lmingwex -plugin-opt=-pass-through=-lmsvcrt -plugin-opt=-pass-through=-lkernel32 -plugin-opt=-pass-through=-lpthread -plugin-opt=-pass-through=-ladvapi32 -plugin-opt=-pass-through=-lshell32 -plugin-opt=-pass-through=-luser32 -plugin-opt=-pass-through=-lkernel32 -plugin-opt=-pass-through=-lmingwthrd -plugin-opt=-pass-through=-lmingw32 -plugin-opt=-pass-through=-lgcc -plugin-opt=-pass-through=-lmoldname -plugin-opt=-pass-through=-lmingwex -plugin-opt=-pass-through=-lmsvcrt -plugin-opt=-pass-through=-lkernel32 --sysroot=/home/gfortran/gcc-home/binary/mingw32/cross/x86_64/gcc/12-20220403 -m i386pep -Bdynamic -o jdk2022.exe c:/program files (x86)/gcc_eq/gcc_11.3.0/bin/../lib/gcc/x86_64-w64-mingw32/11.3.0/../../../../x86_64-w64-mingw32/lib/../lib/crt2.o c:/program files (x86)/gcc_eq/gcc_11.3.0/bin/../lib/gcc/x86_64-w64-mingw32/11.3.0/crtbegin.o -Lc:/program files (x86)/gcc_eq/gcc_11.3.0/bin/../lib/gcc/x86_64-w64-mingw32/11.3.0 -Lc:/program files (x86)/gcc_eq/gcc_11.3.0/bin/../lib/gcc -Lc:/program files (x86)/gcc_eq/gcc_11.3.0/bin/../lib/gcc/x86_64-w64-mingw32/11.3.0/../../../../x86_64-w64-mingw32/lib/../lib -Lc:/program files (x86)/gcc_eq/gcc_11.3.0/bin/../lib/gcc/x86_64-w64-mingw32/11.3.0/../../../../lib -Lc:/program files (x86)/gcc_eq/gcc_11.3.0/bin/../lib/gcc/x86_64-w64-mingw32/11.3.0/../../../../x86_64-w64-mingw32/lib -Lc:/program files (x86)/gcc_eq/gcc_11.3.0/bin/../lib/gcc/x86_64-w64-mingw32/11.3.0/../../.. C:\Users\John\AppData\Local\Temp\ccwyhTot.o -stack 32000000 -Map=jdk2022.map -lgfortran -lgomp -ldl -lmingwthrd -lmingw32 -lgcc -lmoldname -lmingwex -lmsvcrt -lkernel32 -lquadmath -lm -lmingwthrd -lmingw32 -lgcc -lmoldname -lmingwex -lmsvcrt -lkernel32 -lpthread -ladvapi32 -lshell32 -luser32 -lkernel32 -lmingwthrd -lmingw32 -lgcc -lmoldname -lmingwex -lmsvcrt -lkernel32 c:/program files (x86)/gcc_eq/gcc_11.3.0/bin/../lib/gcc/x86_64-w64-mingw32/11.3.0/crtfastmath.o c:/program files (x86)/gcc_eq/gcc_11.3.0/bin/../lib/gcc/x86_64-w64-mingw32/11.3.0/crtend.o
COLLECT_GCC_OPTIONS='-v' '-fimplicit-none' '-fallow-argument-mismatch' '-O3' '-march=native' '-ffast-math' '-fstack-arrays' '-fopenmp' '-o' 'jdk2022.exe' '-mthreads' '-pthread' '-dumpdir' 'jdk2022.'
 Volume in drive C has no label.
 Volume Serial Number is 7443-5314

 Directory of C:\forum\memory

21/01/2023  11:51 AM               540 jdk2022.f90
21/01/2023  12:18 PM         1,010,116 jdk2022.map
21/01/2023  12:18 PM         2,744,994 jdk2022.exe
21/01/2023  12:18 PM            26,932 jdk2022.log
               4 File(s)      3,782,582 bytes
               0 Dir(s)  331,920,576,512 bytes free
[AUDIT Ver 1.21] Saturday, 21 January 2023 at 12:18:05.453
[AUDIT Ver 1.21] elapse       63.657 seconds: Saturday, 21 January 2023 at 12:19:09.111        1.000

It replicates with gfortran 11 on ubuntu on several different boxes. An strace shows it spawning clone() calls as the number of threads increases …

#!/bin/bash
exec 2>&1
gfortran -fbacktrace -fopenmp main.f90
time env OMP_THREAD_LIMIT=1 ./a.out
time env OMP_THREAD_LIMIT=2 ./a.out
time env OMP_THREAD_LIMIT=4 ./a.out
strace ./a.out 2>&1|head -n 1000|tail -10
#OMP_NESTED=FALSE OMP_NUM_THREADS=1 OMP_THREAD_LIMIT=1
exit
real    0m0.706s
user    0m0.658s
sys     0m0.049s

real    0m8.403s
user    0m4.464s
sys     0m8.916s

real    0m56.682s
user    2m23.910s
sys     0m23.988s

clone(child_stack=0x14a68f23cf30, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tid=[2770746], tls=0x14a68f23d700, child_tidptr=0x14a68f23d9d0) = 2770746
clone(child_stack=0x14a68ea38f30, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tid=[2770747], tls=0x14a68ea39700, child_tidptr=0x14a68ea399d0) = 2770747
clone(child_stack=0x14a68e837f30, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tid=[2770748], tls=0x14a68e838700, child_tidptr=0x14a68e8389d0) = 2770748
clone(child_stack=0x14a68f03bf30, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tid=[2770749], tls=0x14a68f03c700, child_tidptr=0x14a68f03c9d0) = 2770749
clone(child_stack=0x14a68e435f30, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tid=[2770750], tls=0x14a68e436700, child_tidptr=0x14a68e4369d0) = 2770750
clone(child_stack=0x14a68e234f30, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tid=[2770751], tls=0x14a68e235700, child_tidptr=0x14a68e2359d0) = 2770751
clone(child_stack=0x14a68ee3af30, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tid=[2770752], tls=0x14a68ee3b700, child_tidptr=0x14a68ee3b9d0) = 2770752
clone(child_stack=0x14a68e636f30, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tid=[2770753], tls=0x14a68e637700, child_tidptr=0x14a68e6379d0) = 2770753
futex(0x55cb7f4fb0b4, FUTEX_WAKE_PRIVATE, 2147483647) = 9
futex(0x55cb7f4fb0b4, FUTEX_WAKE_PRIVATE, 2147483647) = 9

The following modified code does not exhibit the problem.

I defined “x” as private, which I think is important. ( Used implicit none and default (none) ! )
I also reduced the loop and thread count for practicality, but I don’t think this is the issue.

program test
use omp_lib
implicit none
integer l
!$OMP PARALLEL DO &
!$OMP& NUM_THREADS(1)
do l=1,10 ! 1000
  write (*,*) 'Primary thread=',omp_get_thread_num(),l
  call foo
end do
!$OMP END PARALLEL DO
end program

subroutine foo
use omp_lib
implicit none
integer, parameter :: l=200,m=100,n=10
! number of threads
integer, parameter :: nthd=10
integer i,j,id
! automatic arrays
real(8) a(n,l),b(n,m),x(m)
integer thread_report(0:nthd)

a(:,:)=2.d0
b(:,:)=3.d0
thread_report = 0

do i=1,l
  if ( i <= 2 ) thread_report = 0
  !$OMP PARALLEL DO DEFAULT (none) SHARED(i,a,b,thread_report) PRIVATE(id,j,x) &
  !$OMP& NUM_THREADS(nthd)
    do j=1,m

      id = omp_get_thread_num()
      thread_report(id) = thread_report(id) + 1
      if ( thread_report(id) < 4 ) write (*,*) '  Secondary thread=',id,i,j

      x(j) = dot_product(a(:,i),b(:,j))

    end do
  !$OMP END PARALLEL DO
end do

end subroutine

Note : Code is updated to limit the reporting that verifies thread use.

I notice in my output that thread “0” is used as a primary and secondary thread. Could this cause a problem ?
The following adaptation of 2 x primary threads in the main program, each with 5 secondary threads in foo produces apparent conflict between the reported thread “id = omp_get_thread_num()” and the alternate primary thread. I am not familiar with this nested operation, but looks concerning.

program test
use omp_lib
implicit none
integer l,tm

!$OMP PARALLEL DO PRIVATE(l,tm) &
!$OMP& NUM_THREADS(2)
do l=1,10 ! 1000
  tm = omp_get_thread_num()
  write (*,*) 'Primary thread=',tm,l
  call foo (l,tm)
end do
!$OMP END PARALLEL DO

end program

subroutine foo (im,tm)
use omp_lib
implicit none
integer, parameter :: l=200,m=100,n=10
! number of threads
integer, parameter :: nthd=12
integer im,tm, i,j,id
! automatic arrays
real(8) a(n,l),b(n,m),x(m)
integer thread_report(0:nthd)

a(:,:)=2.d0
b(:,:)=3.d0
thread_report = 0

do i=1,l
  if ( i <= 2 ) thread_report = 0
  !$OMP PARALLEL DO DEFAULT (none) SHARED(i,a,b,thread_report,tm) PRIVATE(id,j,x) &
  !$OMP& NUM_THREADS(5)
    do j=1,m

      id = omp_get_thread_num()
      thread_report(id) = thread_report(id) + 1
      if ( thread_report(id) < 4 ) write (*,*) '  Secondary thread=',id,i,j,tm

      x(j) = dot_product(a(:,i),b(:,j))

    end do
  !$OMP END PARALLEL DO
end do
write (*,*) im,tm,' :',thread_report

end subroutine

Why would you want to do that? There’s no race condition on x(:) in the initial code.

Perhaps you are correct, although it did appear to remove the problem.

I am wondering what -O3 will do in foo with the do i ; do j loop. A lot could be optimised out.

I also changed the program to
!$OMP PARALLEL DO &
!$OMP NUM_THREADS(2)
do l=1,1000

and subroutine foo to
!$OMP PARALLEL DO DEFAULT (none) SHARED(i,a,b,thread_report,tm) PRIVATE(id,j,x) &
!$OMP& NUM_THREADS(5)
do j=1,m
id = omp_get_thread_num()

I found that ;
“program omp” used threads 0 & 1
“foo omp” used threads 0-4 ( based on the value of id )
I find this result surprising as it is not what I would expect from nested OMP.