The two main OpenMP loop scheduling options are static
and dynamic
. The latter can be more efficient if the workloads are not balanced between the iterations, but less efficient in the case of balanced workloads (because the scheduling itself is more expensive).
The nonmonotonic:dynamic
option that appeared in recent specifications is supposed to address this problem, by getting more or less the best of the two worlds: it starts like a static scheduling, but with a chunk size of 1 (by default, like the default chuck size of the dynamic
option), and a thread that completes early all of its pre-attributed threads can “steal” the non-processed chunks from the other threads (at least this is my understanding).
However, when testing it, it’s not really what happens. I have tested the following code with gfortran 13, ifort 18, 19, and 21 (unfortunately I don’t have access to more recent versions).
In short, it used 2 threads to execute 4 iterations. The first iteration is artificially delayed to see how the next iterations are attributed to the threads:
program foo
implicit none
integer, parameter :: N=4
integer :: i, t(N), o, order(N)
real :: x(N)
integer, external :: omp_get_thread_num
call omp_set_num_threads(2)
o = 0
!$OMP PARALLEL DO SCHEDULE(static)
do i = 1, N
call loopcontent(i,o,order,x,t)
end do
!$OMP END PARALLEL DO
call print_results("STATIC")
o = 0
!$OMP PARALLEL DO SCHEDULE(dynamic)
do i = 1, N
call loopcontent(i,o,order,x,t)
end do
!$OMP END PARALLEL DO
call print_results("DYNAMIC")
o = 0
!$OMP PARALLEL DO SCHEDULE(nonmonotonic:dynamic)
do i = 1, N
call loopcontent(i,o,order,x,t)
end do
!$OMP END PARALLEL DO
call print_results("NONMONOTONIC:DYNAMIC")
contains
subroutine loopcontent(i,o,order,x,t)
integer, intent(in) :: i
integer, intent(inout) :: o, order(:), t(:)
real, intent(inout) :: x(:)
integer :: j
!$OMP CRITICAL
o = o+1
order(i) = o
!$OMP END CRITICAL
x(i) = 0.0
if (i == 1) then
do j = 1, 10000000
x(i) = cos(x(i))
end do
end if
t(i) = omp_get_thread_num()
end subroutine
subroutine print_results(what)
character(*), intent(in) :: what
print*
print "(A)", "==============================="
print "(A)", what
print "(A)", " loop thread order"
do i = 1, N
print "(3I7,F10.3)", i, t(i), order(i), x(i)
end do
end subroutine
end
The 4 compilers behave the same (and as expected) or the static
and dynamic
cases, so I’m just pasting one output:
===============================
STATIC
loop thread order
1 0 1 0.739
2 0 4 0.000
3 1 2 0.000
4 1 3 0.000
===============================
DYNAMIC
loop thread order
1 0 1 0.739
2 1 2 0.000
3 1 3 0.000
4 1 4 0.000
Regarding the nonmonotonic:dynamic
option, gfortran 13 and ifort 18/21 behave the same, and the execution order is not different from the dynamic
option. Instead, I would expect the thread 1 to execute the iterations 2, 4, 3, in this order (the iteration 3 being “stolen” to the thread 0 once the iteration 2, 4 are completed):
===============================
NONMONOTONIC:DYNAMIC
loop thread order
1 0 1 0.739
2 1 2 0.000
3 1 3 0.000
4 1 4 0.000
And ifort 19 behaves like with the static
option, so still not as expected:
===============================
NONMONOTONIC:DYNAMIC
loop thread order
1 0 1 0.739
2 0 4 0.000
3 1 2 0.000
4 1 3 0.000
Is my understanding of nonmonotonic:dynamic
wrong, or is it an (non-)implementation issue (the compilers recognize the option, but just map it to static
or dynamic
)? Could someone test with a more recent ifx version?