How does work OpenMP "nonmonotonic:dynamic" schedule?

The two main OpenMP loop scheduling options are static and dynamic. The latter can be more efficient if the workloads are not balanced between the iterations, but less efficient in the case of balanced workloads (because the scheduling itself is more expensive).

The nonmonotonic:dynamic option that appeared in recent specifications is supposed to address this problem, by getting more or less the best of the two worlds: it starts like a static scheduling, but with a chunk size of 1 (by default, like the default chuck size of the dynamic option), and a thread that completes early all of its pre-attributed threads can “steal” the non-processed chunks from the other threads (at least this is my understanding).

However, when testing it, it’s not really what happens. I have tested the following code with gfortran 13, ifort 18, 19, and 21 (unfortunately I don’t have access to more recent versions).

In short, it used 2 threads to execute 4 iterations. The first iteration is artificially delayed to see how the next iterations are attributed to the threads:

program foo
implicit none

    integer, parameter :: N=4
    integer :: i, t(N), o, order(N)
    real :: x(N)
    integer, external :: omp_get_thread_num

    call omp_set_num_threads(2)

    o = 0
    !$OMP PARALLEL DO SCHEDULE(static)
    do i = 1, N
        call loopcontent(i,o,order,x,t)
    end do
    !$OMP END PARALLEL DO
    call print_results("STATIC")

    o = 0
    !$OMP PARALLEL DO SCHEDULE(dynamic)
    do i = 1, N
        call loopcontent(i,o,order,x,t)
    end do
    !$OMP END PARALLEL DO
    call print_results("DYNAMIC")

    o = 0
    !$OMP PARALLEL DO SCHEDULE(nonmonotonic:dynamic)
    do i = 1, N
        call loopcontent(i,o,order,x,t)
    end do
    !$OMP END PARALLEL DO
    call print_results("NONMONOTONIC:DYNAMIC") 

contains 

    subroutine loopcontent(i,o,order,x,t)
        integer, intent(in) :: i
        integer, intent(inout) :: o, order(:), t(:)
        real, intent(inout) :: x(:)

        integer :: j

        !$OMP CRITICAL
        o = o+1
        order(i) = o
        !$OMP END CRITICAL
        x(i) = 0.0
        if (i == 1) then
            do j = 1, 10000000
                x(i) = cos(x(i))
            end do
        end if
        t(i) = omp_get_thread_num()
    end subroutine

    subroutine print_results(what)
        character(*), intent(in) :: what

        print*
        print "(A)", "==============================="
        print "(A)", what
        print "(A)", "    loop  thread  order"
        do i = 1, N
            print "(3I7,F10.3)", i, t(i), order(i), x(i)
        end do
    end subroutine

end

The 4 compilers behave the same (and as expected) or the static and dynamic cases, so I’m just pasting one output:

===============================
STATIC
    loop  thread  order
      1      0      1     0.739
      2      0      4     0.000
      3      1      2     0.000
      4      1      3     0.000

===============================
DYNAMIC
    loop  thread  order
      1      0      1     0.739
      2      1      2     0.000
      3      1      3     0.000
      4      1      4     0.000

Regarding the nonmonotonic:dynamic option, gfortran 13 and ifort 18/21 behave the same, and the execution order is not different from the dynamic option. Instead, I would expect the thread 1 to execute the iterations 2, 4, 3, in this order (the iteration 3 being “stolen” to the thread 0 once the iteration 2, 4 are completed):

===============================
NONMONOTONIC:DYNAMIC
    loop  thread  order
      1      0      1     0.739
      2      1      2     0.000
      3      1      3     0.000
      4      1      4     0.000

And ifort 19 behaves like with the static option, so still not as expected:

===============================
NONMONOTONIC:DYNAMIC
    loop  thread  order
      1      0      1     0.739
      2      0      4     0.000
      3      1      2     0.000
      4      1      3     0.000

Is my understanding of nonmonotonic:dynamic wrong, or is it an (non-)implementation issue (the compilers recognize the option, but just map it to static or dynamic)? Could someone test with a more recent ifx version?