How does work OpenMP "nonmonotonic:dynamic" schedule?

The two main OpenMP loop scheduling options are static and dynamic. The latter can be more efficient if the workloads are not balanced between the iterations, but less efficient in the case of balanced workloads (because the scheduling itself is more expensive).

The nonmonotonic:dynamic option that appeared in recent specifications is supposed to address this problem, by getting more or less the best of the two worlds: it starts like a static scheduling, but with a chunk size of 1 (by default, like the default chuck size of the dynamic option), and a thread that completes early all of its pre-attributed threads can “steal” the non-processed chunks from the other threads (at least this is my understanding).

However, when testing it, it’s not really what happens. I have tested the following code with gfortran 13, ifort 18, 19, and 21 (unfortunately I don’t have access to more recent versions).

In short, it used 2 threads to execute 4 iterations. The first iteration is artificially delayed to see how the next iterations are attributed to the threads:

program foo
implicit none

    integer, parameter :: N=4
    integer :: i, t(N), o, order(N)
    real :: x(N)
    integer, external :: omp_get_thread_num

    call omp_set_num_threads(2)

    o = 0
    !$OMP PARALLEL DO SCHEDULE(static)
    do i = 1, N
        call loopcontent(i,o,order,x,t)
    end do
    !$OMP END PARALLEL DO
    call print_results("STATIC")

    o = 0
    !$OMP PARALLEL DO SCHEDULE(dynamic)
    do i = 1, N
        call loopcontent(i,o,order,x,t)
    end do
    !$OMP END PARALLEL DO
    call print_results("DYNAMIC")

    o = 0
    !$OMP PARALLEL DO SCHEDULE(nonmonotonic:dynamic)
    do i = 1, N
        call loopcontent(i,o,order,x,t)
    end do
    !$OMP END PARALLEL DO
    call print_results("NONMONOTONIC:DYNAMIC") 

contains 

    subroutine loopcontent(i,o,order,x,t)
        integer, intent(in) :: i
        integer, intent(inout) :: o, order(:), t(:)
        real, intent(inout) :: x(:)

        integer :: j

        !$OMP CRITICAL
        o = o+1
        order(i) = o
        !$OMP END CRITICAL
        x(i) = 0.0
        if (i == 1) then
            do j = 1, 10000000
                x(i) = cos(x(i))
            end do
        end if
        t(i) = omp_get_thread_num()
    end subroutine

    subroutine print_results(what)
        character(*), intent(in) :: what

        print*
        print "(A)", "==============================="
        print "(A)", what
        print "(A)", "    loop  thread  order"
        do i = 1, N
            print "(3I7,F10.3)", i, t(i), order(i), x(i)
        end do
    end subroutine

end

The 4 compilers behave the same (and as expected) or the static and dynamic cases, so I’m just pasting one output:

===============================
STATIC
    loop  thread  order
      1      0      1     0.739
      2      0      4     0.000
      3      1      2     0.000
      4      1      3     0.000

===============================
DYNAMIC
    loop  thread  order
      1      0      1     0.739
      2      1      2     0.000
      3      1      3     0.000
      4      1      4     0.000

Regarding the nonmonotonic:dynamic option, gfortran 13 and ifort 18/21 behave the same, and the execution order is not different from the dynamic option. Instead, I would expect the thread 1 to execute the iterations 2, 4, 3, in this order (the iteration 3 being “stolen” to the thread 0 once the iteration 2, 4 are completed):

===============================
NONMONOTONIC:DYNAMIC
    loop  thread  order
      1      0      1     0.739
      2      1      2     0.000
      3      1      3     0.000
      4      1      4     0.000

And ifort 19 behaves like with the static option, so still not as expected:

===============================
NONMONOTONIC:DYNAMIC
    loop  thread  order
      1      0      1     0.739
      2      0      4     0.000
      3      1      2     0.000
      4      1      3     0.000

Is my understanding of nonmonotonic:dynamic wrong, or is it an (non-)implementation issue (the compilers recognize the option, but just map it to static or dynamic)? Could someone test with a more recent ifx version?

I have slightly modified the test code to better see the behaviors:

  • n=8 iterations instead of 4 (still with nt=2 threads)
  • as before, only the first has a significant (artificial) workload to ensure it’s delayed and takes much more time than all other iterations, which basically do nothing
  • testing explicitly the monotonic:dynamic case, as nonmonotonic is actually now the default for the dynamic scheduling
  • testing the default chunk (which is 1 for the dynamic case), but also chunks of sizes 2 and 4

The tests are performed with gfortran 13 and Intel ifx 2024. For each test 4 columns are printed:

  • The iteration number
  • The thread number that executes the given iteration
  • The “order” of the given iteration (e.g. “3” means that this iterations starts executing after 2 other iterations)
  • a real number that is supposed to be 0.739 for the 1st iteration after some fake computations, and 0.000 for the other iterations.

The static and monotonic:dynamic schedule work as expected for both compilers:

  • In the static schedule, the chunk size is n/nt=4 iterations, the thread 0 executes the iterations 1-4 and the thread 1 executes the iterations 5-8. Because the iteration 1 is much longer than the other ones, the iterations 2-4 are executed after all the other iterations
  • In the monotonic:dynamic schedule, the chunk size is 1 iteration; the thread 0 takes the iteration 1 and spends the rest of the program execution on it, while the thread 1 takes the iterations 2-8 in that order.

For the nonmonotonic:dynamic schedule, the two compilers behave differently, though. gfortran 13 gives the same result than the monotonic:dynamic case: probably they didn’t have time to develop this new schedule and have just mapped nonmonotonic to monotonic; I believe this is authorized by the OpenMP specification for nonmonotonic, which just opens up the possibility for a thread to iterate backwards, without mandating it.

So, the interesting case is the Intel compiler, which apparently implements the new schedule (and did it at least from ifort 2021, as I got the same results with it than with ifx 2024):

  • The default chunk size is still 1, BUT in contrast to the monotonic case the thread 1 starts with the iteration 5 instead of 2. Then it executes the iterations 6, 7, 8. At this point the thread 0 is still occupied with iteration 1, and the thread 0 starts going backward by executing the iterations 4, 3, 2 in that order.
  • What happens:
    • the initial distribution of the iterations is the same as the static schedule with the default chunk size, regardless the chunk size 1, 2, or 4 (note that the initial ditribution would be different with a chunk size that would not be a divider of the number of iterations, though).
    • all the iterations are pre-attributed from the beginning, in constrast to the usual monotonic case where there is no pre-attributation (each thread simply dynamically takes the next unprocessed chunk).
    • once a thread has finished with the iterations that were pre-distributed to it, it starts “stealing” the unprocessed iterations that were pre-attributed to the other threads
    • as shown by the cases with chunk sizes 2 and 4, a thread won’t steal iterations from a chunk that another thread has started to process: the distribution remains chunk based (all the iterations of a sigle thread are processed by a single thread.

gfortran 13:

===============================
STATIC
    loop  thread  order
      1      0      1     0.739
      2      0      6     0.000
      3      0      7     0.000
      4      0      8     0.000
      5      1      2     0.000
      6      1      3     0.000
      7      1      4     0.000
      8      1      5     0.000

===============================
MONOTONIC:DYNAMIC
    loop  thread  order
      1      1      1     0.739
      2      0      2     0.000
      3      0      3     0.000
      4      0      4     0.000
      5      0      5     0.000
      6      0      6     0.000
      7      0      7     0.000
      8      0      8     0.000

===============================
NONMONOTONIC:DYNAMIC
    loop  thread  order
      1      1      1     0.739
      2      0      2     0.000
      3      0      3     0.000
      4      0      4     0.000
      5      0      5     0.000
      6      0      6     0.000
      7      0      7     0.000
      8      0      8     0.000

===============================
NONMONOTONIC:DYNAMIC,2
    loop  thread  order
      1      1      1     0.739
      2      1      8     0.000
      3      0      2     0.000
      4      0      3     0.000
      5      0      4     0.000
      6      0      5     0.000
      7      0      6     0.000
      8      0      7     0.000

===============================
NONMONOTONIC:DYNAMIC,4
    loop  thread  order
      1      1      1     0.739
      2      1      6     0.000
      3      1      7     0.000
      4      1      8     0.000
      5      0      2     0.000
      6      0      3     0.000
      7      0      4     0.000
      8      0      5     0.000

Intel ifx 2024

===============================
STATIC
    loop  thread  order
      1      0      1     0.739
      2      0      6     0.000
      3      0      7     0.000
      4      0      8     0.000
      5      1      2     0.000
      6      1      3     0.000
      7      1      4     0.000
      8      1      5     0.000
 
===============================
MONOTONIC:DYNAMIC
    loop  thread  order
      1      1      1     0.739
      2      0      2     0.000
      3      0      3     0.000
      4      0      4     0.000
      5      0      5     0.000
      6      0      6     0.000
      7      0      7     0.000
      8      0      8     0.000
 
===============================
NONMONOTONIC:DYNAMIC
    loop  thread  order
      1      0      1     0.739
      2      1      8     0.000
      3      1      7     0.000
      4      1      6     0.000
      5      1      2     0.000
      6      1      3     0.000
      7      1      4     0.000
      8      1      5     0.000
 
===============================
NONMONOTONIC:DYNAMIC,2
    loop  thread  order
      1      0      1     0.739
      2      0      8     0.000
      3      1      6     0.000
      4      1      7     0.000
      5      1      2     0.000
      6      1      3     0.000
      7      1      4     0.000
      8      1      5     0.000
 
===============================
NONMONOTONIC:DYNAMIC,4
    loop  thread  order
      1      0      1     0.739
      2      0      6     0.000
      3      0      7     0.000
      4      0      8     0.000
      5      1      2     0.000
      6      1      3     0.000
      7      1      4     0.000
      8      1      5     0.000

Code:

program foo
implicit none

    integer, parameter :: N=8
    integer :: i, t(N), o, order(N)
    real :: x(N)
    integer, external :: omp_get_thread_num

    call omp_set_num_threads(2)

    o = 0
    !$OMP PARALLEL DO SCHEDULE(static)
    do i = 1, N
        call loopcontent(i,o,order,x,t)
    end do
    !$OMP END PARALLEL DO
    call print_results("STATIC",order,x,t)

    o = 0
    !$OMP PARALLEL DO SCHEDULE(monotonic:dynamic)
    do i = 1, N
        call loopcontent(i,o,order,x,t)
    end do
    !$OMP END PARALLEL DO
    call print_results("MONOTONIC:DYNAMIC",order,x,t)

    o = 0
    !$OMP PARALLEL DO SCHEDULE(nonmonotonic:dynamic)
    do i = 1, N
        call loopcontent(i,o,order,x,t)
    end do
    !$OMP END PARALLEL DO
    call print_results("NONMONOTONIC:DYNAMIC",order,x,t) 

    o = 0
    !$OMP PARALLEL DO SCHEDULE(nonmonotonic:dynamic,2)
    do i = 1, N
        call loopcontent(i,o,order,x,t)
    end do
    !$OMP END PARALLEL DO
    call print_results("NONMONOTONIC:DYNAMIC,2",order,x,t) 

    o = 0
    !$OMP PARALLEL DO SCHEDULE(nonmonotonic:dynamic,4)
    do i = 1, N
        call loopcontent(i,o,order,x,t)
    end do
    !$OMP END PARALLEL DO
    call print_results("NONMONOTONIC:DYNAMIC,4",order,x,t) 

contains 

    subroutine loopcontent(i,o,order,x,t)
        integer, intent(in) :: i
        integer, intent(inout) :: o, order(:), t(:)
        real, intent(inout) :: x(:)

        integer :: j

        !$OMP CRITICAL
        o = o+1
        order(i) = o
        !$OMP END CRITICAL
        x(i) = 0.0
        if (i == 1) then
            do j = 1, 10000000
                x(i) = cos(x(i))
            end do
        end if
        t(i) = omp_get_thread_num()
    end subroutine

    subroutine print_results(what,order,x,t)
        character(*), intent(in) :: what
	     integer, intent(in) :: t(:), order(:)
	     real, intent(in) :: x(:)

        print*
        print "(A)", "==============================="
        print "(A)", what
        print "(A)", "    loop  thread  order"
        do i = 1, N
            print "(3I7,F10.3)", i, t(i), order(i), x(i)
        end do
    end subroutine

end