Coarray Fortran "critical" block

hsnyder · May 7, 2021, 3:58pm

I’m trying to implement a work queue using the critical construct, but am getting results contrary to my expectations. Here’s a minimal working example:

program critbug
    integer :: next_task[*]
    integer :: my_task

    next_task = 0
    sync all 
    do 
        call get_next_task(my_task)
        if (my_task > 10) exit
        print *, 'image', this_image(), 'about to do task', my_task
        call sleep(3)
        print *, 'image', this_image(), 'done task', my_task
    end do

contains 
    subroutine get_next_task(next)
        integer,intent(out) :: next
        critical 
            next = next_task[1]
            next_task[1] = next + 1
        end critical
    end subroutine
end program

My expectation was that critical/end critical acts like a mutex, i.e. each image will block while another image is in the critical section, and when the previous image exits the critical section, another image will be able to enter it, etc. This is not what happens in practice, either with the intel compiler or with gfortran/opencoarrays. For the latter, what actually happens is this:

$ cafrun -np 8 ./a.out
 image           1 about to do task           0
 image           1 done task           0
 image           7 about to do task           1
 image           8 about to do task           2
 image           2 about to do task           3
 image           3 about to do task           4
 image           1 about to do task           5
 image           7 done task           1
 image           8 done task           2
 image           2 done task           3
 image           3 done task           4
 image           1 done task           5
 image           4 about to do task           6
 image           6 about to do task           7
 image           8 about to do task           8
 image           3 about to do task           9
 image           5 about to do task          10
 image           4 done task           6
 image           6 done task           7
 image           8 done task           8
 image           3 done task           9
 image           5 done task          10

If I put a sync all statement at the end of get_next_task(), then I get the expected behaviour, except the program hangs at the end:

$ cafrun -np 8 ./a.out
 image           1 about to do task           0
 image           2 about to do task           3
 image           4 about to do task           1
 image           5 about to do task           4
 image           6 about to do task           6
 image           7 about to do task           7
 image           8 about to do task           2
 image           3 about to do task           5
 image           2 done task           3
 image           4 done task           1
 image           5 done task           4
 image           6 done task           6
 image           7 done task           7
 image           8 done task           2
 image           3 done task           5
 image           1 done task           0
 image           1 about to do task           8
 image           3 about to do task           9
 image           4 about to do task          10
 image           1 done task           8
 image           3 done task           9
 image           4 done task          10

Is this a bug, have I misunderstood what critical is supposed to do, or both?

Regardless, is there a correct way to implement a work queue in Fortran without resorting to the ISO_C_binding to use POSIX constructs? In my real application, the tasks take different amounts of time to complete, so I do not want to sync all every time a new task is dealt out.

@rouson, might you have any insight on this one?

Thanks very much, everyone.

rouson · May 7, 2021, 4:45pm

@hsnyder I haven’t used critical much and wouldn’t recommend it unless there’s a shared resource (e.g., a file) that each image needs to access. It sounds like the shared resource in your case is the mutex variable next_task, in which case I think you’re essentially rolling your own event_type. If I’m right, then I’d recommend simply using event_type.

One of things I most love about event_type is that it turns an old argument on its head. Many early attempts to apply object-oriented programming (OOP) to computational science hurt performance. With event_type, OOP actually aids performance by eliminating the need for synchronization and it goes further by making a compelling case for OOP’s information-hiding in the form of private data. It can be tricky to get a mutex right on your own for subtle reasons. The greatest service that event_type does is to encapsulate a common pattern, a counting semaphore, while keeping the trickiest implementation detail (the atomic manipulation of the counter).

If adding synchronization fixes a problem, it’s often a sign that the algorithm has a race condition. I suspect there’s a misunderstaning of critical. It gives you coherent accesses to the next_task, but won’t impose any segment ordering in-between entries to the critical block, which is what the additional sync all does so I suspect the compilers are behaving correctly here.

Sourcery is funded by NASA to co-develop open-source task-scheduling software: the Framework for Extensible Asynchronous Task Scheduling (FEATS. I invite you to join us in that effort. We are in the very early stages because we’ve been laying a lot of groundwork, including working on a directed acyclic graph (DAG) library to represent task dependencies. Right now, we’ve just written code to handle the initial task distribution and we’re working out what we expect users to write versus what services the FEATS will provide. Next steps include designing a data structure and algorithms for images to retrieve data produced by tasks on which a given image’s task depends.

hsnyder · May 7, 2021, 7:51pm

@rouson thanks for your reply. I’m not sure that event_type actually does allow you to implement a counting semaphore, at least not a global one, because while you can post to events on other images, you can’t wait on them (no coindexing in event wait). This means you can only wait for someone to explicitly notify YOU, rather than wait for a global notification that another “spot” is available… Am I wrong?

I’ll respond separately regarding FEATS once I’ve had a chance to review the link you sent - it sounds interesting.

billlong · May 7, 2021, 7:53pm

I would replace the critical section:

        critical 
            next = next_task[1]
            next_task[1] = next + 1
        end critical

with
call atomic_fetch_add(next_task[1], 1, next )

which will add 1 to next_task[3] and return the previous value as next.

Making the whole thing:

cat test.f90
program critbug
use,intrinsic :: iso_fortran_env, only: Atomic_int_kind
integer(atomic_int_kind) :: next_task[*]
integer(atomic_int_kind) :: my_task

next_task = 0
sync all 
do 
    call get_next_task(my_task)
    if (my_task > 10) exit
    print *, 'image', this_image(), 'about to do task', my_task
    call sleep(3)
    print *, 'image', this_image(), 'done task', my_task
end do

contains
subroutine get_next_task(next)
integer(atomic_int_kind),intent(out) :: next
call atomic_fetch_add(next_task[1], 1, next )
end subroutine get_next_task
end program

ftn test.f90
srun -n8 -CBW28 ./a.out
srun: job 2797323 queued and waiting for resources
srun: job 2797323 has been allocated resources
image 2 about to do task 4
image 2 done task 4
image 4 about to do task 3
image 4 done task 3
image 6 about to do task 5
image 6 done task 5
image 8 about to do task 6
image 8 done task 6
image 5 about to do task 2
image 5 done task 2
image 7 about to do task 7
image 7 done task 7
image 7 about to do task 10
image 7 done task 10
image 1 about to do task 0
image 1 done task 0
image 1 about to do task 8
image 1 done task 8
image 3 about to do task 1
image 3 done task 1
image 3 about to do task 9
image 3 done task 9

hsnyder · May 7, 2021, 8:03pm

@billlong Thanks. I tried something very close to that previously. It works on the intel compiler, but with gfortran and opencoarrays, there’s some sort of bug that scrambles the values, and you get results like the following, even if you use atomic_define instead of setting the initial value to zero with the assignment operator.

 image           5 about to do task -1454442688
 image           6 about to do task  -147290369

What compiler are you using?

rouson · May 7, 2021, 8:39pm

@hsnyder I’ve written counting semaphores using event_type many times and would be glad to demonstrate briefly offline. I generally find that the constraints in the standard are there to guide the developer toward better code either in terms of robustness or performance.

For performance reasons, waiting on a remote semaphore counter would have negative performance implications in terms of the demands that it would put on the network and the cost of the wait. Getting things like this right are one of the main reasons that I recommend against rolling your own solution when the language provides a solution.

For robustness reasons, I also recommend against heading in the direction atomic subroutine calls unless you have a lot of parallel programming experience. I’m reasonably confident that event_type can support your needs. If I’m correct, then event_type is a much better choice than anything that has been written in this

rouson · May 7, 2021, 8:51pm

It’s worth noting that in Modern Fortran Explained: Incorporating Fortran 2018 by Metcalf, Reid, and Cohen, the authors include atomic_define (and atomic_ref) in their “Deprecated features” appendix because they are just too hard to get right. The related pitfalls exemplify why writing this kind of low-level code is just not a good idea for the vast majority of users.

bwe · May 7, 2021, 9:16pm

Do you know of a good guide to using event_type? I’ve googled it, but I’m unclear on exactly what it is or how to use it.

rouson · May 7, 2021, 9:33pm

I give a brief introduction starting at 16:20 in First Experiences with Parallel Application Development in Fortran 2018 - YouTube. Milan Curcic’s book Modern Fortran: Building efficient parallel applications has some material on event_type as does the aforementioned book by Metcalf, Reid & Cohen.

Also, @everythingfunctional and I teach 1-2 days of Fortran 2018 parallel programming, including event_type, in our tutorials. We don’t have any upcoming tutorials on the calendar yet, but would be glad to schedule if a site wants to host one – presumably virtually for the near future.

hsnyder · May 9, 2021, 9:33pm

For anyone who might be looking for a solution to a similar problem (task queue), or for anyone who might be able to suggest improvements, here’s a working implementation of a task queue using events. I dislike it because of the need for a master image that sits there and deals out tasks, but it does work.

Basically the “tasks” in this application are identified by a single integer (we’re presuming that’s enough information for the worker image to know what to do), and I’m simulating the runtime of doing an actual task by just sleeping for a number of seconds equal to the task ID (which again is an integer).

I haven’t bothered with making the master image into another team, or encapsulating the queue in an object, though you definitely could do…

program tq
    use iso_fortran_env
    implicit none

    integer :: my_next_task[*], i, j

    type(event_type) :: task_needed_ev[*]
    logical, allocatable :: task_needed_ar(:)[:]
    logical :: done[*]

    type(event_type) :: task_assigned[*]

    allocate(task_needed_ar(num_images())[*])

    done = .false.

    if (this_image() == 1) then
        i = 0
        do
            ! wait until asked for a task
            event wait(task_needed_ev)

            ! if no more tasks to give out, tell all images to exit
            if (i > 10) then
                do j = 2, num_images()
                    done[j] = .true.
                    event post (task_assigned[j])
                end do
                exit
            end if

            ! if there are more tasks to give out
            ! find which image needs a task
            do j = 2, num_images()
                if (task_needed_ar(j)) then

                    task_needed_ar(j) = .false.

                    ! assign task
                    my_next_task[j] = i
                    event post (task_assigned[j])
                    i = i + 1
                    exit

                end if
            end do

        end do

    else ! if we're not image 1
        associate (me=>this_image())
        do
            task_needed_ar(me)[1] = .true.
            event post (task_needed_ev[1])
            event wait (task_assigned)
            if (done) exit

            print *, 'Image', me, 'doing task', my_next_task
            call sleep(my_next_task)
            print *, 'Image', me, 'done task', my_next_task

        end do
        end associate
    end if


end program

ivanpribec · May 9, 2021, 10:15pm

Sounds a bit like OpenMP tasks. If I recall correctly with OpenMP the master thread does work on a task only once it has finished creating tasks, and the other threads executing the tasks aren’t yet finished.

Topic		Replies	Views
Coarray code stalling in a Critical block	5	559	September 14, 2021
Why cannot the Fortran coarray access an element in a different image? Help	13	320	May 7, 2025
Stuck trying to get coarrays to do simple thing Help	4	442	March 30, 2022
Parallel Asynchronous Fortran I/O Help	12	1169	February 13, 2023
Using Coarrays and Memory Efficiently Help	12	1266	February 12, 2022

Coarray Fortran "critical" block

Related topics