Coarray code stalling in a Critical block

I have a large Fortran program that communicates with external software via TCP/IP. It does this by launching the external software itself and specifying a port number to use (via call system). I’m working on reconfiguring the software to use coarray fortran to run multiple simulations at once.

To support parallelization via coarrays, each image needs to get its own port unique number. To do this, I’m currently using code similar to below. For the first loop, each image runs through connect inside a critical block to ensure each image gets a unique port number (subsequent loops can reuse the same port number). The functions find_open_port() and client_start are written in C and compile directly with the Fortran.

The problem I’m encountering is that the program stalls at the critical block for a long time. It’ll typically get through 3-4 images right away, but it’ll then take several minutes to get through the next few (assuming 8 images total).

subroutine tcpconnect
    implicit none
    character(len=256) :: cmd
    integer :: port_number = -1
    integer :: status = -1
    
    if (port_number == -1) port_number = find_open_port()
    write (cmd,'(a,i0)') '/path/to/other ',port_number
    call system(cmd)
    status = client_start(port_number)
    if (status /= 0) then
        ! bad
        error stop "CONNECTION FAILED"
    end if
end subroutine
subroutine io_manager
    logical :: pass1 = .true.

    if (pass1) then
        pass1 = .false.
        critical
            call tcpconnect
        end critical
    else
        call tcpconnect
    end if
end subroutine

Everything runs fine if I replace port_number = find_open_port() with port_number = 2100 + this_image() and remove the critical block. This guarantees different port numbers but I don’t want to hardcode the port.

Could this be an issue with the compiler not knowing how to handle C-code inside a critical block?

Is there a better way to do this? Should I be using event_type instead of critical blocks?

Using ifort 2020.4.

I think I may have fixed it. This issue’s been plaguing me for months and this just suddenly came to me. As far as I can tell it seems to work now by adding a sync all after the critical block, as shown.

subroutine io_manager
    logical :: pass1 = .true.

    if (pass1) then
        pass1 = .false.
        critical
            call tcpconnect
        end critical
        sync all
    else
        call tcpconnect
    end if
end subroutine

Any ideas why this was necessary?

I would need more context and/or a more complete example to properly diagnose, but there is one thing that sticks out to me from your example. Why must call tcpconnect be in a critical block in one place, but not the other?

Also, can you think of any reason why all of the images must call tcpconnect the first time before any of them call it again? That is what the sync all is effectively doing. critical only says “one image at a time in here”. Once an image has completed the block it is free to move on. sync all says “every body has to get to here before anybody can move on”.

The first time each image goes through tcpconnect, it runs find_open_port() to get an open port number to use. It needs to be called in a critical block to avoid multiple images ending up getting the same port number (which is what happens otherwise).

Subsequent calls to tcpconnect don’t need to be critical because the port number has already been assigned and find_open_port doesn’t get called; it continues to use the same number in subsequent runs.

Side note: a nice side effect of this sync all solution is that it fixes a theoretical problem I thought I could encounter (but never did): with a large enough number of images, theoretically an image could end up running find_open_port during the brief period in between runs where another image might have already claimed a port but the connection might be closed. sync all should ensure that each image gets a port while all connections are active.

“Call system()” is non standard. Could it be executing asynchronously? Try “call execute_command_line()” instead.

3 Likes

That sounded like a great guess but doesn’t seem to be it. Using execute_command_line(,wait=.true.) and not using sync all still stalled out in several images, the same as with the system command.

At least everything works with sync all, I was just hoping to find out why it seemed to be required to coax all the images to complete the block. I’m going to guess it has to do with the fact that external C code is being called.

1 Like