Force all images to crash if one crashes

bwe · March 16, 2022, 2:04pm

I have a parallelized Fortran program that, in serial, would run several simulations back to back. Its parallel version is very simple in that it uses multiple images to run multiple independent simulations simultaneously. It’s basic implementation is:

program main

do i = 1, num_simulations
    ! Skip runs not intended for this image
    if (.not. mod(i-1,num_img)+1 == this_image()) cycle

    ! Do simulation
end do

sync all ! generate some timing stats
! Print out time of each image

end program

So, assuming you have 3 images and 9 simulations, image 1 does runs 1, 4, and 7; image 2 does runs 2, 5, and 8, and image 3 does runs 3, 6, and 9.

This setup works great when all the runs work. The problem is if any run crashes due to, e.g., a floating error, the executable just hangs forever and never terminates.

Two questions:

Is there a way to compile this to tell the program to terminate all images if any image crashes?
Is there a better way to terminate the program normally rather than the sync all at the end? Would it be more advantageous to pass data back to image 1 when each image is done (without sync all)? Would that allow images to terminate and release memory back if they finish before the rest?

I’m using ifort 19.1.3.304

Beliavsky · March 16, 2022, 2:19pm

Regarding your first question, one uses ERROR STOP for this, as opposed to STOP.

bwe · March 16, 2022, 2:27pm

Error stop will handle a caught error, but the issue still occurs for uncaught errors like unexpected floating overflows.

Beliavsky · March 16, 2022, 2:38pm

Calling ieee_set_halting_mode() as described here with code here to force floating point overflows to terminate a program may work, but I have not tried it in a parallel program.

everythingfunctional · March 16, 2022, 3:35pm

As @Beliavsky pointed out, use error stop to cause all images to terminate for any errors you catch. For errors you can’t catch, add stat= to your sync all statement (I haven’t used it so you’ll have to double check on the exact syntax, but I think it should be sync all (stat=status_variable)). That should notice any images that have stopped instead of waiting forever for them to all reach that statement. Then you can use error stop again in that case. Relevant section of the standard is 11.6.11.

bwe · March 16, 2022, 4:04pm

Ah nice, thanks, that sounds like that’ll probably work! Thanks

Edit: Reading my Modern Fortran book, there’s a section on program termination that would seem to imply that one image initiating error termination would terminate all images anyway. (“An image initiates error termination if it executes a statement that would cause the termination of a single-image program but is not a STOP or END PROGRAM statement. This causes all other images that have not already initiated error termination to initiate error termination.”) Wonder if there’s another issue at play here as well.

sblionel · March 16, 2022, 4:23pm

Fortran 2018 has the concept of “failed images”, where you can detect that an image has failed (more likely as the number of images increases) and recover from it, perhaps by assigning a “spare” image to a team.

bwe · March 16, 2022, 5:12pm

Very odd. I replaced the final sync all with this:


print*,"HERE0"
sync all(stat=sync_stat)
print*"HERE1"
if (sync_stat == stat_stopped_image) then
  write (*,*) "ERROR DETECTED ON AN IMAGE"
  ERROR STOP
end if

I force a floating overflow on image 2 and get this in the terminal:

forrtl: error (72): floating overflow
In coarray image 2
<stack trace>

HERE0

So it makes it to the sync statement, but then seems to terminate gracefully rather than via ERROR STOP, and doesn’t complete any of the print statements (skipping the additional HERE as well as both the error prints and normal prints) after the sync all.

So it didn’t hang all the processes, but it did silently fail, which is also bad (normally with long runs I won’t notice the fortll error in the middle of a long output to the terminal).

ivanpribec · March 16, 2022, 5:28pm

Shouldn’t you be testing vs sync_stat instead of stat? Perhaps adding implicit none to your program would be a good idea.

bwe · March 16, 2022, 5:35pm

Good catch, but I was actually comparing against sync_stat (I def use implicit none). I’ve retyped the code to here from another machine by hand so it was just a transposition error. So the issue is still occurring. :\

EDIT: Just noticed there’s also a stat_failed_image in iso_fortran_env. I was using stat_stopped_image. I’ll have to give that a try tomorrow at work.

bwe · March 17, 2022, 12:30pm

Update: neither stat_stopped_image nor stat_failed_image worked. Here’s a minimum viable program that demonstrates what I’m seeing:

program main
    use, intrinsic :: iso_fortran_env, only: stat_failed_image, stat_stopped_image
    implicit none

    real :: myvalue
    integer :: mystat

    myvalue = this_image() * 10.0

    if (this_image() == 2) myvalue = huge(0.0) / tiny(0.0) ! crash image 2

    write (*,'(a,i0,a,f10.2)') 'IMAGE ',this_image(),' PRE-SYNC, VALUE: ', myvalue
    sync all (stat=mystat)
    if (mystat == stat_failed_image .or. mystat == stat_stopped_image) then
        write (*,*) 'ERROR OCCURRED ON AN IMAGE'
        error stop
    end if
    write (*,'(a,i0,a,i0)') 'IMAGE ',this_image(),' POST-SYNC, VALUE: ', myvalue

end program

Build with: ifort caftest.f90 -o caftest -fpe0 -coarray -coarray-num-images=4

Run with: ./caftest

Result:

IMAGE 1, PRE-SYNC, VALUE: 10.00
IMAGE 3, PRE-SYNC, VALUE: 30.00
IMAGE 4, PRE-SYNC, VALUE: 40.00
forrtl: error (72): floating overflow
In coarray image 2
<stacktrace>

It does exit cleanly without hanging indefinitely but the error message is never printed/the error isn’t handled.

Federchen · March 18, 2022, 3:36pm

The compilers are still at an early stage but the named constant stat_failed_image does already work if we use the fail image statement to raise image failure and the failed_images() function to return the failed image(s) (output is from ifort 2021.4.0):

program main
  use, intrinsic :: iso_fortran_env
  implicit none

  real :: myvalue
  integer :: mystat
  integer, allocatable, dimension(:) :: fail_imgs

  myvalue = this_image() * 10.0

  if (this_image() == 2) then
    fail image ! let the image fail
  end if

  sync all (stat=mystat)

  fail_imgs = failed_images() ! get the failed images

  write (*,'(a,i0,a,f10.2)') 'IMAGE ',this_image(),' PRE-SYNC, VALUE: ', myvalue
  sync all (stat=mystat)
  if (mystat == stat_failed_image .or. mystat == stat_stopped_image) then
      write (*,*) 'ERROR OCCURRED ON IMAGE', fail_imgs
      error stop
  end if
  write (*,'(a,i0,a,i0)') 'IMAGE ',this_image(),' POST-SYNC, VALUE: ', myvalue

end program

IMAGE 1 PRE-SYNC, VALUE:      10.00
IMAGE 3 PRE-SYNC, VALUE:      30.00
IMAGE 4 PRE-SYNC, VALUE:      40.00
 ERROR OCCURRED ON IMAGE           2
 ERROR OCCURRED ON IMAGE           2

msz59 · March 18, 2022, 8:44pm

A few comments on the above code compiled with gfortran v.11 + opencoarrays, tested on MacOS.

In the original form, with myvalue=huge(0.0)/tiny(0.0), gfortran fails to compile, reporting the arithmetic overflow, apparently treating the RHS as a constant expression.
When changed this statement to non-constant but still overflowing, the code runs smoothly, outputting Infinity in image 2.
When set FP overflow to halt, using ieee_exceptions intrinsic module and executing call ieee_set_halting_mode(ieee_overflow,.true.), the code exits but never reaches the error stop or POST-SYNC output. Same with fail image instead of forced overflow.

BTW, the OP snippet has another error in the last write format (i0 for myvalue)

Code modified for gfortran

program main
  use, intrinsic :: iso_fortran_env, only: stat_failed_image, stat_stopped_image
  use, intrinsic :: ieee_exceptions, only: ieee_overflow, ieee_set_halting_mode
  implicit none

  real :: myvalue, small=0.01, zero=0.0
  integer :: mystat

  call ieee_set_halting_mode(ieee_overflow,.true.)
  myvalue = this_image() * 10.0

  if (this_image() == 2) myvalue = huge(0.0) / small  ! crash image 2
!  if (this_image() == 2) fail image ! crash image 2

  write (*,'(a,i0,a,f10.2)') 'IMAGE ',this_image(),' PRE-SYNC, VALUE: ', myvalue
  sync all (stat=mystat)
  if (mystat == stat_failed_image .or. mystat == stat_stopped_image) then
    write (*,*) 'ERROR OCCURRED ON AN IMAGE'
    error stop
  end if
  write (*,'(a,i0,a,f10.2)') 'IMAGE ',this_image(),' POST-SYNC, VALUE: ', myvalue
end program main

Output

$ caf coarr_error.f90
$ cafrun -np 4 ./a.out

Program received signal SIGFPE: Floating-point exception - erroneous arithmetic operation.

Backtrace for this error:
IMAGE 1 PRE-SYNC, VALUE:      10.00
IMAGE 3 PRE-SYNC, VALUE:      30.00
IMAGE 4 PRE-SYNC, VALUE:      40.00

Could not print backtrace: executable file is not an executable
#0  0x1056e0d3e
#1  0x1056dff4d
#2  0x7fff205fad7c
#3  0x10549b0e8
#4  0x10549b3c1
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpiexec noticed that process rank 1 with PID 0 on node Michals-MBP-2 exited on signal 8 (Floating point exception: 8).
--------------------------------------------------------------------------
Error: Command:
   `/usr/local/bin/mpiexec -n 4 ./a.out`
failed to run.

bwe · March 18, 2022, 8:56pm

I was able to test with ifort 2021.4 today too and discovered the same thing–that failed_images() seems to work, but only for intentionally failing the image via calling fail image (as opposed to being able to catch an unintentionally failed/crashed image).

A thought I just had is that checking if (mystat /= 0) might work, since perhaps there’s some other non-zero status thrown for crashed image other than stat_failed_image or stat_stopped_image. The ifort on my Mac doesn’t support coarrays and I haven’t figured out how to compile with opencoarrays with gfortran, so I can’t test until Monday at work.

msz59 · March 18, 2022, 9:28pm

If you use Homebrew on Mac, install opencoarrays and open-mpi formulae. Opencoarrays include two scripts, caf and cafrun to compile and run coarray programs (see the ‘Output’ above to see how).

bwe · March 20, 2022, 6:09pm

Thanks, that worked and now I can run coarray stuff at home. Quite a bit clunkier than ifort’s implementation where it just runs out of it’s own executable, but alas.

Checking if (mystat /= 0) also didn’t work. It may be the case that sync stat is only intended for checking images that were intentionally failed as opposed to failed due to an unexpected error (like floating overflow). Too bad.

At this point I can’t come up with a good dummy test code to reproduce the original problem for which I opened the thread: all images hanging if one image hangs. Perhaps I need to work on creating a minimum example code to reproduce that and bump the thread again in the future if I can.

Thanks for all the help so far everyone!

Federchen · March 20, 2022, 9:24pm

The fail image statement is of limited practical use for the programmer as it can only fail the executing image. But it is still very useful for trying out failure recovery methods already. (I did only try out some basics yet and have some ideas already).

The fail image statement would be more useful if it would allow to fail remote images as well, as the programmer can develop techniques to get hints as if a remote image is a candidate for an image failure. (I can’t tell if such a remote fail image would be technically possible, though). Without such a feature the raising and detection of failed images is solely left to the run-time. (The compilers are still at an early stage and won’t do the job yet). Thus, raising (remote) image failure is one aspect where the Fortran programmer has no low-level access to yet. (As opposed to data transfer/synchronization, and recovery methods at the coarray team level -as far as I can tell from some basic testing yet-).

Moreover, failed images in Fortran, specifically with OpenCoarrays/gfortran, are still cutting-edge and ongoing research: Refining Fortran Failed Images .

sblionel · March 22, 2022, 12:57pm

That is exactly why it is there.

greenrongreen · March 22, 2022, 6:12pm

I can comment on Intel Fortran coarrays and failing image detection. Intel Fortran Coarray implement is built on top of Intel MPI. Intel MPI is MPICH, MPI 3.1. iMPI have not moved to MPI 4 which is where better support for failed rank detection. Our MPI team provided a few non-standard functions to give us some rudimentary failed image detection but it’s incomplete and non-MPI-standard. This is to say we’re hopeful iMPI will give us more capability for failed image detection in the future. But as our MPI architect cautioned - it’s a 2-way responsibility. The MPI or CAF programmer will have to add code to detect the failure AND devise a recovery scheme (if possible) in their code, or to fail “gracefully”.

Intel is working to fully support Fortran 2018, which includes the features of ISO/IEC TS 18508. Note that some of the features listed in the technical spec were modified somewhat in the integration into the Fortran standard. These differences are noted in the Fortran 2018 standard on page xv of the introduction, in the second bullet on that page. Ifort (and ifx) will support these features as specified in F2018, not in TS 18508 where there are differences. This work is on the ‘to-do’ list which is quite long these days so don’t expect something overnight or even this calendar year. It’ll be a multi-year journey.

Federchen · March 24, 2022, 10:08am

Thanks for sharing, I really appreciate this. Still, if I’d to ask a single question only it would be:

‘Is it possible to make the FAIL IMAGE statement so that it can fail remote images as well, i.e. to implement a new intrinsic function FAIL_IMAGE ( image ) that would do that job?’

From my current understanding, and from a Fortran programmer’s viewpoint, fault detection and recovery is a three-step process (step 2. is the reason for my demand here):

1. Detection of Failed Images
The Fortran 2018 programmer can already detect any failed images through failing (missing or delayed) atomic data transfer(s). (This requires a programming model that constantly does fain-grained switches between sending and receiving on the images).

2. Marking Images as Failed Images
After detection of failed images the Fortran programmer should be able to (remotely) mark these images as failed images using something like a FAIL_IMAGE function. That is what I am asking for here because this alone would give enough low-level access to the programmer for implementing the recovery process, as it should allow to leave a CHANGE TEAM construct gracefully in the presence of failed images.

3. Recovery Process
Without the possibility for the Fortran programmer to mark failed images, we can only STOP (not ERROR STOP) all active images of a coarray team with (a) failed image(s). Thus, the main advantage of such failed images features would be to not lose more images than necessary in a recovery process.

Topic		Replies	Views
Weird things with Fortran parallel images Help	6	518	May 30, 2022
Loop stops execution without exit the program	2	430	September 1, 2022
Refining Fortran Failed Images	0	354	March 11, 2021
Coarrays: control over the number of images	13	416	October 16, 2024
The STOP statement and error_unit	10	536	January 4, 2024

Force all images to crash if one crashes

Related topics