I have a parallelized Fortran program that, in serial, would run several simulations back to back. Its parallel version is very simple in that it uses multiple images to run multiple independent simulations simultaneously. It’s basic implementation is:
program main
do i = 1, num_simulations
! Skip runs not intended for this image
if (.not. mod(i-1,num_img)+1 == this_image()) cycle
! Do simulation
end do
sync all ! generate some timing stats
! Print out time of each image
end program
So, assuming you have 3 images and 9 simulations, image 1 does runs 1, 4, and 7; image 2 does runs 2, 5, and 8, and image 3 does runs 3, 6, and 9.
This setup works great when all the runs work. The problem is if any run crashes due to, e.g., a floating error, the executable just hangs forever and never terminates.
Two questions:
Is there a way to compile this to tell the program to terminate all images if any image crashes?
Is there a better way to terminate the program normally rather than the sync all at the end? Would it be more advantageous to pass data back to image 1 when each image is done (without sync all)? Would that allow images to terminate and release memory back if they finish before the rest?
Calling ieee_set_halting_mode() as described here with code here to force floating point overflows to terminate a program may work, but I have not tried it in a parallel program.
As @Beliavsky pointed out, use error stop to cause all images to terminate for any errors you catch. For errors you can’t catch, add stat= to your sync all statement (I haven’t used it so you’ll have to double check on the exact syntax, but I think it should be sync all (stat=status_variable)). That should notice any images that have stopped instead of waiting forever for them to all reach that statement. Then you can use error stop again in that case. Relevant section of the standard is 11.6.11.
Ah nice, thanks, that sounds like that’ll probably work! Thanks
Edit: Reading my Modern Fortran book, there’s a section on program termination that would seem to imply that one image initiating error termination would terminate all images anyway. (“An image initiates error termination if it executes a statement that would cause the termination of a single-image program but is not a STOP or END PROGRAM statement. This causes all other images that have not already initiated error termination to initiate error termination.”) Wonder if there’s another issue at play here as well.
Fortran 2018 has the concept of “failed images”, where you can detect that an image has failed (more likely as the number of images increases) and recover from it, perhaps by assigning a “spare” image to a team.
Very odd. I replaced the final sync all with this:
print*,"HERE0"
sync all(stat=sync_stat)
print*"HERE1"
if (sync_stat == stat_stopped_image) then
write (*,*) "ERROR DETECTED ON AN IMAGE"
ERROR STOP
end if
I force a floating overflow on image 2 and get this in the terminal:
So it makes it to the sync statement, but then seems to terminate gracefully rather than via ERROR STOP, and doesn’t complete any of the print statements (skipping the additional HERE as well as both the error prints and normal prints) after the sync all.
So it didn’t hang all the processes, but it did silently fail, which is also bad (normally with long runs I won’t notice the fortll error in the middle of a long output to the terminal).
Good catch, but I was actually comparing against sync_stat (I def use implicit none). I’ve retyped the code to here from another machine by hand so it was just a transposition error. So the issue is still occurring. :\
EDIT: Just noticed there’s also a stat_failed_image in iso_fortran_env. I was using stat_stopped_image. I’ll have to give that a try tomorrow at work.
The compilers are still at an early stage but the named constant stat_failed_image does already work if we use the fail image statement to raise image failure and the failed_images() function to return the failed image(s) (output is from ifort 2021.4.0):
program main
use, intrinsic :: iso_fortran_env
implicit none
real :: myvalue
integer :: mystat
integer, allocatable, dimension(:) :: fail_imgs
myvalue = this_image() * 10.0
if (this_image() == 2) then
fail image ! let the image fail
end if
sync all (stat=mystat)
fail_imgs = failed_images() ! get the failed images
write (*,'(a,i0,a,f10.2)') 'IMAGE ',this_image(),' PRE-SYNC, VALUE: ', myvalue
sync all (stat=mystat)
if (mystat == stat_failed_image .or. mystat == stat_stopped_image) then
write (*,*) 'ERROR OCCURRED ON IMAGE', fail_imgs
error stop
end if
write (*,'(a,i0,a,i0)') 'IMAGE ',this_image(),' POST-SYNC, VALUE: ', myvalue
end program
A few comments on the above code compiled with gfortran v.11 + opencoarrays, tested on MacOS.
In the original form, with myvalue=huge(0.0)/tiny(0.0), gfortran fails to compile, reporting the arithmetic overflow, apparently treating the RHS as a constant expression.
When changed this statement to non-constant but still overflowing, the code runs smoothly, outputting Infinity in image 2.
When set FP overflow to halt, using ieee_exceptions intrinsic module and executing call ieee_set_halting_mode(ieee_overflow,.true.), the code exits but never reaches the error stop or POST-SYNC output. Same with fail image instead of forced overflow.
BTW, the OP snippet has another error in the last write format (i0 for myvalue)
Code modified for gfortran
program main
use, intrinsic :: iso_fortran_env, only: stat_failed_image, stat_stopped_image
use, intrinsic :: ieee_exceptions, only: ieee_overflow, ieee_set_halting_mode
implicit none
real :: myvalue, small=0.01, zero=0.0
integer :: mystat
call ieee_set_halting_mode(ieee_overflow,.true.)
myvalue = this_image() * 10.0
if (this_image() == 2) myvalue = huge(0.0) / small ! crash image 2
! if (this_image() == 2) fail image ! crash image 2
write (*,'(a,i0,a,f10.2)') 'IMAGE ',this_image(),' PRE-SYNC, VALUE: ', myvalue
sync all (stat=mystat)
if (mystat == stat_failed_image .or. mystat == stat_stopped_image) then
write (*,*) 'ERROR OCCURRED ON AN IMAGE'
error stop
end if
write (*,'(a,i0,a,f10.2)') 'IMAGE ',this_image(),' POST-SYNC, VALUE: ', myvalue
end program main
Output
$ caf coarr_error.f90
$ cafrun -np 4 ./a.out
Program received signal SIGFPE: Floating-point exception - erroneous arithmetic operation.
Backtrace for this error:
IMAGE 1 PRE-SYNC, VALUE: 10.00
IMAGE 3 PRE-SYNC, VALUE: 30.00
IMAGE 4 PRE-SYNC, VALUE: 40.00
Could not print backtrace: executable file is not an executable
#0 0x1056e0d3e
#1 0x1056dff4d
#2 0x7fff205fad7c
#3 0x10549b0e8
#4 0x10549b3c1
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpiexec noticed that process rank 1 with PID 0 on node Michals-MBP-2 exited on signal 8 (Floating point exception: 8).
--------------------------------------------------------------------------
Error: Command:
`/usr/local/bin/mpiexec -n 4 ./a.out`
failed to run.
I was able to test with ifort 2021.4 today too and discovered the same thing–that failed_images() seems to work, but only for intentionally failing the image via calling fail image (as opposed to being able to catch an unintentionally failed/crashed image).
A thought I just had is that checking if (mystat /= 0)might work, since perhaps there’s some other non-zero status thrown for crashed image other than stat_failed_image or stat_stopped_image. The ifort on my Mac doesn’t support coarrays and I haven’t figured out how to compile with opencoarrays with gfortran, so I can’t test until Monday at work.
If you use Homebrew on Mac, install opencoarrays and open-mpi formulae. Opencoarrays include two scripts, caf and cafrun to compile and run coarray programs (see the ‘Output’ above to see how).
Thanks, that worked and now I can run coarray stuff at home. Quite a bit clunkier than ifort’s implementation where it just runs out of it’s own executable, but alas.
Checking if (mystat /= 0) also didn’t work. It may be the case that sync stat is only intended for checking images that were intentionally failed as opposed to failed due to an unexpected error (like floating overflow). Too bad.
At this point I can’t come up with a good dummy test code to reproduce the original problem for which I opened the thread: all images hanging if one image hangs. Perhaps I need to work on creating a minimum example code to reproduce that and bump the thread again in the future if I can.
The fail image statement is of limited practical use for the programmer as it can only fail the executing image. But it is still very useful for trying out failure recovery methods already. (I did only try out some basics yet and have some ideas already).
The fail image statement would be more useful if it would allow to fail remote images as well, as the programmer can develop techniques to get hints as if a remote image is a candidate for an image failure. (I can’t tell if such a remote fail image would be technically possible, though). Without such a feature the raising and detection of failed images is solely left to the run-time. (The compilers are still at an early stage and won’t do the job yet). Thus, raising (remote) image failure is one aspect where the Fortran programmer has no low-level access to yet. (As opposed to data transfer/synchronization, and recovery methods at the coarray team level -as far as I can tell from some basic testing yet-).
Moreover, failed images in Fortran, specifically with OpenCoarrays/gfortran, are still cutting-edge and ongoing research: Refining Fortran Failed Images .
I can comment on Intel Fortran coarrays and failing image detection. Intel Fortran Coarray implement is built on top of Intel MPI. Intel MPI is MPICH, MPI 3.1. iMPI have not moved to MPI 4 which is where better support for failed rank detection. Our MPI team provided a few non-standard functions to give us some rudimentary failed image detection but it’s incomplete and non-MPI-standard. This is to say we’re hopeful iMPI will give us more capability for failed image detection in the future. But as our MPI architect cautioned - it’s a 2-way responsibility. The MPI or CAF programmer will have to add code to detect the failure AND devise a recovery scheme (if possible) in their code, or to fail “gracefully”.
Intel is working to fully support Fortran 2018, which includes the features of ISO/IEC TS 18508. Note that some of the features listed in the technical spec were modified somewhat in the integration into the Fortran standard. These differences are noted in the Fortran 2018 standard on page xv of the introduction, in the second bullet on that page. Ifort (and ifx) will support these features as specified in F2018, not in TS 18508 where there are differences. This work is on the ‘to-do’ list which is quite long these days so don’t expect something overnight or even this calendar year. It’ll be a multi-year journey.
Thanks for sharing, I really appreciate this. Still, if I’d to ask a single question only it would be:
‘Is it possible to make the FAIL IMAGE statement so that it can fail remote images as well, i.e. to implement a new intrinsic function FAIL_IMAGE ( image ) that would do that job?’
From my current understanding, and from a Fortran programmer’s viewpoint, fault detection and recovery is a three-step process (step 2. is the reason for my demand here):
1. Detection of Failed Images
The Fortran 2018 programmer can already detect any failed images through failing (missing or delayed) atomic data transfer(s). (This requires a programming model that constantly does fain-grained switches between sending and receiving on the images).
2. Marking Images as Failed Images
After detection of failed images the Fortran programmer should be able to (remotely) mark these images as failed images using something like a FAIL_IMAGE function. That is what I am asking for here because this alone would give enough low-level access to the programmer for implementing the recovery process, as it should allow to leave a CHANGE TEAM construct gracefully in the presence of failed images.
3. Recovery Process
Without the possibility for the Fortran programmer to mark failed images, we can only STOP (not ERROR STOP) all active images of a coarray team with (a) failed image(s). Thus, the main advantage of such failed images features would be to not lose more images than necessary in a recovery process.