Force all images to crash if one crashes

sblionel · March 25, 2022, 12:33am

The model for failed images is that, on detection of a failed image, the current team is reconfigured to exclude the failed image and to, perhaps, include a “spare” image, then resume. There are examples of this, and it doesn’t require “marking” an image as failed.

Federchen · March 25, 2022, 8:49pm

Of course, I was talking about a customized detection of failed images using F18 low-level features: Simply said, we can already detect missing atomic data transfers (with the remote image numbers), which is in most cases due to delayed data transfers but may also be due to failed images (i.e. algorithm failure or hardware failure). The process can be refined so that the cases of delayed data transfer could be identified in most cases, so that we’d be able to minimize the cases of missing atomic data transfers to a smaller number that could symbolize a number of failed images (with the image numbers).

Thus, we already have a number of images (with image numbers) that we want to be treated as failed images, but we only can’t give this to the run-time yet, because the FAIL IMAGE statement does only fail the executing image. In principle, all we possibly need would be such an intrinsic function FAIL_IMAGE(image) (if such could possibly be implemented, or something else) as a means to give the results of our customized failed images detection to the run-time and, as you say, that the current team could reconfigure itself to exclude the (customized) failed images from further execution with image control statements. Then, we could use a customized recovery process as well.

Topic		Replies	Views
Weird things with Fortran parallel images Help	6	522	May 30, 2022
Loop stops execution without exit the program	2	430	September 1, 2022
Refining Fortran Failed Images	0	354	March 11, 2021
Coarrays: control over the number of images	13	431	October 16, 2024
The STOP statement and error_unit	10	555	January 4, 2024

Force all images to crash if one crashes

Related topics