Force all images to crash if one crashes

The model for failed images is that, on detection of a failed image, the current team is reconfigured to exclude the failed image and to, perhaps, include a “spare” image, then resume. There are examples of this, and it doesn’t require “marking” an image as failed.

2 Likes

Of course, I was talking about a customized detection of failed images using F18 low-level features: Simply said, we can already detect missing atomic data transfers (with the remote image numbers), which is in most cases due to delayed data transfers but may also be due to failed images (i.e. algorithm failure or hardware failure). The process can be refined so that the cases of delayed data transfer could be identified in most cases, so that we’d be able to minimize the cases of missing atomic data transfers to a smaller number that could symbolize a number of failed images (with the image numbers).

Thus, we already have a number of images (with image numbers) that we want to be treated as failed images, but we only can’t give this to the run-time yet, because the FAIL IMAGE statement does only fail the executing image. In principle, all we possibly need would be such an intrinsic function FAIL_IMAGE(image) (if such could possibly be implemented, or something else) as a means to give the results of our customized failed images detection to the run-time and, as you say, that the current team could reconfigure itself to exclude the (customized) failed images from further execution with image control statements. Then, we could use a customized recovery process as well.