Refining Fortran Failed Images

The slides are available.

Refining Fortran Failed Images
2020 IEEE/ACM Fifth International Workshop on Extreme Scale Programming Models and Middleware (ESPM2)
by Nathan Weeks and Glenn Luecke
Publication Year: 2020
The Fortran 2018 standard introduced syntax and semantics that allow a parallel application to recover from failed images (fail-stop processes) during execution. Teams are a key new language feature that facilitates this capability for applications that use collective subroutines: when a team of images is partitioned into one or more sets of new teams, only active images comprise the new teams; failed images are excluded. This paper summarizes the language facilities for handling failed images specified in the Fortran 2018 standard and subsequent interpretations by the US Fortran Programming Language Standards Technical Committee. We propose standardizing some semantics that have been left processor (implementation) dependent to enable the development of portable fault-tolerant parallel Fortran applications. Finally, we present a prototype implementation of a substantial subset of the Fortran 2018 failed images functionality, including semantic changes proposed herein. This prototype comprises OpenCoarrays, with failed-images enhancements constructed using Open MPI ULFM routines, and a GFortran compiler customized to support additional syntax needed to enable fault tolerant execution of image control statements.