Out-of-core, Checkpointing, Fault Tolerance

The recent emphasis on desirable features for future Fortran has been on generics and tidy-ups of existing syntax with new semantics. Fortran would be playing catch-up to C++ if we followed that route and is already quite far behind. In my opinion, Fortran needs to chase a different frontier, the largest-scale computations. What are the language/runtime features that are needed there? I found this video very interesting and informative on the challenges that a large-scale (in fact, record-breaking) computation had to face and how the developer(s) constructed their own solutions to the challenges. As a result, we already have an implementation of sorts of the desired features and that leaves the work of tying it to new Fortran language constructs and implementing them in a compiler. I would love to hear your thoughts.

4 Likes

i have not looked at the video yet, but large-scale calculations certainly seem a popular direction of development. Here is an article that was discussed in the context of Mathematics of Arrays (see the thread by that name for some reference) a few weeks ago - Slack
And I know from the practical developments in my own institute/company that computations get larger and larger - some phenomena that we want to describe require very detailed grids so that you resolve all the physics.

Interesting video and interesting topic overall IMO. The moment I read checkpointing and out-of-core computation the HDF5 library sprang to mind. I have used it successfully to do both, but mostly checkpointing of solutions on unstructured and adapted 6D meshes.

1 Like

I agree. It seems focusing on HPC features could be its differentiator from C++, whose goal is flexibility. Fortran could be a simpler, less-flexible, but faster and more stable language for HPC. Here are two ideas:

  1. Maybe a functionality to query device (CPU/accelerator) properties would be good. For example, querying how much RAM is on the machine, the size of the L1/L2 caches, etc. These can make significant micro-optimization differences. Maybe to keep it a bit hardware-agnostic, it could be something like ProcessorData(info) that would let the specific compiler decide what info could be given. It is true metaprogramming could be used instead, to query this information and write appropriate Fortran source files to compile.

  2. The idea of checkpointing is very nice. If Fortran went this route as a language, this could eliminate a lot of code writing necessary to restarting runs. If there was, say, a SaveState(filename, istat) intrinsic that saved all of the information necessary to restart the calculation (and could be done separately on different images or teams?) onto disk, that would run asynchronously so as to not interrupt program flow too much, that would be great. Each savestate could have a unique ID attached to it, such that the programmer may only need to write at the beginning of the program

if exists(save_filename) then
  call LoadState(save_filename, istat)
else
  {computation ...}
  call SaveState(save_filename, istat)   ! istat set to 0 is success, istat = 1 is 'still in process of saving asynchronously', istat = -1 is 'failed to save state'. Perhaps some other values. 
  ! LoadState would start executing here after the 'same' SaveState that saved it
  {more computation}
endif

The idea with the unique IDs is that 1) they’d be saved to the SaveState; 2) by using the LoadState function, it would automatically jump to code execution starting right after the SaveState that saved this state in the first place.
In a sense, this idea is like purposefully saving a core file. It would use up a large amount of disk space, but would be automatic. Also, modern NVMe SSDs can write bulk data quite quickly, almost as quickly as RAM. There is the issue of not enough hard disk space; the istat parameter could tell the programmer if there wasn’t enough space (or otherwise failed to save), and leave it up to them how to handle it. Also, SaveState could check if any allocated variables can be deallocated and not save them (to save memory) if the program is no longer using them.
Perhaps the idea can be tweaked a bit (and details would need to be filled in on how this would work for parallel programs; maybe not, of you just designate separate filenames for separate programs), but as a programmer who uses clusters that have at least one node fail somewhat often, I love the idea.

Is checkpointing the same as the Corefile Resume Feature of g95? Quoting the manual:

On x86 Linux systems, the execution of a g95-compiled program can be suspended and resumed. If you interrupt a program by sending it the QUIT signal, which is usually bound to control-backslash, the program will write an executable file named dump to the current directory. Running this file causes the execution of your program to resume from when the dump was written.

andy@fulcrum:~/g95/g95 % cat tst.f90

b = 0.0
do i=1, 10
do j=1, 3000000
call random number(a)
a = 2.0*a - 1.0
b = b + sin(sin(sin(a)))
enddo
print *, i, b
enddo
end
andy@fulcrum:~/g95/g95 % g95 tst.f90
andy@fulcrum:~/g95/g95 % a.out
1 70.01749
2 830.63153
3 987.717
4 316.48703
5 -426.53815
6 25.407673
Process dumped
7 -694.2718
8 -425.95465
9 -413.81763
10 -882.66223
andy@fulcrum:~/g95/g95 % ./dump
Restarting
............Jumping
7 -694.2718
8 -425.95465
9 -413.81763
10 -882.66223
andy@fulcrum:~/g95/g95 %

Any open files must be present and in the same places as in the original process. If you link against
other languages, this may not work. While the main use is allowing you to preserve the state of a run across a reboot, other possibilities include pushing a long job through a short queue or moving a running process to another machine. Automatic checkpointing of your program can be done by setting the environment variable G95 CHECKPOINT with the number of seconds to wait between dumps. A value of zero means no dumps. New checkpoint files overwrite old checkpoint files.

1 Like

Yes, it is the same idea, but the control of when checkpointing happens rests in the code and not in the user.

Both ways may be useful. Checkpointing (restartable dump) on signal seems to me a smart idea although probably impossible to put into the Standard. But I’d say, most welcome extension.