MPI IO or not to MPI IO?

Hello everyone, I would like to ask your opinions/ideas:
(The post is about MPI but I come here because we use it mainly from Fortran and I know many of you have HPC experience :slight_smile: )

With my team we are currently debating about changing the IO strategy we use which is basically: there are as many output files as MPI processes. For several reasons we are now strongly considering to use MPI IO in order to have all processes write/read to/from a single file … well we do time-varying computations with multi-objects FEM simulations … so its it more 1 file per MPI process per object per time step, we want to reduce the “per MPI process” …

So, has anyone here made that jump? and if so what was your experience in terms of development time (human resources)? and the performance gains/losses? We have noted that if we recombine the meshes into single files, we can visualize way faster since multithreading IO for the graphical rendering compared to multithreading for handling several files seems to be more efficient, but then the question would be the possible downgrades to the solver’s performance.

Thanks for any feedback!

3 Likes

Will this run on a system that has a file system such as LUSTRE that can support parallel IO ?. If not, it might be a waste of your time.

We deploy from personal workstations to clusters and on windows and Linux.

First time I hear about this, I’ll take a closer look.

When you say a waste of time do you mean that the development effort could be wasted by a downgrade in execution performance? On this particular point IO is not a bottleneck, as long as it doesn’t heavily hinder the global execution time I think the effort could be paid off by the global workflow simplification.

Just to give you an idea, it can happen that a given simulation runs on 32 MPI processes, and for whatever good reason, the user might want the next simulation in a chained process to continue on 16 or 64… while we do have the tools to merge and repartition, this comes at a price. There are other scenarios. But the global idea is that (maybe) having a workflow with a parallel IO to a single file can facilitate many aspects of the global user experience. Not only runtime execution.

My point is that if a large increase in IO performance is your goal then you might not get it with parallel IO unless your underlying hardware provides support for it. If you look at total solution time required to let each processor write to a dedicated file asynchronously and then merge files in a post-processing step compared to doing MPI-IO it might actually be more efficient to let each processor write at its own pace. IO performance is almost always driven more by the hardware than the software you are using. So I would be very careful about embarking on a large scale code refactoring before I know for sure that their will be an actual benefit. I would suggest you bulld a parallel version of HDF5 and write some simple test cases that mimic your IO patterns and requirements. If you see a usable improvement in performance then by all means proceed with either using something like HDF5, netCDF with HDF5 parallel support or the parallel flavor of netCDF before trying to implement your own MPI-IO routines. If you are doing FEM, you might also look at the Exodus/Nemesis libraries from Sandia that are now open source. Exodus and Nemesis are part of the SEACAS libraries and tools you can download from here:

Edit

I should have mentioned that Nemesis was at one time a standalone collection of routines to read/write Exodus files in parallel. The Nemesis functions have now been merged into the Exodus software. In addition, there are several tools for splitting Exodus files into multiple files for parallel runs and more important to the OP for merging or joining the parallel files back into a single Exodus database. The previous Nemisis merge tool NEM_JOIN has been replace by a routine called EPU. There are also several other tools for merging and splitting exodus files plus extracting Exodus data to MATLAB and other formats. Plus, I believe that GMSH can create Exodus grid files but don’t quote me on that and I think Paraview and maybe LANL’s Visit program can display Exodus files. Note that the Exodus Fortran interfaces are still C wrappers. A few years ago I wrote a complete set of Fortran C-Interop interfaces for Exodus but they were never really tested beyond reading some simple grid files. If I get some time I’ll revisit my interfaces and see if they are in a shape to release upon an unsuspecting world.

Also, Exodus files are really netCDF files which are depending how you build netCDF HDF5 files.

2 Likes

Thanks so much for all the information @rwmsu we will take a closer look at all of these, hopefully I can come back latter with some lessons-learned :slight_smile:

Perhaps you should also consider using the master process approach, i.e. collecting all the data on one mpi-process and then writing from that process into a single file. We utilized this strategy in our code (atmosphere modeling) for I/O with <1000 processes and 1-2 GB files, and we only had to switch to parallel I/O using netcdf when the file size and the number of processes increased even further.

Already tried and the results were not good, we’d rather avoid this approach. We might reconsider it but for the moment we leave it as last resource.

I have no experience with MPI IO myself, but it was recently added in the geometry reading phase of a HPC code I follow (a lattice Boltzmann fluid solver for sparse geometries), and produced much better performance at least for small or moderate rank counts. (For high rank counts there appears to be a different bottleneck in this code.)

There are a few items concerning MPI I/O in the List of Best Practices from the European Performance Optimisation and Productivity (POP) Centre of Excellence:

In EXCELLERAT (another European engineering and HPC-related project), their Best Practice Guide (PDF, 1.9 MB) states the following:

Parallel I/O

Libraries for parallel input and output, such as HDF5 and NetCDF, exist and should be used.
They are built on top of MPI I/O and provide higher level interfaces. As well as providing
performance benefits over serial I/O they make some simulations possible that would otherwise
be impossible. This is because serial output is implemented by gathering all of the data in a
single MPI process and then writing it from there. The amount of memory available on a node
may not be large enough to allow this. Similar considerations apply to serial input.

Checkpointing should use the same I/O mechanism as the rest of the I/O in the program. A
single checkpoint only is required to restart a simulation. Checkpoints consume disc space and
it is usually desirable to delete then as soon as possible. However, to allow for the possibility
of a program terminating whilst writing a checkpoint, the previous checkpoint should not be
deleted until the new checkpoint has been completed.

One more resource I’m aware of are the Best practices for parallel IO and MPI-IO hints (PDF, 604 KB) by Philippe Wautelet from CNRS - IDRIS (Institute for Development and Resources in Intensive Scientific Computing).

Further resources can be found in materials from the PRACE Course: Advanced Topics in High Performance Computing, courtesy of the Leibniz Supercomputing Centre (LRZ) and the Erlangen Regional Computing Centre (RRZE). The following slide decks are relevant:

The following slide (taken from the second slide deck) provides a nice overview of the I/O Software Stack:

You can find dozens more presentations just be searching for “<name of computing centre> + mpi io”. For example this way I found another slide deck on Advanced MPI and Parallel I/O (PDF, 25.6 MB) provided by the CSC in Finland which hosts the LUMI supercomputer. (The I/O slides start on page 43.) They show a similar software stack diagram:

Anyways, I hope you find some of the resources useful. As @rwmsu has said, you will generally need hardware that supports parallel I/O to see an improvement. In one of the slide decks I linked, they also mention this:

Generally, use of MPI I/O is often limited to special file systems; do not expect it to work on your average NFS-mounted $HOME [directory].

4 Likes

Brilliant! thanks @ivanpribec for all those resources!!

The quote from Parallel I/O with HDF5 and Performance Tuning Techniques page 7, just gets the global feeling:


:grin:

You can find some good lectures on YouTube too.

From ARCHER:

From ANL (part of the playlist on Data Intensive Computing and I/O:

From the Erlangen National High Performance Computing Center (NHR@FAU):

1 Like

I noticed this paper just got accepted a few days ago:

Here’s the full abstract:

The high-performance computing (HPC) I/O stack has been complex due to multiple software layers, the inter-dependencies among these layers, and the different performance tuning options for each layer. In this complex stack, the definition of an “I/O access pattern” has been re-appropriated to describe what an application is doing to write or read data from the perspective of different layers of the stack, often comprising a different set of features. It has become common having to redefine what is meant when discussing a pattern in every new study as no assumption can be made. This survey aims to propose a baseline taxonomy, harnessing the I/O community’s knowledge over the last 20 years. This definition can serve as a common ground for HPC I/O researchers and developers to apply known I/O tuning strategies and design new strategies for improving I/O performance. We seek to summarize and bring a consensus with the multiple ways to describe a pattern based on common features already used by the community over the years.

Curiously, although they mention stdio (the C file interface) in the article, there is no mention of Fortran at all. I wonder how Fortran unformatted file I/O fits into the HPC picture today along with co-arrays.

1 Like

Did a small search and found nothing about I/O with coarrays. I have the feeling that it could be extremely interesting and useful to have a “cowrite” and “coread” intrinsic procedures.

Edit: this blog/tutorial Blog articles using Fortran - #5 by vmagnin deserves a mention in this thread