I have no experience with MPI IO myself, but it was recently added in the geometry reading phase of a HPC code I follow (a lattice Boltzmann fluid solver for sparse geometries), and produced much better performance at least for small or moderate rank counts. (For high rank counts there appears to be a different bottleneck in this code.)
There are a few items concerning MPI I/O in the List of Best Practices from the European Performance Optimisation and Productivity (POP) Centre of Excellence:
In EXCELLERAT (another European engineering and HPC-related project), their Best Practice Guide (PDF, 1.9 MB) states the following:
Parallel I/O
Libraries for parallel input and output, such as HDF5 and NetCDF, exist and should be used.
They are built on top of MPI I/O and provide higher level interfaces. As well as providing
performance benefits over serial I/O they make some simulations possible that would otherwise
be impossible. This is because serial output is implemented by gathering all of the data in a
single MPI process and then writing it from there. The amount of memory available on a node
may not be large enough to allow this. Similar considerations apply to serial input.
Checkpointing should use the same I/O mechanism as the rest of the I/O in the program. A
single checkpoint only is required to restart a simulation. Checkpoints consume disc space and
it is usually desirable to delete then as soon as possible. However, to allow for the possibility
of a program terminating whilst writing a checkpoint, the previous checkpoint should not be
deleted until the new checkpoint has been completed.
One more resource I’m aware of are the Best practices for parallel IO and MPI-IO hints (PDF, 604 KB) by Philippe Wautelet from CNRS - IDRIS (Institute for Development and Resources in Intensive Scientific Computing).
Further resources can be found in materials from the PRACE Course: Advanced Topics in High Performance Computing, courtesy of the Leibniz Supercomputing Centre (LRZ) and the Erlangen Regional Computing Centre (RRZE). The following slide decks are relevant:
The following slide (taken from the second slide deck) provides a nice overview of the I/O Software Stack:
You can find dozens more presentations just be searching for “<name of computing centre> + mpi io”. For example this way I found another slide deck on Advanced MPI and Parallel I/O (PDF, 25.6 MB) provided by the CSC in Finland which hosts the LUMI supercomputer. (The I/O slides start on page 43.) They show a similar software stack diagram:
Anyways, I hope you find some of the resources useful. As @rwmsu has said, you will generally need hardware that supports parallel I/O to see an improvement. In one of the slide decks I linked, they also mention this:
Generally, use of MPI I/O is often limited to special file systems; do not expect it to work on your average NFS-mounted $HOME
[directory].