I noticed in The File IO Best Practices webpage that there isn’t mention of binary writes/reads to/from file.
I am personally not aware of a best practice for this, but think it’s useful to include. Either due to increased I/O speed, reduced storage requirements, easier file transfers, etc. A common issue I’m encountering is I/O bottlenecks or storage bottlenecks - both presumably common enough in HPC environments.
I’d want to see all the information needed to have a binary file read/writeable in a portable manner (cross compiler/machine; for example a file created in an HPC environment, that I want to bring back to my local machine). Which I believe requires querying record lengths when writing & using the same ones when reading.
If this is too specific, perhaps an “Fortran Best Practices for HPC” guide could be made?
Good idea. Unformatted stream I/O is fast and simple to use and should be mentioned. From the File I/O section of my list of Fortran codes I see that HDF5 and NetCDF are used by many Fortraners, so probably they should be discussed.
Lately I had to write a binary file that will be read back by another fortran program, but both under my control.
So I decided to write binary files with access='stream' and all the variables had a fixed size like real64 from the iso_fortran_env.
Moreover I wrote some control string at the beginning and at the end of the file so I can raise an error if the file is not in the format that I have chosen.
I think this is important as if you store the length of your arrays in the binary file you should be certain that the file is in the format you have chosen and not another one, and get the array allocated with a huge size that may slow your computer. But I may be just fixed on this idea, so your choice.
I’m using access stream so I don’t have to worry on how a particular compiler is storing the record length.
Moreover having fixed the size of the variables in bit I’m more on the safe side when reading it back, even after some time or with a program compiled with a different compiler. The endianess will still depend on the particular machine but that’s all I can do.
In another situation I wanted to read back data using Python so I decided to use NETCDF taking care to make a file compatible with xarray. That is a Python package able to read a NETCDF file with a single command (if it has a particular format inside the various possibility offered by NETCDF).
This is a key thing that complicates matters. The Fortran program on different systems may use different order for storing the higher and lower order bytes. I.e. One system may store the bytes of a 64 bit real in order abcdefgh, and then on the other system expect them to be stored in order hgfedcba, completely screwing up the values read back in. It’s possible to correct for, but you need to be aware of the possibility.
For many years I have written my own random access, variable length record, binary file I/O (based on CDC openms). This had been based on superimposing my logical records onto a Fortran fixed length record, random access file.
More recently I have changed to stream access to increase the addressable file size and remove unnecessary buffering, as the OS now provides adequate file I/O buffering and larger records for 64-bit OS. Fortran Stream access is a more flexible and preferable file system approach, but the increase in flexibility is minimal.
Although I have shared files between different hardware and compilers, I have never had to deal with endianess in these transfers. That is the main hurdle for file portability.
What I do notice is that for all the care taken to provide record header/footers to cope with record sizes; in 40 years I have never reported errors in record read/writes, apart from catistrophic loss of disk access. I wonder if all the wrapping and checking has been necessary?
It probably goes back to magnetic tape usage, but disk access is much more reliable than the history of record wrapping implies.
Writing binary I/O libraries in Fortran has also had problems with mixed type arguments, which has been a problem with compiler usage, but not file usage.
Now with newer faster disks, a text file format can be much easier format to use and inspect, providing the issue of accuracy of floating point reals is addressed.
Recent targeting of high speed disk transfer rates can often ignore the practical limits of data processing rates.
@egio
That could be a significant achievement, as little endian formats offer a number of advantages, such as overlaying 1, 2,4 and 8 byte integers without any problems. I am not familiar with what is required on big endian memory addressing, apart from assuming it would have a lot of problems.
Big endian formats may require more information of the type and kind of integer variables and arrays being written. Different real kinds have this issue for both endianess.
I’m not sure what was the initial reason for the the record markers in the Fortran I/Os, maybe it had to do with reliability of the physical devices at the time, but maybe also to faster seeking in a file even when the records had not a fixed length (and so no true random access is possible).
The record markers are those I created for my variable length record, random access file structure, that were typical of record structures at the time. I basically superimposed a Fortran unformatted record structure (of header/footer is record size) onto a Fortran fixed length record, for random access, where the fixed length records are buffered. The header/footer can be used to regenerate the file index, although I maintain these in memory and write to an index file.
What I was trying to identify was that for all the error checking that takes place on binary files, actual read/write errors are virtually non-existent, ie disk files are a very reliable medium.
If netCFD / HDF5 copes with endianess, it must have a very sophistocated record structure for type and kind, to cope this aspect of OS and possibly a preferred endianess when storing data.
Fortran stream I/O provides a file access method that lets Fortran users easily implement their own file database structure.
The record markers in fortran unformatted i/o contain the lengths of the records. That length information allows records to be skipped over, or partially read and then skipped, and for BACKSPACE to work.
I still feel like writing data to different records could be parallelized, rather than simply writing to different files (something some clusters really don’t like).
If netCFD / HDF5 copes with endianess, it must have a very sophistocated record structure for type and kind, to cope this aspect of OS and possibly a preferred endianess when storing data.
I’ve heard the HDF5 documentation/standard is quite obscure and long, which is given as a reason there’s only one implementation of it out there.
Parallel netcdf, parallel HDF. The point really being that libraries like netCDF and the HDF storage layer it sits on top of are meant to solve many issues of inter-platform compatibility, along with making data more self-describing. Of course they are both oriented around specific data models. netCDF, for example, is really aimed at gridded data.