GSoC '25: stdlib filesystem

I’ve written a light wrapper around MPIIO a while back to do Single-File Multi-Process IO. It might be aligned with what you describe but it might need some adaptations.

Now, this might be quite beyond the scope of the current GSoC project.

If you are interested in having such parallel IO library polished in a broad manner within stdlib or as a small standalone Fortran package let me know to discuss about it. This thread contains many good resources about that topic MPI IO or not to MPI IO?

OK, I see. My misunderstanding was coming from the terminology: it’s rather a (simulated) “ramfile”, not a “ramdisk”.

Note that something like this (which is a user-defined statement) is not possible:

write_stdlib(50,rec=rec) x,y,z

It would have to be a generic routine with much less flexibility:

call write_stdlib(50,rec=rec,x)

Anyway, I am still bit skeptic about the final purpose. If it was just keeping stuff in RAM I wouldn’t see the point, as instead of writing x,y,z to the ramfile you would just have to keep the x,y,z objects alive (or copy them to other objects). So I guess that the main motivation is the transparent communication between processes to access the ramfiles, by encapsulating all the MPI stuff in the library routines? Is that correct? If yes, isn’t it somehow reinventing the coarrays?

2 Likes

Meeting ends on these ideas, I think the true game changer would be to have a well-defined class for IO streams (could mimic the Fortran syntax though via call io%open, call io%write, or via operators, or else) that abstracts out the I/O target:

  • Files (external units)
  • Memory (strings / internal units)
  • External Processes via pipes

those could have an option for compressed I/O via zlib, for example.

1 Like

those could have an option for compressed I/O via zlib, for example.

I would also add BASE64 and maybe XDR encoding. BASE64 primarily for writing VTK XML binary files. Some option for encoding or compressing data is something I would like to see in Fortrans intrinsic IO utilities but that will never happen in my lifetime so this would be the next best option.

1 Like

List of already opened issues on compressed IO:

A file OPENed with STATUS='SCRATCH' specifier will be available only to the process and will release resources at normal termination. Your compiler documentation might mention how they handle SCRATCH files efficiently.
If you want multiple processes to access it or want auto-compression, then I think it’s best to rely on OS services (shm_overview(7), zfs kernel module, etc).

2 Likes

For base64, I would suggest using BeFoR64 if you can get the author to remove the GPL3 license for FOSS development. I also have my own implementation that I’ll contribute for free if there is an interest. It’s almost equivalent to BeFor64 but still needs some “polishing” to remove some of the rough edges.

The MPI or coarray approach would only work for a single execution image (running in parallel, perhaps on multiple compute nodes). A ramdisk would presumably allow the data to be shared by a sequence of execution images (so “nonvolatile” rather than “volatile”).

Another model is the use of named fifo buffers. These look like files to a user program, but they can be used for temporary storage and communiation between processes. I’m unsure how persistent (nonvolatile) they are, and also the programmer does not control where the data actually lives, in RAM or on actual disks, etc., that is up to the OS and how it uses its resources. Exploring Named Pipes in Linux. Introduction | by Murad Bayoun | Medium

A user process cannot create a (non volatile) ramdisk, what is discussed here is rather a “ramfile”, which disappears when the process ends

The ability of a user to create a ramdisk (or ramfile) is just a matter of choice. If a user has the permissions necessary to create an actual file (in the file system on a spinning disk or an SSD), then he could be given the permissions to create a persistent ramdisk. The confusion in this discussion is because the original post specified “volatile,” when I think he might have intended to say “nonvolatile”.

To make discussions even more confusing, even the word volatile has two meanings. One is related to persistence or permanence, as I’m using it here. The other meaning is for a permanent object whose value can change at any moment – that is the meaning used in the fortran attribute.

I only use scratch files when I’m too lazy to guess a file name but I always assume these will be real files. Is there any compiler implementations that may actually store the scratch file data in memory rather than using the filesystem?

Thanks for all the comments and suggestions.

I’ll reproduce part of the Advantages section from BeeGFS On Demand:

The main advantages of the typical BeeOND use-case on compute nodes are:

  1. A very easy way to remove I/O load and nasty I/O patterns from your persistent global file system. Temporary data created during the job runtime will never need to be moved to your global persistent file system, anyways. But even the data that should be preserved after the job end might be better stored to a BeeOND instance initially and then at the end can be copied to the persistent global storage completely sequentially in large chunks for maximum bandwidth.

  2. Applications that run on BeeOND do not “disturb” other users of the global parallel file system and in turn also got the performance of the BeeOND drives exclusively for themselves without any influence by other users.

  3. Applications can complete faster, because with BeeOND, they can be running on SSDs or a RAM-disk, while they might only be running on spinning disks on your normal persistent global file system. Combining the SSDs of multiple compute nodes not only gets you to high bandwidth easily, it also gets you to a system that can handle very high IOPS.

The idea would be to reproduce this but only for Fortran files and with a portable library which can be used on any shared system without requiring root access.

@PierU
…If yes, isn’t it somehow reinventing the coarrays?

I was thinking of extending our simple ‘RAM disk’ to multiple nodes with co-arrays. However, I’m not sure how the images would be distributed among MPI processes (this seems to depend on the compiler options). In the end, I think asynchronous MPI would be more predicable. However, we don’t want to merely keep the data in memory in large arrays for several reasons: they can be very large (100s of GB) which would require writing to disk, but also the data may have to be committed to disk for the next job. [This is common practice for electronic structure codes which generate large sets of eigenvectors on the first run and then use these for subsequent runs.]

Having something like BeeOND, ideally as a path such as /stdlib_fs or as drop-in replacements for open, write, read, etc. could speed up many HPC codes for very little coding effort.

I have never really found a good use for fortran scratch files except for the most trivial of situations. In my field of electronic structure, it is most often necessary to direct files to a particular device (a particular hard disk, or more recently to an SSD) in order to have sufficient capacity or to optimize bandwidth. Fortran scratch files do not allow the programmer to do that. I would say that fortran scratch files are one of the most poorly designed and poorly conceived features in the language. The workaround is to use an ordinary named file, which of course allows the programmer the ability to fully specify the device and file name path, but then to close the file with status='delete' upon termination.

It would be ideal if the status='delete' (or some other equivalent) could be specified in the open statement, then the programmer would not need to be so careful to close and delete all of the relevant files before program termination. I could not count the number of times I’ve had a job fail, leaving multiple external disks filled to capacity and preventing any further jobs from running until those files were manually deleted. I eventually discovered that it is possible to unlink an open file on most unix/posix computers, which mostly works correctly. Namely, the unlinked file can be read, written, rewound, and appended as normal, but upon program termination, even without a close statement, its contents are freed. However a problem with this quirky workaround is that it is not possible to monitor the individual file sizes as the job is running since there are no longer external paths to those files.

1 Like

Most compilers do have features that let you specify where scratch files reside, but since the standard does not define much at all about how the system implements scratch files the options
are very compiler-specific. There is a discussion in the Fortran Wiki regarding this …

The documentation for the compiler you use might have some surprises in it for you.

The closest to what you are describing was actually available in large part on pre-Unix HPC systems.
The best remaining one is probably the assign(1) statement for Cray programming environments.

One description of a utility for assigning attributes to Fortran files is

For base64 encoding there is also the encode_base64() and decode_base64() procedures in

https://github.com/urbanjost/M_string

2 Likes

What do you mean by “could be given” ? Technically yes, a regular user can be given any administrative right. But the permission to create a ramdisk is completely different from the one to create a file.

I am a bit confused by the description here, but I think you are saying you have a program written to read and write from a file.
You would like to be able to open a file as an internal file even if the machine does not have a memory-resident file system you can write to; and maybe you are saying all images even in a parallel job running on multiple nodes should have access to the file.

If it is just a coarray that holds the data you can write it out using stream I/O, but just use it as an array while executing. That does mean it gets complicated if you do not know the size required when the program starts.

Note that internal files can only be written to using formatted I/O. The loophole is that you can write anything with an “A” format (perhaps not a well-known Fortran feature, but standard). So (again best off if you know the maximum size when you allocate the “file”) you can actually use that and a few other tricks to write the record length at the beginning of each record (available via an INQUIRE). If the file is direct-access it is pretty easy to just have the code conditionally do a WRITE or a store into a user-defined type array; and again just write the data out at the end if you need to retain it.

But on just about any HPC cluster that uses the nodes as general resources and not just for floating point ops there is usually a memory-resident file system available to the user. Many job schedulers create a private /tmp mount for each job that is often memory-resident in one form or another now-a-days.

The tricky part is if all the processes have to see the file. Sometimes it is trivial to rewrite the code so one node does the I/O and the rest get and put data to that node, or it puts the data into coarrays.

Some of the products already listed (as well as others) let you create a single parallel file system visible to multiple nodes; but a lot of HPC systems already have high-speed I/O available (eg., Lustre, ..) so there are often alternatives, albeit not as convenient.

But is that what you are saying? Instead of internal files having to be of fixed width and length (and formatted, although the ‘(*(A))’ format implied above gives you a partial solution for that) you want to be able to write binary sequential data of arbitrary record width as well as an arbitrary number of records to an internal file?

And do you need other images or other processes or sub-processes to see that data concurrently?

program really
! a program I hesitate to show someone
character(len=128) :: file(1000)
character(len=17) :: str
real :: x, y
complex :: z
write(file(1),'(*(a))') 'this is standard?',10.3,sqrt(40.0),(3,-4)
read(file(1),'(*(a))')str,x,y,z
write(*,*)str,x,y,z
end program really

A more legitimate use of that trick is when you need to mix formatted and binary data on the same output file. Rare, but sometimes a legitimate need. That comes up on writing to stdout in particular.

1 Like

Neat trick. Never thought of mixing in binary with text this way. I would have probably tried using TRANSFER but that can get a little unweildy sometimes.

It predates f90 so I tried it on some compilers that came after f90, concerned since I do not see it heavily used it might be a dusty corner but it worked with everything I tried, so congrats to the compiler developers.

It was particularly handy in creating globs of data that I combined with a hash table from Kennedy that turned a project I was thinking was going to be hard with Fortran into something that was working as an fpm project in a few minutes. So not common that I know of but standard and uses WRITEs’ ability to process arbitrary arguments to good advantage.

It has an interesting history as to why the feature was standardized long ago which I thought I saw described recently but did not find with a quick search. You have to increment the “line count” like a direct access file but if you use an allocated string and query the length of the I/O with an INQUIRE I find it much easier to use than TRANSFER with composite data myself. Nothing beats EQUIVALENCE but it seems to be in the doghouse currently.

I agree that some modern compilers allow reading numbers, and even intrinsics that return numbers, with A format.
Where does the standard allow that?

I played around with your code a little. One thing I noticed is that if you write out the file(1) values, the complex number gets written out as four characters not the eight I would expect for the 8 bytes required to store a REAL32 complex number. This puzzled me until I used TRANSFER to move (3.0, -4.0) into an array (size 8) of 1 byte characters and then used ICHAR to look at the resulting integer values for each character. For (3.0, 4.0) the first two bytes are 0 which translates into a non-printable NULL character in the ASCII Character set. Fortran obviously doesn’t care about this when you do an internal READ on the resulting data. However, what would happen in say a VTK XML file if you used your method to mix binary data with the XML text that would then be read by a C or C++ code. Wouldn’t there be a couple bytes missing or am I not seeing something obvious here. Note if you write (3.1, -4.1) you get 8 characters printed instead of 4. VTK lets you use either raw binary or base64 encoded values for binary data in their XML files. I also suspect that the list directed output also plays a role in this behaviour since it apparently skips the non-printable characters. Having to do tricks like this and/or use something like base64 encoding is why I think Fortran needs some intrinsic capability to mix text and binary but again I don’t see that ever happening. I guess you could do this with stream IO but i’ve never taken the time to work out the best way to do that.