GSoC '25: stdlib filesystem

Hello everyone!
As you may be aware I have been accepted in this year’s GSoC under the mentorship of @hkvzjal and @fxm.

I will be giving weekly updates in this thread with it’s relevant PR’s, discussions etc

The initial plan is to start with implementing some path functions useful for further functionality taking inspiration from stdlib_os’s os_path module developed by @MarDie, @Arjen and @awvwgk and from how other languages like Python, Go, Rust etc handle it.

If you have any feedback, suggestions, questions etc do let me know!
Thank you!

7 Likes

Thanks for starting this, @suprit05. I’m tagging @FedericoPerini here as another mentor, who - alongside Jose - has a much deeper understanding of stdlib itself. :slight_smile:

3 Likes

Additional resources that might be of interest are

  1. M_io
  2. M_path
  3. M_process
  4. M_system

:smile:. I listed mine in probable order of relevance, but just guessing. There are others perhaps others can list. Some interesting differences in OOP versus procedural, how different platforms are distinguished, how proprocessing is applied, whether ford or doxygen is used, documentation approaches in general, …

3 Likes

If you plan to include support for files and directories you may also want to have a look at what @interkosmos did with fortran-unix and in particular the binding to dirent.

It is not portable to Windows but you can make it so with that dirent.h header.

Just some food for thoughts for your path functionality. When working with Nuke (in C#) I came across their AbsolutePath (resp. RelativePath) object which I found very convenient. In brief, it overloads the / operator so that you can build paths in a very visual way (e.g. newpath = root / folder / filename) and the operator takes care of choosing / or \ depending on the os and does the proper file concatenation.

5 Likes

This would really be amazing!

1 Like

A welcome feature of any stdlib file system for use with HPC applications, would be a RAM disk implemented as a fixed directory.

For example, we could open a file like:

open(60,file='/stdlib_fs/file.txt',form='FORMATTED')

with the directory /stdlib_fs looking like that of a normal file system but existing in volatile memory only.

This is important for HPC systems where the networked file system can be a severe bottleneck. In fact, we implement a simple version of this in our code for direct access files only. Here’s a snippet:

! construct the filename
fname=trim(scrpath)//'EVECSV'//trim(fext)
!$OMP CRITICAL(u206)
! read from RAM disk if required
if (ramdisk) then
  call getrd(fname,ik,tgs,v1=vkl_,n1=nstsv_,nzv=nstsv*nstsv,zva=evecsv)
  if (tgs) goto 10
end if
! find the record length
inquire(iolength=recl) vkl_,nstsv_,evecsv
open(206,file=fname,form='UNFORMATTED',access='DIRECT',recl=recl)
read(206,rec=ik) vkl_,nstsv_,evecsv
close(206)
10 continue
!$OMP END CRITICAL(u206)

In this case, if ramdisk is .true. then the data will be retrieved from memory unless the data is not in the RAM disk, in which case the code will fall back to the normal file system. Simple as this is, it allows for almost ideal scaling as more nodes are added, which is not the case with the networked file system, which can result in longer run times with increasing node count.

To be particularly useful, /stdlib_fs would have to be visible to, and consistent among different nodes on a cluster. This would be fairly difficult to implement (probably using MPI) and our simple RAM disk does not do this.

[However, it could likely be done at a system level using BeeGFS On Demand which is an on-the-fly parallel file system. Used in combination with the Linux RAM disk in tmpfs, a RAM-only parallel file system could be generated for the running application only, avoiding having to access the shared file system.]

1 Like

Progress Update: as of 17th June 2024

Goal was to start with implementing path related functions to facilitate future addition of functionality.

Pull Requests

Basic path related functions were added.

joinpath: joins the given paths according to the platform’s path-separator.

operator(/): as was suggested in the discourse here by @davidpfister an operator is also provided for the same functionality

splitpath: splits the path following the last path-separator and returns the head and tail.

basename: just returns the tail returned splitpath

dirname: just returns the head returned by splitpath

Build system adjustments were also made to add some compile time constants according to the platform.

A derived type fs_error containg an integer code and a fixed-length string is added for concise error handling

This is a stripped down version of state_type already present in the stdlib, syntax is also kept very similar… except the introduction of code to store integers returned by C functions such as GetLastError (Windows API), global variables like errno (Posix platforms)

All PRs are accompanied by their relevant tests, documentation and examples

Concepts Learnt

How different operating systems handle error handling, system calls, backward compatibility, legacy code etc Mostly learnt by reading code of already existing fileystem libraries like

And of course man pages and Windows API documentation :grinning_face: .

4 Likes

I don’t really get it: a ramdisk is typically an OS feature, and once a ramdisk is created on the machine, it can be transparently accessed by any application as if it was a regular disk.

HPC systems often have local disks (preferably SSD, nowadays) on the nodes.

@PierU:
I don’t really get it: a ramdisk is typically an OS feature, and once a ramdisk is created on the machine, it can be transparently accessed by any application as if it was a regular disk.

Setting up a RAM disk may not be possible on a cluster with user privileges only.

@PierU
HPC systems often have local disks (preferably SSD, nowadays) on the nodes.

How does Node A read the SSD of Node B? This is what BeeGFS On Demand was made for, it can link the disks of separate nodes together into a concurrent parallel filesystem, which is made available only for the running application. However root privileges are required to set it up.

For our code we have data which should preserved between runs and available to all nodes during runs. Our simple ‘RAM disk’ stores only the data generated by that node, and defaults to the regular file system for off-node data.

All the HPC systems I’ve used in the last 15 or so years used the LUSTRE file system for programs running on the backend compute nodes. If I remeber how LUSTRE works you write data to a cluster of meta-data servers which then figures out the best way to “stripe” your data to the final storage disks.

It seems like there are two different features being discussed as if they are the same thing. The OP specified a ram disk where data could be accessed with i/o statements, yet it lived in volatile memory. Volatile memory implies memory just during a single program execution step. But the important feature of a ram disk is that it persists beyond a single program execution step so that it can be accessed not only by a running program but also other programs running at the same time or other programs running at some later time. The data is placed in ram memory presumably to enhance bandwidth or reduce latency compared to, for example, a physical spinning hard disk.

If the shared data is intended to exist only within the lifetime of an executing program, including a parallel program running on multiple threads or multiple nodes, then other options are available, including data within a shared module or data accessed on remote nodes by MPI or with coarrays. But if the data is intended to persist over the lifetimes of several program execution steps, then MPI or corarrays are insufficient.

So there are 2 fully orthogonal topics:

  • setting up and using a ramdisk; yes, setting up the ramdisk requires root privileges at some point, and I can’t see how a regular user process can create it on-the-fly (which was your initial wish)
  • setting up a shared FS between nodes, which is what BeeGFS does. As mentioned in the description, it uses existing local disks -including ramdisks- to create a shared FS, but it won’t create a ramdisk on its own.

At the end I can’t really see what may be added in stdlib here (?).

Hi @suprit05,
Good PR!, I had a look at it and I have few comments. I hope this is fine if I put them here.

This is mostly about hardcoded pathsep in stdlib_system. While the approach looks fine, you may want to have a look at the function separator() in fpm.

I also noticed the comment ! if no pathsep, then it probably was a root dir like C:\. That probably means that you need a function isrooted and isabsolute/isrelative (and eventually fullpath)

General question here: would that make sense to unify fpm_filesystem with the stdlib_filesystem?
Keep the good job :+1:

2 Likes

I think that for stdlib, having the path separator as a compile-time constant instead of a function is the ideal approach.

That’s definitely true if you compile stdlib with your project on the desired platform. This implicitly means that you do not plan to distribute stdlib as binaries for dynamic linking. That said, I do agree that calling a function every time you need to access the separator value is a bit heavy. This could only be done once and the result stored in a module variable.

1 Like

AFAIK, you can’t use a linux compiled shared library on windows nor the other way around. You can do cross-compilation with MinGW to use a linux kernel to produce a windows binary, but it is still a binary target for Windows. On Windows you’ll use a WSL but per the compiler and produced binary, it is a Linux kernel for any practical purposes. So I would say that this is not an actual problem since by default you have to produce your binaries per targeted platform in any case.

1 Like

In Windows, you may be working in a non-cmd shell, i.e. a Git shell, or a MinGW shell, so \ may not be the default separator under all circumstances

1 Like

Just another thing that came to my mind. AFAIK unlike ifort/ifx, gfortran does not define macros like _WIN32 depending on the platform you compile on. So, if you use fpm as a build system you have to supply it in the toml file:

[preprocess]
cpp.suffixes = ["F90", "f90"]
cpp.macros = ["_WIN32"]

which makes the fpm.toml platform-dependent, unless you use profiles as discussed here.

Since the introduction of file system stuff, stdlib also has a little bit of C baked inside, so maybe it can be resolved on the C files and the constant passed to the Fortran module? gcc does include all the necessary macros that gfortran doesn’t.

2 Likes

@PierU
At the end I can’t really see what may be added in stdlib here (?).

To demonstrate the effect of our toy ‘RAM disk’ I ran the same quantum time-evolution calculation on 16 nodes (4096 cores) of a 768 node cluster with a 20 PB parallel file system, shared by many other users.

With our ‘RAM disk’ enabled and no writing of direct access files, the job took 21 minutes. The CPU utilization was over 90%

Without using the ‘RAM disk’ the time was 87 minutes with a CPU utilization of about 40%.

As you can see, the shared file system is a severe bottle-neck for our type of calculation, with CPUs having to wait for data to be written and read.

This feature of a temporary, volatile file system could be made more general-purpose and part of stdlib.

For example, there could be routines like

open_stdlib(50,file='data',form='UNFORMATTED',access='DIRECT',recl=recl)
write_stdlib(50,rec=rec) x,y,z

which, by default, would be just wrappers for the usual open and write statements.

However, you could specify additional options like

open_stdlib(50,file='data',form='UNFORMATTED',access='DIRECT',recl=recl, &
 volatile=.true.)

which would then store the data to RAM only. If required, the same data could be written once to the non-volatile file system at the end of the calculation (this is usually what we do).

On a cluster, one could pass the MPI communicator

open_stdlib(50,file='data',form='UNFORMATTED',access='DIRECT',recl=recl, &
 volatile=.true.,MPI_Comm=mpi_comm_world)

which would allow data to be read by all nodes. This would require some nifty coding using asynchronous MPI communication.

From our HPC point of view, this would be a great feature.