Hard Disk Access on HPC Systems

Carltoffel · January 30, 2023, 10:31am

What are the best ways to access hard disk on HPC clusters?

We tried to use parallel netCDF, but it didn’t speed up the program because compute nodes usually are diskless and therefore must access hard disks via network.
For those interested, here is the configuration of the JUWELS Cluster Module I’m referring to.

When I did a quick benchmark (saving 8 GB of data into a netCDF4 file), it took about 8 s with one core and didn’t get significantly faster than 5 s with multiple cores.
Using time and likwid on a single core I could see, the benchmark is barely using any FLOPS. Instead, up to 9/10 of the time it was waiting (I think we can safely assume because of the hard disk access).

I have some ideas how to come around this, but first I want to check if I am reinventing the wheel or if there are better options. I would simply use a thread for netCDF operations to make hard disk access asynchronous.

What are best practices and are there any libraries?

PS: We use hard disks for two reasons: To save the model output and to overcome the limited main memory size because we need to save each time step which requires too much memory, at least so far.

plevold · January 30, 2023, 2:37pm

Writing data to disk is likely limited by the disk capacity so throwing more cores at it isn’t likely to help.

Also, on a HPC cluster the disks are often shared with other users as well which means that benchmarking could be difficult. If another job out of your control is reading or writing a lot to the same disk then you have to share the disk capacity with this job and your time waiting for disk will increase.

In general I can think of three ways to improve the situation:

Get faster disks
Write less data to disk
Perform other computations while waiting for the write to complete

If you don’t control the HPC hardware then option 1 likely isn’t an option and I think 1 GB/sec is already quite decent. You could write less data by using compression algorithms. That will cost you more CPU though so whether it’s worth it is case dependent. I don’t know if netCDF has support for it, but at least HDF5 has some built in compression algorithms.

The last option is perhaps the most interesting one. If you can start writing parts of the data set to disk while continuing the computation you could keep utilizing the CPU while waiting for the disk to catch up. Of course you need to do this with another chunk of memory than the one with the data waiting to be written to disk in it. Fortran actually has standardized support for asynchronous IO though I’ve never tested it myself.

MarDie · January 31, 2023, 6:23am

The newest HDF5 release has also support asynchronous IO: https://www.hdfgroup.org/2022/12/release-of-hdf5-1-14-0-newsletter-189. I have not looked into it in detail, but it seems not straight forward. Nevertheless, I find HDF5 much better than plain Fortran IO.

Carltoffel · January 31, 2023, 7:27am

Since netCDF4 uses HDF5, so I can try both, compression and asynchronous IO. Thanks for the hints!

MarDie · January 31, 2023, 2:09pm

Don’t expect miracles from compression unless your data has pattern that can be exploited.
If data size or time for IO is really an issue and you use double precision, casting to single precision for output could be an option. For HDF5 that works (not sure about netcdf) and for normal postprocessing, e.g. plotting, single precision is often enough.

JohnCampbell · February 1, 2023, 2:05am

While I have no experience of HPC clusters, I think option 2. is the only likely improvement.

For option 1 : I have been testing newer NVMe SSD options and although they claim to achieve up to 7 GByte/sec, there are many barriers to achieving this performance, including OS drivers, data processing rates and exceeding disk buffers (plural) capacity with real data. These high transfer rates can exceed the data processing rate, which also makes it hard to prove the performance.

Option 3 looks to be a lot of work, althouigh with the knowledge, this may work in some cases.

My implementation of option 2 has been to define larger derived data types of allocatable components so that external disk access is not required. This is less effective where the data source is external rather than generated or the memory model is distributed rather than shared.

Legacy solutions often benefit from in-memory derived data structures. I have had good results with this approach.

My solution is to divert data to memory and so “Write less to disk” ( and get more memory! )

jkd2022 · February 1, 2023, 9:12pm

We had exactly this problem with our electronic structure code The Elk Code. Data had to be written so that it could be used for the next run and some arrays were too large to keep in memory.

We solved it in two ways: the first was to switch some of the variables to single precision (also suggested by MarDie above). This did not significantly affect the accuracy of the results and was faster as a bonus. For example, large arrays are calculated in single precision but are then contracted and accumulated in smaller double precision arrays.

The second way, was to implement a simple RAM disk in Fortran for direct-access files (see the module modramdisk.f90 in the elk/src directory).

The ‘disk’ is an allocatable array of allocatable arrays (I think this is a Fortran 2003 feature):

! record data stored as 4-byte words
type, private :: rec_t
  integer(4), allocatable :: dat(:)
end type

! RAM disk file consisting of the filename and an array of records
type, private :: file_t
  character(len=:), allocatable :: fname
  type(rec_t), allocatable :: rec(:)
end type

! arrays of files constituting the RAM disk
type(file_t), allocatable, private :: file(:)

which are allocated on-the-fly. All variables (real, complex, integer) are stored in 4-byte integer arrays using the Fortran transfer function.

Here is an example of writing to the RAM disk:

! write to RAM disk if required
if (ramdisk) then
  call putrd(fname,ik,v1=vkl(:,ik),n1=nstsv,nzv=nstsv*nstsv,zva=evecsv)
end if
! write to disk if required
if (wrtdsk) then
! find the record length
  inquire(iolength=recl) vkl(:,ik),nstsv,evecsv
  open(206,file=fname,form='UNFORMATTED',access='DIRECT',recl=recl)
  write(206,rec=ik) vkl(:,ik),nstsv,evecsv
  close(206)
end if

By default, the record is written to both the RAM disk and the regular disk, although the latter can be disabled.

Here is code which reads from the RAM disk:

! read from RAM disk if required
if (ramdisk) then
  call getrd(fname,ik,tgs,v1=vkl_,n1=nstsv_,nzv=nstsv*nstsv,zva=evecsv)
  if (tgs) goto 10
end if
open(206,file=fname,form='UNFORMATTED',access='DIRECT',recl=recl)
read(206,rec=ik) vkl_,nstsv_,evecsv
close(206)
10 continue

If a record is not in the RAM disk, then the code seamlessly falls back to reading it from the regular disk. This can happen if the maximum number of records was exceeded or the data is in the RAM disk on a different node. Fortunately for our code, each node mostly reads the data it has written and not the data written by another node.

This simple RAM disk sped up the code enormously because the big direct access files were not written to or read from the networked filesystem. (We’re currently working on a way to read the RAM disk of a different node using one-sided MPI communication, but it’s not that easy.)

Lastly, there may be another method you could use. We thought about trying this but it we needed root access on the cluster so we didn’t pursue it. It’s to use the temporary BeeGFS
file system BeeOND. The idea was to stitch together the Linux RAM disks (tmpfs) with BeeOND into a coherent networked file system independent from the cluster filesystem. For each run, the files needed by the code would be copied to the BeeOND disk, the code would be run and the files copied back to the regular file system. In principle we could devote the RAM of entire nodes to the file system, and make the disk as large as needed.

I’m not sure if there are any gotchas with this approach: you’d certainly need to get the system administrator to set it up and build it into the batch submission system, not something that you could do at user level.

PierU · February 12, 2023, 12:20pm

I don’t get it at all… What you name “ramdisk” here is just a structure with allocatable arrays to store the data (so nothing similar to what we usually name “ramdisk”), and these arrays are allocated in memory… But the initial problem to solve is (I am just citing you) “some arrays were too large to keep in memory”. Putting the data in your “ramdisk” doesn’t change the memory issue. Maybe I am missing something important, but what ?

CRquantum · February 12, 2023, 10:41pm

If you are using a traditional HDD disk whose input/output operations per second (IOPS) is very limited (about 100 IOPS, so parallel IO perhaps have no use because the total IOPS is just about 100), the most straightforward way, is likely to just upgrade to NVME SSD whose IOPS is usually 100000 IOPS or more. You can compare the performance of HDD and NVME SSD for your data.
If you are already using NVME SSD, I guess it is likely you will be benefit from parallel IO. Perhaps need to do some test.

HPC clusters can also be bottlenecked by its network speed. So perhaps you can do this benchmark on your local PC, if it took 8s with 1 core, and 1 seconds with 8 cores on your local PC, then the parallel IO works fine. In such case, the bottleneck can be the HPC cluster’s network.

PierU · February 13, 2023, 8:08am

The basic reason why parallel I/Os make little sense on a HDD is that the way the data are read/written from/to the disk is not parallel at all. There is usually only one arm that helds the read/write heads, and the arm cannot be at two locations at the same time.

Still, it can be faster as long as the write cache (either in RAM or on the disk itself) is not full. That is for a limited volume of data (unless you have a huge amount of RAM that is usable by the cache).

Carltoffel · February 13, 2023, 9:47am

We’re still talking about a supercomputer… It has a storage cluster attached, which is way faster than a single HDD or even SSD. In my benchmark with 8 GB of data I reached way over 12 GB/s (using two nodes).
The support of the super computer said, the storage is so fast, it shouldn’t be the bottleneck. But the network actually could be the bottleneck, which is 100 GBit/s per Node.

Robpk · February 13, 2023, 11:49am

is binary data output an option?
EDIT: should’ve looked up netcdf/hdf file formats before asking

Carltoffel · February 13, 2023, 12:13pm

netCDF is a binary file format, so yes, it isn’t only an option, we are already using it.

Topic		Replies	Views
Fortran Best Practices - I/O - binary?	14	1814	June 21, 2023
Achieving 1.88x Speedup in WRF with OpenMP Offload and Codee Announcements	0	155	November 14, 2024
DeepSeek's code generation capabilities for traditional HPC languages AI	7	420	April 17, 2025
HPC Wire discusses Fortran "Fortran, Why Yes Fortran" Announcements	1	279	August 23, 2024
Parallel Programming with Coarrays in Fortran (blog post)	5	755	April 1, 2024

Hard Disk Access on HPC Systems

Related topics