Speed of Writing to File vs /dev/null

I was curious lately about the speed of writing with and without a format string write(lun,'(a)') vs write(lun,*) for example. Previously I had identified that writing with * was always faster, which I suppose makes sense because it seems that something about writing (besides the actual action of putting data on the disk) is happening/being computed at run-time. For instance, a program will compile without errors, then fail at runtime if a write statement has mismatched format + output variables.

Anyways, I then wondered what happens if I just send my output to /dev/null, and I was very surprised to find that for some compilers this was faster, and others significantly worse.

I found the following:

  • writing to files, gfortran was about an order of magnitude faster than Intel OneAPI ifort or ifx when writing Hello World! via * or '(a)', but only about twice as fast as AMD AOCC Flang with * and the same speed with '(a)'

  • writing to /dev/null, gfortran slowed down considerably, which was completely unexpected; AMD AOCC Flang was relatively the same timing, and only Intel OneAPI ifort and ifx took about 1/3 the time they spent writing to files

Any insights to this behavior? I am surprised how inconsistent the writing performance was between different compilers, and even more surprised that writing to /dev/null was not substantially faster for all of them.

code (click to see)
module mymod
implicit none
private

    public :: hw1, hw2

    contains

        impure subroutine hw1(lun)
            integer, intent(in), optional :: lun
            if (present(lun)) then
                write(lun,*) 'Hello World!'
            else
                write(*,*) 'Hello World!'
            end if
        end subroutine hw1

        impure subroutine hw2(lun)
            integer, intent(in), optional :: lun
            if (present(lun)) then
                write(lun,'(a)') 'Hello World!'
            else
                write(*,'(a)') 'Hello World!'
            end if
        end subroutine hw2

end module mymod

program main
use, intrinsic :: iso_fortran_env
use, non_intrinsic :: mymod
implicit none

    integer(int64), parameter :: i_max = 100000000_int64
    integer(int64) :: c1, c2, cr, i, j
    real(real64) :: elapsed
    integer :: lun, lun1, lun2

    write(*,'(a)') 'COMPILER: '//compiler_version()//', OPTIONS: '//compiler_options()

    do j=1_int64,2_int64

        if (j == 2_int64) then
            write(*,*) 'dumping output to /dev/null'
            open(newunit=lun, file='/dev/null', action='write')
            lun1 = lun
            lun2 = lun
        else
            write(*,*) 'hw1 writes to 831, hw2 writes to 530'
            lun1 = 831
            lun2 = 530
        end if

        call system_clock(c1, cr)
        do i=1_int64,i_max
            call hw1(lun1)
        end do
        call system_clock(c2)
        elapsed = real(max(c2 - c1, 1_int64), real64)/real(cr, real64)
        write(*,*) 'write(831,*): ',elapsed,' seconds'

        call system_clock(c1, cr)
        do i=1_int64,i_max
            call hw2(lun2)
        end do
        call system_clock(c2)
        elapsed = real(max(c2 - c1, 1_int64), real64)/real(cr, real64)
        write(*,*) 'write(530,''(a)''): ',elapsed,' seconds'

    end do

write(*,*)''
end program main
terminal output (click to see)
COMPILER: GCC version 13.1.1 20230429, OPTIONS: -march=znver3 -mmmx -mpopcnt -msse -msse2 -msse3 -mssse3 -msse4.1 -msse4.2 -mavx -mavx2 -msse4a -mno-fma4 -mno-xop -mfma -mno-avx512f -mbmi -mbmi2 -maes -mpclmul -mno-avx512vl -mno-avx512bw -mno-avx512dq -mno-avx512cd -mno-avx512er -mno-avx512pf -mno-avx512vbmi -mno-avx512ifma -mno-avx5124vnniw -mno-avx5124fmaps -mno-avx512vpopcntdq -mno-avx512vbmi2 -mno-gfni -mvpclmulqdq -mno-avx512vnni -mno-avx512bitalg -mno-avx512bf16 -mno-avx512vp2intersect -mno-3dnow -madx -mabm -mno-cldemote -mclflushopt -mclwb -mclzero -mcx16 -mno-enqcmd -mf16c -mfsgsbase -mfxsr -mno-hle -msahf -mno-lwp -mlzcnt -mmovbe -mno-movdir64b -mno-movdiri -mmwaitx -mno-pconfig -mpku -mno-prefetchwt1 -mprfchw -mno-ptwrite -mrdpid -mrdrnd -mrdseed -mno-rtm -mno-serialize -mno-sgx -msha -mshstk -mno-tbm -mno-tsxldtrk -mvaes -mno-waitpkg -mwbnoinvd -mxsave -mxsavec -mxsaveopt -mxsaves -mno-amx-tile -mno-amx-int8 -mno-amx-bf16 -mno-uintr -mno-hreset -mno-kl -mno-widekl -mno-avxvnni -mno-avx512fp16 -mno-avxifma -mno-avxvnniint8 -mno-avxneconvert -mno-cmpccxadd -mno-amx-fp16 -mno-prefetchi -mno-raoint -mno-amx-complex --param=l1-cache-size=32 --param=l1-cache-line-size=64 --param=l2-cache-size=512 -mtune=znver3 -Ofast -flto -fwhole-program -fpre-include=/usr/include/finclude/math-vector-fortran.h
 hw1 writes to 831, hw2 writes to 530
 write(831,*):    7.8473035189999996       seconds
 write(530,'(a)'):    10.895682978000000       seconds
 dumping output to /dev/null
 write(831,*):    17.326002824000000       seconds
 write(530,'(a)'):    21.694812569000000       seconds
 
COMPILER: flang Flang - 1.5 2017-05-01, OPTIONS: -Ofast -march=native -flto -o proj_flang
 hw1 writes to 831, hw2 writes to 530
 write(831,*):     12.12563490000000       seconds
 write(530,'(a)'):     10.88859700000000       seconds
 dumping output to /dev/null
 write(831,*):     11.41602090000000       seconds
 write(530,'(a)'):     10.42051290000000       seconds
 
COMPILER: Intel(R) Fortran Intel(R) 64 Compiler Classic for applications running on Intel(R) 64, Version 2021.9.0 Build 20230302_000000, OPTIONS: -Ofast -mavx2 -ipo -fp-model=fast=2 -fp-speculation=fast -o proj_ifort
 hw1 writes to 831, hw2 writes to 530
 write(831,*):    74.2634630000000       seconds
 write(530,'(a)'):    75.9310700000000       seconds
 dumping output to /dev/null
 write(831,*):    24.6305700000000       seconds
 write(530,'(a)'):    24.2716460000000       seconds
 
COMPILER: Intel(R) Fortran Compiler for applications running on Intel(R) 64, Version 2023.1.0 Build 20230320, OPTIONS: -Ofast -mavx2 -ipo -fp-model=fast=2 -fp-speculation=fast -o proj_ifx 
 hw1 writes to 831, hw2 writes to 530
 write(831,*):    73.6425570000000       seconds
 write(530,'(a)'):    74.0088560000000       seconds
 dumping output to /dev/null
 write(831,*):    22.1595820000000       seconds
 write(530,'(a)'):    23.5241890000000       seconds
compile + run script (click to see)
#!/bin/bash

clear
rm -v a.out fort.*

gfortran -Ofast -march=native -flto -fwhole-program -o proj_gfortran main.f90
./proj_gfortran

flang -Ofast -march=native -flto -o proj_flang main.f90
./proj_flang

ifort -Ofast -mavx2 -ipo -fp-model=fast=2 -fp-speculation=fast -o proj_ifort main.f90
./proj_ifort

ifx -Ofast -mavx2 -ipo -fp-model=fast=2 -fp-speculation=fast -o proj_ifx main.f90
./proj_ifx

Though no idea about /dev/null, string buffering might be related to the difference among compilers (like gfortran and ifort)? On my Linux machine (Ubuntu22), gfortran-12 -O3 gives

COMPILER: GCC version 12.1.0, OPTIONS: -mtune=generic -march=x86-64 -O3 -fpre-include=/usr/include/finclude/math-vector-fortran.h

 hw1 writes to 831, hw2 writes to 530
 write(831,*):   0.12493006700000001       seconds
 write(530,'(a)'):   0.15688484599999999       seconds
 dumping output to /dev/null
 write(831,*):   0.27056407999999998       seconds
 write(530,'(a)'):   0.32190670300000002       seconds

and with this environment variable

export GFORTRAN_UNBUFFERED_ALL="Y"

it gives

COMPILER: GCC version 12.1.0, OPTIONS: -mtune=generic -march=x86-64 -O3 -fpre-include=/usr/include/finclude/math-vector-fortran.h
 hw1 writes to 831, hw2 writes to 530
 write(831,*):   0.65039449299999996       seconds
 write(530,'(a)'):   0.71268454800000003       seconds
 dumping output to /dev/null
 write(831,*):   0.27168524100000002       seconds
 write(530,'(a)'):   0.32212702100000001       seconds

(*1) I changed i_max to 1000000 because the original code outputs GB size files…)
(*2) This page also has some tests (here with flush())

(except from the above code)
! Flush() is used to avoid merely buffering the data, giving artificial results.
! in a real program, the amount of data written to disk is substantial
! on each iteration, and the buffer would naturally get flushed.

2 Likes

There have been several threads about /dev/null. The main points are that writing to /dev/null or nul can save filespace but that the appropriate solution is to conditionally execute the write statements; and if they are debug statements to conditionally compile them using preprocessing.

Formatted files generate overhead even if not actually written, as can be demonstrated by writing to internal files instead of physical files; in particular if functions are called in the I/O statements.

The presence of I/O statements prevents many compiler optimizations, particularly in loops.

Depending on the version of Fortran and the OS you may or may not be able to open /dev/null for mixed formatted/unformatted I/O or attach it to multiple LUNs; making such usage non-portable.

Almost all the speed differences between writing to /dev/null or somewhere else are related to cacheing and buffering. The compiler may implement its own buffers, the system and the hardware device almost always do by default as well.

things like defining a macro to be “!WRITE” or “WRITE” instead of using #if/#else/#endif, writing to internal files (you have to know the size of your biggest I/O statement); assigning files to memory residence, and asking for an additional feature to the language like “STATUS=‘NULL’” are all discussed; even intentionally assigning a LUN of -1 and using IOSTAT to handle the error are all interesting discussions.

The general conclusion is to expect what you are seeing – writing to /dev/null is NOT a NOOP and can slow down large programs and is very system-dependent. Performance can be generally improved by changing the buffering options.

Using the comp.lang.fortran google interface I am not finding conversations I know are there, accept this one:

https://groups.google.com/g/comp.lang.fortran/c/WjcMg0tqdvA/m/4QEBYGJwDgAJ

there have been others including some dramatic performance improvements where debug statements including timing calls (which usually require a system interrupt to make a system call and gather kernel statistics) that were being written to /dev/null in inner loops were conditionally compiled only for the debug version of a code showed how misleading it can be to think anything like" “writing to /dev/null is like a noop” can be.

3 Likes

Thank you for your detailed response. Unfortunately, my actual use case involves writing debug information from tens of thousands of locations across thousands of files, with inconsistent spacing and capitalization. Perhaps a good regex could find them all, but it would be extremely difficult to verify. Hiding the writes behind preprocessor macros or writing to just some huge buffer would require herculean effort. Currently this file gets opened on /dev/null when such information is not required, but this is obviously leaving a lot of performance on the table.

At one time, I used the convention that a positive unit number meant to write to the file, and a negative unit number meant to avoid the write. I had statements like

if(unit.gt.0) write(unit...)

scattered throughout my codes. Then the newunit= keyword was added to fortran open statements, and those negative values became potentially legitimate unit numbers, so I had to go through my codes and back out my tests. I replaced them with #if preprocessor blocks, which of course are nonstandard, but that seemed like the best approach.

Yep, I have lots of that too. Except it’s not even that linked, and actually more like if(whatever_logical_expression_the_author_felt_like) write(debug_lun.... I was thinking more about this, maybe it would be faster if rather than opening /dev/null, I opened debug_lun like open(newunit=debug_lun, status='scratch') when output was not desired to be saved.

UPDATE: Unfortunately, I found no improvement by writing to a file opened with status='scratch'.

code (click to see)
module mymod
implicit none
private

    public :: hw1, hw2

    contains

        impure subroutine hw1(lun)
            integer, intent(in), optional :: lun
            if (present(lun)) then
                write(lun,*) 'Hello World!'
            else
                write(*,*) 'Hello World!'
            end if
        end subroutine hw1

        impure subroutine hw2(lun)
            integer, intent(in), optional :: lun
            if (present(lun)) then
                write(lun,'(a)') 'Hello World!'
            else
                write(*,'(a)') 'Hello World!'
            end if
        end subroutine hw2

end module mymod

program main
use, intrinsic :: iso_fortran_env
use, non_intrinsic :: mymod
implicit none

    integer(int64), parameter :: i_max = 10000000_int64
    integer(int64) :: c1, c2, cr, i, j
    real(real64) :: elapsed
    integer :: lun, lun1, lun2

    write(*,'(a)') 'COMPILER: '//compiler_version()//', OPTIONS: '//compiler_options()

    do j=1_int64,3_int64

        if (j == 2_int64) then
            write(*,*) 'dumping output to /dev/null'
            open(newunit=lun, file='/dev/null', action='write')
            lun1 = lun
            lun2 = lun
        else if (j == 3_int64) then
            write(*,*) 'dumping output to SCRATCH file'
            open(newunit=lun, status='scratch')
            lun1 = lun
            lun2 = lun
        else
            write(*,*) 'hw1 writes to 831, hw2 writes to 530'
            lun1 = 831
            lun2 = 530
        end if

        call system_clock(c1, cr)
        do i=1_int64,i_max
            call hw1(lun1)
        end do
        call system_clock(c2)
        elapsed = real(max(c2 - c1, 1_int64), real64)/real(cr, real64)
        write(*,'(a,i0,a,f0.4,a)') 'write(',lun1,',*): ',elapsed,' seconds'

        call system_clock(c1, cr)
        do i=1_int64,i_max
            call hw2(lun2)
        end do
        call system_clock(c2)
        elapsed = real(max(c2 - c1, 1_int64), real64)/real(cr, real64)
        write(*,'(a,i0,a,f0.4,a)') 'write(',lun2,',''(a)''): ',elapsed,' seconds'

    end do

write(*,*)''
end program main
output to terminal (click to see)
COMPILER: GCC version 13.1.1 20230429, OPTIONS: -march=znver3 -mmmx -mpopcnt -msse -msse2 -msse3 -mssse3 -msse4.1 -msse4.2 -mavx -mavx2 -msse4a -mno-fma4 -mno-xop -mfma -mno-avx512f -mbmi -mbmi2 -maes -mpclmul -mno-avx512vl -mno-avx512bw -mno-avx512dq -mno-avx512cd -mno-avx512er -mno-avx512pf -mno-avx512vbmi -mno-avx512ifma -mno-avx5124vnniw -mno-avx5124fmaps -mno-avx512vpopcntdq -mno-avx512vbmi2 -mno-gfni -mvpclmulqdq -mno-avx512vnni -mno-avx512bitalg -mno-avx512bf16 -mno-avx512vp2intersect -mno-3dnow -madx -mabm -mno-cldemote -mclflushopt -mclwb -mclzero -mcx16 -mno-enqcmd -mf16c -mfsgsbase -mfxsr -mno-hle -msahf -mno-lwp -mlzcnt -mmovbe -mno-movdir64b -mno-movdiri -mmwaitx -mno-pconfig -mpku -mno-prefetchwt1 -mprfchw -mno-ptwrite -mrdpid -mrdrnd -mrdseed -mno-rtm -mno-serialize -mno-sgx -msha -mshstk -mno-tbm -mno-tsxldtrk -mvaes -mno-waitpkg -mwbnoinvd -mxsave -mxsavec -mxsaveopt -mxsaves -mno-amx-tile -mno-amx-int8 -mno-amx-bf16 -mno-uintr -mno-hreset -mno-kl -mno-widekl -mno-avxvnni -mno-avx512fp16 -mno-avxifma -mno-avxvnniint8 -mno-avxneconvert -mno-cmpccxadd -mno-amx-fp16 -mno-prefetchi -mno-raoint -mno-amx-complex --param=l1-cache-size=32 --param=l1-cache-line-size=64 --param=l2-cache-size=512 -mtune=znver3 -Ofast -flto -fwhole-program -fpre-include=/usr/include/finclude/math-vector-fortran.h
 hw1 writes to 831, hw2 writes to 530
write(831,*): .7974 seconds
write(530,'(a)'): 1.1112 seconds
 dumping output to /dev/null
write(-10,*): 1.7593 seconds
write(-10,'(a)'): 2.1744 seconds
 dumping output to SCRATCH file
write(-11,*): .7593 seconds
write(-11,'(a)'): 1.0907 seconds
 
COMPILER: flang Flang - 1.5 2017-05-01, OPTIONS: -Ofast -march=native -flto -o proj_flang
 hw1 writes to 831, hw2 writes to 530
write(831,*): 1.2256 seconds
write(530,'(a)'): 1.1631 seconds
 dumping output to /dev/null
write(-13,*): 1.1675 seconds
write(-13,'(a)'): 1.1140 seconds
 dumping output to SCRATCH file
write(-14,*): 1.1890 seconds
write(-14,'(a)'): 1.1365 seconds
 
COMPILER: Intel(R) Fortran Intel(R) 64 Compiler Classic for applications running on Intel(R) 64, Version 2021.9.0 Build 20230302_000000, OPTIONS: -Ofast -mavx2 -ipo -fp-model=fast=2 -fp-speculation=fast -o proj_ifort
 hw1 writes to 831, hw2 writes to 530
write(831,*): 7.9760 seconds
write(530,'(a)'): 7.9733 seconds
 dumping output to /dev/null
write(-129,*): 2.3500 seconds
write(-129,'(a)'): 2.6184 seconds
 dumping output to SCRATCH file
write(-130,*): 7.8103 seconds
write(-130,'(a)'): 7.8982 seconds
 
COMPILER: Intel(R) Fortran Compiler for applications running on Intel(R) 64, Version 2023.1.0 Build 20230320, OPTIONS: -Ofast -mavx2 -ipo -fp-model=fast=2 -fp-speculation=fast -o proj_ifx 
 hw1 writes to 831, hw2 writes to 530
write(831,*): 7.9988 seconds
write(530,'(a)'): 8.0506 seconds
 dumping output to /dev/null
write(-129,*): 2.4725 seconds
write(-129,'(a)'): 2.5456 seconds
 dumping output to SCRATCH file
write(-130,*): 7.9204 seconds
write(-130,'(a)'): 8.0303 seconds

Because of the high potential for performance improvements (because code can often be optimized better that does not contain I/O or conditional statements) the #if/#else/#endif approach can have a lot of advantages; but note one number is still reserved as being an invalid unit – minus one. That can be useful .

1 Like

With several caveats, on most Linux systems if you do “export TMPDIR=/dev/shm” you can get a significant speedup from scratch files.

Several (not all) compilers place scratch files in the default scratch space area. Hopefully this is controlled by the TMPDIR variable and/or the compiler provides extensions for where you can put the file. This is definitely a “check the manual” topic though.

This of course can only be used if you have sufficient memory on your machine to both run your program and hold the scratch data, but even if not usable in production it can be a useful test for trying to determine how much of a performance gain you might get by cleaning up things; as I/O to memory-resident files is typically much faster than to disk. If you already have high-performance solid-state disk storage you might not see as dramatic a change as when using conventional disks. Since you already made a STATUS=‘SCRATCH’ test that is an easy one if you have a /dev/shm or make a tmpfs filesystem though.

By itself, a scratch file would not typically give you a performance gain as it is typically implemented as regular file, often in /tmp; that is unlinked after being created. So it is very likely it will perform at the same rate as a file in your scratch directory unless you have made configuration changes to optimize it.

Indeed. Thanks for this tip, I’ll check it out Monday. I’ll do Fortran, micro benchmarks, and play with things for free (it’s fun!). I don’t find+read compiler manuals though :slight_smile:

Yeah, this was surprising to me. I had assumed a file on scratch would be in memory. Evidently this is not the case, but perhaps some environment variable will let be so (as you mentioned above)…

Yeah. No way to know how big the file will be or so on. Fortran leaves what a file is pretty much an open question so it could be in memory up to a certain size or in a database for that matter, but usually the default is pretty mundane and is rarely optimized accept in HPC environments, and then not always. For big programs with potentially terabytes of I/O and atypically big memories and high-performance networks and things like Lustre file systems this is all a big deal and something as simple as making sure environment varialbles for buffer sizes (in particular for ifort/ifx) are optimal values and that default $TMPDIR $TMP $TEMP variables are set properly and so on is a huge deal; but for a lot of stuff that does not need a supercomputer it is not enough of an issue not to keep the defaults simple and the most common, I’m afraid.

I was actually too curious and ended up asking ChatGPT how to do it, the suggestion matched what you said more or less, but when I try it I don’t see any differences really…

ChatGPT question + response, not really formatted...
User
how do I instruct gfortran to place a file opened as open(newunit=lun, status='scratch') in memory?
ChatGPT

In GNU Fortran (gfortran), the status='scratch' specifier in the OPEN statement indicates that the file is temporary and resides in memory rather than on disk. By default, gfortran will allocate memory for the scratch file automatically. However, you can also explicitly instruct gfortran to use memory for the scratch file by setting the TMPDIR environment variable to a memory-based location.

Here's how you can do it:

    Set the TMPDIR environment variable to a memory-based directory. This step depends on your operating system and shell. Below are a few examples:

    Linux/Mac (Bash):

    shell

export TMPDIR=/dev/shm

Windows (Command Prompt):

shell

set TMPDIR=RAM:

Windows (PowerShell):

shell

$env:TMPDIR = "RAM:"

This sets the temporary directory to /dev/shm on Linux/Mac or RAM: on Windows, both of which typically represent a memory-based file system.

Compile and run your Fortran program as you normally would, using the OPEN statement with status='scratch' to create the file in memory:

fortran

    program main
      implicit none
      integer :: lun, i

      ! Open a scratch file
      open(newunit=lun, status='scratch')

      ! Write some data to the file
      do i = 1, 10
        write(lun, *) i
      end do

      ! Close the file
      close(lun)

    end program main

With the TMPDIR environment variable set to a memory-based location, the scratch file created by OPEN will be allocated in memory instead of on disk. Remember to adjust the TMPDIR environment variable as needed for your specific operating system and shell.

If you have any info about Lustre as far as what it’s actually optimized for, how to get good performance when writing to it, etc, that would be great. No one at work can explain anything about it beyond “it’s a high performance file system,” which is essentially useless.

UPDATE: ChatGPT also claims that ifort will place files opened with status='scratch' in memory by default as well, but, again, haven’t checked the manual…

The design flaw with status='scratch' has always been (since f77) that you cannot specify the file name. The file name is what allows you to place the file in a particular directory, or on a particular device, or on a network file system, or to a soft link, or to link to an environment variable, and so on. My applications use lots of file space, and one often needs to spread the various files around in order for everything to fit.

Just a wild guess, but isn’t the bottleneck the conversion (creation) of string data on write statements, rather than the work done by OS with /dev/null? (because this page indicates /dev/null is “bloody fast”…)

Much much ago, I also tried using Ramdisk as an external file (on a usual Linux PC), but did not get much (or any??) speedup as compared to usual files… I am not sure if this will also be the case with recent OS + recent hardware (fast memory).

The ifort manual does not mention memory residence at all in their on-line manual:

Scratch files go into a temporary directory and are visible while they are open. Scratch files are deleted when the unit is closed or when the program terminates normally, whichever occurs first.

To specify the path for scratch files, you can use one of the following environment variables:

    On Windows*: FORT_TMPDIR, TMP, or TEMP, searched in that order

    On Linux* and macOS: FORT_TMPDIR or TMPDIR, searched in that order

If no environment variable is defined, the default is the current directory. 

nor did gfortran. Not always wanting to read the manual myself, I used strace(1) and also ls -ld /proc/$PID/fd and looked at other /proc files and all scratch files were actual disk files.

Proper lustre usage is a bit of a long subject. VERY dependent on the problem. Typically it is connected over a high-speed MPI network, which can obviously be a performance advantage. Other than that you have to know if the code is reading and writing for a few files, what their sizes are, and how much I/O is performed on the files; and whether it is asynchronous, sequential or random, and so on. How many slices you assign to the files is critical.

In simple terms your files can reside on a very high number of disks. Being broken up on many devices means you can have simultaneous access to many different parts of the file without the normal contention a sequential disk would have; and that random access can occur at different file points at the same time. Each one of those devices typically has a cache which can also be used to advantage. So a whole lot of devices are available that gives you some of the performance you would expect from solid state devices or better. If the files are small or you do not use the lustre commands to optimize or there is nothing parallel or asynchronous going on it can actually be slower than using your local files or memory-resident files. And since it is usually using the highest speed network on your system that is also likely the same network messages are being passed on for MPI codes. So you can get contention. If you are using Lustre with a giant parallel code that does a lot of simultaneous I/O at points in the code where it is not passing messages over the same network and someone helped you set up the right number of slices you will rave about lustre. If you do not you will probably wonder why on earth anyone bought it.

What are you referencing here?

Some things I cannot change: Files will be “small” (on the order of MB, 10GB would be an exceptionally large output), many will be created/written to, and that will all happen asynchronously. The jobs I am concerned with will not be utilizing MPI at all, so if infiniband is the transport from compute node ↔ Lustre filesystem, it will be unburdened by MPI traffic. I have no idea what these “Lustre slices” are.

Also, I do not believe my bottleneck is string creation. I do see a speed improvement when I am definitely writing to memory (internal file, just a variable character(len=1000) :: buffer). I also see significant slowdown wrapping my loop with !$omp parallel do, which further indicates to me that the overhead of using multiple threads for my simple example code is actually greater than the computation work being done. This is true even for writing to the internal buffer variable, which was made threadprivate.

code implementing all discussed write strategies so far (click to see)
module mymod
implicit none
private

    public :: hw1, hw2, hw3

    contains

        impure subroutine hw1(lun)
            integer, intent(in), optional :: lun
            if (present(lun)) then
                write(lun,*) 'Hello World!'
            else
                write(*,*) 'Hello World!'
            end if
        end subroutine hw1

        impure subroutine hw2(lun)
            integer, intent(in), optional :: lun
            if (present(lun)) then
                write(lun,'(a)') 'Hello World!'
            else
                write(*,'(a)') 'Hello World!'
            end if
        end subroutine hw2

        impure subroutine hw3(internal_file)
            character(len=*), intent(inout), optional :: internal_file
            if (present(internal_file)) then
                write(internal_file,*) 'Hello World!'
            else
                write(*,*) 'Hello World!'
            end if
        end subroutine hw3

end module mymod

program main
use, intrinsic :: iso_fortran_env
use, non_intrinsic :: mymod
implicit none

    integer(int64), parameter :: i_max = 10000000_int64
    integer(int64) :: c1, c2, cr, i, j
    real(real64) :: elapsed
    integer :: lun, lun1, lun2
    character(len=1000) :: dump_iout

    !$omp threadprivate(dump_iout)

    write(*,'(a)') 'COMPILER: '//compiler_version()//', OPTIONS: '//compiler_options()

    do j=1_int64,3_int64

        if (j == 2_int64) then
            write(*,*) 'dumping output to /dev/null'
            open(newunit=lun, file='/dev/null', action='write', asynchronous='yes')
            lun1 = lun
            lun2 = lun
        else if (j == 3_int64) then
            write(*,*) 'dumping output to SCRATCH file'
            open(newunit=lun, status='scratch', action='write', asynchronous='yes')
            lun1 = lun
            lun2 = lun
        else
            write(*,*) 'hw1 writes to 831, hw2 writes to 530'
            lun1 = 831
            lun2 = 530
        end if

        call system_clock(c1, cr)
        !$omp parallel do firstprivate(lun1)
        do i=1_int64,i_max
            call hw1(lun1)
        end do
        call system_clock(c2)
        elapsed = real(max(c2 - c1, 1_int64), real64)/real(cr, real64)
        write(*,'(a,i0,a,f0.4,a)') 'write(',lun1,',*): ',elapsed,' seconds'

        call system_clock(c1, cr)
        !$omp parallel do firstprivate(lun2)
        do i=1_int64,i_max
            call hw2(lun2)
        end do
        call system_clock(c2)
        elapsed = real(max(c2 - c1, 1_int64), real64)/real(cr, real64)
        write(*,'(a,i0,a,f0.4,a)') 'write(',lun2,',''(a)''): ',elapsed,' seconds'

        if (j == 3_int64) then
            write(*,*) 'writing to internal file...'
            call system_clock(c1, cr)
            !$omp parallel do
            do i=1_int64,i_max
                call hw3(dump_iout)
            end do
            call system_clock(c2)
            elapsed = real(max(c2 - c1, 1_int64), real64)/real(cr, real64)
            write(*,'(a,f0.4,a)') 'write(dump_iout,*): ',elapsed,' seconds'
        end if

    end do

write(*,*)''
end program main
terminal output when compiled with gfortran, no OpenMP, then OpenMP (click to see)
(base) [tyranids@valinor canidb]$ gfortran -Ofast -march=native -flto -fwhole-program -o proj_gfortran main.f90
./proj_gfortran
COMPILER: GCC version 13.1.1 20230429, OPTIONS: -march=znver3 -mmmx -mpopcnt -msse -msse2 -msse3 -mssse3 -msse4.1 -msse4.2 -mavx -mavx2 -msse4a -mno-fma4 -mno-xop -mfma -mno-avx512f -mbmi -mbmi2 -maes -mpclmul -mno-avx512vl -mno-avx512bw -mno-avx512dq -mno-avx512cd -mno-avx512er -mno-avx512pf -mno-avx512vbmi -mno-avx512ifma -mno-avx5124vnniw -mno-avx5124fmaps -mno-avx512vpopcntdq -mno-avx512vbmi2 -mno-gfni -mvpclmulqdq -mno-avx512vnni -mno-avx512bitalg -mno-avx512bf16 -mno-avx512vp2intersect -mno-3dnow -madx -mabm -mno-cldemote -mclflushopt -mclwb -mclzero -mcx16 -mno-enqcmd -mf16c -mfsgsbase -mfxsr -mno-hle -msahf -mno-lwp -mlzcnt -mmovbe -mno-movdir64b -mno-movdiri -mmwaitx -mno-pconfig -mpku -mno-prefetchwt1 -mprfchw -mno-ptwrite -mrdpid -mrdrnd -mrdseed -mno-rtm -mno-serialize -mno-sgx -msha -mshstk -mno-tbm -mno-tsxldtrk -mvaes -mno-waitpkg -mwbnoinvd -mxsave -mxsavec -mxsaveopt -mxsaves -mno-amx-tile -mno-amx-int8 -mno-amx-bf16 -mno-uintr -mno-hreset -mno-kl -mno-widekl -mno-avxvnni -mno-avx512fp16 -mno-avxifma -mno-avxvnniint8 -mno-avxneconvert -mno-cmpccxadd -mno-amx-fp16 -mno-prefetchi -mno-raoint -mno-amx-complex --param=l1-cache-size=32 --param=l1-cache-line-size=64 --param=l2-cache-size=512 -mtune=znver3 -Ofast -flto -fwhole-program -fpre-include=/usr/include/finclude/math-vector-fortran.h
 hw1 writes to 831, hw2 writes to 530
write(831,*): .8666 seconds
write(530,'(a)'): 1.1941 seconds
 dumping output to /dev/null
write(-10,*): 3.7639 seconds
write(-10,'(a)'): 4.0636 seconds
 dumping output to SCRATCH file
write(-11,*): 2.6869 seconds
write(-11,'(a)'): 2.9038 seconds
 writing to internal file...
write(dump_iout,*): 1.3779 seconds
 
(base) [tyranids@valinor canidb]$ gfortran -Ofast -march=native -flto -fwhole-program -o proj_gfortran -fopenmp main.f90
(base) [tyranids@valinor canidb]$ ./proj_gfortran 
COMPILER: GCC version 13.1.1 20230429, OPTIONS: -march=znver3 -mmmx -mpopcnt -msse -msse2 -msse3 -mssse3 -msse4.1 -msse4.2 -mavx -mavx2 -msse4a -mno-fma4 -mno-xop -mfma -mno-avx512f -mbmi -mbmi2 -maes -mpclmul -mno-avx512vl -mno-avx512bw -mno-avx512dq -mno-avx512cd -mno-avx512er -mno-avx512pf -mno-avx512vbmi -mno-avx512ifma -mno-avx5124vnniw -mno-avx5124fmaps -mno-avx512vpopcntdq -mno-avx512vbmi2 -mno-gfni -mvpclmulqdq -mno-avx512vnni -mno-avx512bitalg -mno-avx512bf16 -mno-avx512vp2intersect -mno-3dnow -madx -mabm -mno-cldemote -mclflushopt -mclwb -mclzero -mcx16 -mno-enqcmd -mf16c -mfsgsbase -mfxsr -mno-hle -msahf -mno-lwp -mlzcnt -mmovbe -mno-movdir64b -mno-movdiri -mmwaitx -mno-pconfig -mpku -mno-prefetchwt1 -mprfchw -mno-ptwrite -mrdpid -mrdrnd -mrdseed -mno-rtm -mno-serialize -mno-sgx -msha -mshstk -mno-tbm -mno-tsxldtrk -mvaes -mno-waitpkg -mwbnoinvd -mxsave -mxsavec -mxsaveopt -mxsaves -mno-amx-tile -mno-amx-int8 -mno-amx-bf16 -mno-uintr -mno-hreset -mno-kl -mno-widekl -mno-avxvnni -mno-avx512fp16 -mno-avxifma -mno-avxvnniint8 -mno-avxneconvert -mno-cmpccxadd -mno-amx-fp16 -mno-prefetchi -mno-raoint -mno-amx-complex --param=l1-cache-size=32 --param=l1-cache-line-size=64 --param=l2-cache-size=512 -mtune=znver3 -Ofast -flto -fwhole-program -fopenmp -fpre-include=/usr/include/finclude/math-vector-fortran.h
 hw1 writes to 831, hw2 writes to 530
write(831,*): 4.0256 seconds
write(530,'(a)'): 4.4494 seconds
 dumping output to /dev/null
write(-10,*): 6.8750 seconds
write(-10,'(a)'): 7.7228 seconds
 dumping output to SCRATCH file
write(-11,*): 4.9806 seconds
write(-11,'(a)'): 5.9535 seconds
 writing to internal file...
write(dump_iout,*): 5.6348 seconds
1 Like

One performance issue can be the formatting of the data. I cannot think of a case where unformatted I/O would not be faster than formatted.

Another one is that your IO statements may include function calls that are being called to generate information you are then throwing away.

Very system dependent, and depends very much on what buffering is being used, but if /dev/null I/O is unbuffered /dev/null can easily be (and in some of the discussions the numbers show it dramatically) slower than writing to an actual file is in command time as when writing to a file the I/O is now-a-days often heavily buffered and optimized and initially cached in memory.

Many optimizations cannot be performed by the compiler when I/O statements and/or conditionals are present, which requires some kind of conditional compilation. "deed"ing was a very common extension largely because of this in pre-free-formatting FORTRAN. It is usually still available in fixed-format input but I rarely see any equivalent in free-format other than use of a pre-processor. A “deed” line started with “D” instead of “{C,c,*}” and was just an alternate form of a comment line unless the switch was turned on to compile the lines instead.

I have seen people prefix all their debug write statements with things like DEBUG1, DEBUG2, …
and then use macro preprocessors to define DEBUG1 to be an exclamation or an empty string, which seems to work well except for continued lines.

So any one of those can be a big bottle neck, and in some of the other threads whether /dev/null was faster or slower or the same performance-wise varied so it is definitely not true it is always faster, and the fast it still does not eliminate the formatting of the data or the lost optimizations means there can be major hidden performance losses that only eliminating the statements can erase completely.

1 Like

Ouch. Bluntly, if you have a lustre file system that is available to a lot of users for general use and no instructions are being provided that sounds like a recipe for wasting the investment.

 lfs help

shows some of the commands you can use to effectively use Lustre. To oversimplify you can control how many different devices your file will be stored on. Configuring your “slices” is basically how you specify this layout. If your files are typically around 1MB that is so small that you want to keep them on one slice. Assuming you have a lot of processes or threads working with individual files simultaneously you want to make sure those files are spread out on many of the devices so they can all be accessed simultaneously without contention at those lower implementation-level locations. If at the other extreme you had a few large files being simultaneously accessed at different locations you would want (probably) want the opposite layout – the file would be split into many parts and reside on many different devices. There are tutorials on-line about Lustre usage and about Lustre configuration. The short story is unless you get really lucky You have to have a good idea of the I/O access pattern your application requires
(how often, in what size chunks, in what kind of order, using how many files, …) to optimize your performance, and use commands that are specific to your Lustre file system (as listed by “lfs help”) to tell Lustre how it needs to lay your files out to perform well. It is possible that if a lot of this data is scratch and small enough to fit in memory you should be using something like the /dev/shm file systemand binary files opar perhaps just using arrays and not using files at all instead of using the Lustre server at all. I am not sure how much I/O is even a performance issue for you, so there are way too many unknowns to give any definitive answer from my perspective but if you can provide a good description of your application’s I/O pattern to someone responsible for the Lustre file system they should be able to guide you through how to optimize for your code as-is and/or what you want to change to maximize performance if you are resource bound; or maybe just tell you what you want to do instead.

If you look at your current system time, user time, and wallclock time is system time more than a few percent of the user time? Is a performance gain a big deal because of the time or amount of runs that makes the effort worthwhile? Are there only a few codes using Lustre at once or hundreds or thousands? and on and on … I personally don’t have any short answers with all the unknowns.

It is curious where I/O has and has not been optimised by compilers.
You would expect “write (lu, * )” to be faster than “write (lu,fmt=‘(a)’)”, as the format expression does not need to be interpreted, but perhaps ‘*’ may require assessment of the variable type ?
You would expect that writing to a null device should be faster, but most file writes now only go to memory, such is the efficieny of I/O buffering with most operating systems. (Windows does do well with this)

It all depends on where the compiler developers put their effort into improving performance, after achieving functionality.
As for “write (lu,fmt=‘(2i8,6f12.6)’) ng, nl, disp(1:6)”, Gfortran ver 7(or 6?) performance was abysmal, so I wrote my own routine, in Fortran, for I8 and F12.6 (for 8 byte reals) and have never bothered to remove it.
I had put all the work into applying OpenMP to improve eigenvector calculations, but my benchmark program was significantly slower than the non-openmp F95 compiler for writing the results to a formatted file, (which I now never read!)

I think the answer is if the use of * vs fmt or LU > file vs LU > null does’t agree with the compiler testing, stick with what makes sense, as the compiler can change, or wait for a hardware upgrade that is more likely. (more memory or faster SSD drives are coming very quickly, which make old I/O test codes obsolete)
Just beware of buffering (especially OS and SSD disk buffer sizes) when trying to do I/O benchmarks.