IO flush issue on linux cluster

Hi everyone,

A colleague reported an issue we have encountered in a new Linux cluster regarding file IO (writing) with NFS protocol. Basically, it was seen that in this new cluster the configuration makes it such that files are not flushed unless something like 1Gb of memory has been reached to write the file or that the close clause is reached. This point is problematic because in some cases the simulations can be quite long, and it is expected to do monitoring with certain files which contain information of what is happening at runtime.

At first we thought the issue might be with MPI in parallel, but after several tests, one of my colleagues managed to create a mwe showing that already in sequential we can see the issue.

main.f90
program main
    use iso_fortran_env
    implicit none
    real(real64) :: t1, t2, dt
    integer(int32) :: u1, u2, u3
    integer(int32) :: i, N, counter_timer
    integer :: ret
    ! Declare the interface for POSIX fsync function
 
    interface
        function fsync (fd) bind(c,name="fsync")
          use iso_c_binding, only: c_int
          integer(c_int), value :: fd
          integer(c_int) :: fsync
        end function fsync
    end interface  
    !------------------------------------------------
    write(*,*) 'Printing files'
    call cpu_time(t1)
    open(newunit=u1, file="fortran_no_flush.txt", status='replace')
    open(newunit=u2, file="fortran_with_flush.txt", status='replace')
    open(newunit=u3, file="fortran_with_sync.txt" , status='replace')
    i = 0
    N = 60
    counter_timer = 0
    
    do  
        i = i + 1
        write(u1, 001) i
        write(u2, 001) i
        write(u3, 001) i
001 format("Line ", I9)        
        flush(u2)
        flush(u3)
        ret = fsync(fnum(u3)) !> fnum only works with gfortran
        if (ret /= 0) stop "Error calling FSYNC"
 
        call cpu_time(t2)
        dt = t2 - t1
        if (dt > counter_timer) then
            write(*,'(I2,"/",I2)') counter_timer, N
            counter_timer = counter_timer + 1
        endif
        if (t2 - t1 > N) exit
    enddo

    close(u1)
    close(u2)
    close(u3)

end program

In this program, 3 files are opened for writing. We even set a flush clause on two of them. Compiling with gfortran -Wall -O3 main.f90 -o test and running under our cluster the “test” program while simultaneously monitoring on the frontend node with

tail -f fortran_with_flush.txt 

(second file) It is not about after 30 seconds that we see the lines being actually flushed to the file, while

tail -f fortran_with_synch.txt 

(third file) It is immediate.

Now, the fsync solution is more of a hack, this would imply doing some heavy modifications in a codebase that has worked for more than 40 years, and the fnum function is not available under intel compiler which is the actual compiler we use for production.

Has anyone encountered such a weird problem with newer clusters? How would you go about such a long latency problem for flushing files? We hesitate regarding using the option “NOAC” at the configuration level of the NFS communication protocol which could solve the problem but it seems to downgrade performance.

Some resources on the topic:
https://www.alibabacloud.com/help/en/nas/writing-delays-when-accessing-a-file-from-different-nfs-clients
https://linux.die.net/man/5/nfs

I’ve seen this before on other non-NFS clusters. In the end I just closed and immediately reopen the output files regularly to get the data to sync to disk.

I think this is basically the tradeoff. Either the file contents are locally cached on the client or they aren’t. If they are locally cached, then you can’t see the results immediately on the server, but if caching is disabled, then you will see the performance hit because each i/o operation will stall until the server data is updated and reported back to the client. The fortran FLUSH statement is usually implemented so that it only applies to the local fortran i/o buffers on the client, not to the remote server state.

There are sometimes multiple layers of caches between nodes on parallel machines themselves. When one node writes data to a remote disk, it can pass through several levels of handshaking before it arrives at the disk.

There is another layer of caching on the disk itself. These days, SSDs have local caches that are up to several GBs in size, so when the data is written to the disk, all that really means is that the data is pushed to the disk cache, not necessarily that it is written to the magnetic surface or to the SSD nonvolatile memory itself. That usually isn’t important to the user program unless robustness due to power outages to the disk itself is critical.

@rfarmer that’s one of the solutions we are trying to avoid as it would require changing many places in the code. We also saw the issue on C and C++, so changes at the code level might be very lengthy and bug-prone.

@RonShepard Thanks for the insights! This is a very dark zone for me and the colleague that is actually trying to fix it. We still need to iterate with the IT guys managing the cluster to find some robust solution.

All inputs are very appreciated :slight_smile:

Typically the executable has a software cache and as mentioned almost all modern hardware has a cache and this is quite normal for NFS; but assuming it is on your system try using the unbuffer command with the caveat that if you are doing large amounts of I/O you can degrade performance significantly. Instead of going down all these paths, usually a flush in the Fortran code and reading the files from the node (assuming I/O is indeed all coming from a single node) they were written on is sufficient unless you truly need real-time monitoring. Most compilers describe environment variables and/or OPEN options to minimize the internal software buffer sizes which can eliminate one of the simpler caches. There are many things you can do, but each has costs and several require changing the code, so I would try using a simple approach like the unbuffer command from the node doing the writes first.

If that is not sufficient, …

  1. what compiler are you using?
  2. can you write the output to a local device? A tmpfs file system would probably be the best unless you have an unusually high-end system. A df command should list what tmpfs filesystems you already have available. For small files you probably at least have /dev/shm writable. Note filling it will not go well for your system if just a simple default configuration is available.
1 Like

Thanks @urbanjost We’ll try that out!! the actual production compiler is ifort19 (I referenced gfortran at the top because we managed to see that the issue was present regardless of the compiler)