Hi everyone,
A colleague reported an issue we have encountered in a new Linux cluster regarding file IO (writing) with NFS protocol. Basically, it was seen that in this new cluster the configuration makes it such that files are not flushed unless something like 1Gb of memory has been reached to write the file or that the close
clause is reached. This point is problematic because in some cases the simulations can be quite long, and it is expected to do monitoring with certain files which contain information of what is happening at runtime.
At first we thought the issue might be with MPI in parallel, but after several tests, one of my colleagues managed to create a mwe showing that already in sequential we can see the issue.
main.f90
program main
use iso_fortran_env
implicit none
real(real64) :: t1, t2, dt
integer(int32) :: u1, u2, u3
integer(int32) :: i, N, counter_timer
integer :: ret
! Declare the interface for POSIX fsync function
interface
function fsync (fd) bind(c,name="fsync")
use iso_c_binding, only: c_int
integer(c_int), value :: fd
integer(c_int) :: fsync
end function fsync
end interface
!------------------------------------------------
write(*,*) 'Printing files'
call cpu_time(t1)
open(newunit=u1, file="fortran_no_flush.txt", status='replace')
open(newunit=u2, file="fortran_with_flush.txt", status='replace')
open(newunit=u3, file="fortran_with_sync.txt" , status='replace')
i = 0
N = 60
counter_timer = 0
do
i = i + 1
write(u1, 001) i
write(u2, 001) i
write(u3, 001) i
001 format("Line ", I9)
flush(u2)
flush(u3)
ret = fsync(fnum(u3)) !> fnum only works with gfortran
if (ret /= 0) stop "Error calling FSYNC"
call cpu_time(t2)
dt = t2 - t1
if (dt > counter_timer) then
write(*,'(I2,"/",I2)') counter_timer, N
counter_timer = counter_timer + 1
endif
if (t2 - t1 > N) exit
enddo
close(u1)
close(u2)
close(u3)
end program
In this program, 3 files are opened for writing. We even set a flush clause on two of them. Compiling with gfortran -Wall -O3 main.f90 -o test
and running under our cluster the “test” program while simultaneously monitoring on the frontend node with
tail -f fortran_with_flush.txt
(second file) It is not about after 30 seconds that we see the lines being actually flushed to the file, while
tail -f fortran_with_synch.txt
(third file) It is immediate.
Now, the fsync solution is more of a hack, this would imply doing some heavy modifications in a codebase that has worked for more than 40 years, and the fnum
function is not available under intel compiler which is the actual compiler we use for production.
Has anyone encountered such a weird problem with newer clusters? How would you go about such a long latency problem for flushing files? We hesitate regarding using the option “NOAC” at the configuration level of the NFS communication protocol which could solve the problem but it seems to downgrade performance.
Some resources on the topic:
https://www.alibabacloud.com/help/en/nas/writing-delays-when-accessing-a-file-from-different-nfs-clients
https://linux.die.net/man/5/nfs