Parallel Asynchronous Fortran I/O

Hello,

I am trying to figure out how to read & write unformatted files in Fortran, where different processes may read&write different records, and hopefully can do so asynchronously.

The first program below, containing the first piece of a pi estimator, seems to work with the parallel I/O.

This program writes to the same file, same [fixed] unit number, but to different records based on the image number. The compilation command is ifx -debug -threads -coarray=shared -coarray-num-images=8 -o my_caf_prog ./basic_newunit.f90 . (threads is not necessary as I do not have the asynchronous specifier anywhere yet; also, coarray=distributed yields the same behavior)
The code is below:

program main                                                                                                                                                                                  
  implicit none                                                                                                                                                                               
  integer, parameter :: blocks_per_image = 2**16                                                                                                                                              
  integer, parameter :: block_size = 2**10                                                                                                                                                    
  real, dimension(block_size) :: x, y                                                                                                                                                         
  integer :: in_circle[*], unit[*]  ! an integer but each image has a different local copy                                                                                                    
  integer :: i, n_circle, n_total, rec_len                                                                                                                                                    
  real :: step, xfrom                                                                                                                                                                         
                                                                                                                                                                                              
  n_total = blocks_per_image * block_size * num_images()                                                                                                                                      
  step = 1./real(num_images())                                                                                                                                                                
  xfrom = (this_image() - 1) * step                                                                                                                                                           
                                                                                                                                                                                              
  inquire(iolength=rec_len) in_circle, n_total                                                                                                                                                
                                                                                                                                                                                              
  open(100,file='output.txt',form='UNFORMATTED',access='DIRECT',recl=rec_len)                                                                                                                 
                                                                                                                                                                                              
  in_circle = 0                                                                                                                                                                               
  do i=1, blocks_per_image                                                                                                                                                                    
     call random_number(x)                                                                                                                                                                    
     call random_number(y)                                                                                                                                                                    
     in_circle = in_circle + count((xfrom + step * x)** 2 + y**2 < 1.)                                                                                                                        
  end do                                                                                                                                                                                      
                                                                                                                                                                                              
  write(100,rec=this_image()) in_circle, n_total                                                                                                                                              
  sync all                                                                                                                                                                                    
  close(100)                                                                                                                                                                                  
                                                                                                                                                                                              
  ! Reset in_circle, n_total to make sure we read values                                                                                                                                      
  in_circle = 10                                                                                                                                                                              
  n_total = 10                                                                                                                                                                                
  ! read from file we wrote to earlier                                                                                                                                                        
                                                                                                                                                                                              
  open(100,file='output.txt',form='UNFORMATTED',access='DIRECT', action='READ', recl=rec_len, status='OLD')                                                                                   
  read(100,rec=this_image()) in_circle, n_total                                                                                                                                               
  write(*,*), this_image(), " reads in_circle and n_total: ", in_circle, n_total                                                                                                              
                                                                                                                                                                                              
  sync all                                                                                                                                                                                    
                                                                                                                                                                                              
  close(100)                                               
end program main  

with expected output

./my_caf_prog 
           2  reads in_circle and n_total:     65871670   536870912
           3  reads in_circle and n_total:     63695869   536870912
           4  reads in_circle and n_total:     60285902   536870912
           5  reads in_circle and n_total:     55407149   536870912
           6  reads in_circle and n_total:     48613368   536870912
           7  reads in_circle and n_total:     38896892   536870912
           8  reads in_circle and n_total:     21944055   536870912
           1  reads in_circle and n_total:     66933288   536870912

EDIT: I have figured out my first question.

I can also not hardcode the unit, by specifying newunit=unit, and it works so long as I make a new unit whenever I reopen a closed file.

My second question is: Does this approach write to the file unit asynchronously with respect to the different processes? (i.e. each file read/write doesn’t force a sync all?) If not, is there a way to do this? (I know there is an ‘asynchronous’ specifier, but I think that refers only to the read/write happening w.r.t. the rest of that image’s execution. In theory I can do this asynchronously alongside parallel I/O, right?)

My goal is to have fast (parallelizable) I/O, at least on a single computer w/ multiple processors. I know there’s HDF5 and netCDF and whatnot, but I’d like to know how to make this work within Fortran, without the need of external libraries.

The error feels like a bug to me, but I’ll comment that connecting to the same file from multiple images is (as of F2023) implementation-dependent. In F2018 and earlier, it was non-standard.

There is no inherent synchronization of I/O across images, other than the standard suggests that writes to the standard output or error files be “merged” as a good practice.

You do know that Fortran has aynchronous I/O, yes?

I think you would be better off to use asynchronous I/O in a single image if you can. What you’re doing now is playing with fire.

2 Likes

I do not know how to do this.

It might help to indicate which operating system are you using.
Are you controling OS file buffering ?

Is this similar to providing shared access to files from multiple processes ?
Most compilers provide SHARE= in OPEN to control multiple access. You may need to investigate this non-standard functionality. This can control opening a file by multiple processes or multiple opens in the same process.

My solution to this problem is typically to open a unique file for each thread and then at the end of the long computation phase to process into a single file in an OMP CRITICAL construct where I can open, use and close the merged file. This is my way to avoid OS file buffering issues.

I ended up editing my original post - the error seemed to have to do with reopening a file that had been closed while using the same unit. For some reason a hardcoded unit of 100 didn’t encounter the error, but the newunit= functionality did.

You do know that Fortran has aynchronous I/O, yes?

Indeed, that is why I mentioned the ‘asynchronous’ tag.

I think you would be better off to use asynchronous I/O in a single image if you can. What you’re doing now is playing with fire.

If different images are writing to different records, why is there an issue?

re: @JohnCampbell

It might help to indicate which operating system are you using.
Are you controling OS file buffering ?

An Ubuntu OS. It’s a personal computer, so I can control it presumably. But ideally I’ll have something working reliably on Linux clusters.

Is this similar to providing shared access to files from multiple processes ?
Most compilers provide SHARE= in OPEN to control multiple access. You may need to investigate this non-standard functionality. This can control opening a file by multiple processes or multiple opens in the same process.

I did look into this, but I came across a source that said modern Fortran does not recommend this practice. I may look into it regardless, but right now I am experimenting with standard Fortran functionality.
EDIT: According to Intel documentation, SHARED is now default on Linux/macOS systems.

My solution to this problem is typically to open a unique file for each thread and then at the end of the long computation phase to process into a single file in an OMP CRITICAL construct where I can open, use and close the merged file. This is my way to avoid OS file buffering issues.

Not a bad idea. This will work for my personal computer, but if I want something I can put on a cluster, the typical clusters I work with don’t like user code making tons of different files during the computation - both policy-wise and efficiency-wise.

I’ll keep experimenting and see if I find something I’m happy with.

This is because the records are artificial constructs on many i/o devices, such as spinning disks. Say two images read the same disk sector, one updates one subset of bits, the other updates another subset of bits, and then they both write back their modifications. Whichever one writes first will get overwritten by the one that writes later.

1 Like

According to This guide for Improving I/O, records are related to the BLOCKSIZE of the system. Your point seems valid if the RECL is less than the BLOCKSIZE, which the automatic behavior may allow to happen (a quote taken from the site is below).

Is it safe to say that if individual records are larger than the BLOCKSIZE, this behavior should not occur? If so, I could always chunk records appropriately.

The sum of the record length (RECL specifier in an OPEN statement) and its overhead is a multiple or divisor of the blocksize, which is device-specific. For example, if the BLOCKSIZE is 8192 then RECL might be 24576 (a multiple of 3) or 1024 (a divisor of 8).

The RECL value should fill blocks as close to capacity as possible (but not over capacity). Such values allow efficient moves, with each operation moving as much data as possible; the least amount of space in the block is wasted. Avoid using values larger than the block capacity, because they create very inefficient moves for the excess data only slightly filling a block (allocating extra memory for the buffer and writing partial blocks are inefficient).

The RECL value unit for formatted files is always 1-byte units. For unformatted files, the RECL unit is 4-byte units, unless you specify the -assume byterecl (Linux) or /assume:byterecl (Windows) option for the ASSUME specifier to request 1-byte units.

I now have some working code that does both parallel (coarray) and asynchronous read/writes to a single file, of different records. It’s not too different from the above code, but I’ll include it at the bottom for completeness.

I obtain the expected output, which is a good sign. As Ron points out however, this is possibly luck. I ran it 20 times and did not see any errors, but a stress test (large files of small records) may yield the behavior Ron is alluding to.

program main                                                                                                                                                                                  
  implicit none                                                                                                                                                                               
  integer, parameter :: blocks_per_image = 2**16                                                                                                                                              
  integer, parameter :: block_size = 2**10                                                                                                                                                    
  real, dimension(block_size) :: x, y                                                                                                                                                         
  integer :: in_circle[*], unit[*]  ! an integer but each image has a different local copy                                                                                                    
  integer :: i, n_circle, n_total, rec_len, io_id                                                                                                                                             
  real :: step, xfrom                                                                                                                                                                         
                                                                                                                                                                                              
  n_total = blocks_per_image * block_size * num_images()                                                                                                                                      
  step = 1./real(num_images())                                                                                                                                                                
  xfrom = (this_image() - 1) * step                                                                                                                                                           
                                                                                                                                                                                              
  inquire(iolength=rec_len) in_circle, n_total                                                                                                                                                
                                                                                                                                                                                              
  open(newunit=unit,file='output.txt',form='UNFORMATTED',access='DIRECT',recl=rec_len, asynchronous='yes')                                                                                    
                                                                                                                                                                                              
  in_circle = 0                                                                                                                                                                               
  do i=1, blocks_per_image                                                                                                                                                                    
     call random_number(x)                                                                                                                                                                    
     call random_number(y)                                                                                                                                                                    
     in_circle = in_circle + count((xfrom + step * x)** 2 + y**2 < 1.)                                                                                                                        
  end do                                                                                                                                                                                      
                                                                                                                                                                                              
  write(unit,rec=this_image(), asynchronous='yes') in_circle, n_total                                                                                                                         
  sync all                                                                                                                                                                                    
  close(unit) ! async operations finish before it closes                                                                                                                                      
                                                                                                                                                                                              
  ! Reset in_circle, n_total to make sure we read values                                                                                                                                      
  in_circle = 10                                                                                                                                                                              
  n_total = 10                                                        
open(newunit=unit,file='output.txt',form='UNFORMATTED',access='DIRECT', action='READ', recl=rec_len, status='OLD', asynchronous='yes')                                                      
  read(unit,rec=this_image(), asynchronous='yes', id=io_id) in_circle, n_total                                                                                                                
  ! can in principle do computations here, so long as they don't need in_circle, n_total                                                                                                      
                                                                                                                                                                                              
  wait(unit=unit, id=io_id) ! need to wait before printing this, to let asynchronous read complete. unit specifies fileunit, id specifies which particular IO operation.                      
  write(*,*), this_image(), " reads in_circle and n_total: ", in_circle, n_total                                                                                                              
                                                                                                                                                                                              
  sync all                                                                                                                                                                                    
                                                                                                                                                                                              
  close(unit)                                                                                                                                                                                 
                                                                                                                                                                                              
end program main          

Because the behavior is not specified by the standard. Even the ability to open the same file in more than one connection (and each image has its own set of connections - you are not sharing a unit) is implementation-dependent and may require non-standard OPEN options. An implementation may allow you to do these writes but the file may end up corrupted, depending on just how it is done internally.

I remember from my VMS days that the file system offered a particular file structure (called “relative organization”) that included extra metadata the OS used for coordinating record I/O across multiple processes. Most OSes used today don’t have such protections - you’re on your own.

1 Like

If you think about how the records would be updated, those sectors that are entirely within a larger record might be safe, but the boundary sectors that contain parts of two records could still be corrupted by access from the different images. I would say that it is still not safe to do this. The only exception would be a file system that is designed with parallel access in mind. Those kinds of file systems do exist. Here is a link to one that I have heard about, but not used. Parallel Virtual File System - Wikipedia

The only exception would be a file system that is designed with parallel access in mind.

I see. I went on a spree of following parallel/distributed filesystems, including OrangeFS (the successor to PVFS), and ended up finding OrangeFS was merged into Linux kernel 4.6.

I am not so familiar with this, and the Linux Kernel page of OrangeFS documentation doesn’t tell me much about actually using it. Do my regular READ/WRITE commands use OrangeFS automatically since it is part of the Linux kernel? I am not sure. Perhaps I will open a discussion on their GitHub page about it.

Because the behavior is not specified by the standard

I see… perhaps it should be? If coarrays are in the current standard, and they largely can replace MPI API usage in Fortran, then maybe the standard could also have something that could replace MPI I/O (ROMIO)? [The link suggests ROMIO allows different processes to access different bytes of a file]

The simplest syntax I can think of is simply being able to WRITE a coarray to a single file, with multiple processes working independently. Being able to do so asynchronously as well. One would presumably need to READ this file with the same # of processors as it was written with. The standard would impose these read/writes can be done independently such that speedup is achieved? The implementation could worry about the details.

If we don’t want to involve coarrays, perhaps the standard could support the behavior I have in the code above explicitly. Essentially there can be a flag in an OPEN statement requiring that different records can be accessed simultaneously by different processes. I am aware of non-standard SHARE flags in some compilers, and in fact the Intel compiler uses SHARED by default. Perhaps it can be added to the standard. The implementation can figure out the details, whether it be records are padded such that they always occupy different sectors, or what.

A more flexible/drastic approach would be to change direct-access read/write, so that not only do you have records, but bins of records, each bin containing a number of records that are guaranteed to not overlap in sectors with other bins. The idea being each image could be responsible for different bins or number of bins. Thus you could give image 1 responsibility over bins 1 & 2, image 2 bins 3&4, etc. This would give you the ability to then READ these files with a different # of processors, so long as you are able to specify which processors are responsible for which bins. (Or you could allow the implementation to decide that for you, & give you some array that tells you which image each bin was assigned to).

I don’t think so. Coarray applications work best when there is minimal communication across images. Adding what is effectively an all-images (or at least all-team?) synchronization on I/O, when images could be distributed across thousands of images, seems untenable.

Instead consider having image 1 do all the I/O and receive data from the other images. This is the most common method.

I’m not going to consider the “more flexible/drastic approach” - this seems to me to be far outside what the Fortran standard tries to do. You can emulate your approach with the current language.

1 Like

This whole idea sounds like a great way to destroy your own work and crash the computer. I would use ONE thread to handle ONE file at a time and a dedicated thread to handle the finished files from other threads and merge them back to the main or output file. It’s what I’ve always had to do with reports in reports with a total in a call out. Open, write, close, toss the name to the merge thread.

Knarfnarf

1 Like