Get record lengths from unformatted sequential access file

I am working with a heritage data file format that is unformatted, sequential access. It contains time history output from an engineering code that basically looks like this:

ncol "descriptive string of text" "time[sec]    " "var1[units]  " "var2[units]  " "[ncol-3 more columns...]"
realdata     realdata     realdata     [...]
realdata     realdata     realdata     [...]
realdata     realdata     realdata     [...]

where ncol is the number of columns of data (different for each file), and realdata can be either single (default real) or double precision as specified by the user at runtime.
(Edit: turns out the first three lines are actually written as a single line, so I have updated the example.)

The file has no explicit indication whether single or double precision data was printed, but gfortran and Intel Fortran store the record length in bytes in each record marker. I think we can figure out the precision by comparing the record length of the data lines in bytes against the number of columns ncol. If the record length is 4*ncol, it’s single, and if it’s 8*ncol it’s double. But how can I directly access the record length? (Note it’s a sequential access file, so recl is not relevant as far as I know.)

Or is there a better Fortranic way to do this?

I figured out way to do this with access='stream' by counting exactly how many bytes to step past the introductory content and read the record marker from the first data line.
(unable to copy/paste from source system so retyping here, please forgive errors)

open(newunit=iunit,file='filename.txt',access='stream')
read(iunit,pos=5) ncol ! skip 4-byte record marker on first line

! skip 48 bytes first line = 8 bytes of record markers + 4 bytes of `ncol` + 36 bytes of descriptive text
! +36 bytes for each column header text, and 
! +1 to the next byte to find the record marker of the first data line
read(iunit,pos=(48+ncol*36+1)) reclen 
close(iunit)

if (reclen == ncol*4) then
   print *, 'This file is single precision.'
elseif (reclen == ncol*8) then
   print *, 'This file is double precision.'
else
   print *, 'This file is trouble precision.'
endif

You can open the file as binary stream and then read the header record plus the two times four bytes that mark the start and stop of a record. That would give you the length of the first record, as you say, the first record has a specific length. The next four bytes would give the length in bytes of the second record.

Just a sketch of course.

1 Like

Yes, you’re right, I could simplify the method of finding the second record marker, instead of (48+ncol*36+1) which assumes 36 bytes for each text entry:

open(newunit=iunit,file='filename.txt',access='stream')
read(iunit,pos=1) reclen1 ! read 4-byte record marker on first line
read(iunit,pos=5) ncol    ! skip 4-byte record marker on first line

! skip reclen1 + 2*4 bytes record markers to jump over first line and read the record marker of the second line
read(iunit,pos=(reclen+2*4+1)) reclen 

Another possibility:

character(:), allocatable :: str

open(newunit=lu,file='filename',form='unformatted')
read(lu) ncol
allocate( character(8*ncol) :: str )
read(lu,iostat=stat) str   ! try reading 8*ncol bytes
if (stat == 0) then
   ! success, the data are likely 64 bits
else
   ! failure, the record is shorter than 8*ncol
   ! the data are likely 32 bits
end if

I’ve always thought it would be nice if INQUIRE could be extended to return this type of information for a file. In this case the number of records in the file and an array of the record length of each record. Just being able to return the number of records in a sequential access file would be a plus for me. I would not have to keep repeating the same (granted small) loop constructs just to count the number of records in an input file for every code I write.

I’ve always thought this too, but there is a reason why it is not supported. For unformatted sequential disk files, the i/o library must know the record length in order to support reading partial records and skipping to the end of the record and to support backspace, so it seems natural that the programmer should be able to query that information from the library. It is almost free information, with no or little extra effort required. This extends also to more modern media, like SSDs which emulate spinning disk files. But the reason it isn’t required is because of the tape drive legacy of fortran. I’m talking about the old reel-to-reel drives. On those devices, the record length was not available, but rather there were gaps between the records. To read the record, the tape had to be physically advanced past the read head, transferring the information until the end of record gap was detected. It was only after the fact that the record length was available. So for a programmer to request that information, and then use it to actually read the data would have required a scan step, a backspace step, and then the actual read step, all of which were slow and required effort.

So I think a better solution would be to define within the fortran standard a way to read a record of unknown length. In modern fortran, this could be done with an allocatable array or an allocatable character string. The library could reallocate the array if necessary, and return both the data and the record length together. This was not possible with f77 and earlier, because there were no allocatable variables, but with modern fortran (at least since f2003) it would be straightforward to implement. Reading an array of unknown size, or in this case a record of unknown size, is a common task, and all of the necessary pieces of the language are now there to support it. Of course, it should also work for formatted i/o in a similar way. Also note that although this can be emulated using, for example, nonadvancing i/o or stream i/o and linked-list data structures, these approaches require copying the data multiple times and are less efficient than if this capability were incorporated directly into the fortran i/o library.

I appreciate the historical perspective. But have to admit I don’t see why this prevents inquire from being extended to provide current record length for unformatted sequential files.

Sequential files, both formatted and unformatted, can have variable length records. Simply looking at the first record does not tell anything about the rest of the records in the file. It is also important to note that there is no standard way to delimit individual records. In the formatted case, these days it is generally with a ā€˜newline’ or ā€˜cr/lf’ sequence at the end of each record. In unformatted case, some form of byte count is used before each record. The individual record delimiters or lengths for each record must be present so that BACKSPACE can be supported.

In the case of direct access files, the record lengths are fixed as specified at OPEN time. Generally there are no control bytes preceding the records, so INQUIRE can’t really tell beforehand what the correct record length should be. (Perhaps some systems, and I seem to recall VAX/VMS is one of them, keeps it in the file’s metadata.) Once the file has been opened with a given RECL, it would be trivial to have INQUIRE return the number of records.

Sorry, to be more specific, I’m talking only about unformatted sequential files (previous post has been updated). And by ā€œcurrent recordā€ I meant the record length adjacent to the current position in the file, not the first record.

Unless I’m mistaken, in unformatted files (at least in gfortran and Intel Fortran) the record markers at the beginning and end of each record contain a 4-byte integer indicating the number of bytes enclosed in the record.

The standard does not specify how the records are separated and marked in the file, and what @RonShepard meant is that on tapes, the record lengths were not explicitly written on the tape. The separation between two records was just a gap between them. That means that to know the record length, one had to actually read the record.

Nonetheless, the capability to inquire the current record length could be added to the language. Just without garantee that the information is available (inquire(unit=lu,recl=length) could possibly possibly return -1).

That said, this is asking the language to provide some workarounds for poorly specified/written files, or for missing/lost documentation.

At least at one Corporation binary files with simple repetitive structure were required to have a starting block that contained the FORMAT statements used, and/or a descrption of the file structure. Using a self-describing format ala HDF5 and similar solutions is a good approach but requires depending on a good deal of infrastructure. Some groups require for binary-to-ascii programs to be available for any data files that are archived in long-term storage and to store only the ASCII versions and the code for the ascii-to-binary converter. In the past I have spent more than a few hours trying to read binary files on a current platform that were generated on one with a different word size, matissa, and endian. Seems like a recurring problem with multiple solutions over the years. Direct access files can suffer from this too, although if the data is all floating point it is typically relatively easy. You cannot beat simple binary files for speed though.

Does any language address this, like using a record separator with a key to a format contained somewhere in the file? Binary files with an unknown structure have been a recurring issue for many years but I do not see an obvious generic solution that is not much less efficient. At least there are far less binary formats for simple unformatted reads and writes than their use to be.

I do not expect Exdir or HDF5 or netCDF or … to be part of the Fortran standard anytime soon so keep your Format statements and unformatted data structure info close.

For a while all files had to start with a magic string and we kept a magic file for the file(1) command that described the file type and pointed to a description of the file format. That worked very well for cases like this where the first few bytes of a file basically let you know how the rest must be read for simple cases. Lately, with more use of out-house codes and utilities that it not being followed very rigioursly lately.

On magtapes, there were a plethora of differing approaches with unformatted files. For efficiency, it generally wasn’t simply one unformatted write/read per physical block on tape. For example on CDC NOS, I (Internal) and SI (SCOPE Internal) tape drivers would use physical blocks of 512 60-bit words, with a short block at the end of OS ā€˜logical record’ and ā€˜end of file’. However the S (Stranger) and L (Long Stranger) drivers, one could read/write physical blocks of varying sizes with a single read/write request. The interface between the Fortran program, and the OS calls (RA+1 requests) to the tape driver was done via the Cyber Record Manager. Record Manager attempted to insulate the Fortran (and COBOL, etc) programmer by supporting a bunch of different file formats. Some IBM compatible.

IBM mainframers had their JCL decks specifying similar things. Same with Univac and others. The main way we’d transfer text files was with fixed size records and fixed number of records per physical tape block - either in ASCII or EBCDIC. Unformatted compatibility between differing systems was almost non-existent. So folks would write lots of custom tape reading programs to try to make sense of things. (On CDC, the FORM utility, when it wasn’t broken, combined with Cyber Record Manager, could often make sense of at least tapes which originated on IBM systems.)

Anyway, these days for disk files that need to transcend different systems, NetCDF has a little bit of learning curve, but seems to work great.

It would be pretty easy for unformatted disk files. Slightly harder for formatted files - since the I/O library would have to read the entire record to find the end-of-record terminator. Then reposition back at the beginning of the record. Not sure how it would work if you were doing partial record processing via advance='no', or with interactive input.

Much like a standardized module format, I see this as another area where an effort to create a standard file format for sequential access files is needed. Would some sort file format with global descriptor/header at the start of the file and individual descriptors written before each record instead of an EOR mark at the end enable INQUIRE to return more info about the file. This is another area where we are stuck with something designed to work with slow hardware that has been obsolete for decades.

1 Like

I just looked online, and it appears that the gap was about 3/4 inch or 19mm. That is a relatively large gap compared to the bit density on the tape itself, which ranged from early 200 characters per inch to eventually 6250 characters per inch. The tape itself was 1/2 inch width, to put that in perspective. Tape drives worked by starting and stopping the motion of the magnetic tape as it was pulled over the read/write heads. It was not possible to start and stop the tape instantly because of the mass of the moving tape and also because of the mass of the various rotating mechanical parts of the drive, so that is why the gaps were so large. The drives would suck a couple feet of the tape into a vacuum chamber positioned between the reel and the read/write heads to help isolate mechanically these start/stop operations from the circular motion of the heavy tape reel itself.

This reminds me of watching TV shows in the 1960s and 1970s that had computers in them. They were usually big devices, the size of several upright refrigerators, with flashing lights on the front. But the focus was almost always on either a line printer, which was the size of a washing machine and could print a full page in one or two seconds at full speed, or on a spinning tape drive, which itself was also the size of an upright refrigerator. A tape drive might weigh 750 lbs and cost $20k at that time (more than the cost of an average house). But the tape drives in the TV shows were never actually reading or writing a tape, which looked like a quick series of starts and stops of the rotating reel of tape, it was rather almost always of a tape drive rewinding, which was a much faster continuous spinning operation, faster than for example a more familiar reel-to-reel audio tape. This all persisted on TV shows even into the 1980s, when the common perception of a computer switched to having a CRT monitor and keyboard sitting atop a desk.

Ah, the Hollywood Electrodata/Burroughs B205s. Appeared in lots of '60s shows - Batman, Lost in Space, Time Tunnel, Voyage to the Bottom of the Sea, etc.

Expensive magtape drives used vacuum chambers on both sides of the heads to buffer the tape. Very low latency to start and stop the tape motion. Lower end drives used rollers and such, so were not so good.

You could literally hear the inefficient use of the tape drives when reading/writing excessively small physical blocks. Add some blocking of records and all became well.

1 Like

Although much less present, tapes are certainly not obsolete. They are still used in some domains that have to deal with gigatons of data.

On all the HPC systems I’ve used over the last 20 plus years, tapes are only used for long term storage and backup. For an application to use any data stored on the tapes the data must first be copied to a hard disk or more recently some form of solid state storage device. I stand by my assertion that tapes are obsolete as a form of storage that is directly accessed from a running Fortran program. The fact that Fortran’s intrinsic I/O system is still based on a tape storage is what I think needs to be addressed. My position is that Fortran needs an alternative intrinsic I/O system based on more modern concepts about how you store and access scientific data in sequential access files. While something like an intrinsic netCDF or HDF5 might be overkill, something better than the current record based system is needed.

I would agree with this. I remember writing my own tape reading/writing programs in the early 1980s in fortran. Those codes handled various record and block sizes and translated characters between 7-bit and 8-bit ASCII, EBCDIC, and a few other character sets at that time. As the internet became more popular in the 1980s, I gradually switched over to using FTP to download, upload, and distribute, data files. I also sent and received quite a few 3-1/2 inch floppy disks through the mail in the 1980s. But before that, 9-track tape was the best way to store, distribute, and exchange data. And somehow, fortran with all of its quirks at that time provided all the flexibility a programmer might need to process the data; there was little or no need to write assembler or to access low-level OS calls.

I would say that the use of 9-track tapes as an exchange medium began to decline once FTP over the internet became widely available. Since then, there are other network protocols such as curl, scp, svn, and git that are available to share and broadcast data, but I think even these days FTP is still used. But before all that, 9-track tape reels, mailed or carried by hand, were the common denominator, and fortran was sufficient to do almost anything the programmer needed with those tapes.

This was actually one of the major points of contention with the fortran 8x revision that caused that 15-year delay in the 1980s and early 1990s. Some vendors and users wanted fortran to go in the direction of C, where everything (files, fifo pipes, interprocess communication, etc.) was treated as a sequential stream of bytes, while others wanted to transform fortran i/o into a full-featured database language. What won out was somewhere in between, where new functionality was added to the underlying record-based i/o based on tape drives. Even direct-access i/o, which was based on an underlying spinning disk model, was not standardized until f77, over a decade after such devices were in common use.

I should have added some further discussion in my previous post about the use of tapes in general. Those previous comments were all about the 7-track and 9-track tapes in use in the 1970s and 1980s. Tapes in general continued to be used up through the 2000’s and even to the present, but those newer devices used cartridges, not the reel-to-reel technology. I have never written a fortran program that directly accessed a cartridge tape device. I think several technologies were used in those devices, but some of the significant features are that they use helical-scan technology and tapes that continuously stream over the read/write heads, not the older start/stop record-based technology. The recording densities are higher, the capacities are larger, data compression is built in to the device, and the cartridges are designed to be fetched and loaded by robotic devices, not just by human operators. So tapes have continued to be used as bulk storage and archival media, but they are no longer central as distribution and exchange media.