How to read a mixed formatted/binary file?

lars · July 1, 2023, 11:12am

I am trying to read a UNV-58b file, which consist of a mix of formatted ascii data and binary data blocks.

C     OPEN FILE LOCATED AT <FILEPATH>
      LUN = 7
      OPEN (
     & UNIT = LUN,
     & FILE = FILEPATH,
     & STATUS = 'OLD',
     & IOSTAT = STATUS_INT,
     & POSITION = 'REWIND',
     & ACTION = 'READ',
     & ENCODING = 'DEFAULT',
     & FORM = 'FORMATTED')

Then a number of fields are read (formats provided by the UNV standard):

        READ(LUN, 582, END=10, ERR=9)
     &  DATASETNUM, F02, F03, F04, F05, F06, F07, F08, F09, F010
582   FORMAT(I6, 1A1, I6, I6, I12, I12, I6, I6, I12, I12)

…some more records…

C       RECORD 11
        READ(LUN, 5811, END=10, ERR=9)
     &    F111, F112, F113, F114, F115, F116
5811    FORMAT(I10, 3I5, 2(1X, A20))

Then some binary data is read, let’s say it is 200 real(4) values:

READ(LUN, '(200A)', END=9, ERR=9) (Y(I), I = 1,200)

It works fine until a CR or LF byte is encountered, but they are just incidental (from the standard: “Note: there are no CR/LF characters embedded in or following the binary data.”)
So data in Y will only be correct up until the appearance of CR/LF byte.

How to avoid this problem?

mecej4 · July 1, 2023, 1:19pm

This is the first time that I heard about the existence of the UNV file format. The documentation appears to imply that a record-oriented file system was used when the standard was created, and the use of the word “record” is bound to cause confusion in file systems where a file is just a sequence of bytes.

The statement “there are no CR/LF characters embedded in or following the binary data” cannot be true in general. For example, the single-precision real number 0.13481537 has an internal representation of Z’3E0A0D0A’; if the file has been opened as a formatted file, those CR/LF bytes may be treated as EOL marks in formatted READ statements. To avoid such misinterpretation, you may need to read 80 bytes from the file at a time into a buffer and process the data in the buffer.

lars · July 1, 2023, 1:51pm

I interpret “there are no CR/LF characters embedded in or following the binary data” to mean that any CR/LF bytes are unintentional and should not be read as such. Since there are no forbidden numbers in the data block.

lars · July 1, 2023, 1:53pm

I have even tried reading the binary data byte by byte, but any CR or LF byte is skipped, and must be re-added to array, and then cast to a real(4) or real(8), and later to complex(8) or complex(16) if needed. But I haven’t managed to figure out (in the code) if the byte skipped was a CR or LF, so reconstruction is difficult.

mecej4 · July 1, 2023, 4:23pm

When a file is opened as a formatted file, the Fortran I/O runtime gives special treatment to CR and LF characters.

Trouble should be expected when choosing FORM=‘formatted’ to open a file and then trying to read binary data (which may contain embedded CR and LF bytes mixed with other bytes) .

urbanjost · July 1, 2023, 5:33pm

It would be interesting to see a dump of the file; but it is using “effective character variables” which although I have not seen used in a while I think are still standard. Essentially, you are reading and writing a normal formatted file except using the “A” format descriptor to print “any” or “all” variables, it’s original meaning. So that is, everything is just working with a normal formatted file by default, except that anything read with the A descriptor is treated as if the data were transferred byte per byte to a CHARACTER variable. But if it were a standard Fortran file there would be normal record structure present and the only oddity would be that there would be character strings in the files that would likely have non-printable character values.

Without seeing a dump of the file I am not sure if that file format, dating back to F66 is being followed or if record terminators or line terminators are there or not. So what compiler are you using? If it supports effective character I/O this program should echo back 10,20,30,100,200,300 twice:

program testit
! effective character variables
real,parameter :: arr(*)=[10.0,20.0,30.0]
real           :: arrin(size(arr))
! the real values are treated as if stored into
! a character variable of the same number of 
! bytes, so you still have a formatted record
! structure
write(10,'(9999a)') arr
write(10,'(9999a)') arr*10
rewind(10)
read(10,'(9999a)') arrin 
write(*,*)arrin
read(10,'(9999a)') arrin
write(*,*)arrin

write(20,'(9999a)',advance='no') arr
write(20,'(9999a)',advance='no') arr*20
rewind(20)
read(20,'(9999a)',advance='no') arrin 
write(*,*)arrin
read(20,'(9999a)',advance='no') arrin
write(*,*)arrin
end program testit

So if it does not work as first described, try using “advance=‘no’” on the I/O statements writing the “binary” data (actually, writing the effective character data). If that does not work it can be done with real stream I/O; but would need a dump of the file to figure out if you want formatted stream I/O (probably what would work for you even though rarely used) or if you need to use real stream I/O and do some parsing of your own. If these are really old files they might be using a totally different record structure than where they were generated at as well. If you do not have a command on your system like “od” a simple Fortran program that reads a file using stream I/O and writes the bytes read in as hex values would tell you what kind of record and line structure you have.

So are these newly generated CAD files (in which case the program hopefully can just generate plain 58 files instead of 58b files (all formatted versus mixed formatted/binary) and avoid this altogether; or old files that might have been generated on a Cray, DEC, HP, IBM, … machine long ago? And does your compiler support the rarely used (but still standard?) effecitve character variable I/O being used? Do you have a small file you can share? What platform are you running on now?

I think effective character I/O is still standard(?) using the A format but I almost never see it being used anymore, It is sort of like you can print anything with the A format and it acts like you had called TRANSFER() to convert it to or from a CHARACTER variable automatically.

urbanjost · July 1, 2023, 5:46pm

If you are truly reading the binary data byte by byte nothing would be skipped. Are you opening the file as a stream? Note that if it was written formatted and you read it back truly as a stream of bytes you have to handle the record structure information yourself.

lars · July 1, 2023, 6:03pm

No it is not truly byte-by-byte, but opened in formatted mode and then reading one byte at a time, with advance=‘no’.

lars · July 1, 2023, 6:06pm

unv58b_example.txt (1.6 KB)

RonShepard · July 1, 2023, 6:14pm

I think with modern fortran, this would be the right approach. You basically read the file as a sequence of bytes, and then reconstruct the integers and real numbers using TRANSFER(). The way TRANSFER() works, you will need to know if these data were written to the file in big- or little-endian conventions, and then whatever was done when creating the file, you do the corresponding reverse operations to read the file.

There might be other approaches to try too, such as with nonadvancing i/o, but from the description given above, I think I would try access='stream' first, and see how that works.

There are two types of stream access, formatted and unformatted. The best choice depends on how the file is written. Are you changing also how the file is written, or is that fixed already and you are just reading the file? If you have control of both the writing and the reading of the file, then that gives you much more flexibility and you should be able to construct a simple and easy to maintain solution. If you are reading a file with a fixed format, perhaps created on a different computer or using a different language, then you have less flexibility in your approach.

lars · July 1, 2023, 6:23pm

access=‘stream’
Makes no difference

lars · July 1, 2023, 6:27pm

They are fairly new, but can’t be considered do be re-generatable (which would solve the issue altogether)
Windows, ifort 2021.9.0

lars · July 1, 2023, 6:33pm

So in the attached example file, see printed output from the read loop, the problem occurs first at byte 361 (which should return 10), where iostat can be seen return an end-of-file (-2). Bytes 362 and 363 are correct however.

RonShepard · July 1, 2023, 9:17pm

ASCII 32 is a space character. Are you saying that character should instead be 10 (which is a line feed)?

I have no idea why either of those characters would result in a -2 iostat value.

lars · July 1, 2023, 9:27pm

Yes, instead of the actual character (10), 32 is returned, and iostat returns -2.
Byte #362 is 32, so that is fine.

I would have assumed that advance=‘no’, would ignore the LF byte, but it does not.

urbanjost · July 2, 2023, 12:26am

There was just a discussion about old programming techniques and creating resources so I wondered if anything treated the A format like it originally worked in FORTRAN 66 so I looked
for options that would cause the (VERY) old behavior when I saw none of the compilers i tried
did exactly what this assumed was the behavior.

I could not find any switches that made it act the old way.

Long ago in a far-away land the A format descriptor would read or write any type (there were not many) as a stream of bytes and read it as such too. So it was like you converted the values to characters (bytes) with the subtle feature than upon reading back with the A descriptor bytes were read back, ignoring whether they looked like a LF or CR-LF line terminator even though formatted I/O was being performed.

It made sense in that there WAS no character type and characters were always stored in something else (typically integers) , there was this thing called Holleriths, and most machines stored files in a very different manner than a stream of bytes with some bytes being “special” and
indicating line terminators as in now common.

As far as I can tell even though the “A” descriptor still works with non-character types nothing reads back a formatted sequential file like that anymore.

So one approach is to read everything as a stream into “lines” yourself. Read back the formatted data into a character variable and then read it with the format you were using on the file before; assuming the line ends with a NEW_LINE(‘A’) character. But when you need to read back “binary” data just grab that many bytes and a newline. You can do that with direct stream I/O,
in this example (which might have lots of bugs) I used a routine I already had to slurp the file into a character array to test the concept. It just reads the first thirteen lines into character variables you could then read from, but reads the “14th line” just as 210 real values. It looks like reasonable values that might be correct get read that way. In a real one you would read formatted lines into LINE and read from it, and read binary lines as a number of bytes you would calculate and then read the values from that with the A descriptor or a TRANSFER or even equivalence I suppose. Anyway, this shows a general approach I think would work that would require the minimal change from the method described in your reference.

I am thinking this file format and that documentation must go WAY back. So what was once a simple “trick” is not so simple anymore. If anyone knows of a simple way to get the old behavior I would like to know what I missed. I do remember seeing older codes that made the assumption made here but have not seen anything else in a long time that did, so maybe this behavior has been gone a long time. Did you have anything written in Fortran recently that could read the files?

code

module M_bytes
use, intrinsic :: iso_fortran_env, only : stderr=>ERROR_UNIT, stdin=>INPUT_UNIT, stdout=>OUTPUT_UNIT
implicit none
private
public :: filebyte
public :: c2s
contains

function c2s(chars) result(string)
character(len=1),intent(in) :: chars(:)
character(len=:), allocatable  :: string
integer :: i
   string=repeat(' ',size(chars))
   do i=1,size(chars)
     string(i:i)=chars(i)
   enddo
end function c2s

subroutine filebyte(filename,text)

!@(#) M_io filebyte(3f) allocate text array and read file filename into it"

character(len=*),intent(in)              :: filename    ! filename to shlep
character(len=1),allocatable,intent(out) :: text(:)     ! array to hold file
integer :: nchars=0             ! holds size of file
integer :: igetunit             ! use newunit=igetunit in f08
integer :: ios=0                ! used for I/O error status
integer :: i
integer :: icount
character(len=256)  :: message
character(len=:),allocatable :: line
   message=''
   open(newunit=igetunit, file=filename, action="read", iomsg=message,&
   &form="unformatted", access="stream",status='old',iostat=ios)
   if(ios == 0)then  ! if file was successfully opened
      inquire(unit=igetunit, size=nchars)
      if(nchars <= 0)then
         write(stderr,'(a)') '*filebyte* empty file '//filename
         return
      endif
      ! read file into text array
      if(allocated(text))deallocate(text) ! make sure text array not allocated
      allocate ( text(nchars) )           ! make enough storage to hold file
      read(igetunit,iostat=ios,iomsg=message) text      ! load input file -> text array
      if(ios /= 0)then
         write(stderr,'(a)')'*filebyte* bad read of '//filename//':'//trim(message) 
      endif
   else
      write(stderr,'(a)') '*filebyte* '//message
      allocate ( text(0) )           ! make enough storage to hold file
   endif

   close(iostat=ios,unit=igetunit)            ! close if opened successfully or not

end subroutine filebyte
end module M_bytes

program demo_filebyte
use M_bytes, only      : filebyte, c2s
implicit none
integer :: ipos=1
character(len=1),allocatable :: text(:) ! array to hold file in memory
character(len=*),parameter :: FILENAME='unv58b_example.txt' ! file to read
character(len=:),allocatable :: line
integer :: iline
integer,parameter :: iarr=210
real :: arr(iarr)

call filebyte(FILENAME,text) ! allocate character array and copy file into it
!!
if(.not.allocated(text))then
   write(*,*)'*rever* failed to load file '//FILENAME
   stop
endif
write(*,*)'BYTES=',size(text)
!!
!----------------------------------------------------
! write file reversed to stdout
!write(*,'(*(a:))',advance='no')text(size(text):1:-1)
!----------------------------------------------------
! SO INSTEAD OF THIS LOOP, YOU WOULD CALL EITHER 
! GET_NEXT_BYTES() or GET_NEXT_LINE() as appropriate
! AND DO INTERNAL READS FROM LINE INSTEAD OF READING
! FROM THE LUN.
!----------------------------------------------------
iline=1
do
   if(iline.eq.14)then
      call get_next_bytes(iarr*4,line)
      read(line,'(*(a))')arr
      write(*,*)'VALUES:',arr
   else
      call get_next_line(line)
      write(*,*)'LINE:',iline,len(line),ipos,line
   endif
   if(ipos.gt.size(text))exit
   iline=iline+1
enddo

   deallocate(text)  ! release memory
contains

subroutine get_next_line(line)
character(len=:),allocatable :: line
integer :: i
NEXTLINE: block 
      do i=ipos,size(text)
         if(text(i) == NEW_LINE('A'))then
             line = c2s(text(ipos:i-1))
             ipos=i+1
             exit NEXTLINE
         endif
      enddo
      line=c2s(text(ipos:))
      ipos=i
endblock NEXTLINE
end subroutine get_next_line

subroutine get_next_bytes(bytes,line)
integer,intent(in) :: bytes
character(len=:),allocatable :: line
   if(ipos+bytes-1.gt.size(text))stop '<ERROR> past end of file'
   line=c2s(text(ipos:ipos+bytes-1))
   ipos=ipos+bytes+1
   ipos=ipos+1  ! assume read to end of data and a line terminator follows
end subroutine get_next_bytes

end program demo_filebyte

BYTES=        1589
 LINE:           1          81          83     -1                                                                          
 LINE:           2          81         165     58b     1     2          11         420     0     0           0           0 
 LINE:           3          20         186 OA-level for xxxxxx
 LINE:           4           5         192 NONE
 LINE:           5          19         212 19-Sep-22 10:54:02
 LINE:           6          30         243 Record    1 of section "XXX" 
 LINE:           7          63         307 Tracked processing\Fixed sampling\Runup\Sections\Overall level
 LINE:           8          81         389     2         0    0         0 NONE               1   0 NONE               0   0
 LINE:           9          70         460          2       105         0  0.00000e+00  0.00000e+00  0.00000e+00
 LINE:          10          68         529         19    0    0    0 X-axis               rpm                 
 LINE:          11          68         598          1    0    0    0 40                   Nm                  
 LINE:          12          68         667          0    0    0    0 NONE                 NONE                
 LINE:          13          68         736          0    0    0    0 NONE                 NONE                
 VALUES:   860.707825       314.939606       867.283386       320.570984       878.022034       337.360718       883.781677       351.171478       890.069397       371.175171       900.991882       423.692688       907.901184       484.064331       921.356445       2018.21899       932.245972       2257.45752       941.161011       2442.19775       950.253174       2618.11499       961.920410       2821.09229       970.904785       2951.55884       979.062622       3046.41870       992.936157       3151.50586       999.022766       3179.69727       1010.41034       3219.88892       1019.90747       3249.84961       1031.69629       3291.14453       1041.29199       3328.74756       1050.18799       3368.27881       1059.35815       3415.11841       1068.06091       3460.06104       1079.90356       3524.89526       1085.96411       3559.76758       1098.23645       3628.70264       1108.24890       3683.48364       1117.12329       3733.09912       1127.28613       3787.78857       1139.84583       3857.04102       1149.38525       3906.50049       1159.94678       3961.69116       1169.59131       4012.30933       1179.80811       4067.71875       1189.90967       4123.00635       1196.87036       4162.32568       1210.03357       4236.45654       1216.65088       4275.26660       1227.23022       4334.57568       1236.71619       4389.52930       1243.43909       4427.48535       1259.89587       4524.29395       1263.11047       4542.60596       1276.25037       4617.89355       1283.07324       4654.68164       1297.00122       4729.23926       1304.23950       4765.48486       1318.62708       4839.17383       1326.10083       4876.05811       1337.38220       4934.21533       1347.63000       4987.21436       1355.13098       5026.53711       1369.02087       5101.07324       1369.17322       5102.07520       1383.46667       5179.69727       1397.91589       5257.01123       1405.34363       5295.79883       1412.71387       5333.92969       1428.01868       5410.66064       1435.97449       5448.85547       1444.77014       5489.57129       1452.86023       5525.54834       1461.70337       5562.80615       1480.33142       5632.68701       1490.00049       5665.90186       1500.34351       5699.23584       1510.79797       5731.00879       1510.88025       5731.23779       1521.47595       5761.94189       1542.94409       5819.56738       1543.08594       5820.04102       1554.15881       5848.30420       1565.60010       5876.05273       1576.67114       5902.21680       1587.78125       5927.85059       1588.20691       5928.75439       1610.17615       5976.96875       1610.41992       5977.54688       1621.39478       6001.01074       1632.23145       6023.45996       1642.50220       6044.25195       1652.49390       6064.38818       1662.44031       6083.75781       1671.64697       6101.58252       1680.54199       6118.58496       1696.86292       6148.51074       1704.27942       6161.63086       1711.53687       6174.09521       1717.95093       6184.85938       1734.16699       6211.69189       1738.36804       6218.49609       1748.10400       6231.94434       1757.94495       6228.55371       1771.95081       6162.43750       1782.47961       6123.80908       1782.56873       6123.44092       1799.77832       6051.79004       1808.71143       6008.29541       1821.77234       5934.43359       1835.02356       5850.09375       1839.39221       5819.51318       1851.78918       5723.18359       1859.98303       5652.51660       1872.04395       5538.95654       1883.42554       5418.01367    
 LINE:          15          12        1590     -1

The long line that starts with “VALUES” is the one of interest that looks to have reasonable numbers. I would use the concept demonstrated to read the mixed files; not necessarily the code I patched together to try it.

lars · July 2, 2023, 5:39am

It is not that old, I think it was published in 1993 and last revised in 1999.

urbanjost · July 2, 2023, 11:24am

Well, assuming you know the number of values in the binary sections so you can calculate the number of bytes for the binary reads reading the file as a stream and creating lines to read from appears to work with the test file. If the binary writes are always on WRITE statements by themselves and of known size and/or always just one binary section always at the end of the file that approach should work. Perhaps some F77-only compiler might work directly with the data, but on the compilers I tried even asking for F77-compatibility none would read the data ss sequential formatted without interpreting any “newline” character as an end-of-line when reading back the data; so I think a change in the documentation is required. Are there any other Fortran programs that read the mixed file with binary sections that could point to other solutions?

lars · July 2, 2023, 11:46am

They are rarely at the end of the file, since you typically have many data blocks (the example contains only one), so at least you can assume that the example block repeats 10-100 times in a file.

The number of values must be stated in the header, so that is “known”, including its format (number of bytes per value, which is 4-16). In addition the first record (the dataset ID) contains the number of bytes in the binary block, however some software, e.g. Siemens Testlab, sometimes does not write the correct number of bytes in that field.

Even if it is a widely used format, I have only encountered it in proprietary closed source software, so I’ve never seen their source code/implementation details. The standard contains only sample code for the all-ASCII variant.

lars · July 2, 2023, 11:58am

Yes, the values are correct. I will have to do some more benchmarking to determine the best option, since one of the important criteria is performance. Actual files are typically 1-10 GB (and sometimes even more than that), so I want to avoid doing unnecessary allocations.
Another option could be to re-open the file at the binary blocks and read unformatted (?), and the continue reading the next data block…

Topic		Replies	Views
Reading a binary file Help	5	1451	December 10, 2022
Binary records and reading into sliced arrays Help	8	195	February 2, 2025
Read/Write Binary files with F90-95 Help	18	6266	March 24, 2021
Reading a file with NUL values into an array	6	237	September 23, 2024
Read out Fortran unformatted file in C	21	2290	April 19, 2022

How to read a mixed formatted/binary file?

Related topics