Read out the content of a Fortran binary file based on the limited information given by its hexadecimal representation

The corresponding FORTRAN code snippet that reads these files appears to be located here. But I failed to figure out the logic used by it.

For more details on some of these files, see the following.


$ strings kLG_1.data 
P1        
0.123 0.313 0.427 0 GP1 1 2 GP 0        (
0.123 0.313 0.427 0 -GP2 1 2 GP 0       (
0.5 0.5 0.5 1 R1 1 2 R 1                (
0.5 0.5 0.5 1 -R2 1 2 R 1               (
0 0.5 0.5 1 T1 1 2 T 1                  (
0 0.5 0.5 1 -T2 1 2 T 1                 (
0.5 0 0.5 1 U1 1 2 U 1                  (
0.5 0 0.5 1 -U2 1 2 U 1                 (
0.5 0.5 0 1 V1 1 2 V 1                  (
0.5 0.5 0 1 -V2 1 2 V 1                 (
0.5 0 0 1 X1 1 2 X 1                    (
0.5 0 0 1 -X2 1 2 X 1                   (
0 0.5 0 1 Y1 1 2 Y 1                    (
0 0.5 0 1 -Y2 1 2 Y 1                   (
0 0 0.5 1 Z1 1 2 Z 1                    (
0 0 0.5 1 -Z2 1 2 Z 1                   (
0 0 0 1 GM1 1 2 GM 1                    (
0 0 0 1 -GM2 1 2 GM 1                   (
0 0 0 1 -GM2 1 2 GM 1                   (
$ strings kLG_158.data 
P3c1      
?F]k
?F]k
?F]k
?F]k
?F]k
?F]k
?F]k
?F]k
?F]k
?F]k
?F]k
?F]k
?F]k
?F]k
?F]k
?F]k
0 0 0.5 1 A1 1 12 A 0                   (
0 0 0.5 1 A2 1 12 A 0                   (
0 0 0.5 1 A3 2 12 A -1                  (
0 0 0.5 1 -A4 1 12 A 1                  (
0 0 0.5 1 -A5 1 12 A 1                  (
0 0 0.5 1 -A6 2 12 A 1                  (
0 0 0.427 0 DT1 1 12 DT 0               (
0 0 0.427 0 DT2 1 12 DT 0               (
0 0 0.427 0 DT3 2 12 DT 0               (
0 0 0.427 0 -DT4 1 12 DT 0              (
0 0 0.427 0 -DT5 1 12 DT 0              (
0 0 0.427 0 -DT6 2 12 DT 0              (
0 0 0 1 GM1 1 12 GM 1                   (
0 0 0 1 GM2 1 12 GM 1                   (
0 0 0 1 GM3 2 12 GM 1                   (
0 0 0 1 -GM4 1 12 GM 0                  (
0 0 0 1 -GM5 1 12 GM 0                  (
0 0 0 1 -GM6 2 12 GM -1                 (
0.333333 0.333333 0.5 1 H1 1 6 H -1     (
0.333333 0.333333 0.5 1 H2 1 6 H -1     (
?F]k
?F]k
?F]k
?F]k
0.333333 0.333333 0.5 1 H3 1 6 H -1     (
?F]k
?F]k
?F]k
?F]k
0.333333 0.333333 0.5 1 -H4 1 6 H 1     (
0.333333 0.333333 0.5 1 -H5 1 6 H 1     (
?F]k
?F]k
?F]k
?F]k
0.333333 0.333333 0.5 1 -H6 1 6 H 1     (
?F]k
?F]k
?F]k
?F]k
0.333333 0.333333 0 1 K1 1 6 K 1        (
0.333333 0.333333 0 1 K2 1 6 K 1        (
?F]k
?F]k
?F]k
?F]k
0.333333 0.333333 0 1 K3 1 6 K 1        (
?F]k
?F]k
?F]k
?F]k
0.333333 0.333333 0 1 -K4 1 6 K -1      (
0.333333 0.333333 0 1 -K5 1 6 K -1      (
?F]k
?F]k
?F]k
?F]k
0.333333 0.333333 0 1 -K6 1 6 K -1      (
?F]k
?F]k
?F]k
?F]k
0.333333 0.333333 0.427 0 P1 1 6 P 0    (
0.333333 0.333333 0.427 0 P2 1 6 P 0    (
?F]k
?F]k
?F]k
?F]k
0.333333 0.333333 0.427 0 P3 1 6 P 0    (
?F]k
?F]k
?F]k
?F]k
0.333333 0.333333 0.427 0 -P4 1 6 P 0   (
0.333333 0.333333 0.427 0 -P5 1 6 P 0   (
?F]k
?F]k
?F]k
?F]k
0.333333 0.333333 0.427 0 -P6 1 6 P 0   (
?F]k
?F]k
?F]k
?F]k
0.123 0 0.427 0 D1 1 4 D 0              (
0.123 0 0.427 0 D2 1 4 D 0              (
0.123 0 0.427 0 -D3 1 4 D 0             (
0.123 0 0.427 0 -D4 1 4 D 0             (
0.5 0 0.5 1 L1 1 4 L 0                  (
0.5 0 0.5 1 L2 1 4 L 0                  (
0.5 0 0.5 1 -L3 1 4 L 1                 (
0.5 0 0.5 1 -L4 1 4 L 1                 (
0.5 0 0 1 M1 1 4 M 1                    (
0.5 0 0 1 M2 1 4 M 1                    (
0.5 0 0 1 -M3 1 4 M 0                   (
0.5 0 0 1 -M4 1 4 M 0                   (
0.123 0 0.5 0 R1 1 4 R 0                (
0.123 0 0.5 0 R2 1 4 R 0                (
0.123 0 0.5 0 -R3 1 4 R 0               (
0.123 0 0.5 0 -R4 1 4 R 0               (
0.123 0 0 0 SM1 1 4 SM 0                (
0.123 0 0 0 SM2 1 4 SM 0                (
0.123 0 0 0 -SM3 1 4 SM 0               (
0.123 0 0 0 -SM4 1 4 SM 0               (
0.5 0 0.427 0 U1 1 4 U 0                (
0.5 0 0.427 0 U2 1 4 U 0                (
0.5 0 0.427 0 -U3 1 4 U 0               (
0.5 0 0.427 0 -U4 1 4 U 0               (
0.123 0.313 0 0 B1 1 2 B 0              (
0.123 0.313 0 0 -B2 1 2 B 0             (
0.123 0.123 0.427 0 C1 1 2 C 0          (
0.123 0.123 0.427 0 -C2 1 2 C 0         (
0.123 0.313 0.5 0 E1 1 2 E 0            (
0.123 0.313 0.5 0 -E2 1 2 E 0           (
0.123 0.313 0.427 0 GP1 1 2 GP 0        (
0.123 0.313 0.427 0 -GP2 1 2 GP 0       (
0.123 0.123 0 1 LD1 1 2 LD 1            (
0.123 0.123 0 1 -LD2 1 2 LD -1          (
0.123 0.123 0.5 1 Q1 1 2 Q -1           (
0.123 0.123 0.5 1 -Q2 1 2 Q 1           (
0.123 0.123 0.5 1 -Q2 1 2 Q 1           (

Regards,
HZ

With the code you referred to and some understanding of the structure of unformatted files, I would say, that it is an “ordinary” unformattted file with data in little-endian. You can recognise the records because the first part of a record is a 4-bytes integer that indicates how many bytes the record consists of. The last part is the same integer value. In the screenshot:
0e000 0000 indicates 14 bytes follow. From the code you can see that this should be the integer doubnum (value: 2) and the string spacegroupsymbol (10 characters). Then you see the same value 0e00 0000 again, which closes the record.
7c00 0000 is the start of the next record, but I leave analysing that content - via the presented code - to you :slight_smile:

@Arjen Thank you very much for your analysis and explanation. I tried to test it with the following code, but my Fortran fundamentals are poor, and I still don’t know how to read the data correctly:

! This is the kLG.f90 source file for testing.
program kLG
   implicit none

   integer  :: recl
   real, dimension(4)   :: x
   
   open(12, file='kLG_1.data',status='old',form='unformatted',access='direct',recl=36)
   read(12,rec=1)x  
   close(12)
   
   write(*, *) x   
end program 

But I got the following strange output:

$ gfortran kLG.f90
$ ./a.out 
   1.96181785E-44   2.80259693E-45   1.35688433E-19   1.35631564E-19

Any more hints for solving this problem?

Regards,
HZ

The problem is not really related to the files or the programs. It is your (unfounded) expectations of what Fortran unformatted files are, and how they are to be used. I urge you to obtain and read good textbooks and manuals regarding the subject, and work through several example programs in which you write and read unformatted Fortran files containing small amounts of data.

For instance, open an unformatted file for writing, write 9 integer values to it, and examine the bytes contained in the file using Emacs, a hex-file dumper, etc. Next, write a program to read that file and see if the data are read back correctly.

What I see as your stumbling block: not recognizing that unformatted Fortran files contain (i) the data, usually in the internal machine representation, and (ii) metadata, i.e., information regarding record size, etc., that is written by the Fortran I/O system and later used to read the records correctly. The data representation usually depends on the CPU architecture and OS. The metadata may vary from compiler to compiler, even on a given CPU and OS, but many compilers on Linux and Windows use a common format for the metadata.

You have to know how to distinguish between data bytes and metadata bytes. If you do not know exactly what information was written into an unformatted file, you are probably not going to be able to read that file.

There is strong evidence in the file ‘kLG_1.data’ that it is an unformatted sequential file, and not a direct access file. The file starts with records of byte lengths 14, 124, 124, 8, 40, followed by 18 groups of byte lengths (8, 4, 16, 8, 4, 16, 40), i.e., a total of 131 records.

@mecej4 is quite right: an unformatted sequential file (note that this is different from an unformatted direct-acess file, as you assume it is, given the keywords in the open statement) is a combination of data and metadata. For most programmers, only the data are of interest - the metadata, i.e. record information, is handled via the read/write statements internally.

That said, these files contain no information per se about what data they actually hold. You need to have access to a description or, as in this case, the actual source code, to make sense of the contents. (This is not unique to Fortran, by the way. Self-describing data files, such as netCDF or SQLite databases, exist by the grace of conventions enforced by the libraries that read or write them, not by something inherent in the programming language, or at least not in languages like C or Fortran).

As the question was to identify what the file contains and nothing was apparenty known about the origin, I examined the individual bytes and recognised the structure.

I am not familiar with Emacs (no 8 or 9 present?). I am trying to understand the metadata, but I can not get 14 from 0e00. If reversed, hex 0000 00e0 as a 4-byte integer is 224 ? ( a leading or trailing 0 implies a multiple of 16 ?)
I also am struggling to find consistent Fortran unformatted sequential file metadata structures.
This does not look like an unformatted Fortran sequential file.

Without the source of the (assuming) Fortran code that generated the file, including the I/O lists and declarations that describe both the type and kind of variables written, there is limited definate information to be gained.

Also, Fortran unformatted direct access files do not include metatata and so the information to be gained is even more limited, unless some associated information related to the record size and I/O lists are provided.

  1. What are your clues to this conclusion?
  2. I have many files like this, which should include a similar storage structure, but with a different number of record entries. I want to automatically determine the data and metadata information, and extract the data of all these files in batches.

Regards,
HZ

The purpose of the following document should be to provide the missing information you said above:

Here is a C program to read unformatted files. The 40-byte records appear to be character strings, the 16 byte records contain pairs of double precision reals, and the other records contain one or two 4-byte integers. Note that I said “appear to”. When I see the 8 bytes 00 00 00 00 00 00 F0 3F, I recognize that as double precision 1.0. Similarly for other integers and reals.

/* Read Fortran unformatted file with record markers containing byte sizes */
#include <stdio.h>
int main(int argc, char *argv[]){
int recn,recl,nin; char buf[0x1000]; long offset;
FILE *fil=fopen(argv[1],"rb");
recn=0; offset = 0;
do{
   nin = fread(&recl,4,1,fil); // prefix marker
   if(nin < 1)break;
   offset+=4;
   if(nin > 0x1000){
      fprintf(stderr,"Buffer is too small, need more than 4096 bytes\n");
      exit(1);
      }
   nin=fread(buf,1,recl,fil);
   fread(&recl,4,1,fil); // postfix marker
   recn++; printf("%4d %8d %12ld\n",recn,nin,offset);
   offset+=nin+4;
   }while(1);
fclose(fil);
}

1 Like

I added the #include <stdlib.h> line as follows:

/* Read Fortran unformatted file with record markers containing byte sizes */
#include <stdio.h>
#include <stdlib.h>
int main(int argc, char *argv[]){
int recn,recl,nin; char buf[0x1000]; long offset;
FILE *fil=fopen(argv[1],"rb");
recn=0; offset = 0;
do{
   nin = fread(&recl,4,1,fil); // prefix marker
   if(nin < 1)break;
   offset+=4;
   if(nin > 0x1000){
      fprintf(stderr,"Buffer is too small, need more than 4096 bytes\n");
      exit(1);
      }
   nin=fread(buf,1,recl,fil);
   fread(&recl,4,1,fil); // postfix marker
   recn++; printf("%4d %8d %12ld\n",recn,nin,offset);
   offset+=nin+4;
   }while(1);
fclose(fil);
}

See the following for the testing result:

$ gcc kLG.c
$ ./a.out kLG_1.data 
   1       14            4
   2      124           26
   3      124          158
   4        8          290
   5       40          306
   6        8          354
   7        4          370
   8       16          382
   9        8          406
  10        4          422
  11       16          434
  12       40          458
  13        8          506
  14        4          522
  15       16          534
  16        8          558
  17        4          574
  18       16          586
  19       40          610
  20        8          658
  21        4          674
  22       16          686
  23        8          710
  24        4          726
  25       16          738
  26       40          762
  27        8          810
  28        4          826
  29       16          838
  30        8          862
  31        4          878
  32       16          890
  33       40          914
  34        8          962
  35        4          978
  36       16          990
  37        8         1014
  38        4         1030
  39       16         1042
  40       40         1066
  41        8         1114
  42        4         1130
  43       16         1142
  44        8         1166
  45        4         1182
  46       16         1194
  47       40         1218
  48        8         1266
  49        4         1282
  50       16         1294
  51        8         1318
  52        4         1334
  53       16         1346
  54       40         1370
  55        8         1418
  56        4         1434
  57       16         1446
  58        8         1470
  59        4         1486
  60       16         1498
  61       40         1522
  62        8         1570
  63        4         1586
  64       16         1598
  65        8         1622
  66        4         1638
  67       16         1650
  68       40         1674
  69        8         1722
  70        4         1738
  71       16         1750
  72        8         1774
  73        4         1790
  74       16         1802
  75       40         1826
  76        8         1874
  77        4         1890
  78       16         1902
  79        8         1926
  80        4         1942
  81       16         1954
  82       40         1978
  83        8         2026
  84        4         2042
  85       16         2054
  86        8         2078
  87        4         2094
  88       16         2106
  89       40         2130
  90        8         2178
  91        4         2194
  92       16         2206
  93        8         2230
  94        4         2246
  95       16         2258
  96       40         2282
  97        8         2330
  98        4         2346
  99       16         2358
 100        8         2382
 101        4         2398
 102       16         2410
 103       40         2434
 104        8         2482
 105        4         2498
 106       16         2510
 107        8         2534
 108        4         2550
 109       16         2562
 110       40         2586
 111        8         2634
 112        4         2650
 113       16         2662
 114        8         2686
 115        4         2702
 116       16         2714
 117       40         2738
 118        8         2786
 119        4         2802
 120       16         2814
 121        8         2838
 122        4         2854
 123       16         2866
 124       40         2890
 125        8         2938
 126        4         2954
 127       16         2966
 128        8         2990
 129        4         3006
 130       16         3018
 131       40         3042

It really confirms the previous conclusion you made:

But I still can’t figure out why each data record has been stored in 7 fields with the bytes lengths sequence (8, 4, 16, 8, 4, 16, 40). As you can see below, each record has 9 fields with a headline denoting the space group name:

$ strings kLG_1.data | grep -v '^\?' | sed -re 's/\($//' | awk '!a[$0]++ {print $0, NF,NR}'
P1         1 1
0.123 0.313 0.427 0 GP1 1 2 GP 0         9 2
0.123 0.313 0.427 0 -GP2 1 2 GP 0        9 3
0.5 0.5 0.5 1 R1 1 2 R 1                 9 4
0.5 0.5 0.5 1 -R2 1 2 R 1                9 5
0 0.5 0.5 1 T1 1 2 T 1                   9 6
0 0.5 0.5 1 -T2 1 2 T 1                  9 7
0.5 0 0.5 1 U1 1 2 U 1                   9 8
0.5 0 0.5 1 -U2 1 2 U 1                  9 9
0.5 0.5 0 1 V1 1 2 V 1                   9 10
0.5 0.5 0 1 -V2 1 2 V 1                  9 11
0.5 0 0 1 X1 1 2 X 1                     9 12
0.5 0 0 1 -X2 1 2 X 1                    9 13
0 0.5 0 1 Y1 1 2 Y 1                     9 14
0 0.5 0 1 -Y2 1 2 Y 1                    9 15
0 0 0.5 1 Z1 1 2 Z 1                     9 16
0 0 0.5 1 -Z2 1 2 Z 1                    9 17
0 0 0 1 GM1 1 2 GM 1                     9 18
0 0 0 1 -GM2 1 2 GM 1                    9 19

OTOH, I also checked another such file with your above code, but it gives very large record number, which shouldn’t be really the case:

$ ./a.out ~/Public/repo/github.com/zjwang11/irvsp.git/IRVSPDATA/kLittleGroups/kLG_68.data 
   1       14            4
   2      124           26
   3      124          158
   4      124          290
   5      124          422
   6      124          554
   7      124          686
   8      124          818
   9      124          950
  10      124         1082
  11      124         1214
  12      124         1346
  13      124         1478
  14      124         1610
  15      124         1742
  16      124         1874
  17      124         2006
  18        8         2138
  19       40         2154
  20        8         2202
  21        4         2218
  22       16         2230
  23        8         2254
  24        4         2270
  25       16         2282
[...]
3020        8        58334
3021        8        58350
3022        8        58366
3023        8        58382
3024        8        58398
3025        4        58414
3026       16        58426
3027        8        58450
3028        8        58466
3029        8        58482
3030        8        58498
3031        8        58514
3032        8        58530
3033        8        58546
3034       40        58562
3035        8        58610
3036        4        58626
3037       16        58638
3038        8        58662
3039        8        58678
3040        8        58694
3041        8        58710
3042        8        58726
3043        8        58742
3044        8        58758
3045        8        58774
3046        4        58790
3047       16        58802
3048        8        58826
3049        8        58842
3050        8        58858
3051        8        58874
3052        8        58890
3053        8        58906
3054        8        58922
3055       40        58938

As not familiar with Emacs’ byte zero output, I wrote a Fortran program, using stream I/O and reproduced the record structure as mecej4 reported.

   use bucket_info
   call open_stream_file

   num_rec = 0
   do
     num_rec = num_rec+1
     call get_next_record
     call display_this_record
   end do
   end

The program terminates after 131 records with an end of file or unexpected record size.

Without the I/O list that generated the file, we can recover a series of bytes for each record and then have to guess the grouping of bytes into integer, real or character types of unknown kinds.
mecej4 can fortunately recognise the byte pattern for 1.0d0, and some character strings can be recognisable (especially hex 20 blanks), but that is a long way from the certainty provided by the I/O list that generated the file.

Text files are a lot more portable !

The attached file has both a buffer read then a fortran record read attempt.

binary_read.f90 (3.5 KB)

Sorry, this line of reasoning / method of extracting information makes no sense at all. The Linux/Unix string utility displays printable characters when they occur consecutively in clumps of 4 or more. Everything else in the file is ignored! When a Fortran unformatted file is written, most of the bytes in the file correspond to “unprintable” characters, so by ignoring them you are discarding almost all the numerical information (integers, reals, etc.) and carrying on as if what is left is all that is important.

For instance, for the data file kLG_1.data, your application of strings and sed leaves discards 90 percent of the content of the file.

The question is, do you have the Fortran program sources that produced every one of the unformatted data files that are of interest to you? Or, at least complete and precise documentation of the structure and content of the files? If not, you are simply inviting us to go with you for a walk in a minefield while wearing blindfolds.

All I know is the code snippet below that reads these files (in this case, the unit specifier 11 is used):

Reading the source file nonsymm.f90, in particular the READ(11) statements and the declarations of the variables in the IOLists of those READ statements, answers all the questions that remained unanswered earlier.

The input file is a Fortran unformatted sequential file. Here are the READ statements that use that file.


nonsymm.f90:207:      read(11) DoubNum, spacegroupsymbol
nonsymm.f90:248:         read(11) SymElemR(:,:,i),SymElemt(:,i),Df(:,:)
nonsymm.f90:347:      read(11)  Numk,tnir
nonsymm.f90:350:        read(11) ListIrrep
            371:     irk: DO J=1,DoubNum
nonsymm.f90:372:            read(11) itmp,itmp
            373:            IF(itmp==1) THEN
nonsymm.f90:376:               read(11) itmp;
            378:               IF(itmp==1) THEN         
nonsymm.f90:379:                  read(11) abcde(1:2)
            380:               ELSEIF(itmp==2) THEN
nonsymm.f90:381:                  read(11) abcde(1:5)

The number of records read, as well as the sizes of the records, are dependent on the input data, rather than being fixed, because of the DO loops with loop count dependent on input data, and locating some of the READs inside the IF…THEN…ELSEIF…ENDIF.

As I did earlier, I recommend that you read documentation to understand the rules that govern Fortran unformatted I/O. They do not have to be mysteries.

Got it. Thank you very much. The following are the code snippet lines I extracted, which correspond to the relevant data extraction logic you gave above:

werner@X10DAi-00:~/Public/repo/github.com/zjwang11/irvsp.git$ rg  '^[ ]*(irk[:]|IF\(itmp|read\((bb|11)\))'
lib_irrep_bcs/lib_bilbao.f90
64:    read(bb) num_doub_sym, symbol_sg
90:        read(bb) rot_bilbao(:,:,i), tau_bilbao(:,i), Df(:,:)
135:    read(11)  Numk,tnir
138:      read(11) ListIrrep
159:      irk:DO j=1,num_doub_sym
160:          read(11) itmp,itmp
161:          IF(itmp==1) THEN
164:             read(11) itmp;labels(2,j,iir,ikt)=itmp
166:             IF(itmp==1) THEN
167:                read(11) abcde(1:2)
169:                read(11) abcde(1:5)

src_irvsp_v2/nonsymm.f90
207:      read(11) DoubNum, spacegroupsymbol
275:         read(11) SymElemR(:,:,i),SymElemt(:,i),Df(:,:)
374:      read(11)  Numk,tnir
377:        read(11) ListIrrep
398:        irk:DO j=1,Doubnum
399:            read(11) itmp,itmp
400:            IF(itmp==1) THEN
403:               read(11) itmp;labels(2,j,iir,ikt)=itmp
405:               IF(itmp==1) THEN
406:                  read(11) abcde(1:2)
408:                  read(11) abcde(1:5)

No, the hexadecimal string “0e00 0000” consists of 4 bytes, 0e, 00, 00 and 00. The byte ordering is little-endian, so the first byte, 0e, is actually the least signficant one. And that is 14 (decimal)

Thanks for clarifying the order.

Based on the byte order and hex order in each byte, it could be better for a binary file display, such as Emacs, to display the byte values from right to left, rather from left to right, and so we read all hex values from right to left in 4-bit order.
Is this done with any binary file displays ?

I very much doubt that for the simple reason that it would break (ASCII) strings and 16-byte integers, not to mention UTF-8 strings. You would really need to what bytes make up a single item.

The ambiguity as to order often arises when bits are grouped into chunks that are more than 8 bits long, and we do not know where one data item within a record ends (and the next data item within the same record begins, if there is such an item). I, too, find it confusing when the display shows groups of more than two hex-characters at a time.

Here are the displays from two Unix/Linux utilities for the same data file.

T:\>xxd kLG_1.data | head
00000000: 0e00 0000 0200 0000 5031 2020 2020 2020  ........P1
T:\>od -t x2 kLG_1.data | head
0000000 000e 0000 0002 0000 3150 2020 2020 2020
T:\>od -t x1 kLG_1.data | head
0000000 0e 00 00 00 02 00 00 00 50 31 20 20 20 20 20 20

I will modify my binary_display program to display from right to left so the hex values are in order.
We can then see if this breaks the mold, or we can read like other languages?

I am suggesting this as the Emac display in the OP has “0e00” at the start of the first line, but I think it would read better if it was “000e” at the right end of the first line, then fill to the left.