The Fortran binary file is located here, and I obtained the following information of its hexadecimal representation in Emacs using the M-x hexl-find-file:
So, I want to know if it’s possible to read out the content of this Fortran binary file based on the limited information given by its hexadecimal representation, as shown above.
Reliably, no. There are ways to detect patterns in the file and guess whether it is a sequential file or a direct access file, whether there are likely strings in the file, and what likely word size and endian the data was by assuming a likely range for values; but it is not trivial and not foolproof. Knowing what platform the file was written on, what compiler was used, the vintage of the file, … almost anything can help but the short answer is no; and to even attempt guessing is a non-trivial and somewhat unreliable process unless the file was written on a few very specific platforms, almost all of which have not existed for a long time. If you have the format statements that were used to write it it becomes far more plausible, especially if the file was written with just one format, or if you know the file is nothing but floating point values, etc. You know nothing about the file except it is a “Fortran” file? Some developers where thoughtful (or required to) and actually wrote the FORMAT statements used as strings into the files, but that is very rare in general. If you are on GNU/Linux or Linux try running the “strings” command on the file and see if there are any strings in the file that give you information about it’s origin or type.
Although it is an interesting puzzle unless the format is exceeding simple (like as mentioned, a stream of float values, hopefully of a known size and endian) you would have to have a hugely compelling reason to even try; and even if you succeeded it would just likely be a bunch of numbers. Without knowing what the code did with those numbers what value would they be?
With the code you referred to and some understanding of the structure of unformatted files, I would say, that it is an “ordinary” unformattted file with data in little-endian. You can recognise the records because the first part of a record is a 4-bytes integer that indicates how many bytes the record consists of. The last part is the same integer value. In the screenshot:
0e000 0000 indicates 14 bytes follow. From the code you can see that this should be the integer doubnum (value: 2) and the string spacegroupsymbol (10 characters). Then you see the same value 0e00 0000 again, which closes the record.
7c00 0000 is the start of the next record, but I leave analysing that content - via the presented code - to you
@Arjen Thank you very much for your analysis and explanation. I tried to test it with the following code, but my Fortran fundamentals are poor, and I still don’t know how to read the data correctly:
! This is the kLG.f90 source file for testing.
program kLG
implicit none
integer :: recl
real, dimension(4) :: x
open(12, file='kLG_1.data',status='old',form='unformatted',access='direct',recl=36)
read(12,rec=1)x
close(12)
write(*, *) x
end program
The problem is not really related to the files or the programs. It is your (unfounded) expectations of what Fortran unformatted files are, and how they are to be used. I urge you to obtain and read good textbooks and manuals regarding the subject, and work through several example programs in which you write and read unformatted Fortran files containing small amounts of data.
For instance, open an unformatted file for writing, write 9 integer values to it, and examine the bytes contained in the file using Emacs, a hex-file dumper, etc. Next, write a program to read that file and see if the data are read back correctly.
What I see as your stumbling block: not recognizing that unformatted Fortran files contain (i) the data, usually in the internal machine representation, and (ii) metadata, i.e., information regarding record size, etc., that is written by the Fortran I/O system and later used to read the records correctly. The data representation usually depends on the CPU architecture and OS. The metadata may vary from compiler to compiler, even on a given CPU and OS, but many compilers on Linux and Windows use a common format for the metadata.
You have to know how to distinguish between data bytes and metadata bytes. If you do not know exactly what information was written into an unformatted file, you are probably not going to be able to read that file.
There is strong evidence in the file ‘kLG_1.data’ that it is an unformatted sequential file, and not a direct access file. The file starts with records of byte lengths 14, 124, 124, 8, 40, followed by 18 groups of byte lengths (8, 4, 16, 8, 4, 16, 40), i.e., a total of 131 records.
@mecej4 is quite right: an unformatted sequential file (note that this is different from an unformatted direct-acess file, as you assume it is, given the keywords in the open statement) is a combination of data and metadata. For most programmers, only the data are of interest - the metadata, i.e. record information, is handled via the read/write statements internally.
That said, these files contain no information per se about what data they actually hold. You need to have access to a description or, as in this case, the actual source code, to make sense of the contents. (This is not unique to Fortran, by the way. Self-describing data files, such as netCDF or SQLite databases, exist by the grace of conventions enforced by the libraries that read or write them, not by something inherent in the programming language, or at least not in languages like C or Fortran).
As the question was to identify what the file contains and nothing was apparenty known about the origin, I examined the individual bytes and recognised the structure.
I am not familiar with Emacs (no 8 or 9 present?). I am trying to understand the metadata, but I can not get 14 from 0e00. If reversed, hex 0000 00e0 as a 4-byte integer is 224 ? ( a leading or trailing 0 implies a multiple of 16 ?)
I also am struggling to find consistent Fortran unformatted sequential file metadata structures.
This does not look like an unformatted Fortran sequential file.
Without the source of the (assuming) Fortran code that generated the file, including the I/O lists and declarations that describe both the type and kind of variables written, there is limited definate information to be gained.
Also, Fortran unformatted direct access files do not include metatata and so the information to be gained is even more limited, unless some associated information related to the record size and I/O lists are provided.
I have many files like this, which should include a similar storage structure, but with a different number of record entries. I want to automatically determine the data and metadata information, and extract the data of all these files in batches.
Here is a C program to read unformatted files. The 40-byte records appear to be character strings, the 16 byte records contain pairs of double precision reals, and the other records contain one or two 4-byte integers. Note that I said “appear to”. When I see the 8 bytes 00 00 00 00 00 00 F0 3F, I recognize that as double precision 1.0. Similarly for other integers and reals.
/* Read Fortran unformatted file with record markers containing byte sizes */
#include <stdio.h>
int main(int argc, char *argv[]){
int recn,recl,nin; char buf[0x1000]; long offset;
FILE *fil=fopen(argv[1],"rb");
recn=0; offset = 0;
do{
nin = fread(&recl,4,1,fil); // prefix marker
if(nin < 1)break;
offset+=4;
if(nin > 0x1000){
fprintf(stderr,"Buffer is too small, need more than 4096 bytes\n");
exit(1);
}
nin=fread(buf,1,recl,fil);
fread(&recl,4,1,fil); // postfix marker
recn++; printf("%4d %8d %12ld\n",recn,nin,offset);
offset+=nin+4;
}while(1);
fclose(fil);
}
It really confirms the previous conclusion you made:
But I still can’t figure out why each data record has been stored in 7 fields with the bytes lengths sequence (8, 4, 16, 8, 4, 16, 40). As you can see below, each record has 9 fields with a headline denoting the space group name:
As not familiar with Emacs’ byte zero output, I wrote a Fortran program, using stream I/O and reproduced the record structure as mecej4 reported.
use bucket_info
call open_stream_file
num_rec = 0
do
num_rec = num_rec+1
call get_next_record
call display_this_record
end do
end
The program terminates after 131 records with an end of file or unexpected record size.
Without the I/O list that generated the file, we can recover a series of bytes for each record and then have to guess the grouping of bytes into integer, real or character types of unknown kinds.
mecej4 can fortunately recognise the byte pattern for 1.0d0, and some character strings can be recognisable (especially hex 20 blanks), but that is a long way from the certainty provided by the I/O list that generated the file.
Text files are a lot more portable !
The attached file has both a buffer read then a fortran record read attempt.
Sorry, this line of reasoning / method of extracting information makes no sense at all. The Linux/Unix string utility displays printable characters when they occur consecutively in clumps of 4 or more. Everything else in the file is ignored! When a Fortran unformatted file is written, most of the bytes in the file correspond to “unprintable” characters, so by ignoring them you are discarding almost all the numerical information (integers, reals, etc.) and carrying on as if what is left is all that is important.
For instance, for the data file kLG_1.data, your application of strings and sed leaves discards 90 percent of the content of the file.
The question is, do you have the Fortran program sources that produced every one of the unformatted data files that are of interest to you? Or, at least complete and precise documentation of the structure and content of the files? If not, you are simply inviting us to go with you for a walk in a minefield while wearing blindfolds.
Reading the source file nonsymm.f90, in particular the READ(11) statements and the declarations of the variables in the IOLists of those READ statements, answers all the questions that remained unanswered earlier.
The input file is a Fortran unformatted sequential file. Here are the READ statements that use that file.
nonsymm.f90:207: read(11) DoubNum, spacegroupsymbol
nonsymm.f90:248: read(11) SymElemR(:,:,i),SymElemt(:,i),Df(:,:)
nonsymm.f90:347: read(11) Numk,tnir
nonsymm.f90:350: read(11) ListIrrep
371: irk: DO J=1,DoubNum
nonsymm.f90:372: read(11) itmp,itmp
373: IF(itmp==1) THEN
nonsymm.f90:376: read(11) itmp;
378: IF(itmp==1) THEN
nonsymm.f90:379: read(11) abcde(1:2)
380: ELSEIF(itmp==2) THEN
nonsymm.f90:381: read(11) abcde(1:5)
The number of records read, as well as the sizes of the records, are dependent on the input data, rather than being fixed, because of the DO loops with loop count dependent on input data, and locating some of the READs inside the IF…THEN…ELSEIF…ENDIF.
As I did earlier, I recommend that you read documentation to understand the rules that govern Fortran unformatted I/O. They do not have to be mysteries.
Got it. Thank you very much. The following are the code snippet lines I extracted, which correspond to the relevant data extraction logic you gave above:
No, the hexadecimal string “0e00 0000” consists of 4 bytes, 0e, 00, 00 and 00. The byte ordering is little-endian, so the first byte, 0e, is actually the least signficant one. And that is 14 (decimal)
Based on the byte order and hex order in each byte, it could be better for a binary file display, such as Emacs, to display the byte values from right to left, rather from left to right, and so we read all hex values from right to left in 4-bit order.
Is this done with any binary file displays ?
I very much doubt that for the simple reason that it would break (ASCII) strings and 16-byte integers, not to mention UTF-8 strings. You would really need to what bytes make up a single item.