Binary data mischief with od-xxd-hexdump and fortran

Hi Forts! I am a longtime lurker and first time poster. This is something I am writing for my blog. It is not quite complete but I felt like sharing it here. Hope its not too basic for this crowd!

There is a mysterious file with a magic number binary header. It is not known how this file was created. We wish to store the binary header data as a variable within a fortran program so that we can test other binary files for a match to this reference header. One alternative way to test the other files would be to keep the reference file around and readin the header into any type of a variable at the top of the program and use it as a reference. A neater solution is to encode the data into the source code of the program itself. A somewhat related exercise is to embed arbitrary binary data into an executable Linking a binary blob with GCC | Freedom Embedded but lets not digress.

Let us say that the header is the first 8 bytes. For demonstration, I created a random file of 64 bytes using

dd if=/dev/urandom of=random-file.bin bs=1 count=64 iflag=fullblock

but literally any file will do.

(If you use a larger file, for the next set of commands, you might want to trim the output to the first few lines by piping through head )

The first step is to look at the binary using one of od/hexdump/xxd. Instead of showing us lots and lots of 0s and 1s that would cause your eyes to glaze over, these commands output a compact representation of the binary data in hexadecimal (or octal). This leads to our first problem. By default, these commands do not print the same thing for the same file:

od

od stands for octal dump and prints octals by default. It needs some options to print hexadecimal.

od -Ax -tx random-file.bin
000000 bd188db2 950361bd df635ec8 907d5cd8
000010 541eebca 3cc7968c 6b053c19 fcc91ed4
000020 3e5b2291 947f60dd d1f87cfd ffffc55c
000030 7613604a 60e26218 db09e09a e733906b
000040

Sidebar: od is OG, first released in November 1971(I had linked to wikipedia page of od here but had to remove it to stay at the two link limit). It predates the Bourne Again shell and is also the reason for the inconsistency in bash’s do loop syntax. The usual convention followed for constructs in bash is if ... fi or case ... esac etc. If not for the already existing od command, do loops would have been closed with od instead of done!

xxd

xxd random-file.bin

00000000: b28d 18bd bd61 0395 c85e 63df d85c 7d90  .....a...^c..\}.
00000010: caeb 1e54 8c96 c73c 193c 056b d41e c9fc  ...T...<.<.k....
00000020: 9122 5b3e dd60 7f94 fd7c f8d1 5cc5 ffff  ."[>.`...|..\...
00000030: 4a60 1376 1862 e260 9ae0 09db 6b90 33e7  J`.v.b.`....k.3.

hexdump

hexdump random-file.bin

0000000 8db2 bd18 61bd 9503 5ec8 df63 5cd8 907d
0000010 ebca 541e 968c 3cc7 3c19 6b05 1ed4 fcc9
0000020 2291 3e5b 60dd 947f 7cfd d1f8 c55c ffff
0000030 604a 7613 6218 60e2 e09a db09 906b e733
0000040

If you are well-versed with these tools you can probably see whats coming. I got lucky here and used xxd at first and was able to do what I wanted to do relatively painlessly. If I had used one of the other tools with their default options I would have probably pulled out a non-negligible percentage of my hair before getting to a place of understanding.

One hexadecimal digit encodes four bits of data. od printed 16 “words” of 4 bytes. The other two commands printed 4x8 = 32 words and each word is 2 bytes. And no two are alike! We will square away the output of hexdump and od at a later time. Right now, let us continue with xxd's output and add some options to cleanly print the first 8 bytes of the file in hex

xxd -p -l8 random-file.bin
b28d18bdbd610395

Its the same was what we got previously with the spaces removed and keeping only the first 8 bytes. So far so good.

Now, we will use this hex inside fortran to generate the binary header using the transfer intrinsic. Since it is only 8 bytes, we could use any 8 byte to store this. In the program below, I chose a scalar INT64 type and transfer'ed the hex into it and also read the 8 bytes from the binary file into another variable of the same type. So we are testing the reference file with itself so it should of course pass.

program main
    use iso_fortran_env, only: iwp=> int64
    implicit none
    integer(iwp) i,j

    i = transfer(Z"b28d18bdbd610395",i)

    open(unit=11, file="random-file.bin", access='stream')
    read(11) j
    close(11)

    write(*,'(a, L2)') "Are they equal?: ", i==j
endprogram

The result of this program for me was:

Are they equal?:  F

That didnt quite work as expected! Let us look at the binary representation of i and j to see if we can get a hint. Adding the following lines after reading in j in the previous program

write(*,'(A,Z0)')"i: ",i
write(*,'(A,Z0)')"j: ",j

gave me

i: B28D18BDBD610395
j: 950361BDBD188DB2
Are they equal?:  F

This is interesting! Of course, i and j look different. But, we see that the order of bytes is reversed (recall that one byte equals two hex digits). Variable i is the same as what we assigned it. j however has its bytes reversed. We have been victimized by Endianness!

Some fortran compilers come with an non-standard extension to convert between endianness when handling files. The program below opens the binary file assuming big-endian ordering:

program main
    use iso_fortran_env, only: iwp=> int64
    implicit none
    integer(iwp) i,j

    i = transfer(Z"b28d18bdbd610395",i)

    open(unit=11, file="random-file.bin", access='stream', convert='big_endian')
    read(11) j
    close(11)

    write(*,'(A,Z0)')"i: ",i
    write(*,'(A,Z0)')"j: ",j
    write(*,'(A, L2)') "Are they equal?: ", i==j
endprogram

with the output:

i: B28D18BDBD610395
j: B28D18BDBD610395
Are they equal?:  T

Now this is exactly what we expected!

Platform details:

Win10 WSL2

gfortran --version
GNU Fortran (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0

First of all, welcome to the forum, @Abhilash :slight_smile: - lurking is quite okay, posting is better.

Just a few remarks here:

  • I noticed a few typos in your post. You will want to correct them, I guess (“there a …” → “there is a …”, “know” → “known”, and some others.)
  • As you rightly say, the keyword is non-standard. A more standard solution is to use a string instead of a 64-bits integer, but it does illustrate the issue of endianness.
1 Like

Hi Arjen. I appreciate the kind welcome!

Thank you for pointing out the typos. I reviewed it and fixed those and some others.

It would be nice to do it in a standards-compliant way. I am looking to improve it. The solution is not immediately apparent to me though. I think if I use the hex as a string in the code rather than as a literal it would be much better. The literal cannot be arbitrarily large. So having it as a string would avoid that issue.

I found “a” solution for this:

program main
    implicit none
    character(len=20) :: binhead
    character(len=20) :: binary_data
    integer           :: i

    binary_data = "b28d18bdbd610395"

    read( binary_data, '(8Z2)' ) (binhead(i:i), i = 1,8)

    write(*, '(8Z2.2)' ) (binhead(i:i), i = 1,8)
endprogram

Note the output format - with Z2 a leading zero will be written as a space.

1 Like

If you want to see hex contents of a file byte-by-byte, use
od -t x1 filename
or
hexdump -C filename
the latter gives ASCII representation of printable bytes in extra column on the right.

The magic numbers headers are stored byte-by-byte so they should be endianess-independent. The Z"hexdigits" format in the code is always interpreted as big-endian but its value in memory may be reversed. So to do any robust comparisons, you need to use 1-byte entities - either characters, as @Arjen suggested or 1-byte integers, if available.

1 Like

Thanks @msz59! I did read about the default assumptions of hexdump and od and what options to use for my desired output. For example, od -An -t x1 -N16 random-file.bin|tr -d " " to just print the first 16 bytes. (I brought up the problem so I should have probably cleared it up in the post!).

I did not know about the endianness of the Z"---" in the source and of course the endianness multi-byte types on the disk would depend on the system. That really clears up the endianess issues for me.

@Arjen This is great! Thank you for taking the time. I think this is what I will go with for my purpose.

program main
    implicit none
    integer  i
    character(len=64) ::  binary_data="b28d18bdbd610395c85e63df&
                                      &d85c7d90caeb1e548c96c73c&
                                      &193c056bd41ec9fc"
    character(len=32) ::  refbinhead, readbinhead

    read( binary_data, '(32Z2)' ) (refbinhead(i:i), i = 1,32)
    open(unit=11, file="random-file.bin", access='stream')
    read(11) readbinhead
    close(11)

    write(*,'(A,(100Z2.2))')"ref  : ",( refbinhead(i:i), i = 1,32)
    write(*,'(A,(100Z2.2))')"read : ",(readbinhead(i:i), i = 1,32)
    write(*,'(A, L2)') "Are they equal?: ", refbinhead==readbinhead
endprogram

which outputs

ref  : B28D18BDBD610395C85E63DFD85C7D90CAEB1E548C96C73C193C056BD41EC9FC
read : B28D18BDBD610395C85E63DFD85C7D90CAEB1E548C96C73C193C056BD41EC9FC
Are they equal?:  T

This was just a symbolic use of the term which actually refers to the way of storing multibyte values in the memory or on the disk. I have just meant that the text format of BOZ constants, just as decimal ones, reads (naturally) from the most-significant to least-significant digit. One can see that as an analogy of big-endianness.

1 Like

Ah got it! Thanks for clarifying!

The many different ways in which numbers are represented in natural languages is an additional source of confusion.

In modern English, we use little endian names thirteen, fourteen, …, nineteen for some two-digit numbers, and big-endian names twentyone, … for others. North Indian languages use little endian names, South Indian languages use big endian names for the same numbers. In English poetry, it is not uncommon to see

When I was one-and-twenty
I heard a wise man say …,

1 Like

This resembles ein-und-zwanzig in German, which continues little-endian from 13 to 99. Slavic languages like Polish do it from 11 to 19, interestingly not having special terms for 11 and 12

In some natural languages, there is sometimes a combination of counting forwards and backwards from the nearest multiple of a power of ten. In German, 468 would be vier-hundert-acht-und-sech-zig (4 X 100 + 8 + 6 X 10). In Hindi, 19 is unnees (-1+20 or one-short-of-twenty). French sometimes uses more than one number base; 99 is quatre-vingt-dix-neuf (4 X 20 + 10 + 9). In Danish, I read, 456 is fire-hundred-seks-og-halv-treds (4 X 100 + 6 + (3-1/2) X 20).

Some of us may remember that in tables of common logarithms log 2 is 0.3010, but log 0.2 is represented as -1 + 0.3010 (displayed as ¯1.3010, with the bar directly over the ‘1’), rather than as -0.6990 , and was to be read as “bar-one-point-three-zero-one-zero”.

2 Likes

Thanks for catching that. gfortran 9 does not flag that with -std=f2018. gfortran 12 did. (i didn’t check any others)

Using int() intrinsic (along with -fno-range-check) would be a conforming replacement for that particular line I suppose.

I made a search through F2018 standard description of intrinsics to find following functions allowing BOZ-literal-constants:

  1. Type conversion: int, cmplx, dble, real
  2. Bitwise comparison: bge, bgt, ble, blt
  3. Bit operations: dshiftl, dshiftr, iand, ieor, ior, merge_bits - in these only one argument can be a BOZ constant

It may also be worth noting that in data statement a BOZ constant must correspond to an integer object.