Improving Fortran Results in the Julia Micro-benchmarks

To avoid the transposition, can you use a named-list?

I am asking because I have never used a named-list myself.

Best

Would you like to try this?

subroutine printfd(n)
integer, intent(in) :: n
integer :: i , unit
character(*), parameter :: newline = new_line("")
open(unit=1, file="/dev/null")
write(unit=1, fmt=*) (i, i+1, newline, i=1, n)
close(unit=1)
end subroutine

(Also, the integer::unit variable declaration here is not necessary, maybe it can be removed later)

1 Like

Does this change the benchmark considering that you’re replacing 10000 write statements with one? E.g. C and Julia programs in the suite don’t do that. It’s possible that the intent of the benchmark was to compare the speed of doing many small writes.

4 Likes

I come there from Julia benchmark. The newest result is here.

Fortran’s performance is in general pretty good, except for the print_to_file. I notice that the corresponding C code specifies the format explicitly

fprintf(f, "%ld %ld\n", i, i+1);

while the Fortran code is

write(unit=1, fmt=*) i, i+1

I don’t know how fmt=* works. Can this be a problem for the performance?

A similar fortran format would be

write(unit=1,'(i0,1x,i0)') i, i+1

You could try it to see if it makes a difference. I’ve seen codes that do minimal computation and lots of output where things like this do matter.

I notice that you are writing to unit 1. If that is a preconnected tty unit, then it is probably unbuffered, and flushing each line does take some effort. If you can write to a normal file, it might also speed things up.

Let me paste the full functions.

C code

void printfd(int n) {
    FILE *f = fopen("/dev/null", "w");
    long i = 0;
    for (i = 0; i < n; i++)
        fprintf(f, "%ld %ld\n", i, i+1);
    fclose(f);
}

Fortran:

subroutine printfd(n)
integer, intent(in) :: n
integer :: i , unit
open(unit=1, file="/dev/null")
do i = 1, n
    write(unit=1, fmt=*) i, i+1
end do
close(unit=1)
end subroutine

I’m not familiar with Fortran and doesn’t know the meaning of unit=1, but FILE *f = fopen("/dev/null", "w") and open(unit=1, file="/dev/null") look the same to me, unless the action mode (ā€œwriteā€ vs ā€œreadwriteā€) matters.

Writing to /dev/null is indeed not realistic. I don’t know why it was designed in that way. Maybe it’s not allowed to a file on Github?

I tried the following simple file, which is transplanted from Microbenchmarks.jl/perf.f90 at master Ā· JuliaLang/Microbenchmarks.jl Ā· GitHub

module types
implicit none
private
public dp, i64
integer, parameter :: dp=kind(0.d0)          ! double precision
integer, parameter :: i64 = selected_int_kind(18) ! At least 64-bit integer
end module




module utils
! Various utilities
use types, only: dp, i64
implicit none
private
public sysclock2ms

contains

! Convert a number of clock ticks, as returned by system_clock called
! with integer(i64) arguments, to milliseconds
function sysclock2ms(t)
  integer(i64), intent(in) :: t
  integer(i64) :: rate
  real(dp) :: sysclock2ms, r
  call system_clock(count_rate=rate)
  r = 1000._dp / rate
  sysclock2ms = t * r
end function sysclock2ms

end module


subroutine printfd(n)
integer, intent(in) :: n
integer :: i , unit
open(unit=10, file="/dev/null")
do i = 1, n
    write(unit=10, fmt='(i0,1x,i0)') i, i+1
end do
close(unit=10)
end subroutine


program perf
use types, only: dp, i64
use utils, only: sysclock2ms
implicit none

integer, parameter :: NRUNS = 1000
integer :: i, f, n, m, k, k2
integer(i64) :: t1, t2, tmin

tmin = huge(0_i64)
do i = 1, 5
    call system_clock(t1)
    call printfd(100000)
    call system_clock(t2)
    if (t2-t1 < tmin) tmin = t2-t1
end do
print "('fortran,print_to_file,',f0.6)", sysclock2ms(tmin)

end program

Some observations:

  • fmt=* is actually faster than fmt='(i0,1x,i0)' by about 10%
  • No difference in changing unit from 1 to 10
  • Output to a regular file makes a big difference, more than 3 times faster!
  • flush after write for the regular-file case is slower than the /dev/null case, by about 30-40%.

So I think you are right, writing to /dev/null is unbuffered, and very slow.

The real equivalent to the C fopen code is to use newunit= in the open statement. This lets the I/O library choose an unused unit number and avoid any potential unit number collision problems with unit pre-connections and unit use in other parts of a large application . Your example would then become something like:

subroutine printfd (n)
integer, intent(in) :: n
integer :: i, iunit
open (newunit=iunit, file="/dev/null")
do i = 1, n
    write (unit=iunit, fmt=*) i, i+1
end do
close (unit=iunit)
end subroutine

Doubtful this will change the timing much - since most of the time is spent encoding the integers into character strings during the writes.

With respect to buffering, Fortran leaves it up to the implementation to determine how to do it. Details vary considerably between implementations. E.g., it could even mmap a disk file into the process address space, and just copy data into or out of a file via a simple memcopy - rather than use explicit buffers and read/write system calls. On a linux system one can use a tool like strace to find out for sure how it is being handled.

1 Like

It does not surprise me that there might be a small difference, but I don’t think the trend you see will be universal for all fortran compilers. That is, another compiler (or a new version of a compiler) might well show the reverse timing difference. A format like '(2i8)' with fixed widths might also be faster than one with variable output widths like i0 or g0. Usually these things are minor and no one cares about these small differences, but in this example there is essentially no other computation occurring, it is just a benchmark for the integer-to-character conversions within the i/o library.

Now that I see the rest of the code this should be correct. There is an open statement, so that determines the connection. My previous comment was about something called a preconnection, where the compiler establishes the connection without an open statement. Those are sometimes treated in a special way, e.g. with unbuffered output to a tty. Those preconnections sometimes require handshaking between the fortran code and the external device on every i/o operation, which of course would affect timings in this kind of benchmark. Apparently /dev/null, which is a special virtual device provided by the operating system, also requires some extra handshaking overhead. I was not aware of that, so that behavior is good to know.

This is likely because everything is working within an output buffer, so it is limited only by the memory-to-memory, or cache-to-memory, speed. The actual i/o to the output device is likely done outside of the timing operations, maybe even at or after the program termination. The programmer usually does not have much control over these details of the interactions between the i/o library, the operating system, and the file system.

I will also comment that for someone unfamiliar with fortran that your code looks pretty good otherwise. For example, you have been careful with the 64-bit integer declarations and operations.

Thanks. Good to know this.

I have to say I’m not the person who wrote the code. I just copied and reorganized the relevant pieces from the repo mentioned above, to make a short test program. Nonetheless, it’s nice to know that the code quality in that benchmark is good.

Formatted output only makes sense for human consumption, and a human can process such an item in one second or so. There is no realistic scenario where performance matters and you put a human in the loop looking at thousands of numbers in their formatted output form. I question the relevance of this benchmark.

Maybe this is a bit faster with gfortran. (see @RonShepard’s comment about different compilers.)

subroutine printfd(n)
implicit none
integer, intent(in) :: n
integer :: i , unit
open(unit=10, file="/dev/null")
write(unit=10, fmt='('//int2str(n)//'(I0,1X,I0,A1))',advance='no') (i,i+1,new_line('a'),i=1,n)
close(unit=10)

contains 
    function int2str(ival) result(str)
        integer, intent(in) :: ival
        character(len=:), allocatable :: str
        integer, parameter :: ibuffer_len = range(ival)+2
        character(len=ibuffer_len) :: buffer
        integer :: i, sign, n
        if (ival == 0) then
            str = '0'
            return
        end if
        sign = 1
        if (ival < 0) sign = -1
        n = abs(ival)
        buffer = ""
        i = ibuffer_len
        do while (n > 0)
            buffer(i:i) = char(mod(n,10) + ichar('0'))
            n = n/10
            i = i - 1
        end do
        if (sign == -1) then
            buffer(i:i) = '-'
            i = i - 1
        end if
        str = buffer(i+1:ibuffer_len)
    end function
end subroutine

For some reason, adding a new_line('a') instead of using slash editing (/) seems to be 4x faster on the computer I tested this. A friendlier way would be:

write(unit=10, fmt='(*(I0,1X,I0,:,/))') (i,i+1,i=1,n)

but it was about the same rate as the loop writing single records.


I’m afraid that in practice formatted output gets (ab)used for things it’s not meant for; but I think we can not judge these things too harshly. For instance just take a look at the FASTQ format for biological sequencing data,

where they use ASCII to represent base pairs (A, C, G, T, perhaps also U, Z, representable in 3 bits). Bioinformaticians compete who can write the fastest parser for this format in programming language XYZ.

I feel this changes the logic of the benchmark. I guess the goal of that microbenchmark is to test the system calls. The above code seems only calling write once.

The side-effect is the same. But yes, the Julia microbenchmark notes specify that

the benchmarks are written to test the performance of identical algorithms and code patterns implemented in each language.

I didn’t scroll back enough in the thread earlier; I see now that @zoziha and @milancurcic already discussed the idea of using a single write with an implied do-loop.

Instead of system calls I would prefer saying ā€œruntime library implementationā€. Theoretically, a compiler could replace the Fortran IO statement with something similar to what C does,

fprintf(f, "%ld %ld\n", i, i+1);

But in practice the existing Fortran implementations delegate I/O statements to runtime routines. For example with gfortran you get this type assembly control flow chart:

With flang, the main body of the loop looks very similar, but using different runtime routines:

.LBB0_2:
        mov     edi, 1
        mov     rsi, r14
        mov     edx, 6
        call    _FortranAioBeginExternalListOutput@PLT
        mov     r15, rax
        mov     rdi, rax
        mov     esi, ebx
        call    _FortranAioOutputInteger32@PLT
        inc     ebx
        mov     rdi, r15
        mov     esi, ebx
        call    _FortranAioOutputInteger32@PLT
        mov     rdi, r15
        call    _FortranAioEndIoStatement@PLT
        dec     r12
        cmp     r12, 1
        ja      .LBB0_2

I guess this causes some overhead, compared to the C routine:

printfd:
        push    r12
        mov     esi, OFFSET FLAT:.LC0
        push    rbp
        movsx   rbp, edi
        mov     edi, OFFSET FLAT:.LC1
        push    rbx
        call    fopen
        mov     r12, rax
        test    ebp, ebp
        jle     .L2
        xor     ebx, ebx
.L3:
        mov     rdx, rbx
        inc     rbx
        xor     eax, eax
        mov     esi, OFFSET FLAT:.LC2
        mov     rcx, rbx
        mov     rdi, r12
        call    fprintf
        cmp     rbx, rbp
        jne     .L3
.L2:
        pop     rbx
        mov     rdi, r12
        pop     rbp
        pop     r12
        jmp     fclose
1 Like

One might hope so. However, I remember once using a compiler that generated separate calls for each iteration through the implied loop, and it did that for both formatted and unformatted i/o. We learned very quickly on that compiler to replace

write(unit) (a(i),i=1,n)

with a subroutine call that did

write(unit) a

where the dummy argument was dimensioned appropriately. That compiler also did backspace by rewinding the file and then skipping over the appropriate number of records, also a very slow implementation.

Ultimately, these types of benchmarks always feel contrived and artificial because languages have multiple ways to do things, so comparing one similar looking operation in one language to that in another language often obscures some subtle differences. In fortran, the fastest way to do this kind of operation would probably be to manually convert the integer values to characters (as @ivanpribec did above with int2scr() for the value of n), move those characters into a character array, and then do a single write of that character array with stream i/o (although in this case, going to /dev/null even could be skipped).

My main concern about that int2str() function for this purpose is that the output is an allocatable string. To be most efficient, the output should be either a fixed-length string or written directly into the character buffer in order to avoid the overhead of all those memory allocations. Memory allocations are expensive, and very much so for some compilers.

On the other hand, I think these kinds of benchmarks do show shortcomings of intrinsic functions and, in this case, of i/o library routines. The i/o library routines to convert integer and floating point values to character strings in most compilers are often very slow and can easily be outpaced with manual code, even code written in fortran. I don’t know why that is the case. Perhaps they do more error testing than the typical user code? I have maintained my own library codes to do integer to character conversions for some 40+ years because of this ā€œfeatureā€ of fortran. I have also done special floating point to character conversions, but the general floating point case is much more difficult, so I just fall back to fortran internal i/o for those cases, slow as that is.

1 Like

Yes, the reason I delved a little into this issue is that Fortran is good at all other tests in that benchmark, so it’s a little mysterious to me why this particular test seems different.

I agree with you. Sorry for my bad wording. I think this shows some difference between the implementations of this kind of functions.