To avoid the transposition, can you use a named-list?
I am asking because I have never used a named-list myself.
Best
To avoid the transposition, can you use a named-list?
I am asking because I have never used a named-list myself.
Best
Would you like to try this?
subroutine printfd(n)
integer, intent(in) :: n
integer :: i , unit
character(*), parameter :: newline = new_line("")
open(unit=1, file="/dev/null")
write(unit=1, fmt=*) (i, i+1, newline, i=1, n)
close(unit=1)
end subroutine
(Also, the integer::unit
variable declaration here is not necessary, maybe it can be removed later)
Does this change the benchmark considering that youāre replacing 10000 write
statements with one? E.g. C and Julia programs in the suite donāt do that. Itās possible that the intent of the benchmark was to compare the speed of doing many small writes.
I come there from Julia benchmark. The newest result is here.
Fortranās performance is in general pretty good, except for the print_to_file
. I notice that the corresponding C code specifies the format explicitly
fprintf(f, "%ld %ld\n", i, i+1);
while the Fortran code is
write(unit=1, fmt=*) i, i+1
I donāt know how fmt=*
works. Can this be a problem for the performance?
A similar fortran format would be
write(unit=1,'(i0,1x,i0)') i, i+1
You could try it to see if it makes a difference. Iāve seen codes that do minimal computation and lots of output where things like this do matter.
I notice that you are writing to unit 1. If that is a preconnected tty unit, then it is probably unbuffered, and flushing each line does take some effort. If you can write to a normal file, it might also speed things up.
Let me paste the full functions.
C code
void printfd(int n) {
FILE *f = fopen("/dev/null", "w");
long i = 0;
for (i = 0; i < n; i++)
fprintf(f, "%ld %ld\n", i, i+1);
fclose(f);
}
Fortran:
subroutine printfd(n)
integer, intent(in) :: n
integer :: i , unit
open(unit=1, file="/dev/null")
do i = 1, n
write(unit=1, fmt=*) i, i+1
end do
close(unit=1)
end subroutine
Iām not familiar with Fortran and doesnāt know the meaning of unit=1
, but FILE *f = fopen("/dev/null", "w")
and open(unit=1, file="/dev/null")
look the same to me, unless the action mode (āwriteā vs āreadwriteā) matters.
Writing to /dev/null
is indeed not realistic. I donāt know why it was designed in that way. Maybe itās not allowed to a file on Github?
I tried the following simple file, which is transplanted from Microbenchmarks.jl/perf.f90 at master Ā· JuliaLang/Microbenchmarks.jl Ā· GitHub
module types
implicit none
private
public dp, i64
integer, parameter :: dp=kind(0.d0) ! double precision
integer, parameter :: i64 = selected_int_kind(18) ! At least 64-bit integer
end module
module utils
! Various utilities
use types, only: dp, i64
implicit none
private
public sysclock2ms
contains
! Convert a number of clock ticks, as returned by system_clock called
! with integer(i64) arguments, to milliseconds
function sysclock2ms(t)
integer(i64), intent(in) :: t
integer(i64) :: rate
real(dp) :: sysclock2ms, r
call system_clock(count_rate=rate)
r = 1000._dp / rate
sysclock2ms = t * r
end function sysclock2ms
end module
subroutine printfd(n)
integer, intent(in) :: n
integer :: i , unit
open(unit=10, file="/dev/null")
do i = 1, n
write(unit=10, fmt='(i0,1x,i0)') i, i+1
end do
close(unit=10)
end subroutine
program perf
use types, only: dp, i64
use utils, only: sysclock2ms
implicit none
integer, parameter :: NRUNS = 1000
integer :: i, f, n, m, k, k2
integer(i64) :: t1, t2, tmin
tmin = huge(0_i64)
do i = 1, 5
call system_clock(t1)
call printfd(100000)
call system_clock(t2)
if (t2-t1 < tmin) tmin = t2-t1
end do
print "('fortran,print_to_file,',f0.6)", sysclock2ms(tmin)
end program
Some observations:
fmt=*
is actually faster than fmt='(i0,1x,i0)'
by about 10%unit
from 1 to 10flush
after write
for the regular-file case is slower than the /dev/null case, by about 30-40%.So I think you are right, writing to /dev/null
is unbuffered, and very slow.
The real equivalent to the C fopen
code is to use newunit=
in the open statement. This lets the I/O library choose an unused unit number and avoid any potential unit number collision problems with unit pre-connections and unit use in other parts of a large application . Your example would then become something like:
subroutine printfd (n)
integer, intent(in) :: n
integer :: i, iunit
open (newunit=iunit, file="/dev/null")
do i = 1, n
write (unit=iunit, fmt=*) i, i+1
end do
close (unit=iunit)
end subroutine
Doubtful this will change the timing much - since most of the time is spent encoding the integers into character strings during the writes.
With respect to buffering, Fortran leaves it up to the implementation to determine how to do it. Details vary considerably between implementations. E.g., it could even mmap
a disk file into the process address space, and just copy data into or out of a file via a simple memcopy
- rather than use explicit buffers and read
/write
system calls. On a linux system one can use a tool like strace
to find out for sure how it is being handled.
It does not surprise me that there might be a small difference, but I donāt think the trend you see will be universal for all fortran compilers. That is, another compiler (or a new version of a compiler) might well show the reverse timing difference. A format like '(2i8)'
with fixed widths might also be faster than one with variable output widths like i0
or g0
. Usually these things are minor and no one cares about these small differences, but in this example there is essentially no other computation occurring, it is just a benchmark for the integer-to-character conversions within the i/o library.
Now that I see the rest of the code this should be correct. There is an open statement, so that determines the connection. My previous comment was about something called a preconnection, where the compiler establishes the connection without an open statement. Those are sometimes treated in a special way, e.g. with unbuffered output to a tty. Those preconnections sometimes require handshaking between the fortran code and the external device on every i/o operation, which of course would affect timings in this kind of benchmark. Apparently /dev/null
, which is a special virtual device provided by the operating system, also requires some extra handshaking overhead. I was not aware of that, so that behavior is good to know.
This is likely because everything is working within an output buffer, so it is limited only by the memory-to-memory, or cache-to-memory, speed. The actual i/o to the output device is likely done outside of the timing operations, maybe even at or after the program termination. The programmer usually does not have much control over these details of the interactions between the i/o library, the operating system, and the file system.
I will also comment that for someone unfamiliar with fortran that your code looks pretty good otherwise. For example, you have been careful with the 64-bit integer declarations and operations.
Thanks. Good to know this.
I have to say Iām not the person who wrote the code. I just copied and reorganized the relevant pieces from the repo mentioned above, to make a short test program. Nonetheless, itās nice to know that the code quality in that benchmark is good.
Formatted output only makes sense for human consumption, and a human can process such an item in one second or so. There is no realistic scenario where performance matters and you put a human in the loop looking at thousands of numbers in their formatted output form. I question the relevance of this benchmark.
Maybe this is a bit faster with gfortran. (see @RonShepardās comment about different compilers.)
subroutine printfd(n)
implicit none
integer, intent(in) :: n
integer :: i , unit
open(unit=10, file="/dev/null")
write(unit=10, fmt='('//int2str(n)//'(I0,1X,I0,A1))',advance='no') (i,i+1,new_line('a'),i=1,n)
close(unit=10)
contains
function int2str(ival) result(str)
integer, intent(in) :: ival
character(len=:), allocatable :: str
integer, parameter :: ibuffer_len = range(ival)+2
character(len=ibuffer_len) :: buffer
integer :: i, sign, n
if (ival == 0) then
str = '0'
return
end if
sign = 1
if (ival < 0) sign = -1
n = abs(ival)
buffer = ""
i = ibuffer_len
do while (n > 0)
buffer(i:i) = char(mod(n,10) + ichar('0'))
n = n/10
i = i - 1
end do
if (sign == -1) then
buffer(i:i) = '-'
i = i - 1
end if
str = buffer(i+1:ibuffer_len)
end function
end subroutine
For some reason, adding a new_line('a')
instead of using slash editing (/)
seems to be 4x faster on the computer I tested this. A friendlier way would be:
write(unit=10, fmt='(*(I0,1X,I0,:,/))') (i,i+1,i=1,n)
but it was about the same rate as the loop writing single records.
Iām afraid that in practice formatted output gets (ab)used for things itās not meant for; but I think we can not judge these things too harshly. For instance just take a look at the FASTQ format for biological sequencing data,
where they use ASCII to represent base pairs (A, C, G, T, perhaps also U, Z, representable in 3 bits). Bioinformaticians compete who can write the fastest parser for this format in programming language XYZ.
I feel this changes the logic of the benchmark. I guess the goal of that microbenchmark is to test the system calls. The above code seems only calling write
once.
The side-effect is the same. But yes, the Julia microbenchmark notes specify that
the benchmarks are written to test the performance of identical algorithms and code patterns implemented in each language.
I didnāt scroll back enough in the thread earlier; I see now that @zoziha and @milancurcic already discussed the idea of using a single write with an implied do-loop.
Instead of system calls I would prefer saying āruntime library implementationā. Theoretically, a compiler could replace the Fortran IO statement with something similar to what C does,
fprintf(f, "%ld %ld\n", i, i+1);
But in practice the existing Fortran implementations delegate I/O statements to runtime routines. For example with gfortran you get this type assembly control flow chart:
With flang, the main body of the loop looks very similar, but using different runtime routines:
.LBB0_2:
mov edi, 1
mov rsi, r14
mov edx, 6
call _FortranAioBeginExternalListOutput@PLT
mov r15, rax
mov rdi, rax
mov esi, ebx
call _FortranAioOutputInteger32@PLT
inc ebx
mov rdi, r15
mov esi, ebx
call _FortranAioOutputInteger32@PLT
mov rdi, r15
call _FortranAioEndIoStatement@PLT
dec r12
cmp r12, 1
ja .LBB0_2
I guess this causes some overhead, compared to the C routine:
printfd:
push r12
mov esi, OFFSET FLAT:.LC0
push rbp
movsx rbp, edi
mov edi, OFFSET FLAT:.LC1
push rbx
call fopen
mov r12, rax
test ebp, ebp
jle .L2
xor ebx, ebx
.L3:
mov rdx, rbx
inc rbx
xor eax, eax
mov esi, OFFSET FLAT:.LC2
mov rcx, rbx
mov rdi, r12
call fprintf
cmp rbx, rbp
jne .L3
.L2:
pop rbx
mov rdi, r12
pop rbp
pop r12
jmp fclose
One might hope so. However, I remember once using a compiler that generated separate calls for each iteration through the implied loop, and it did that for both formatted and unformatted i/o. We learned very quickly on that compiler to replace
write(unit) (a(i),i=1,n)
with a subroutine call that did
write(unit) a
where the dummy argument was dimensioned appropriately. That compiler also did backspace
by rewinding the file and then skipping over the appropriate number of records, also a very slow implementation.
Ultimately, these types of benchmarks always feel contrived and artificial because languages have multiple ways to do things, so comparing one similar looking operation in one language to that in another language often obscures some subtle differences. In fortran, the fastest way to do this kind of operation would probably be to manually convert the integer values to characters (as @ivanpribec did above with int2scr()
for the value of n
), move those characters into a character array, and then do a single write of that character array with stream i/o (although in this case, going to /dev/null even could be skipped).
My main concern about that int2str()
function for this purpose is that the output is an allocatable string. To be most efficient, the output should be either a fixed-length string or written directly into the character buffer in order to avoid the overhead of all those memory allocations. Memory allocations are expensive, and very much so for some compilers.
On the other hand, I think these kinds of benchmarks do show shortcomings of intrinsic functions and, in this case, of i/o library routines. The i/o library routines to convert integer and floating point values to character strings in most compilers are often very slow and can easily be outpaced with manual code, even code written in fortran. I donāt know why that is the case. Perhaps they do more error testing than the typical user code? I have maintained my own library codes to do integer to character conversions for some 40+ years because of this āfeatureā of fortran. I have also done special floating point to character conversions, but the general floating point case is much more difficult, so I just fall back to fortran internal i/o for those cases, slow as that is.
Yes, the reason I delved a little into this issue is that Fortran is good at all other tests in that benchmark, so itās a little mysterious to me why this particular test seems different.
I agree with you. Sorry for my bad wording. I think this shows some difference between the implementations of this kind of functions.