My improvements to this Fortran benchmark have now been merged into the Julia benchmarks repository and the benchmarks page has been updated.
Edit: as pointed out by Milan, it looks like the online results page has not actually been updated yet.
My improvements to this Fortran benchmark have now been merged into the Julia benchmarks repository and the benchmarks page has been updated.
Edit: as pointed out by Milan, it looks like the online results page has not actually been updated yet.
Great work, @lkedward! I suppose the update will eventually propagate to the plot on the Julia website (I donât think it has yet)?
Very disturbing⌠Fortran faster than Julia? What??
I need the functionality for a very low-volume niche need, so I am not sure if I ever timed them, but note there are several routines in M_string (a registered fpm(1) package, so easy to access), some of which I do not even list in the man-pages that convert to and from anything from base 2 to base 36 using what I think is a classic algorithm:
$@(#) M_strings::base(3f): convert whole number string in base [2-36] to string in alternate base [2-36]
$@(#) M_strings::decodebase(3f): convert whole number string in base [2-36] to base 10 number
$@(#) M_strings::codebase(3f): convert whole number in base 10 to string in base [2-36]
$@(#) M_strings::todecimal(3f): given string and base return decimal integer
$@(#) M_strings::todecimal(3f): given integer and base return string
that might be pertinent if looking to generalize base conversion for stdlib.
It is interesting in this thread, @mecej4 writes:
I had a very similar experience a couple of years ago when this thread by @lkedward first caught my attention. I do feel good to know I am not the only one who came away with the same impression as @mecej4 on Julia.
I was aware of Julia as a possibly leading contender in the near future in the HPC domain but had not looked into it until this thread surfaced here.
In addition to the comments by @mecej4 , I felt many of the claims among the Julia crowd were based on the microbenchmarks that were setup by the ardent users of Julia. But so many of those microbenchmarks were highly questionable in my opinion, especially when it came to other languages used in their comparisons. Honestly, the codes used in the microbenchmarks are poorer than beginner level.
I personally donât think code such as the following to reveal anything useful, let alone be a measure for anything. But this is what the Julia folks use to claim that their âparse integersâ test shows Julia faster than Fortran:
call system_clock(t1)
do k2 = 1, NRUNS
do k = 1, 1000
call random_number(s1)
n = int(s1*huge(n))
call hex_string(n,s)
m = parse_int(s(:len_trim(s)), 16)
call assert(m == n)
end do
end do
call system_clock(t2)
The hex_string subroutine in perf.f90 generates a hex representation of a 32-bit integer. The present implementation uses modulo() and integer divisions by 16 to generate the hex string, which is then used as the input for the parsing routine. The following replacement uses shifts and masks, instead, and on my PC reduced the parse_integers microbenchmark time from 0.130 s to 0.047 s. I have offered this replacement code as an issue on the Julia microbenchmarks Github.
! Convert an integer to a hex string
!
subroutine hex_string(dec,hexchar)
integer, intent(in) :: dec
character(*) :: hexchar
integer :: i, quotient
character(len=1), parameter :: table(0:15) = &
[(char(i),i=ichar('0'),ichar('9')),(char(i),i=ichar('A'),ichar('F'))]
quotient = dec
hexchar = '00000000'
i = 8
do while (quotient /= 0 .and. i > 0)
hexchar(i:i) = table(iand(quotient,15))
i = i-1
quotient = ishft(quotient,-4)
end do
end subroutine hex_string
The comparison of the benchmarks really makes sense only if all languages implement the same algorithm. Otherwise we have comparison of apples and orangutans. Example: solving a system of equations with Gauss eliminations vs. Cramerâs rule.
I agree mostly, but we have to make allowances where one language does not provide means for implementing certain algorithms (for example, standard Fortran 77 did not provide bit manipulation intrinsics, but C did), and so workarounds have to be used and these may result in poor performance.
Thanks to @oscardssmithâs interest, initiative and help, these changes have been accepted and merged into the Julia microbenchmarks test suite. With these changes, Fortran is the fastest for the parse_integer test among all the languages tested.
C 0.1635
Fortran 0.074126 (was 0.682692)
Go 0.097109
Java 0.275906
Julia 0.123103
Python 1.39308
Rust 0.155348
One should not put too much value on these timings, since the codes for the other languages can also benefit from similar changes. It is a quirk of the parse_integer benchmark that the time for creating the test strings is included in the âparsing timeâ and, indeed, my improvements were confined to the lines of code for this preparatory task.
It is to be hoped that the graph and the CSV table will be updated soon to reflect this change.
My shallow opinion is that, believe it or not, it does not really matter what algorithms are used, because there will never really be (and no need to be) apple to apple comparisons.
Because at the end of the day, the real question is really extremely simple, that is,
To achieve the same result, what programming language is the fastest?
PS.
In your example, solving a system of equations, I personally think that, it does not matter one uses Gauss eliminations or Cramerâs rule or anything else. At the end, what really matters is just what language (it can use any algorithms it prefers) solves the system of equations the fastest. That is all.
I mean it is like soccer, in the World Cup, the goal is to win the World Cup. It does not matter what formation you use, 442, 343, 41212, 352, 532, âŚetc. It does not matter what strategy you used. You are the best if you are the champion.
Why Fortranâs print_to_file is so slow in the Julia benchmark?
If you see in detail, the print_to_file is slow compared to many other counterparts such as c, go, jullia, lua, python, and rust. Really surprising.
Try using different Fortran compilers and comparing. Formatted writes have been noticeably slow with some versions of Gfortran, especially on Windows.
the print_to_file uses the following Fortran file :
(link:Microbenchmarks/perf.f90 at a963d284b09d04b3e0374f6dd46ec4b039ed5569 ¡ JuliaLang/Microbenchmarks ¡ GitHub):
subroutine printfd(n)
integer, intent(in) :: n
integer :: i , unit
open(unit=1, file="/dev/null")
do i = 1, n
write(unit=1, fmt=*) i, i+1
end do
close(unit=1)
end subroutine
For list-directed output of real type list items, the processor supplies its own values for w, d and e in Fw.d, Ew.dEe, etc. The format conversion has to be performed, nevertheless.
Your use of he phrase âfree formatâ in the context of output rather than Fortran source code format can be a source of confusion.
Yes, I agree. The w,d, and e are provided by the processor itself and not by the user. I will update the my previous reply.
In addition to the topic of print_to_file
, in the topic of matrix_statistics
, I found that the scale
function is used in that code (rnorm
function):
DO
CALL RANDOM_NUMBER( u )
CALL RANDOM_NUMBER( v )
u = SCALE( u, 1 ) - one
v = SCALE( v, 1 ) - one
sum = u*u + v*v + vsmall ! vsmall added to prevent LOG(zero) / zero
IF(sum < one) EXIT
END DO
sln = SQRT(- SCALE( LOG(sum), 1 ) / sum)
fn_val = u*sln
As far as I know, simply using 2*u
to replace the code like scale(u, 1)
can be much (a bit) faster:
DO
CALL RANDOM_NUMBER( u )
CALL RANDOM_NUMBER( v )
u = 2*u - one
v = 2*v - one
sum = u*u + v*v + vsmall ! vsmall added to prevent LOG(zero) / zero
IF(sum < one) EXIT
END DO
sln = SQRT(- 2*LOG(sum) / sum)
fn_val = u*sln
Simple example (Incorrect):
program main
real(8) :: t1, t2, x, y
integer, parameter :: N = 100000000
print *, "x = ", x, "; y = ", y
call cpu_time(t1)
do i = 1, N
x = 2*y + i
end do
call cpu_time(t2)
print *, "time for N = ", N, ", x = 2*y : ", t2 - t1, " s; x = ", x
call cpu_time(t1)
do i = 1, N
x = scale(y, 1) + i
end do
call cpu_time(t2)
print *, "time for N = ", N, ", x = scale(y, 1): ", t2 - t1, " s; x = ", x
end program main
Results:
(Compiler optimization options are not used, because with optimization, Fortran runs too fast, and the running time is 0)
# On WSL2-Debian gfortran 10.3
>> gfortran main.f90 && ./a.out
x = 9.8813129168249309E-324 ; y = 2.0000000000000000
time for N = 100000000 , x = 2*y : 0.16183899999999998 s; x = 100000004.00000000
time for N = 100000000 , x = scale(y, 1): 0.72182100000000005 s; x = 100000004.00000000
# On MSYS2-ucrt64-gfortran 12.1
>> gfortran main.f90 && ./a
x = 6.9530460433855700E-310 ; y = 2.0000000000000000
time for N = 100000000 , x = 2*y : 0.15625000000000000 s; x = 100000004.00000000
time for N = 100000000 , x = scale(y, 1): 0.60937500000000000 s; x = 100000004.00000000
# Windows10-ifort 2021.6.0
>> ifort main.f90 && ./main
x = 0.000000000000000E+000 ; y = 2.00000000000000
time for N = 100000000 , x = 2*y : 0.000000000000000E+000 s; x =
100000004.000000
time for N = 100000000 , x = scale(y, 1): 3.125000000000000E-002 s; x =
100000004.000000
Modified example:
program main
real(8) :: t1, t2, x, y = 2.0
integer, parameter :: N = 100000000
call cpu_time(t1)
do i = 1, N
call random_number(y)
x = 2*y
end do
call cpu_time(t2)
print *, "time for N = ", N, ", x = 2*y : ", t2 - t1, " s; x = ", x
call cpu_time(t1)
do i = 1, N
call random_number(y)
x = scale(y, 1)
end do
call cpu_time(t2)
print *, "time for N = ", N, ", x = scale(y, 1): ", t2 - t1, " s; x = ", x
end program main
Results:
# WSL2-Debian gfortran 10.3
>> gfortran -Ofast main.f90 && ./a
time for N = 100000000 , x = 2*y : 0.78022399999999992 s; x = 1.1880948469991348
time for N = 100000000 , x = scale(y, 1): 1.4326160000000001 s; x = 1.6401134682038268
# MSYS2-ucrt64-gfortran 12.1
>> gfortran -Ofast main.f90 && ./a
time for N = 100000000 , x = 2*y : 2.5156250000000000 s; x = 1.1728265567290517
time for N = 100000000 , x = scale(y, 1): 3.0625000000000000 s; x = 1.5269402525788756
# On Windows-ifort 2021.6.0
>> ifort /fast main.f90 && ./main
time for N = 100000000 , x = 2*y : 0.437500000000000 s; x =
1.03704569309432
time for N = 100000000 , x = scale(y, 1): 0.625000000000000 s; x =
1.14890448267426
Whether you request optimization or not, the compiler may determine that an expression that is to be output may be calculated entirely at compile time, store that value in the EXE/a.out, and simply output that previously computed value.
Obviously, the result of the first loop is N+4, and it appears that the Intel compiler did not have to place code in the executable to calculate that result (elapsed time = 0).
We have seen an extreme example of such compile time calculation in a recent thread.
Youâre right, I modified the above example to use random numbers to avoid compiler compile-time optimizations, I donât know if this really works.
All in all, I think scale(u, 1)
is still slower than 2*u
, maybe not that significant.
To the topic print_to_file
:
The use of implied DO-loops leads to a better performance: fortran,print_to_file,8.043278
.
However, the output will be transposed.
subroutine printfd(n)
integer, intent(in) :: n
integer :: i , unit
open(unit=1, file="/dev/null")
write(unit=1,fmt=*) (i,i=1,n)
write(unit=1,fmt=*) (i,i=2,n+1)
close(unit=1)
end subroutine
Previous version: fortran,print_to_file,23.939606
subroutine printfd(n)
integer, intent(in) :: n
integer :: i , unit
open(unit=1, file="/dev/null")
do i = 1, n
write(unit=1,fmt=*) i, i+1
end do
close(unit=1)
end subroutine