Improving Fortran Results in the Julia Micro-benchmarks

lkedward · April 12, 2022, 8:08pm

My improvements to this Fortran benchmark have now been merged into the Julia benchmarks repository and the benchmarks page has been updated.

Edit: as pointed out by Milan, it looks like the online results page has not actually been updated yet.

milancurcic · April 12, 2022, 8:48pm

Great work, @lkedward! I suppose the update will eventually propagate to the plot on the Julia website (I don’t think it has yet)?

jacobwilliams · April 12, 2022, 8:54pm

Very disturbing… Fortran faster than Julia? What??

urbanjost · April 13, 2022, 4:41am

I need the functionality for a very low-volume niche need, so I am not sure if I ever timed them, but note there are several routines in M_string (a registered fpm(1) package, so easy to access), some of which I do not even list in the man-pages that convert to and from anything from base 2 to base 36 using what I think is a classic algorithm:

$@(#) M_strings::base(3f): convert whole number string in base [2-36] to string in alternate base [2-36]
$@(#) M_strings::decodebase(3f): convert whole number string in base [2-36] to base 10 number
$@(#) M_strings::codebase(3f): convert whole number in base 10 to string in base [2-36]
$@(#) M_strings::todecimal(3f): given string and base return decimal integer
$@(#) M_strings::todecimal(3f): given integer and base return string

that might be pertinent if looking to generalize base conversion for stdlib.

FortranFan · April 30, 2022, 5:37pm

It is interesting in this thread, @mecej4 writes:

Why is my code compiled with GFortran on Windows slower than on Ubuntu?

… Before joining this forum, I had barely heard of Julia. Some of the posts in fortran-lang.discourse named Julia as a threat to Fortran, so I searched and read some articles claiming that Julia was xxx times faster than Fortran, etc. I felt that the claims were not credible, and wanted to give Julia a test-drive. I downloaded the 1.7.2 distribution for Windows, and ran a few examples of nonlinear regression.

The Fortran version compiles and links in ~ 2 seconds, and the run takes less than a second. A second run of the EXE takes 0.04 s, since the DLLs are now cached. The Julia version took 25 seconds for the first run, of which 17 s were used to process the directive using PyPlot . A second run while staying within the Julia REPL took 0.4 s. What disappointed me was the inability to compile stable source code into object files, EXEs and libraries and the huge startup time for running any Julia program unless one stays inside the REPL.

I had a very similar experience a couple of years ago when this thread by @lkedward first caught my attention. I do feel good to know I am not the only one who came away with the same impression as @mecej4 on Julia.

I was aware of Julia as a possibly leading contender in the near future in the HPC domain but had not looked into it until this thread surfaced here.

In addition to the comments by @mecej4 , I felt many of the claims among the Julia crowd were based on the microbenchmarks that were setup by the ardent users of Julia. But so many of those microbenchmarks were highly questionable in my opinion, especially when it came to other languages used in their comparisons. Honestly, the codes used in the microbenchmarks are poorer than beginner level.

I personally don’t think code such as the following to reveal anything useful, let alone be a measure for anything. But this is what the Julia folks use to claim that their “parse integers” test shows Julia faster than Fortran:

    call system_clock(t1)
    do k2 = 1, NRUNS
        do k = 1, 1000
            call random_number(s1)
            n = int(s1*huge(n))
            call hex_string(n,s)
            m = parse_int(s(:len_trim(s)), 16)
            call assert(m == n)
        end do
    end do
    call system_clock(t2)

mecej4 · June 5, 2022, 10:24am

The hex_string subroutine in perf.f90 generates a hex representation of a 32-bit integer. The present implementation uses modulo() and integer divisions by 16 to generate the hex string, which is then used as the input for the parsing routine. The following replacement uses shifts and masks, instead, and on my PC reduced the parse_integers microbenchmark time from 0.130 s to 0.047 s. I have offered this replacement code as an issue on the Julia microbenchmarks Github.

! Convert an integer to a hex string
!
subroutine hex_string(dec,hexchar)
  integer, intent(in) :: dec
  character(*) :: hexchar

  integer :: i, quotient

  character(len=1), parameter :: table(0:15) = &
  [(char(i),i=ichar('0'),ichar('9')),(char(i),i=ichar('A'),ichar('F'))]

  quotient = dec
  hexchar  = '00000000'
  i = 8
  do while (quotient /= 0 .and. i > 0)
      hexchar(i:i) = table(iand(quotient,15))
      i = i-1
      quotient = ishft(quotient,-4)
  end do

end subroutine hex_string

PetrKryslUCSD · June 6, 2022, 1:57pm

The comparison of the benchmarks really makes sense only if all languages implement the same algorithm. Otherwise we have comparison of apples and orangutans. Example: solving a system of equations with Gauss eliminations vs. Cramer’s rule.

mecej4 · June 6, 2022, 4:13pm

I agree mostly, but we have to make allowances where one language does not provide means for implementing certain algorithms (for example, standard Fortran 77 did not provide bit manipulation intrinsics, but C did), and so workarounds have to be used and these may result in poor performance.

mecej4 · June 19, 2022, 9:21pm

Thanks to @oscardssmith’s interest, initiative and help, these changes have been accepted and merged into the Julia microbenchmarks test suite. With these changes, Fortran is the fastest for the parse_integer test among all the languages tested.

C       0.1635
Fortran 0.074126 (was 0.682692)
Go      0.097109
Java    0.275906
Julia   0.123103
Python  1.39308
Rust    0.155348

One should not put too much value on these timings, since the codes for the other languages can also benefit from similar changes. It is a quirk of the parse_integer benchmark that the time for creating the test strings is included in the “parsing time” and, indeed, my improvements were confined to the lines of code for this preparatory task.

It is to be hoped that the graph and the CSV table will be updated soon to reflect this change.

CRquantum · June 20, 2022, 5:37am

My shallow opinion is that, believe it or not, it does not really matter what algorithms are used, because there will never really be (and no need to be) apple to apple comparisons.

Because at the end of the day, the real question is really extremely simple, that is,

To achieve the same result, what programming language is the fastest?

PS.

In your example, solving a system of equations, I personally think that, it does not matter one uses Gauss eliminations or Cramer’s rule or anything else. At the end, what really matters is just what language (it can use any algorithms it prefers) solves the system of equations the fastest. That is all.

I mean it is like soccer, in the World Cup, the goal is to win the World Cup. It does not matter what formation you use, 442, 343, 41212, 352, 532, …etc. It does not matter what strategy you used. You are the best if you are the champion.

fortran4r · June 21, 2022, 6:30am

Why Fortran’s print_to_file is so slow in the Julia benchmark?

snano · June 21, 2022, 6:44am

If you see in detail, the print_to_file is slow compared to many other counterparts such as c, go, jullia, lua, python, and rust. Really surprising.

mecej4 · June 21, 2022, 7:55am

Try using different Fortran compilers and comparing. Formatted writes have been noticeably slow with some versions of Gfortran, especially on Windows.

snano · June 21, 2022, 8:21am

the print_to_file uses the following Fortran file :
(link:Microbenchmarks/perf.f90 at a963d284b09d04b3e0374f6dd46ec4b039ed5569 · JuliaLang/Microbenchmarks · GitHub):


subroutine printfd(n)
integer, intent(in) :: n
integer :: i , unit
open(unit=1, file="/dev/null")
do i = 1, n
    write(unit=1, fmt=*) i, i+1
end do
close(unit=1)
end subroutine

mecej4 · June 21, 2022, 8:36am

For list-directed output of real type list items, the processor supplies its own values for w, d and e in Fw.d, Ew.dEe, etc. The format conversion has to be performed, nevertheless.

Your use of he phrase “free format” in the context of output rather than Fortran source code format can be a source of confusion.

snano · June 21, 2022, 8:50am

Yes, I agree. The w,d, and e are provided by the processor itself and not by the user. I will update the my previous reply.

zoziha · June 21, 2022, 2:26pm

In addition to the topic of print_to_file, in the topic of matrix_statistics, I found that the scale function is used in that code (rnorm function):

DO
    CALL RANDOM_NUMBER( u )
    CALL RANDOM_NUMBER( v )
    u = SCALE( u, 1 ) - one
    v = SCALE( v, 1 ) - one
    sum = u*u + v*v + vsmall         ! vsmall added to prevent LOG(zero) / zero
    IF(sum < one) EXIT
  END DO
  sln = SQRT(- SCALE( LOG(sum), 1 ) / sum)
  fn_val = u*sln

As far as I know, simply using 2*u to replace the code like scale(u, 1) can be ~~much~~ (a bit) faster:

DO
    CALL RANDOM_NUMBER( u )
    CALL RANDOM_NUMBER( v )
    u = 2*u - one
    v = 2*v - one
    sum = u*u + v*v + vsmall         ! vsmall added to prevent LOG(zero) / zero
    IF(sum < one) EXIT
  END DO
  sln = SQRT(- 2*LOG(sum) / sum)
  fn_val = u*sln

~~Simple example (Incorrect)~~:

Summary

program main

    real(8) :: t1, t2, x, y
    integer, parameter :: N = 100000000

    print *, "x = ", x, "; y = ", y

    call cpu_time(t1)
    do i = 1, N
        x = 2*y + i
    end do
    call cpu_time(t2)
    print *, "time for N = ", N, ", x = 2*y        : ", t2 - t1, " s; x = ", x

    call cpu_time(t1)
    do i = 1, N
        x = scale(y, 1) + i
    end do
    call cpu_time(t2)
    print *, "time for N = ", N, ", x = scale(y, 1): ", t2 - t1, " s; x = ", x

end program main

Results:
(Compiler optimization options are not used, because with optimization, Fortran runs too fast, and the running time is 0)

# On WSL2-Debian gfortran 10.3
>> gfortran main.f90 && ./a.out
 x =    9.8813129168249309E-324 ; y =    2.0000000000000000     
 time for N =    100000000 , x = 2*y        :   0.16183899999999998       s; x =    100000004.00000000     
 time for N =    100000000 , x = scale(y, 1):   0.72182100000000005       s; x =    100000004.00000000
# On MSYS2-ucrt64-gfortran 12.1
>> gfortran main.f90 && ./a
 x =    6.9530460433855700E-310 ; y =    2.0000000000000000     
 time for N =    100000000 , x = 2*y        :   0.15625000000000000       s; x =    100000004.00000000     
 time for N =    100000000 , x = scale(y, 1):   0.60937500000000000       s; x =    100000004.00000000  
# Windows10-ifort 2021.6.0
>> ifort main.f90 && ./main
 x =   0.000000000000000E+000 ; y =    2.00000000000000     
 time for N =    100000000 , x = 2*y        :   0.000000000000000E+000  s; x =
   100000004.000000
 time for N =    100000000 , x = scale(y, 1):   3.125000000000000E-002  s; x = 
   100000004.000000

Modified example:

program main

    real(8) :: t1, t2, x, y = 2.0
    integer, parameter :: N = 100000000

    call cpu_time(t1)
    do i = 1, N
        call random_number(y)
        x = 2*y
    end do
    call cpu_time(t2)
    print *, "time for N = ", N, ", x = 2*y        : ", t2 - t1, " s; x = ", x

    call cpu_time(t1)
    do i = 1, N
        call random_number(y)
        x = scale(y, 1)
    end do
    call cpu_time(t2)
    print *, "time for N = ", N, ", x = scale(y, 1): ", t2 - t1, " s; x = ", x

end program main

Results:

# WSL2-Debian gfortran 10.3
>> gfortran -Ofast main.f90 && ./a
 time for N =    100000000 , x = 2*y        :   0.78022399999999992       s; x =    1.1880948469991348     
 time for N =    100000000 , x = scale(y, 1):    1.4326160000000001       s; x =    1.6401134682038268
# MSYS2-ucrt64-gfortran 12.1
>> gfortran -Ofast main.f90 && ./a
 time for N =    100000000 , x = 2*y        :    2.5156250000000000       s; x =    1.1728265567290517     
 time for N =    100000000 , x = scale(y, 1):    3.0625000000000000       s; x =    1.5269402525788756
# On Windows-ifort 2021.6.0
>> ifort /fast main.f90 && ./main
 time for N =    100000000 , x = 2*y        :   0.437500000000000       s; x = 
   1.03704569309432
 time for N =    100000000 , x = scale(y, 1):   0.625000000000000       s; x = 
   1.14890448267426

mecej4 · June 21, 2022, 8:35pm

Whether you request optimization or not, the compiler may determine that an expression that is to be output may be calculated entirely at compile time, store that value in the EXE/a.out, and simply output that previously computed value.

Obviously, the result of the first loop is N+4, and it appears that the Intel compiler did not have to place code in the executable to calculate that result (elapsed time = 0).

We have seen an extreme example of such compile time calculation in a recent thread.

zoziha · June 22, 2022, 1:50am

You’re right, I modified the above example to use random numbers to avoid compiler compile-time optimizations, I don’t know if this really works.

All in all, I think scale(u, 1) is still slower than 2*u, maybe not that significant.

Ali · June 23, 2022, 6:29am

To the topic print_to_file:

The use of implied DO-loops leads to a better performance: fortran,print_to_file,8.043278.
However, the output will be transposed.

subroutine printfd(n)
integer, intent(in) :: n
integer :: i , unit
open(unit=1, file="/dev/null")
write(unit=1,fmt=*) (i,i=1,n)
write(unit=1,fmt=*) (i,i=2,n+1)
close(unit=1)
end subroutine

Previous version: fortran,print_to_file,23.939606

subroutine printfd(n)
integer, intent(in) :: n
integer :: i , unit
open(unit=1, file="/dev/null")
do i = 1, n
    write(unit=1,fmt=*) i, i+1
end do
close(unit=1)
end subroutine

Topic		Replies	Views
Julia: Fast as Fortran, Beautiful as Python	184	11731	November 13, 2022
Unbeatable micro benchmark Help	11	917	August 13, 2020
Simple summation 8x slower than in Julia	89	14995	April 2, 2022
The Computer Language Benchmarks Game Announcements	1	653	July 22, 2020
Comparing Fortran and Julia's Bessel function performance	69	4833	October 23, 2022

Improving Fortran Results in the Julia Micro-benchmarks

Related topics