Improving Fortran Results in the Julia Micro-benchmarks

My improvements to this Fortran benchmark have now been merged into the Julia benchmarks repository and the benchmarks page has been updated.

Edit: as pointed out by Milan, it looks like the online results page has not actually been updated yet.

6 Likes

Great work, @lkedward! I suppose the update will eventually propagate to the plot on the Julia website (I don’t think it has yet)?

1 Like

Very disturbing… Fortran faster than Julia? What?? :smile:

2 Likes

I need the functionality for a very low-volume niche need, so I am not sure if I ever timed them, but note there are several routines in M_string (a registered fpm(1) package, so easy to access), some of which I do not even list in the man-pages that convert to and from anything from base 2 to base 36 using what I think is a classic algorithm:

$@(#) M_strings::base(3f): convert whole number string in base [2-36] to string in alternate base [2-36]
$@(#) M_strings::decodebase(3f): convert whole number string in base [2-36] to base 10 number
$@(#) M_strings::codebase(3f): convert whole number in base 10 to string in base [2-36]
$@(#) M_strings::todecimal(3f): given string and base return decimal integer
$@(#) M_strings::todecimal(3f): given integer and base return string

that might be pertinent if looking to generalize base conversion for stdlib.

2 Likes

It is interesting in this thread, @mecej4 writes:

I had a very similar experience a couple of years ago when this thread by @lkedward first caught my attention. I do feel good to know I am not the only one who came away with the same impression as @mecej4 on Julia.

I was aware of Julia as a possibly leading contender in the near future in the HPC domain but had not looked into it until this thread surfaced here.

In addition to the comments by @mecej4 , I felt many of the claims among the Julia crowd were based on the microbenchmarks that were setup by the ardent users of Julia. But so many of those microbenchmarks were highly questionable in my opinion, especially when it came to other languages used in their comparisons. Honestly, the codes used in the microbenchmarks are poorer than beginner level.

I personally don’t think code such as the following to reveal anything useful, let alone be a measure for anything. But this is what the Julia folks use to claim that their “parse integers” test shows Julia faster than Fortran:

    call system_clock(t1)
    do k2 = 1, NRUNS
        do k = 1, 1000
            call random_number(s1)
            n = int(s1*huge(n))
            call hex_string(n,s)
            m = parse_int(s(:len_trim(s)), 16)
            call assert(m == n)
        end do
    end do
    call system_clock(t2)
3 Likes

The hex_string subroutine in perf.f90 generates a hex representation of a 32-bit integer. The present implementation uses modulo() and integer divisions by 16 to generate the hex string, which is then used as the input for the parsing routine. The following replacement uses shifts and masks, instead, and on my PC reduced the parse_integers microbenchmark time from 0.130 s to 0.047 s. I have offered this replacement code as an issue on the Julia microbenchmarks Github.

! Convert an integer to a hex string
!
subroutine hex_string(dec,hexchar)
  integer, intent(in) :: dec
  character(*) :: hexchar

  integer :: i, quotient

  character(len=1), parameter :: table(0:15) = &
  [(char(i),i=ichar('0'),ichar('9')),(char(i),i=ichar('A'),ichar('F'))]

  quotient = dec
  hexchar  = '00000000'
  i = 8
  do while (quotient /= 0 .and. i > 0)
      hexchar(i:i) = table(iand(quotient,15))
      i = i-1
      quotient = ishft(quotient,-4)
  end do

end subroutine hex_string
3 Likes

The comparison of the benchmarks really makes sense only if all languages implement the same algorithm. Otherwise we have comparison of apples and orangutans. Example: solving a system of equations with Gauss eliminations vs. Cramer’s rule.

1 Like

I agree mostly, but we have to make allowances where one language does not provide means for implementing certain algorithms (for example, standard Fortran 77 did not provide bit manipulation intrinsics, but C did), and so workarounds have to be used and these may result in poor performance.

1 Like

Thanks to @oscardssmith’s interest, initiative and help, these changes have been accepted and merged into the Julia microbenchmarks test suite. With these changes, Fortran is the fastest for the parse_integer test among all the languages tested.

C       0.1635
Fortran 0.074126 (was 0.682692)
Go      0.097109
Java    0.275906
Julia   0.123103
Python  1.39308
Rust    0.155348

One should not put too much value on these timings, since the codes for the other languages can also benefit from similar changes. It is a quirk of the parse_integer benchmark that the time for creating the test strings is included in the “parsing time” and, indeed, my improvements were confined to the lines of code for this preparatory task.

It is to be hoped that the graph and the CSV table will be updated soon to reflect this change.

13 Likes

My shallow opinion is that, believe it or not, it does not really matter what algorithms are used, because there will never really be (and no need to be) apple to apple comparisons.

Because at the end of the day, the real question is really extremely simple, that is,

To achieve the same result, what programming language is the fastest?

PS.

In your example, solving a system of equations, I personally think that, it does not matter one uses Gauss eliminations or Cramer’s rule or anything else. At the end, what really matters is just what language (it can use any algorithms it prefers) solves the system of equations the fastest. That is all.

I mean it is like soccer, in the World Cup, the goal is to win the World Cup. It does not matter what formation you use, 442, 343, 41212, 352, 532, …etc. It does not matter what strategy you used. You are the best if you are the champion.

Why Fortran’s print_to_file is so slow in the Julia benchmark?

If you see in detail, the print_to_file is slow compared to many other counterparts such as c, go, jullia, lua, python, and rust. Really surprising.

Try using different Fortran compilers and comparing. Formatted writes have been noticeably slow with some versions of Gfortran, especially on Windows.

1 Like

the print_to_file uses the following Fortran file :
(link:Microbenchmarks/perf.f90 at a963d284b09d04b3e0374f6dd46ec4b039ed5569 ¡ JuliaLang/Microbenchmarks ¡ GitHub):


subroutine printfd(n)
integer, intent(in) :: n
integer :: i , unit
open(unit=1, file="/dev/null")
do i = 1, n
    write(unit=1, fmt=*) i, i+1
end do
close(unit=1)
end subroutine

For list-directed output of real type list items, the processor supplies its own values for w, d and e in Fw.d, Ew.dEe, etc. The format conversion has to be performed, nevertheless.

Your use of he phrase “free format” in the context of output rather than Fortran source code format can be a source of confusion.

Yes, I agree. The w,d, and e are provided by the processor itself and not by the user. I will update the my previous reply.

1 Like

In addition to the topic of print_to_file, in the topic of matrix_statistics, I found that the scale function is used in that code (rnorm function):

DO
    CALL RANDOM_NUMBER( u )
    CALL RANDOM_NUMBER( v )
    u = SCALE( u, 1 ) - one
    v = SCALE( v, 1 ) - one
    sum = u*u + v*v + vsmall         ! vsmall added to prevent LOG(zero) / zero
    IF(sum < one) EXIT
  END DO
  sln = SQRT(- SCALE( LOG(sum), 1 ) / sum)
  fn_val = u*sln

As far as I know, simply using 2*u to replace the code like scale(u, 1) can be much (a bit) faster:

DO
    CALL RANDOM_NUMBER( u )
    CALL RANDOM_NUMBER( v )
    u = 2*u - one
    v = 2*v - one
    sum = u*u + v*v + vsmall         ! vsmall added to prevent LOG(zero) / zero
    IF(sum < one) EXIT
  END DO
  sln = SQRT(- 2*LOG(sum) / sum)
  fn_val = u*sln

Simple example (Incorrect):

Summary
program main

    real(8) :: t1, t2, x, y
    integer, parameter :: N = 100000000

    print *, "x = ", x, "; y = ", y

    call cpu_time(t1)
    do i = 1, N
        x = 2*y + i
    end do
    call cpu_time(t2)
    print *, "time for N = ", N, ", x = 2*y        : ", t2 - t1, " s; x = ", x

    call cpu_time(t1)
    do i = 1, N
        x = scale(y, 1) + i
    end do
    call cpu_time(t2)
    print *, "time for N = ", N, ", x = scale(y, 1): ", t2 - t1, " s; x = ", x

end program main

Results:
(Compiler optimization options are not used, because with optimization, Fortran runs too fast, and the running time is 0)

# On WSL2-Debian gfortran 10.3
>> gfortran main.f90 && ./a.out
 x =    9.8813129168249309E-324 ; y =    2.0000000000000000     
 time for N =    100000000 , x = 2*y        :   0.16183899999999998       s; x =    100000004.00000000     
 time for N =    100000000 , x = scale(y, 1):   0.72182100000000005       s; x =    100000004.00000000
# On MSYS2-ucrt64-gfortran 12.1
>> gfortran main.f90 && ./a
 x =    6.9530460433855700E-310 ; y =    2.0000000000000000     
 time for N =    100000000 , x = 2*y        :   0.15625000000000000       s; x =    100000004.00000000     
 time for N =    100000000 , x = scale(y, 1):   0.60937500000000000       s; x =    100000004.00000000  
# Windows10-ifort 2021.6.0
>> ifort main.f90 && ./main
 x =   0.000000000000000E+000 ; y =    2.00000000000000     
 time for N =    100000000 , x = 2*y        :   0.000000000000000E+000  s; x =
   100000004.000000
 time for N =    100000000 , x = scale(y, 1):   3.125000000000000E-002  s; x = 
   100000004.000000

Modified example:

program main

    real(8) :: t1, t2, x, y = 2.0
    integer, parameter :: N = 100000000

    call cpu_time(t1)
    do i = 1, N
        call random_number(y)
        x = 2*y
    end do
    call cpu_time(t2)
    print *, "time for N = ", N, ", x = 2*y        : ", t2 - t1, " s; x = ", x

    call cpu_time(t1)
    do i = 1, N
        call random_number(y)
        x = scale(y, 1)
    end do
    call cpu_time(t2)
    print *, "time for N = ", N, ", x = scale(y, 1): ", t2 - t1, " s; x = ", x

end program main

Results:

# WSL2-Debian gfortran 10.3
>> gfortran -Ofast main.f90 && ./a
 time for N =    100000000 , x = 2*y        :   0.78022399999999992       s; x =    1.1880948469991348     
 time for N =    100000000 , x = scale(y, 1):    1.4326160000000001       s; x =    1.6401134682038268
# MSYS2-ucrt64-gfortran 12.1
>> gfortran -Ofast main.f90 && ./a
 time for N =    100000000 , x = 2*y        :    2.5156250000000000       s; x =    1.1728265567290517     
 time for N =    100000000 , x = scale(y, 1):    3.0625000000000000       s; x =    1.5269402525788756
# On Windows-ifort 2021.6.0
>> ifort /fast main.f90 && ./main
 time for N =    100000000 , x = 2*y        :   0.437500000000000       s; x = 
   1.03704569309432
 time for N =    100000000 , x = scale(y, 1):   0.625000000000000       s; x = 
   1.14890448267426 
1 Like

Whether you request optimization or not, the compiler may determine that an expression that is to be output may be calculated entirely at compile time, store that value in the EXE/a.out, and simply output that previously computed value.

Obviously, the result of the first loop is N+4, and it appears that the Intel compiler did not have to place code in the executable to calculate that result (elapsed time = 0).

We have seen an extreme example of such compile time calculation in a recent thread.

You’re right, I modified the above example to use random numbers to avoid compiler compile-time optimizations, I don’t know if this really works.

All in all, I think scale(u, 1) is still slower than 2*u, maybe not that significant.

To the topic print_to_file:

The use of implied DO-loops leads to a better performance: fortran,print_to_file,8.043278.
However, the output will be transposed.

subroutine printfd(n)
integer, intent(in) :: n
integer :: i , unit
open(unit=1, file="/dev/null")
write(unit=1,fmt=*) (i,i=1,n)
write(unit=1,fmt=*) (i,i=2,n+1)
close(unit=1)
end subroutine

Previous version: fortran,print_to_file,23.939606

subroutine printfd(n)
integer, intent(in) :: n
integer :: i , unit
open(unit=1, file="/dev/null")
do i = 1, n
    write(unit=1,fmt=*) i, i+1
end do
close(unit=1)
end subroutine
1 Like