Improving Fortran Results in the Julia Micro-benchmarks

lkedward · July 8, 2020, 2:39pm

I recently heard about the Julia microbenchmarks (also on github here) where Fortran appears to be loosely ranked 6th (behind Go, Rust, LuaJIT, Julia & C) based on a collection of benchmark tests.
Fortran performs very well on 6 out of the 8 benchmarks, but falls somewhat behind for the hex int parsing and file io benchmarks.

For the Hex integer parsing benchmark, most of the time is taken up by an internal write statement that converts a random integer to a hexadecimal string (see this line).

I was able to achieve a substantial 78% speedup by replacing this internal write statement with a simple subroutine that converts an input integer to a hex string.
I got a further 20% speedup by replacing the branching logic (see here) in the parse_int routine with a static lookup table, inspired by Ivan’s (@ivanpribec) ascii benchmarks.

I’m don’t know very much about benchmarking, so I’m posting here for some feedback:

In your opinion, is this a legitimate/justified improvement to the Fortran benchmark for which I should open a pull request to the Julia repo?
Are there any changes/corrections you would make to my new code?
Do you have similar experience of internal write being slow and know why it is so?

You can see a full diff of my changes here:

certik · July 8, 2020, 2:59pm

I recently posted about adding benchmarks to our website here:

https://github.com/fortran-lang/fortran-lang.org/issues/117

Thanks for improving the benchmark. You can also speedup the benchmarks by enabling all optimizations options in gfortran, but as I document in the issue, my suggestion was not accepted.

As such, we should have our own repository of benchmarks, and ensure that they are fast and compiled properly.

milancurcic · July 8, 2020, 3:06pm

I think so. Both approaches (internal write and your conversion subroutine) have the same outcome. I don’t know how internal write works. Julia people might argue that you’re using a different algorithm rather than the compiler provided one, and thus, cheating on the benchmark. AFAIK the Fortran standard doesn’t specify how internal write should be implemented.

No and I’m surprised as well. I’m curious how internal write works in gfortran/gcc. @kargl do you know?

septc · July 8, 2020, 3:16pm

Possibly related to Point 3…? (In the answer of the SO page below, the result “Parellel (2) / marked with (*1)” is slowest, where an internal write is used for each array element.)

sblionel · July 8, 2020, 11:12pm

Formatted I/O is very complex and time-consuming. There is interpreting the format, comparing format against the type of the list item, and the conversion itself. There is a lot of error checking involved. (I have worked extensively on DEC/Compaq/Intel formatted I/O support.)

FortranFan · July 9, 2020, 3:11am

@lkedward,

Your enhancement looks both legitimate and efficient. Since the matter on hand is Julia benchmark and the Julia language happens to include an intrinsic function Base.String for which the compiler implementation can be highly optimized and which they are using to compare against other languages, it does make complete sense to employ a specific procedure for the task, especially when a language such as Fortran does not include any native facility for the same.

Your subprogram to “write” an integer to a “hex string” also looks like a good candidate for “string utils” section of the Fortran standard library, perhaps it can be extended further to transform an integer to B or O or Z string?

Though I personally think certain string utilities, such as transforming any of the other intrinsic types to CHARACTER and from CHARACTER back to the other intrinsic types, should be part of the Fortran language itself. For these are common needs that every scientist and engineer coding in anger has experienced while working with data. They end up having to “roll their own” solutions using IO instructions with an internal file. I think adding certain intrinsic procedures, similar to SPLIT in Fortran 202X, will only be par for the course should an element of the wish-list conveyed for Fortran at FortranCon last week - that Fortran should feel like play not work - become an aspect of a vision for this language.

Kudos on the speedup you achieved.

lkedward · July 9, 2020, 8:22am

I agree we should have our own repository of benchmarks for a variety of reasons.
In the meantime I think it’s worth contributing to an existing effort like the Julia one to improve the quality of the Fortran code there.

Regarding optimisation flags, I actually tried -march=native -ffast-math -funroll-loops but this only noticeably improves the madelbrot and iteration_pi_sum benchmarks for which Fortran is already on par with C.

lkedward · July 9, 2020, 8:36am

This is true; most noticeably, the benchmarks redundantly reallocate large arrays within the test loop. This is almost certainly intentional for direct comparison with Julia.

lkedward · July 9, 2020, 8:51am

Thanks for the feedback @milancurcic & @FortranFan, I also think that this is a legitimate improvement. As you point out, Julia and some of the other language benchmarks use a specific routine for generating the hex string. I’ll open a PR and see what they think.
Good point! I will have a go at extending the routine for general BOZ strings as an exercise and for stdlib.

ivanpribec · July 10, 2020, 9:03am

Nice work! I’m happy to see the lookup tables can provide performance benefits.

Concerning your second question, I think it would be nicer to have an allocatable string (similar to the API in the stdlib string issue) when you copy from the tempchar. But I have the feeling you wanted to conform to the API as used in the driver program.

I second @FortranFan’s comments, this would be a perfect addition to the standard library.

simong · July 10, 2020, 11:01pm

The pisum benchmark can be turbocharged with OpenMP:

real(dp) function pisum() result(s)
integer :: j, k
 do j = 1, 500
    s = 0
    !$omp parallel do reduction(+:s)
    do k=1,10000
        s = s + 1._dp / k**2
    end do
end do
end function

and adding -fopenmp to the build line (for gfortran). Reduces the runtime time on my laptop from ~4s to ~1.5s.

FortranFan · July 11, 2020, 3:22am

Well, with any reasonably optimizing Fortran compiler, the so-called “iteration_pi_sum” used in the Julia Micro-benchmark should show an immeasurably low run-time i.e., 0s. A couple of the loops will be optimized away. Are they using -O0 with gfortran? It doesn’t look like a good benchmark case.

lkedward · July 11, 2020, 9:30am

Yes, there are many optimisations that can be made to the Fortran code, but for the Julia benchmark repo they need to be kept consistent with the other language implementations for a ‘fair’ comparison - unfortunately the definition of ‘fair’ is not clear.

@FortranFan, the benchmarks are run at several optimisation levels including -O3.
I think you are right that it isn’t a good test case; the two things that I (a non compiler-developer) notice are that:

The function result is independent of the number of outer-loop iterations, so maybe the outer-loop be removed completely?
The function result can be calculated entirely at compile time.

I had a play on godbolt.org (see Compiler Explorer) and I can see that neither of these has occurred (gfortran 9.3). However if you change the number of inner loop iterations to only 17, then it does become a compile-time constant - so presumably there is a trade-off between compile-time and runtime going on here?

Update: the compiler will remove all loops, if you use an implied do-loop, see Compiler Explorer.

simong · July 11, 2020, 9:51am

@ikedward Yes this argument frequently comes up in benchmarking and you’ve hit the nail on the head - what exactly does fair mean in this context. All the implementations should use the same algorithm so timing differences are not due to algorithmic differences, but compiler optimizations should be turned up and if one language can take advantage of multi-threading and another can’t then that’s +1 for threading. It’s arguable that for Fortran OpenMP isn’t strictly part of the language but do concurrent is (although I couldn’t get that to compile).
Another POV is that really all you’re testing is the compiler’s ability to create efficient machine code and that therefore the results say nothing about the language (I don’t agree with that BTW).

septc · July 11, 2020, 10:11am

According to the last sentence of the benchmark, all the other languages seem to be using only one core (serial execution), so if that is the case, I guess OpenMP may not be the way to go…

(Apart from it, it is interesting that OpenMP gives acceleration from 4 to 1.6 sec
in this “pi” calculation (with possibly 4 cores?). If a parallel version of the benchmark
is to be made, I guess it would also be an interesting comparison.)

Btw, I remember I sometimes saw a comment (on the net) that “why not use Intel Fortran also (for such benchmarking)?” Although ifort is not necessarily faster, I feel it would also be an interesting comparison (given the result of the benchmarkgame site, for example).

ivanpribec · July 13, 2020, 5:50pm

I just stumbled upon the paper Statistically Significant Comparative Performance Testing of Julia and Fortran Languages in Case of Runge–Kutta Methods. Their final conclusion was Fortran is faster in a Runge-Kutta benchmark.

certik · July 13, 2020, 8:24pm

Great find @ivanpribec. This would be a great addition to GitHub - fortran-lang/benchmarks: Fortran benchmarks, so I just created an issue for it Add Runge-Kutta benchmarks · Issue #5 · fortran-lang/benchmarks · GitHub.

The article published the benchmark codes: Bitbucket, and also notes in the conclusion:

… / Julia is in development /

Julia is a dynamic language, so in most cases it will win against Fortran in development speed.

The last point I would like to fix with LFortran and our other fortran-lang related tasks (stdlib, fpm, …). I think Fortran can be made as easy to develop as in Python or Julia.

ivanpribec · July 14, 2020, 11:58am

Thanks @certik.

The authors do mention that Julia could be made faster using the StaticArrays.jl package. For small arrays the LLVM compiler can then perform loop unrolling giving significant speedup.

There is similar issue for the n-body benchmark in The Computer Language
Benchmarks Game. Someone left some comments about it on Twitter indicating more effort had been put into making the Julia code performant than the Fortran one.

ivanpribec · July 14, 2020, 12:58pm

I found another Twitter thread, this one compares Fortran and Julia for a multi-threaded pi calculation benchmark: https://twitter.com/owainkenway/status/1227595296173182978

certik · July 14, 2020, 3:41pm

@ivanpribec, @septc, can you please create new issues at https://github.com/fortran-lang/benchmarks/issues for each benchmark that you found? Those would all be great additions to our benchmark suite.

Topic		Replies	Views
Julia: Fast as Fortran, Beautiful as Python	184	11706	November 13, 2022
Unbeatable micro benchmark Help	11	915	August 13, 2020
Simple summation 8x slower than in Julia	89	14978	April 2, 2022
The Computer Language Benchmarks Game Announcements	1	653	July 22, 2020
Comparing Fortran and Julia's Bessel function performance	69	4814	October 23, 2022

Improving Fortran Results in the Julia Micro-benchmarks

Related topics