I recently heard about the Julia microbenchmarks (also on github here) where Fortran appears to be loosely ranked 6th (behind Go, Rust, LuaJIT, Julia & C) based on a collection of benchmark tests.
Fortran performs very well on 6 out of the 8 benchmarks, but falls somewhat behind for the hex int parsing and file io benchmarks.
For the Hex integer parsing benchmark, most of the time is taken up by an internal write statement that converts a random integer to a hexadecimal string (see this line).
I was able to achieve a substantial 78% speedup by replacing this internal write statement with a simple subroutine that converts an input integer to a hex string.
I got a further 20% speedup by replacing the branching logic (see here) in the parse_int routine with a static lookup table, inspired by Ivanās (@ivanpribec) ascii benchmarks.
Iām donāt know very much about benchmarking, so Iām posting here for some feedback:
In your opinion, is this a legitimate/justified improvement to the Fortran benchmark for which I should open a pull request to the Julia repo?
Are there any changes/corrections you would make to my new code?
Do you have similar experience of internal write being slow and know why it is so?
Thanks for improving the benchmark. You can also speedup the benchmarks by enabling all optimizations options in gfortran, but as I document in the issue, my suggestion was not accepted.
As such, we should have our own repository of benchmarks, and ensure that they are fast and compiled properly.
I think so. Both approaches (internal write and your conversion subroutine) have the same outcome. I donāt know how internal write works. Julia people might argue that youāre using a different algorithm rather than the compiler provided one, and thus, cheating on the benchmark. AFAIK the Fortran standard doesnāt specify how internal write should be implemented.
No and Iām surprised as well. Iām curious how internal write works in gfortran/gcc. @kargl do you know?
Possibly related to Point 3ā¦? (In the answer of the SO page below, the result āParellel (2) / marked with (*1)ā is slowest, where an internal write is used for each array element.)
Formatted I/O is very complex and time-consuming. There is interpreting the format, comparing format against the type of the list item, and the conversion itself. There is a lot of error checking involved. (I have worked extensively on DEC/Compaq/Intel formatted I/O support.)
Your enhancement looks both legitimate and efficient. Since the matter on hand is Julia benchmark and the Julia language happens to include an intrinsic functionBase.String for which the compiler implementation can be highly optimized and which they are using to compare against other languages, it does make complete sense to employ a specific procedure for the task, especially when a language such as Fortran does not include any native facility for the same.
Your subprogram to āwriteā an integer to a āhex stringā also looks like a good candidate for āstring utilsā section of the Fortran standard library, perhaps it can be extended further to transform an integer to B or O or Z string?
Though I personally think certain string utilities, such as transforming any of the other intrinsic types to CHARACTER and from CHARACTER back to the other intrinsic types, should be part of the Fortran language itself. For these are common needs that every scientist and engineer coding in anger has experienced while working with data. They end up having to āroll their ownā solutions using IO instructions with an internal file. I think adding certain intrinsic procedures, similar to SPLIT in Fortran 202X, will only be par for the course should an element of the wish-list conveyed for Fortran at FortranCon last week - that Fortran should feel like play not work - become an aspect of a vision for this language.
I agree we should have our own repository of benchmarks for a variety of reasons.
In the meantime I think itās worth contributing to an existing effort like the Julia one to improve the quality of the Fortran code there.
Regarding optimisation flags, I actually tried -march=native -ffast-math -funroll-loops but this only noticeably improves the madelbrot and iteration_pi_sum benchmarks for which Fortran is already on par with C.
This is true; most noticeably, the benchmarks redundantly reallocate large arrays within the test loop. This is almost certainly intentional for direct comparison with Julia.
Thanks for the feedback @milancurcic & @FortranFan, I also think that this is a legitimate improvement. As you point out, Julia and some of the other language benchmarks use a specific routine for generating the hex string. Iāll open a PR and see what they think.
Good point! I will have a go at extending the routine for general BOZ strings as an exercise and for stdlib.
Nice work! Iām happy to see the lookup tables can provide performance benefits.
Concerning your second question, I think it would be nicer to have an allocatable string (similar to the API in the stdlib string issue) when you copy from the tempchar. But I have the feeling you wanted to conform to the API as used in the driver program.
I second @FortranFanās comments, this would be a perfect addition to the standard library.
The pisum benchmark can be turbocharged with OpenMP:
real(dp) function pisum() result(s)
integer :: j, k
do j = 1, 500
s = 0
!$omp parallel do reduction(+:s)
do k=1,10000
s = s + 1._dp / k**2
end do
end do
end function
and adding -fopenmp to the build line (for gfortran). Reduces the runtime time on my laptop from ~4s to ~1.5s.
Well, with any reasonably optimizing Fortran compiler, the so-called āiteration_pi_sumā used in the Julia Micro-benchmark should show an immeasurably low run-time i.e., 0s. A couple of the loops will be optimized away. Are they using -O0 with gfortran? It doesnāt look like a good benchmark case.
Yes, there are many optimisations that can be made to the Fortran code, but for the Julia benchmark repo they need to be kept consistent with the other language implementations for a āfairā comparison - unfortunately the definition of āfairā is not clear.
@FortranFan, the benchmarks are run at several optimisation levels including -O3.
I think you are right that it isnāt a good test case; the two things that I (a non compiler-developer) notice are that:
The function result is independent of the number of outer-loop iterations, so maybe the outer-loop be removed completely?
The function result can be calculated entirely at compile time.
I had a play on godbolt.org (see Compiler Explorer) and I can see that neither of these has occurred (gfortran 9.3). However if you change the number of inner loop iterations to only 17, then it does become a compile-time constant - so presumably there is a trade-off between compile-time and runtime going on here?
Update: the compiler will remove all loops, if you use an implied do-loop, see Compiler Explorer.
@ikedward Yes this argument frequently comes up in benchmarking and youāve hit the nail on the head - what exactly does fair mean in this context. All the implementations should use the same algorithm so timing differences are not due to algorithmic differences, but compiler optimizations should be turned up and if one language can take advantage of multi-threading and another canāt then thatās +1 for threading. Itās arguable that for Fortran OpenMP isnāt strictly part of the language but do concurrent is (although I couldnāt get that to compile).
Another POV is that really all youāre testing is the compilerās ability to create efficient machine code and that therefore the results say nothing about the language (I donāt agree with that BTW).
According to the last sentence of the benchmark, all the other languages seem to be using only one core (serial execution), so if that is the case, I guess OpenMP may not be the way to goā¦
(Apart from it, it is interesting that OpenMP gives acceleration from 4 to 1.6 sec
in this āpiā calculation (with possibly 4 cores?). If a parallel version of the benchmark
is to be made, I guess it would also be an interesting comparison.)
Btw, I remember I sometimes saw a comment (on the net) that āwhy not use Intel Fortran also (for such benchmarking)?ā Although ifort is not necessarily faster, I feel it would also be an interesting comparison (given the result of the benchmarkgame site, for example).
The article published the benchmark codes: Bitbucket, and also notes in the conclusion:
ā¦ / Julia is in development /
Julia is a dynamic language, so in most cases it will win against Fortran in development speed.
The last point I would like to fix with LFortran and our other fortran-lang related tasks (stdlib, fpm, ā¦). I think Fortran can be made as easy to develop as in Python or Julia.
The authors do mention that Julia could be made faster using the StaticArrays.jl package. For small arrays the LLVM compiler can then perform loop unrolling giving significant speedup.