Let’s create an unbeatable micro benchmark where we could show how Fortran, the inmortal progamming language, is better than other scientific languages.
Which algorithm do you think could be the first one to be tested?
Let’s create an unbeatable micro benchmark where we could show how Fortran, the inmortal progamming language, is better than other scientific languages.
Which algorithm do you think could be the first one to be tested?
Today I have begun creating a small and simple benchmark, but I am afraid that I am not able to optimize Fortran enough. I am just at the begining with Fortran.
Round 1. Fight
--------------
FORTAN
real 0m1,235s
user 0m1,219s
sys 0m0,016s
JULIA
real 0m0,470s
user 0m0,673s
sys 0m0,316s
After compiling my own libblas.a
with param -O3
. I got these results. They are better, of course, but they are not what I am looking for or I am not able to understand the results.
Round 1.1 Fight
---------------
FORTRAN
real 0m0,663s
user 0m0,647s
sys 0m0,016s
JULIA
real 0m0,477s
user 0m0,694s
sys 0m0,298s
Nice start! Which BLAS implementation are you using out of interest? I would strongly suggest using an optimised one like OpenBLAS.
Some ideas for you:
dgemm
call as the micro-benchmark goal (i.e. benchmark one thing at a time)See cpu_time
and system_clock
for calculating the time within Fortran (I forget which one is best for bench-marking…)
This first test case is really benchmarking a library call so I would ultimately expect both Fortran and Julia to be effectively equal since they are both calling an optimised BLAS implementation. Is that the normal way to multiply matrices in Julia?
We’re hoping to put together a repo of Fortran benchmarks, see here - feel free to contribute there also
After following the advises of Ikedward I
downloaded OpenBlas. It was really easy to compile. After that, I
tested both dynamic and static compilation. Finally I got these results with the dynamic version. The static version is still much faster than the Julia version.
Round 1.2. Fight!
-----------------
FORTRAN
real 0m0,102s
user 0m0,277s
sys 0m0,087s
JULIA
real 0m0,485s
user 0m0,663s
sys 0m0,329s
Do not forget that Julia has a great problem called the first plot problem, so all times are important and a compiled language as Fortran takes advantage of this.
Now I would like to create some benchmarks based on intensive use of matrices and create some images with GNU_Plot to show the results. I am open to your ideas and suggestions. Thank you
Nice try!
BLAS libraries are written in fortran, so the timing of this test doesn’t give a definitive answer. In fact, you’ll test the efficiency of the two BLAS (OpenBlas) libraries.
Some other ideas:
create your own matrix-matrix multiplication subroutine (fortran intrinsic matmul might use BLAS subroutine!)
instead of random initialisation, you can try with an explicit expression (like i+j).
as suggested by Ikedward, run several time and you can try to increase the matrix size.
Have you looked at
https://benchmarksgame-team.pages.debian.net/benchmarksgame/fastest/julia-ifc.html
appears to be derived from The Great Language Shootout (circa 2003)
The original test:
λ Measure-Command{.\Fortran1}
Seconds : 0
Milliseconds : 124
Ticks : 1243009
λ Measure-Command{julia test.jl}
Seconds : 0
Milliseconds : 496
Ticks : 4960004
I ran the same program with n=10.000
, which is the furthest I could go on Julia before it complained. Because I am on windows, I had to use allocatable arrays due to the image size restriction.
The results are interesting:
λ Measure-Command{.\fortran2}
Minutes : 0
Seconds : 15
Milliseconds : 754
:
λ Measure-Command{julia test2.jl}
Minutes : 0
Seconds : 15
Milliseconds : 537
:
With Fortran I used Intel’s compiler against MKL linked with /Qmkl:parallel
(/Qmkl:sequential
was 3x slower), Latest Julia 1.5 on Windows 10.
I’m impressed that Julia’s out-of-the-box blas library is as fast as Intel MKL’s (on my machine at least).
The test I created just measure how fast a Blas function is called from Julia.
Blas Is a Fortran library, so with this test we are measuring the same FORTRAN function called natively (FORTRAN) or called from Julia.
Some like this:
# Julia code
for i = 1:1000000
invoke_a_fortran_function()
end
The benchmark is unbeatable because we are comparing Fortran with Fortran.
This isn’t quite correct; BLAS/LAPACK only really defines a standard set of c
and Fortran
interfaces, the actual implementation isn’t necessarily written in Fortran. Netlib BLAS/LAPACK is only really a human-readable implementation provided for reference. ‘Production’ implementations such as OpenBLAS and Intel MKL are written in assembly at the lowest levels in order to optimise for particular platforms and instruction sets (see here for example). Hence why OpenBLAS is so much faster than Netlib BLAS.
I can’t be sure but I think Julia ships with the OpenBLAS implementation so in this case you are correct to say that you are essentially benchmarking the language differences in the library call.
Calling a blas routine is not a very good benchmark, even if you implement it by hand in each language because many languages such as Fortran have built-in operations like matmul
and dot_product
that people should use instead of reimplementing such operations by hand.
A better benchmark would be some more complicated algorithm, say some iterative eigensolver or linear solver, or some other numerical task such as solving an ODE using some method.
For comparison, I’ve tried the matrix multiplication part with Python3.8. The code is like this:
import numpy as np
np.show_config()
N = 10000
a = np.zeros( (N, N) )
b = np.zeros( (N, N) )
c = np.dot( a, b )
print( c[0, 0] )
Python3.8 installed via Homebrew gives this output (with the time
command):
blas_mkl_info:
NOT AVAILABLE
blis_info:
NOT AVAILABLE
openblas_info:
libraries = ['openblas', 'openblas']
library_dirs = ['/usr/local/lib']
language = c
define_macros = [('HAVE_CBLAS', None)]
blas_opt_info:
libraries = ['openblas', 'openblas']
library_dirs = ['/usr/local/lib']
language = c
define_macros = [('HAVE_CBLAS', None)]
lapack_mkl_info:
NOT AVAILABLE
openblas_lapack_info:
libraries = ['openblas', 'openblas']
library_dirs = ['/usr/local/lib']
language = c
define_macros = [('HAVE_CBLAS', None)]
lapack_opt_info:
libraries = ['openblas', 'openblas']
library_dirs = ['/usr/local/lib']
language = c
define_macros = [('HAVE_CBLAS', None)]
0.0
real 0m26.933s
user 1m43.801s
sys 0m1.947s
Here, the CPU is Core i7-3615QM CPU @ 2.30GHz (on Mac mini 2012, very old), and 4 cores are used in parallel (according to htop, with 400% CPU usage). The timing fluctuates somewhat because I am opening three browsers and playing music… (so please do not take this as a rigorous comparison XD)
Python3.8 in Anaconda3 (2019) gives this output (on the same machine):
blas_mkl_info:
libraries = ['mkl_rt', 'pthread']
library_dirs = ['/Volumes/.../anaconda3/lib']
define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
include_dirs = ['/Volumes/.../anaconda3/include']
blas_opt_info:
libraries = ['mkl_rt', 'pthread']
library_dirs = ['/Volumes/.../anaconda3/lib']
define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
include_dirs = ['/Volumes/.../anaconda3/include']
lapack_mkl_info:
libraries = ['mkl_rt', 'pthread']
library_dirs = ['/Volumes/.../anaconda3/lib']
define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
include_dirs = ['/Volumes/.../anaconda3/include']
lapack_opt_info:
libraries = ['mkl_rt', 'pthread']
library_dirs = ['/Volumes/.../anaconda3/lib']
define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
include_dirs = ['/Volumes/.../anaconda3/include']
0.0
real 0m24.285s
user 1m32.364s
sys 0m2.319s
So, Anaconda uses MKL, and 4 cores are used in parallel (according to htop).
This is a similar code with Julia-1.4:
function test()
N = 10000
a = zeros( N, N )
b = zeros( N, N )
c = a * b
println( c[1, 1] )
end
@time test()
which gives
0.0
25.960560 seconds (510.33 k allocations: 2.258 GiB, 0.38% gc time)
real 0m27.348s
user 1m39.876s
sys 0m2.216s
So, the timing seems very close to Python3.8 + OpenBlas, with 4 cores used in parallel. (I guess Julia uses OpenBlas, but not checked it by myself…)
In the case of gfortran, matmul() can use an external BLAS library by the -fexternal-blas
option, so I guess it will also give a similar result.