Unbeatable micro benchmark

Let’s create an unbeatable micro benchmark where we could show how Fortran, the inmortal progamming language, is better than other scientific languages.

Which algorithm do you think could be the first one to be tested?

What

Today I have begun creating a small and simple benchmark, but I am afraid that I am not able to optimize Fortran enough. I am just at the begining with Fortran.

Results

Round 1. Fight
--------------

FORTAN
real	0m1,235s
user	0m1,219s
sys 	0m0,016s

JULIA
real	0m0,470s
user	0m0,673s
sys     0m0,316s

The code

2 Likes

Round 1.1, fight!

After compiling my own libblas.a with param -O3. I got these results. They are better, of course, but they are not what I am looking for or I am not able to understand the results.

Round 1.1 Fight
---------------

FORTRAN
real	0m0,663s
user	0m0,647s
sys     0m0,016s

JULIA
real	0m0,477s
user	0m0,694s
sys     0m0,298s

Nice start! Which BLAS implementation are you using out of interest? I would strongly suggest using an optimised one like OpenBLAS.

Some ideas for you:

  • It’s normal to calculate time based on repeated runs to help mitigate variation due to ‘warm-up’/test-noise;
  • Run timings across a sweep in matrix size n to see how Julia/Fortran comparison varies with matrix size (this is a more interesting result);
  • Separate out the random number generation from the timed section to isolate the dgemm call as the micro-benchmark goal (i.e. benchmark one thing at a time)

See cpu_time and system_clock for calculating the time within Fortran (I forget which one is best for bench-marking…)

This first test case is really benchmarking a library call so I would ultimately expect both Fortran and Julia to be effectively equal since they are both calling an optimised BLAS implementation. Is that the normal way to multiply matrices in Julia?

We’re hoping to put together a repo of Fortran benchmarks, see here - feel free to contribute there also :slight_smile:

1 Like

Fortran reborn. Success

After following the advises of Ikedward I
downloaded OpenBlas. It was really easy to compile. After that, I
tested both dynamic and static compilation. Finally I got these results with the dynamic version. The static version is still much faster than the Julia version.

Round 1.2. Fight!
-----------------

FORTRAN
real	0m0,102s
user	0m0,277s
sys     0m0,087s

JULIA
real	0m0,485s
user	0m0,663s
sys     0m0,329s

Do not forget that Julia has a great problem called the first plot problem, so all times are important and a compiled language as Fortran takes advantage of this.

What’s next

Now I would like to create some benchmarks based on intensive use of matrices and create some images with GNU_Plot to show the results. I am open to your ideas and suggestions. Thank you

1 Like

Nice try!

BLAS libraries are written in fortran, so the timing of this test doesn’t give a definitive answer. In fact, you’ll test the efficiency of the two BLAS (OpenBlas) libraries.

Some other ideas:

  • create your own matrix-matrix multiplication subroutine (fortran intrinsic matmul might use BLAS subroutine!)

  • instead of random initialisation, you can try with an explicit expression (like i+j).

  • as suggested by Ikedward, run several time and you can try to increase the matrix size.

1 Like

Have you looked at

https://benchmarksgame-team.pages.debian.net/benchmarksgame/fastest/julia-ifc.html

appears to be derived from The Great Language Shootout (circa 2003)

The original test:

λ  Measure-Command{.\Fortran1}
Seconds           : 0
Milliseconds      : 124
Ticks             : 1243009

λ  Measure-Command{julia test.jl}
Seconds           : 0
Milliseconds      : 496
Ticks             : 4960004

I ran the same program with n=10.000, which is the furthest I could go on Julia before it complained. Because I am on windows, I had to use allocatable arrays due to the image size restriction.
The results are interesting:

λ  Measure-Command{.\fortran2}
Minutes           : 0
Seconds           : 15
Milliseconds      : 754
:

λ  Measure-Command{julia test2.jl}
Minutes           : 0
Seconds           : 15
Milliseconds      : 537
:

With Fortran I used Intel’s compiler against MKL linked with /Qmkl:parallel (/Qmkl:sequential was 3x slower), Latest Julia 1.5 on Windows 10.
I’m impressed that Julia’s out-of-the-box blas library is as fast as Intel MKL’s (on my machine at least).

1 Like

The test I created just measure how fast a Blas function is called from Julia.

Blas Is a Fortran library, so with this test we are measuring the same FORTRAN function called natively (FORTRAN) or called from Julia.

Some like this:

# Julia code
for i = 1:1000000
    invoke_a_fortran_function()
end

The benchmark is unbeatable because we are comparing Fortran with Fortran.

1 Like

This isn’t quite correct; BLAS/LAPACK only really defines a standard set of c and Fortran interfaces, the actual implementation isn’t necessarily written in Fortran. Netlib BLAS/LAPACK is only really a human-readable implementation provided for reference. ‘Production’ implementations such as OpenBLAS and Intel MKL are written in assembly at the lowest levels in order to optimise for particular platforms and instruction sets (see here for example). Hence why OpenBLAS is so much faster than Netlib BLAS.

I can’t be sure but I think Julia ships with the OpenBLAS implementation so in this case you are correct to say that you are essentially benchmarking the language differences in the library call.

2 Likes

Calling a blas routine is not a very good benchmark, even if you implement it by hand in each language because many languages such as Fortran have built-in operations like matmul and dot_product that people should use instead of reimplementing such operations by hand.

A better benchmark would be some more complicated algorithm, say some iterative eigensolver or linear solver, or some other numerical task such as solving an ODE using some method.

3 Likes

For comparison, I’ve tried the matrix multiplication part with Python3.8. The code is like this:

import numpy as np
np.show_config()

N = 10000
a = np.zeros( (N, N) )
b = np.zeros( (N, N) )
c = np.dot( a, b )
print( c[0, 0] )

Python3.8 installed via Homebrew gives this output (with the time command):

blas_mkl_info:
  NOT AVAILABLE
blis_info:
  NOT AVAILABLE
openblas_info:
    libraries = ['openblas', 'openblas']
    library_dirs = ['/usr/local/lib']
    language = c
    define_macros = [('HAVE_CBLAS', None)]
blas_opt_info:
    libraries = ['openblas', 'openblas']
    library_dirs = ['/usr/local/lib']
    language = c
    define_macros = [('HAVE_CBLAS', None)]
lapack_mkl_info:
  NOT AVAILABLE
openblas_lapack_info:
    libraries = ['openblas', 'openblas']
    library_dirs = ['/usr/local/lib']
    language = c
    define_macros = [('HAVE_CBLAS', None)]
lapack_opt_info:
    libraries = ['openblas', 'openblas']
    library_dirs = ['/usr/local/lib']
    language = c
    define_macros = [('HAVE_CBLAS', None)]
0.0

real	0m26.933s
user	1m43.801s
sys	0m1.947s

Here, the CPU is Core i7-3615QM CPU @ 2.30GHz (on Mac mini 2012, very old), and 4 cores are used in parallel (according to htop, with 400% CPU usage). The timing fluctuates somewhat because I am opening three browsers and playing music… (so please do not take this as a rigorous comparison XD)

Python3.8 in Anaconda3 (2019) gives this output (on the same machine):

blas_mkl_info:
    libraries = ['mkl_rt', 'pthread']
    library_dirs = ['/Volumes/.../anaconda3/lib']
    define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
    include_dirs = ['/Volumes/.../anaconda3/include']
blas_opt_info:
    libraries = ['mkl_rt', 'pthread']
    library_dirs = ['/Volumes/.../anaconda3/lib']
    define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
    include_dirs = ['/Volumes/.../anaconda3/include']
lapack_mkl_info:
    libraries = ['mkl_rt', 'pthread']
    library_dirs = ['/Volumes/.../anaconda3/lib']
    define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
    include_dirs = ['/Volumes/.../anaconda3/include']
lapack_opt_info:
    libraries = ['mkl_rt', 'pthread']
    library_dirs = ['/Volumes/.../anaconda3/lib']
    define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
    include_dirs = ['/Volumes/.../anaconda3/include']
0.0

real	0m24.285s
user	1m32.364s
sys	0m2.319s

So, Anaconda uses MKL, and 4 cores are used in parallel (according to htop).

This is a similar code with Julia-1.4:

function test()

    N = 10000
    a = zeros( N, N )
    b = zeros( N, N )
    c = a * b

    println( c[1, 1] )
end

@time test()

which gives

0.0
 25.960560 seconds (510.33 k allocations: 2.258 GiB, 0.38% gc time)

real 0m27.348s
user 1m39.876s
sys 0m2.216s

So, the timing seems very close to Python3.8 + OpenBlas, with 4 cores used in parallel. (I guess Julia uses OpenBlas, but not checked it by myself…)

In the case of gfortran, matmul() can use an external BLAS library by the -fexternal-blas option, so I guess it will also give a similar result.

1 Like