In OpenBLAS, a subset of the most important LAPACK functions are reimplemented for better performance/threading. This is why they have /lapack-netlib
(reference version) and /lapack
folders. It looks like they re-implement the LU and Cholesky factorizations, but not the other algorithms.
There is also ReLAPACK which uses recursive (or cache-oblivious) instead of blocked algorithms. At the time of publishing in 2016, it improved on the results of MKL and OpenBLAS.
There is also a Julia package RecursiveFactorization.jl that improves the LU factorization performance compared to OpenBLAS.
libFLAME is another library that improves upon the performance of LAPACK. They also offer excellent education material on their linear algebra methodology: http://www.ulaff.net/
I’m not aware how much of these algorithmic advances trickle-down into Reference LAPACK. The problem is that the architectures keep changing, so the optimal performance is a moving target. Also it appears to be hard to optimize for all sizes. At some point linear algebra could be part of the compiler, but it doesn’t appear we are there yet.
We’ve read complaints here on Discourse on how BLAS and LAPACK don’t handle small sizes efficiently, hence the Julia/Python/Mojo communities are motivated to write their own libraries. On the other side you have the DOE and industry using MAGMA / cusolver / rocSOLVER / oneMKL (DPC++) to tackle large problems on GPUs.
Tim Mattson from Intel once showed a nice slide why we have so many parallel “standards” for the same thing, but the point he makes extend to scientific computing languages, linear algebra libraries, or machine learning frameworks alike:
(Source: Tim Mattson, An Introduction to Parallel programming with OpenMP, PDF, 13.9 MB)
The Julia community is very proud of their benchmarks: SciMLBenchmarks.jl: Benchmarks for Scientific Machine Learning (SciML) and Equation Solvers · The SciML Benchmarks, and hats off, the improvements they have made in some (if not most) areas are enormous. But sometimes they get over-excited. Let’s take as an example the root-finding example: Simple Interval Rootfinding (NonlinearSolve.jl vs Roots.jl vs MATLAB) · The SciML Benchmarks (archived version on Sep 4 2023 here)
The claim made is:
MATLAB 2022a reportedly achieves 1.66s. Try this code yourself: we receive ~0.05 seconds, or a 33x speedup.
What is not emphasized is that the MATLAB version was run on an Intel Xeon E5-1650 v4 @ 3.60 GHz 6-Core CPU (2014, Haswell architecture), whereas the Julia code ran on a AMD EPYC 7502 @ 2.5 GHz 32-Core Processor (2018, Zen 2 architecture). So we should expect some differences just because of the hardware and clock speeds.
To make things worse, the 33x comparison is between MATLAB which is using a variant of Dekker’s algorithm , whereas the Julia number comes from a Newton-Raphson solver. So the comparison is already meaningless and misleading. (I can highly recommend the video from Georg Hager, Lecture on Experiments and Data Presentation in High Performance Computing, as a source of good practices for doing performance measurements.)
I wrote my own little test in Fortran, using a wrapper of the classic zeroin
routine from Forsythe, Malcolm, and Moler (1987):
allocate(levels(n), out(n))
call random_number(levels)
levels = 1.5_dp * levels
niter = 1
do
s = timestamp()
do k = 1, niter
do i = 1, n
out(i) = pzeroin(0.0_dp,2.0_dp,myfun,tol,levels(i))
end do
end do
runtime = timestamp() - s
if (runtime > 0.2_dp) exit
niter = niter * 2
end do
print *, runtime/niter
Note how the number of outer iterations is increased to make sure we exceed 200 microseconds. The results I get are the following:
Results are shown for:
- gcc version 13.1.0 (Homebrew GCC 13.1.0)
- Intel(R) Core™ i7-9750H CPU @ 2.60GHz (6 cores, Coffee Lake architecture)
- Python v3.9.6 [Clang 14.0.0 (clang-1400.0.29.202)] on darwin; Scipy v1.10.1
The programs and scripts can be found here: GitHub - ivan-pi/zerobench: Benchmark of zeroin
The naive conclusion would be that Scipy is c. 20 times slower than Fortran. In reality the Scipy wrapper does a lot more work, checking for sign correctness of the initial bracket, checking for NaN values, and other stuff that adds overhead. The core root finding algorithm is actually implemented in C, and can be found in the file brentq.c
:
extern double brentq(callback_type f, double xa, double xb, double xtol,
double rtol, int iter, void *func_data_param,
scipy_zeros_info *solver_stats);
We can call this routine from Fortran, with the following results:
For some more variety, I added the implementations from the PORT Mathematical Software Library (1978) and NAPACK (1988, codes associated with the book of W. W. Hager, Applied Numerical Linear Algebra). As you can see we are in the same order of magnitude, with differences of at most 2x (the C brentq
routine has more elaborate error reporting compared to zeroin
). Keeping in mind I’m using different hardware, the classic zeroin
subroutine is about 4x-5x faster than the Julia benchmark.
Btw, the original version of Brent’s routine can be found in his book Algorithms for Minimization without Derivatives (1973). A comment in the source code says it was tested on the IBM 360/91; here is one photographed at NASA Goddard Space Flight Center:
Granted, compilers and software practices have come a long way since then. But I think it’s pretty cool that code from 50 years ago still compiles and runs.
In conclusion:
- the SciML benchmark pertaining to root-finding is misleading; it ignores hardware and algorithmic differences.
- the SciML benchmark chooses a poor baseline; arguably, if the 1.66 seconds that MATLAB takes were the bottleneck, a MATLAB user could drop down into a MEX-file and benefit from the speed of compiled languages.
- The performance of SciPy’s Brentq (in C) is similar to Fortran, once the Python interpreter and safety nets are removed out of the picture.