Testing the performance of `matmul` under default compiler settings

Fortran (as in the language standard) doesn’t prescribe any kind of optimization for matmul. It is the goal of the standard to try and remain agnostic of the hardware it runs on. If anything you should ask why do different compiler vendors pick different default settings.

In the Cleve’s Corner on the MathWorks pages I have found that:

We use the Intel Math Kernel Library, which includes multithreaded versions of the BLAS (Basic Linear Algebra Subroutines). For vector arguments, the MATLAB elementary function library, which includes exponential and trigonometric functions, is multithreaded.

The reason MathWorks (and Intel) are able to get the performance is because they probably __don’t use__the default compilers, instead taking their time to read the compiler documentation and pick the right optimization flags and fine-tune critical operations for maximum performance. Once they are happy, they add these flags to their build system, and go on with their day.

I’ve tested your code on an Intel Core™ i7-11700K @ 3.60GHz. With OMP_NUM_THREADS=8, the N=20000 case takes c. 80 seconds to complete both the naive matrix product and the matmul version combined. The command I used for compilation was $ ifort -O3 -xHost -qopenmp -qmkl=parallel -heap-arrays 40 -qopt-matmul test_matmul.f90.

Why can’t these flags be the default? My guess is the needs of Fortran users to exert fine control over their hardware are simply much more diverse than those of MATLAB users. Other reasons why a compiler fail at a specific stack size are probably historic. If you look into the categories of flags available with ifort -help, there are hundreds of options for all kinds of scenarios. At super-computing centers, users will often pick specific -xcode flags and target even specific instructions sets that are known to work best. They also might spend days or weeks optimizing and fine-tuning dominant kernels in their codes.

Now just like MATLAB has extensive documentation pages, that teach you how to use “vectorized” operations, masks, and other performant MATLAB constructs, Fortran compiler vendors also provide some performance tuning guides. I’ve found that the ones from Sun Microsystems and IBM contain a lot of valuable tips on how to get good performance:

To draw an analogy, using default settings is a bit like how you learn to drive with the base model of a car. You can’t just start with Formula 1 car. First you need to learn how to drive the base model. As you become more comfortable and start hitting the limits of what the car can do, you start upgrading. Now even if you had an F1 car, if you were driving it in the city, it would be of no benefit, the starting and stopping at traffic lights, would bring you to a halt. You also need the right roads to exercise it, e.g. to maintain the engine, brake, and tire temperatures to get the peak performance.

9 Likes