Testing the performance of `matmul` under default compiler settings

Thank you for pointing out this.

1 Like

If I set n=10000, it took 15 seconds, (note that call cpu_time is not accurate when multiple cores are involved, here it actually took 15 seconds, but call cpu_time shows 60 seconds or so)

1 Like

Thank you for pointing our the inaccuracy in cpu_time. However, when comparing 3000 with 300, the accuracy does not matter so much.

So if you have 10 cores and are running in parallel, then I think that CPU time will return about 10x the wall time.

2 Likes

Even though cpu_time was included in the code, the time reported in my post is indeed wall-clock time. It was not that precise, but the magnitude was correct.

1 Like

That is because I used Intel MKL’s matmul which automatically uses multiple cores. So call cpu_time() will not show real wall time.

About showing true wall time, you may check,

1 Like

I hope the intrinsic matmul will have a comparable performance by default without tuning needed. Before the tuning finished, half of the users would have turned to our languages, if not more.

1 Like

So we know that by linking to an optimal blas, Fortran can perform very well.

I think you really want to focus on why this doesn’t always happen by default:

Blockquote
let us discuss why Fortran should not optimize its intrinsic matrix/vector operations in the same way (MATLAB is optimizing its intrinsics using Fortran!) and how to explain this to Fortran beginners and Fortran criticizers.

Not my expertise. Presumably it isn’t trivial to do this across the full range of platforms (?anyone informed about this?). No doubt more progress would be possible if more programmers choose to contribute to open source compilers, and/or more funding was available for compilers across the board.

2 Likes

I have not even a tiny doubt about it.

This is exactly my point.

2 Likes

I used the intrinsic matmul which only uses one core, it took 54 seconds when n=10000, and call cpu_time() in this case is accurate.
So, in general, call cpu_time() does return the total time, not the wall time. I mean, for this same code, if I use 10 cores, call cpu_time() will still return like 60 seconds which is total time, but the actual wall time is smaller than that.
I mean call cpu_time() can give you the estimation of total core hours you need.

Sorry for naïve/lazy question, once I installed gfortran, can I just run the above command? I mean do I need to install some blas?

I’m running Ubuntu where blas is already installed, so I didn’t need to do anything extra.

On other platforms, I’m not sure.

1 Like

Another example of widely-used scientific computing infrastructure, that doesn’t default to an optimized blas, is the R interpreter.

They show how to do it here, and discuss the various trade-offs.

On my machine, the default R interpreter took 713s to do a matrix multiplication with n=10000 (single core), vs the above results of 115s with gfortran’s default matmul (also single core) and 8s using gfortran with the external blas (6 cores with hyperthreading).

2 Likes

The choice of algorithm for the various fortran intrinsic procedures is a general issue, not specific to matmul. There are two extreme possibilities.

  1. The fortran intrinsic should return the most accurate result that is practical using robust algorithms and memory allocation that fails as seldom as possible. Then if the user wants to use a different algorithm to improve performance, then the user should know his application enough to choose the appropriate algorithm with the correct tradeoffs between accuracy and speed.

  2. The fortran intrinsic should always return the fastest result, even if accuracy or robustness is compromised. The user is then responsible for using a different algorithm to improve the accuracy to be appropriate for his application.

Think about trig functions, where accurate results require much effort to reduce the input argument, then followed by the appropriate taylor expansion, or the appropriate pade approximate, or continued fraction, or whatever. Or think about random numbers, where the fast ones might not satisfy all of the exotic tests for randomness, while the good ones with long periods that do pass the tests can be more expensive. Or in the case of matmul, the single-core version is almost certainly fastest for small matrices, while for larger matrices there are various choices of multiple threads, multiple cores, AVX hardware, gpu hardware, and so on. The use of fast stack allocation (which can sometimes fail) vs. slower heap allocation (which is more robust) of intermediate work arrays is also an issue.

So which of these choices should fortran do? And does the answer for matmul() apply also to the other intrinsic procedures, such as random_number(), sin(), cos(), sqrt(), etc.? Should this be selectable with compiler options or with compiler directives within the code?

For the stack/heap allocation issue, the choices are stack allocation, heap allocation, or a conditional approach where stack allocation is first tested, and if that fails then heap allocation is invoked.

4 Likes

Ron,
Your conditional stack approach is so sensible, but rarely adopted.
Actually, I don’t know any operating system that has !!

Imagine a world where this conditional approach was the general case. The forum Stack Overflow would probably never have been started !! as stack overflow errors would not be so common.

I did a google search of “what is the stack size limit for Windows 64 bit”. It is very worrying to see some of the answers that google identifies, such as 1MB

The existing approach for the stack is really not suitable for a 64-bit OS. My understanding is that code + stack is limited to 2 Gbytes. I recall @slionel suggesting a limit for combined code + stack, which appears to be based on 32-bit Windows constraints. This approach was such a lost opportunity for a new 64-bit OS.

The only recent improvement to stack design, that I have noticed, is with OpenMP where each thread is given it’s own stack.
It is a shame that each thread was not also given it’s own heap address. At present 64-bit OS can address a 48-bit memory address space, which is much larger than available physical memory. Seperate heap addreses should be a possibility.
What I would like to have is for each heap private array to start on a new memory page to limit memory conflicts between threads.

2 Likes

Take matmul as an example. Thanks to the development of numerical analysis and scientific computing for so many years, there is no real need for tradeoffs between speed and accuracy, unless the matrices under consideration are huge (in a modern sense) or the requirement for speed/accuracy is extraordinarily high, in which case the user should consider specially designed algorithms rather than intrinsic functions anyway. For “daily-life” usage such as multiplying 20000-by-20000 matrices, we know quite well how to do it sufficiently fast with good accuracy.

I totally agree. Indeed, this seems to be the only reasonable approach. How could we make it happen at least for Fortran?

I understand quite well the nontriviality, and I totally agree that more contributions are needed for open-source compilers (a huge thank to anyone who has contributed).

On the other hand, I do believe that commercial vendors like Intel have sufficient resources to move in this direction, and I do not believe that the engineers at MathWorks are much more intelligent/capable than those at Intel.

If MathWorks can optimize MATLAB intrinsic procedures up to a satisfactory level with the help of Fortran, Intel should be able to do it (much) better for the corresponding Fortran intrinsic procedures with the help of Fortran itself.

However, look at the reality again.

Few years back I asked Intel why matmul speed was so different for different optimization levels -O… For O3 they used MKL tuned code at that time. I do not see any reason to make complain regarding this choice if still applied.

Fortran (as in the language standard) doesn’t prescribe any kind of optimization for matmul. It is the goal of the standard to try and remain agnostic of the hardware it runs on. If anything you should ask why do different compiler vendors pick different default settings.

In the Cleve’s Corner on the MathWorks pages I have found that:

We use the Intel Math Kernel Library, which includes multithreaded versions of the BLAS (Basic Linear Algebra Subroutines). For vector arguments, the MATLAB elementary function library, which includes exponential and trigonometric functions, is multithreaded.

The reason MathWorks (and Intel) are able to get the performance is because they probably __don’t use__the default compilers, instead taking their time to read the compiler documentation and pick the right optimization flags and fine-tune critical operations for maximum performance. Once they are happy, they add these flags to their build system, and go on with their day.

I’ve tested your code on an Intel Core™ i7-11700K @ 3.60GHz. With OMP_NUM_THREADS=8, the N=20000 case takes c. 80 seconds to complete both the naive matrix product and the matmul version combined. The command I used for compilation was $ ifort -O3 -xHost -qopenmp -qmkl=parallel -heap-arrays 40 -qopt-matmul test_matmul.f90.

Why can’t these flags be the default? My guess is the needs of Fortran users to exert fine control over their hardware are simply much more diverse than those of MATLAB users. Other reasons why a compiler fail at a specific stack size are probably historic. If you look into the categories of flags available with ifort -help, there are hundreds of options for all kinds of scenarios. At super-computing centers, users will often pick specific -xcode flags and target even specific instructions sets that are known to work best. They also might spend days or weeks optimizing and fine-tuning dominant kernels in their codes.

Now just like MATLAB has extensive documentation pages, that teach you how to use “vectorized” operations, masks, and other performant MATLAB constructs, Fortran compiler vendors also provide some performance tuning guides. I’ve found that the ones from Sun Microsystems and IBM contain a lot of valuable tips on how to get good performance:

To draw an analogy, using default settings is a bit like how you learn to drive with the base model of a car. You can’t just start with Formula 1 car. First you need to learn how to drive the base model. As you become more comfortable and start hitting the limits of what the car can do, you start upgrading. Now even if you had an F1 car, if you were driving it in the city, it would be of no benefit, the starting and stopping at traffic lights, would bring you to a halt. You also need the right roads to exercise it, e.g. to maintain the engine, brake, and tire temperatures to get the peak performance.

8 Likes