Testing the performance of `matmul` under default compiler settings

zaikunzhang · July 31, 2022, 1:48pm

I understand quite well the nontriviality, and I totally agree that more contributions are needed for open-source compilers (a huge thank to anyone who has contributed).

On the other hand, I do believe that commercial vendors like Intel have sufficient resources to move in this direction, and I do not believe that the engineers at MathWorks are much more intelligent/capable than those at Intel.

If MathWorks can optimize MATLAB intrinsic procedures up to a satisfactory level with the help of Fortran, Intel should be able to do it (much) better for the corresponding Fortran intrinsic procedures with the help of Fortran itself.

However, look at the reality again.

Jamie · July 31, 2022, 9:11pm

Few years back I asked Intel why matmul speed was so different for different optimization levels -O… For O3 they used MKL tuned code at that time. I do not see any reason to make complain regarding this choice if still applied.

ivanpribec · August 1, 2022, 12:09am

Fortran (as in the language standard) doesn’t prescribe any kind of optimization for matmul. It is the goal of the standard to try and remain agnostic of the hardware it runs on. If anything you should ask why do different compiler vendors pick different default settings.

In the Cleve’s Corner on the MathWorks pages I have found that:

We use the Intel Math Kernel Library, which includes multithreaded versions of the BLAS (Basic Linear Algebra Subroutines). For vector arguments, the MATLAB elementary function library, which includes exponential and trigonometric functions, is multithreaded.

The reason MathWorks (and Intel) are able to get the performance is because they probably __don’t use__the default compilers, instead taking their time to read the compiler documentation and pick the right optimization flags and fine-tune critical operations for maximum performance. Once they are happy, they add these flags to their build system, and go on with their day.

I’ve tested your code on an Intel Core™ i7-11700K @ 3.60GHz. With OMP_NUM_THREADS=8, the N=20000 case takes c. 80 seconds to complete both the naive matrix product and the matmul version combined. The command I used for compilation was $ ifort -O3 -xHost -qopenmp -qmkl=parallel -heap-arrays 40 -qopt-matmul test_matmul.f90.

Why can’t these flags be the default? My guess is the needs of Fortran users to exert fine control over their hardware are simply much more diverse than those of MATLAB users. Other reasons why a compiler fail at a specific stack size are probably historic. If you look into the categories of flags available with ifort -help, there are hundreds of options for all kinds of scenarios. At super-computing centers, users will often pick specific -xcode flags and target even specific instructions sets that are known to work best. They also might spend days or weeks optimizing and fine-tuning dominant kernels in their codes.

Now just like MATLAB has extensive documentation pages, that teach you how to use “vectorized” operations, masks, and other performant MATLAB constructs, Fortran compiler vendors also provide some performance tuning guides. I’ve found that the ones from Sun Microsystems and IBM contain a lot of valuable tips on how to get good performance:

To draw an analogy, using default settings is a bit like how you learn to drive with the base model of a car. You can’t just start with Formula 1 car. First you need to learn how to drive the base model. As you become more comfortable and start hitting the limits of what the car can do, you start upgrading. Now even if you had an F1 car, if you were driving it in the city, it would be of no benefit, the starting and stopping at traffic lights, would bring you to a halt. You also need the right roads to exercise it, e.g. to maintain the engine, brake, and tire temperatures to get the peak performance.

zaikunzhang · August 1, 2022, 12:51am

Hello @ivanpribec Ivan, thank you very much for the informative and insightful input! I appreciate very much your posts with detailed data and references, a style that all of us should follow.

I like your metaphor about the base mode of a car. However, this base mode should also evolve along with time. I suppose that most users today would expect and be comfortable with the base mode of a 21-century car, but not that of a 19-century one.

For most users, when they choose either a car or a language, they just need and will choose “something that simply works”.

I guess this is why Python becomes so popular — it provides very easy access to tools that “simply works” for the user’s daily jobs. Users do not need to tune these tools unless their need is extreme.

I guess this is also why Fortran was popular among scientists. It simply works for the user’s daily jobs. However, since the birth of Fortran, the group of “scientists” has expanded largely, and their “daily jobs” have evolved largely. Within this large group of scientists, I guess over 90 percent (if not more) of them do not want to spend any time tuning their car, or their programming language. In addition, their daily jobs have become much more challenging than 50 years ago, so that something simply worked at that time simply does not work anymore.

Finally, I would like to point out one thing I think is a trend in the development of science, technology, and engineering: let the professionals do the professional thing. This is why we have so many subdomains in science/technology/engineering, and this is how scientists/technologists/engineers can focus on advancing their own subdomains while benefiting from the advances in other subdomains that they do not need/want to know anything about.

A tool that needs tuning for daily jobs is bound to be abandoned. For me, this seems a simple truth that is applicable to any field. I will be among the last to abandon it for multiple reasons if this ever happens, and I want to be loud about this truth so it will never happen.

zaikunzhang · August 1, 2022, 12:53am

Thank you for this information. Putting performance aside, what about robustness? Is segfault ever acceptable as a default behavior?

Jamie · August 1, 2022, 11:26am

Regarding matmul it happend once and I reported this to Intel support. Since then it works for me as expected.

zaikunzhang · August 2, 2022, 3:55pm

This is very interesting. I have tried the code in my post on several different machines (all installed Ubuntu 20.04/22.04) with the latest version of ifort, and it always leads to a quick segfault like the following:

ifort -g -warn all test_matmul.f90 && ./a.out 
 Initialize
 MATMUL
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image              PC                Routine            Line        Source             
a.out              000000000040601A  Unknown               Unknown  Unknown
libpthread-2.31.s  00007FC09489E420  Unknown               Unknown  Unknown
a.out              0000000000405042  Unknown               Unknown  Unknown
a.out              0000000000403822  Unknown               Unknown  Unknown
libc-2.31.so       00007FC0946BA083  __libc_start_main     Unknown  Unknown
a.out              000000000040372E  Unknown               Unknown  Unknown

Note that the compiler options are -g -warn all.

True, we can circumvent the segfault by -heap-arrays, but it is not the point. It is a workaround rather than a fix of the bug. There is no acceptable excuse for a segfault as the default behavior.

Maybe it is only me who has a buggy copy of Intel oneAPI, or flawed machines.

greenrongreen · August 8, 2022, 4:35pm

Process stacksize limits vary by OS and distro. macOS 8s 8192K by default for example, same with a default CentOS 8.2.

OpenMP apps do tend to require more stack. Most admins of HPC cluster unlimit the process stacksize for this reason. Or users just modify their .bashrc to do this.

The debate of stack vs heap allocations for the Intel compiler has gone on for years. Stack is easier to manage, less overhead, etc. Actually helps performance. I have a user this month complaining that using heap-arrays instead of default stack allocation slowed their code 2% and want us “to fix heap arrays”. Yes, right, of course, so simple, why didn’t we think of that? . Lots of array temps in that code over millions of iterations.

Different compilers take different strategies on heap vs stack. I do agree that out of box at O2 giving a segfault is probably not friendly. Counter that with the argument that O2 asks for performance opts, and stack allocation can (by some compilers) to be a valid optimization - you asked for optimization, no? Or you can argue that if you are doing HPC, and probably OpenMP or using threaded libraries for multicore, you’d have unlimited stack by default. Anyhow, we can debate all day and never come to consensus.

The matmul performance is interesting. I’ll look into this example.

JohnCampbell · August 9, 2022, 3:17am

@greenrongreen, the “debate of stack vs heap” can be clarified if the size of the arrays is also considered.
(My impression is) For “large” arrays, there is little difference between stack and heap compute efficiency, but for “small” arrays, it can be more efficient to have these on the stack and so have better access to L1 cache. Small arrays are also probably more frequently allocated.
For OpenMP, especially on the heap, it would be an improvement if large arrays could be positioned on a new memory page to mitigate memory consistency problems. (I have not seen this as a compile option?)

program test_matmul modification
I have reviewed the loop order in program test_matmul and enhanced the program to provide comparison with other approaches. These other approaches consider:

revised DO loop order
partitioning the calculation into sub matrices to reduce memory <> cache transfers
applying OpenMP to improve performance
applying partitioning to OpenMP to reduce memory <> cache transfers

The test is for a range of matrix sizes x 7 coded approaches.
! Program to test different approaches to emulate MATMUL
!
! 1 MATMUL use intrinsic routine
! 2 maprod_v0 original proposed DO order : j i k
! 3 maprod_v1 revised DO order j k i to improve sequential memory access
! 4 maprod_v2 replace inner loop with array syntax
! 5 maprod_part attempt to partition the multiplication to improve L1 Cache usage
! 6 maprod_omp OpenMP approach from V2 using temporary column vector on thread stack
! 7 maprod_omp_part OpenMP approach on J that includes partitioning ( different sizes )

I am reporting as both elapsed time and computation as GFLOP/sec vs matrix size.
This GFLOP/sec vs matrix size shows how the memory <> cache transfers bottleneck is affected by the different strategies, which helps to demonstrate the benefits of the different strategies.
The range of sizes should be used carefully, as the elapsed time is related to O(n^3) so larger sizes take a lot longer!

The partitioning sizes ( di,dj,dk ) are related to the processor cache sizes and also the number of threads available. I use different sizes for single thread (related to L1 cache size) vs OpenMP (related to L1 and shared L3 cache size). The optimal sizes will vary with processor.

Hopefully these results show some how each computation approach addresses the memory <> cache bottleneck. The approaches provided do show significant improvement on the initial posting.

I tested the program using gfortran, built with

del %1.exe
del %1.o
set vec=-march=native
set options=-fimplicit-none -O3 %vec% -fstack-arrays -ffast-math -fopenmp

gfortran %1.f90 %options% -o %1.exe

dir %1.* /od

I am sure there are better approaches available.
I would welcome suggestions so we could all learn from these better approaches.

test_matmulp.f90 (11.6 KB)

zaikunzhang · August 10, 2022, 1:03am

Thank you for this very detailed and careful testing.

On my machine, the result for 2000-dimensional matrices is

 Initialize        2000
 int   552.86281235372076        1.3641471130000000     
 V1    2.8421709430404007E-012   48.336866372000003     
 V2    2.8421709430404007E-012   24.470532088999999     
 Par   2.8421709430404007E-012   13.288560129000000     
 Omp   2.8421709430404007E-012   17.728390386000001     
 OPR   2.8421709430404007E-012   15.154852551999999     
 int   2.8421709430404007E-012   2.3784506919999999     
 V1    2.8421709430404007E-012   91.001646123000000     
 V2    2.8421709430404007E-012   46.876135239999996     
 Par   2.8421709430404007E-012   24.167207832999999     
 Omp   2.8421709430404007E-012   31.981271034000002     
 OPR   2.8421709430404007E-012   29.062833046999998     
 int   2.8421709430404007E-012   3.8525379700000002     
 V1    2.8421709430404007E-012   137.83232808500000     
 V2    2.8421709430404007E-012   69.763581262999992     
 Par   2.8421709430404007E-012   35.665185921999999     
 Omp   2.8421709430404007E-012   45.847050401000004     
 OPR   2.8421709430404007E-012   41.282760373000002     
 int   2.8421709430404007E-012   4.8537640529999999     
 V1    2.8421709430404007E-012   183.49185632699999     
 V2    2.8421709430404007E-012   90.303348412999995     
 Par   2.8421709430404007E-012   48.394379927999999     
 Omp   2.8421709430404007E-012   63.160495253000008     
 OPR   2.8421709430404007E-012   53.523109639000005     
Succeed   2000   0.077950   1.213441   0.000000  45.872964  22.575837  12.098595  15.790124  13.380777

I am particularly interested in comparing V1 and V2, the latter using broadcasting instead of a loop. How should I interpret the comparison between them according to the result?

JohnCampbell · August 10, 2022, 3:40am

I don’t know the compiler, compiler options or processor you used, but from what I see:
V2 with array syntax performs much better than V1 loop syntax.
Did you enable OpenMP ?
The MATMUL intrinsic is performing much better than any coded options, which shows there is much more improvement possible, in comparison the options I provided. It would suggest that the partitioning sizes are not appropriate for your cache sizes.

My Ryzen results are:

Succeed   2000   0.021353   0.753539   0.000000   0.901243   0.900175   0.709549   0.136021   0.087422
  
GFLOP/s   2000     0.027     9.918     0.000     8.342     8.365    10.612    56.898    87.511

zaikunzhang · August 10, 2022, 5:31am

Sorry, I forgot to mention the compiler and options. What I did was simply

gfortran test_matmul.f90

I suppose this did not enable OpenMP.

This is very interesting. It is true in general that the whole array assignment

a = b

is faster than a loop?

JohnCampbell · August 10, 2022, 10:23am

We would hope that the compiler will optimise how to code this as a sequential memory copy. It is very basic array syntax, to get right.

You could try :

gfortran test_matmulp.f90 -fimplicit-none -O3 -march=native -fstack-arrays -ffast-math -fopenmp -o test_matmulp.exe

zaikunzhang · August 10, 2022, 2:36pm

Totally agree. I do not see any reason why anyone should write such an array operation by a loop. The compiler has all the information about what the user wants to do. They should be able to get things optimized easily, if they are willing to do so.

Meanwhile, I do not see a big essential difference between a = b and c = matmul(d, e). If complilers can easily optimize a = b, they should be able to optimize c = matmul(d, e) as well (not to mention that many other languages have already successfully done it), if they are willing to do so. Unfortunately, I see an unreasonably huge and arrogant resistance.

RonShepard · August 10, 2022, 3:36pm

These things are all “quality of implementation” issues. Maybe a compiler will do these high-level constructs right, and maybe they won’t. Or maybe they will get them right when the arrays are contiguous, but not otherwise. As written, the statement c = matmul(d, e) might be expected to allocate some temporary storage (from the stack, or the heap) for the function evaluation, then assign that temporary array to c, and then release the temporary memory. One would hope that the optimizer would recognize that the memory allocation and deallocaton is unnecessary in this simple statement, but who knows. Or what if the statement were a little more complicated with the matmul() in the middle? Or what if instead of the intrinsic matmul, there were a user written function? None of these things are really defined by the standard, they are all up to the compiler and its optimization flags.

zaikunzhang · August 10, 2022, 4:22pm

When we have got something that works, “quality of implementation” is naturally the next thing people should work for.

ivanpribec · August 10, 2022, 9:31pm

We should not attribute arrogance unjustly. Compilers don’t just write themselves. There have been several prompts on this forum to contribute to open-source compilers before if you can.

With commercial compilers the story is different. There I’m afraid that unless you have an official support contract there’s not much you can do other than report errors. But even with freely-available commercial compilers I’d say “don’t look a gift horse in the mouth”.

zaikunzhang · August 11, 2022, 7:54am

I agree.

However, I find that nobody will listen until I try to exaggerate a bit about the problems. So I start to be loud and aggressive so that Fortran will continue to exist healthfully.

Topic		Replies	Views
Compiler option matmul speedup	10	985	September 16, 2023
Automatic arrays and intrinsic array operations: to use or not to use? Help	29	3327	August 1, 2022
Fortran: Array Language (video) Advocacy	20	1014	February 3, 2024
Array intrinsics performances/accuracy	20	874	May 20, 2023
Speed of array intrinsics	22	1891	April 14, 2025

Testing the performance of `matmul` under default compiler settings

Related topics