Testing the performance of `matmul` under default compiler settings

Another example of widely-used scientific computing infrastructure, that doesn’t default to an optimized blas, is the R interpreter.

They show how to do it here, and discuss the various trade-offs.

On my machine, the default R interpreter took 713s to do a matrix multiplication with n=10000 (single core), vs the above results of 115s with gfortran’s default matmul (also single core) and 8s using gfortran with the external blas (6 cores with hyperthreading).

2 Likes

The choice of algorithm for the various fortran intrinsic procedures is a general issue, not specific to matmul. There are two extreme possibilities.

  1. The fortran intrinsic should return the most accurate result that is practical using robust algorithms and memory allocation that fails as seldom as possible. Then if the user wants to use a different algorithm to improve performance, then the user should know his application enough to choose the appropriate algorithm with the correct tradeoffs between accuracy and speed.

  2. The fortran intrinsic should always return the fastest result, even if accuracy or robustness is compromised. The user is then responsible for using a different algorithm to improve the accuracy to be appropriate for his application.

Think about trig functions, where accurate results require much effort to reduce the input argument, then followed by the appropriate taylor expansion, or the appropriate pade approximate, or continued fraction, or whatever. Or think about random numbers, where the fast ones might not satisfy all of the exotic tests for randomness, while the good ones with long periods that do pass the tests can be more expensive. Or in the case of matmul, the single-core version is almost certainly fastest for small matrices, while for larger matrices there are various choices of multiple threads, multiple cores, AVX hardware, gpu hardware, and so on. The use of fast stack allocation (which can sometimes fail) vs. slower heap allocation (which is more robust) of intermediate work arrays is also an issue.

So which of these choices should fortran do? And does the answer for matmul() apply also to the other intrinsic procedures, such as random_number(), sin(), cos(), sqrt(), etc.? Should this be selectable with compiler options or with compiler directives within the code?

For the stack/heap allocation issue, the choices are stack allocation, heap allocation, or a conditional approach where stack allocation is first tested, and if that fails then heap allocation is invoked.

4 Likes

Ron,
Your conditional stack approach is so sensible, but rarely adopted.
Actually, I don’t know any operating system that has !!

Imagine a world where this conditional approach was the general case. The forum Stack Overflow would probably never have been started !! as stack overflow errors would not be so common.

I did a google search of “what is the stack size limit for Windows 64 bit”. It is very worrying to see some of the answers that google identifies, such as 1MB

The existing approach for the stack is really not suitable for a 64-bit OS. My understanding is that code + stack is limited to 2 Gbytes. I recall @slionel suggesting a limit for combined code + stack, which appears to be based on 32-bit Windows constraints. This approach was such a lost opportunity for a new 64-bit OS.

The only recent improvement to stack design, that I have noticed, is with OpenMP where each thread is given it’s own stack.
It is a shame that each thread was not also given it’s own heap address. At present 64-bit OS can address a 48-bit memory address space, which is much larger than available physical memory. Seperate heap addreses should be a possibility.
What I would like to have is for each heap private array to start on a new memory page to limit memory conflicts between threads.

2 Likes

Take matmul as an example. Thanks to the development of numerical analysis and scientific computing for so many years, there is no real need for tradeoffs between speed and accuracy, unless the matrices under consideration are huge (in a modern sense) or the requirement for speed/accuracy is extraordinarily high, in which case the user should consider specially designed algorithms rather than intrinsic functions anyway. For “daily-life” usage such as multiplying 20000-by-20000 matrices, we know quite well how to do it sufficiently fast with good accuracy.

I totally agree. Indeed, this seems to be the only reasonable approach. How could we make it happen at least for Fortran?

I understand quite well the nontriviality, and I totally agree that more contributions are needed for open-source compilers (a huge thank to anyone who has contributed).

On the other hand, I do believe that commercial vendors like Intel have sufficient resources to move in this direction, and I do not believe that the engineers at MathWorks are much more intelligent/capable than those at Intel.

If MathWorks can optimize MATLAB intrinsic procedures up to a satisfactory level with the help of Fortran, Intel should be able to do it (much) better for the corresponding Fortran intrinsic procedures with the help of Fortran itself.

However, look at the reality again.

Few years back I asked Intel why matmul speed was so different for different optimization levels -O… For O3 they used MKL tuned code at that time. I do not see any reason to make complain regarding this choice if still applied.

Fortran (as in the language standard) doesn’t prescribe any kind of optimization for matmul. It is the goal of the standard to try and remain agnostic of the hardware it runs on. If anything you should ask why do different compiler vendors pick different default settings.

In the Cleve’s Corner on the MathWorks pages I have found that:

We use the Intel Math Kernel Library, which includes multithreaded versions of the BLAS (Basic Linear Algebra Subroutines). For vector arguments, the MATLAB elementary function library, which includes exponential and trigonometric functions, is multithreaded.

The reason MathWorks (and Intel) are able to get the performance is because they probably __don’t use__the default compilers, instead taking their time to read the compiler documentation and pick the right optimization flags and fine-tune critical operations for maximum performance. Once they are happy, they add these flags to their build system, and go on with their day.

I’ve tested your code on an Intel Core™ i7-11700K @ 3.60GHz. With OMP_NUM_THREADS=8, the N=20000 case takes c. 80 seconds to complete both the naive matrix product and the matmul version combined. The command I used for compilation was $ ifort -O3 -xHost -qopenmp -qmkl=parallel -heap-arrays 40 -qopt-matmul test_matmul.f90.

Why can’t these flags be the default? My guess is the needs of Fortran users to exert fine control over their hardware are simply much more diverse than those of MATLAB users. Other reasons why a compiler fail at a specific stack size are probably historic. If you look into the categories of flags available with ifort -help, there are hundreds of options for all kinds of scenarios. At super-computing centers, users will often pick specific -xcode flags and target even specific instructions sets that are known to work best. They also might spend days or weeks optimizing and fine-tuning dominant kernels in their codes.

Now just like MATLAB has extensive documentation pages, that teach you how to use “vectorized” operations, masks, and other performant MATLAB constructs, Fortran compiler vendors also provide some performance tuning guides. I’ve found that the ones from Sun Microsystems and IBM contain a lot of valuable tips on how to get good performance:

To draw an analogy, using default settings is a bit like how you learn to drive with the base model of a car. You can’t just start with Formula 1 car. First you need to learn how to drive the base model. As you become more comfortable and start hitting the limits of what the car can do, you start upgrading. Now even if you had an F1 car, if you were driving it in the city, it would be of no benefit, the starting and stopping at traffic lights, would bring you to a halt. You also need the right roads to exercise it, e.g. to maintain the engine, brake, and tire temperatures to get the peak performance.

8 Likes

Hello @ivanpribec Ivan, thank you very much for the informative and insightful input! I appreciate very much your posts with detailed data and references, a style that all of us should follow. :slight_smile:

I like your metaphor about the base mode of a car. However, this base mode should also evolve along with time. I suppose that most users today would expect and be comfortable with the base mode of a 21-century car, but not that of a 19-century one.

For most users, when they choose either a car or a language, they just need and will choose “something that simply works”.

I guess this is why Python becomes so popular — it provides very easy access to tools that “simply works” for the user’s daily jobs. Users do not need to tune these tools unless their need is extreme.

I guess this is also why Fortran was popular among scientists. It simply works for the user’s daily jobs. However, since the birth of Fortran, the group of “scientists” has expanded largely, and their “daily jobs” have evolved largely. Within this large group of scientists, I guess over 90 percent (if not more) of them do not want to spend any time tuning their car, or their programming language. In addition, their daily jobs have become much more challenging than 50 years ago, so that something simply worked at that time simply does not work anymore.

Finally, I would like to point out one thing I think is a trend in the development of science, technology, and engineering: let the professionals do the professional thing. This is why we have so many subdomains in science/technology/engineering, and this is how scientists/technologists/engineers can focus on advancing their own subdomains while benefiting from the advances in other subdomains that they do not need/want to know anything about.

A tool that needs tuning for daily jobs is bound to be abandoned. For me, this seems a simple truth that is applicable to any field. I will be among the last to abandon it for multiple reasons if this ever happens, and I want to be loud about this truth so it will never happen.

Thank you for this information. Putting performance aside, what about robustness? Is segfault ever acceptable as a default behavior?

Regarding matmul it happend once and I reported this to Intel support. Since then it works for me as expected.

This is very interesting. I have tried the code in my post on several different machines (all installed Ubuntu 20.04/22.04) with the latest version of ifort, and it always leads to a quick segfault like the following:

ifort -g -warn all test_matmul.f90 && ./a.out 
 Initialize
 MATMUL
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image              PC                Routine            Line        Source             
a.out              000000000040601A  Unknown               Unknown  Unknown
libpthread-2.31.s  00007FC09489E420  Unknown               Unknown  Unknown
a.out              0000000000405042  Unknown               Unknown  Unknown
a.out              0000000000403822  Unknown               Unknown  Unknown
libc-2.31.so       00007FC0946BA083  __libc_start_main     Unknown  Unknown
a.out              000000000040372E  Unknown               Unknown  Unknown

Note that the compiler options are -g -warn all.

True, we can circumvent the segfault by -heap-arrays, but it is not the point. It is a workaround rather than a fix of the bug. There is no acceptable excuse for a segfault as the default behavior.

Maybe it is only me who has a buggy copy of Intel oneAPI, or flawed machines.

Process stacksize limits vary by OS and distro. macOS 8s 8192K by default for example, same with a default CentOS 8.2.

OpenMP apps do tend to require more stack. Most admins of HPC cluster unlimit the process stacksize for this reason. Or users just modify their .bashrc to do this.

The debate of stack vs heap allocations for the Intel compiler has gone on for years. Stack is easier to manage, less overhead, etc. Actually helps performance. I have a user this month complaining that using heap-arrays instead of default stack allocation slowed their code 2% and want us “to fix heap arrays”. Yes, right, of course, so simple, why didn’t we think of that? . Lots of array temps in that code over millions of iterations.

Different compilers take different strategies on heap vs stack. I do agree that out of box at O2 giving a segfault is probably not friendly. Counter that with the argument that O2 asks for performance opts, and stack allocation can (by some compilers) to be a valid optimization - you asked for optimization, no? Or you can argue that if you are doing HPC, and probably OpenMP or using threaded libraries for multicore, you’d have unlimited stack by default. Anyhow, we can debate all day and never come to consensus.

The matmul performance is interesting. I’ll look into this example.

1 Like

@greenrongreen, the “debate of stack vs heap” can be clarified if the size of the arrays is also considered.
(My impression is) For “large” arrays, there is little difference between stack and heap compute efficiency, but for “small” arrays, it can be more efficient to have these on the stack and so have better access to L1 cache. Small arrays are also probably more frequently allocated.
For OpenMP, especially on the heap, it would be an improvement if large arrays could be positioned on a new memory page to mitigate memory consistency problems. (I have not seen this as a compile option?)

program test_matmul modification
I have reviewed the loop order in program test_matmul and enhanced the program to provide comparison with other approaches. These other approaches consider:

  • revised DO loop order
  • partitioning the calculation into sub matrices to reduce memory <> cache transfers
  • applying OpenMP to improve performance
  • applying partitioning to OpenMP to reduce memory <> cache transfers

The test is for a range of matrix sizes x 7 coded approaches.
! Program to test different approaches to emulate MATMUL
!
! 1 MATMUL use intrinsic routine
! 2 maprod_v0 original proposed DO order : j i k
! 3 maprod_v1 revised DO order j k i to improve sequential memory access
! 4 maprod_v2 replace inner loop with array syntax
! 5 maprod_part attempt to partition the multiplication to improve L1 Cache usage
! 6 maprod_omp OpenMP approach from V2 using temporary column vector on thread stack
! 7 maprod_omp_part OpenMP approach on J that includes partitioning ( different sizes )

I am reporting as both elapsed time and computation as GFLOP/sec vs matrix size.
This GFLOP/sec vs matrix size shows how the memory <> cache transfers bottleneck is affected by the different strategies, which helps to demonstrate the benefits of the different strategies.
The range of sizes should be used carefully, as the elapsed time is related to O(n^3) so larger sizes take a lot longer!

The partitioning sizes ( di,dj,dk ) are related to the processor cache sizes and also the number of threads available. I use different sizes for single thread (related to L1 cache size) vs OpenMP (related to L1 and shared L3 cache size). The optimal sizes will vary with processor.

Hopefully these results show some how each computation approach addresses the memory <> cache bottleneck. The approaches provided do show significant improvement on the initial posting.

I tested the program using gfortran, built with

del %1.exe
del %1.o
set vec=-march=native
set options=-fimplicit-none -O3 %vec% -fstack-arrays -ffast-math -fopenmp

gfortran %1.f90 %options% -o %1.exe

dir %1.* /od

I am sure there are better approaches available.
I would welcome suggestions so we could all learn from these better approaches.

test_matmulp.f90 (11.6 KB)

2 Likes

Thank you for this very detailed and careful testing.

On my machine, the result for 2000-dimensional matrices is

 Initialize        2000
 int   552.86281235372076        1.3641471130000000     
 V1    2.8421709430404007E-012   48.336866372000003     
 V2    2.8421709430404007E-012   24.470532088999999     
 Par   2.8421709430404007E-012   13.288560129000000     
 Omp   2.8421709430404007E-012   17.728390386000001     
 OPR   2.8421709430404007E-012   15.154852551999999     
 int   2.8421709430404007E-012   2.3784506919999999     
 V1    2.8421709430404007E-012   91.001646123000000     
 V2    2.8421709430404007E-012   46.876135239999996     
 Par   2.8421709430404007E-012   24.167207832999999     
 Omp   2.8421709430404007E-012   31.981271034000002     
 OPR   2.8421709430404007E-012   29.062833046999998     
 int   2.8421709430404007E-012   3.8525379700000002     
 V1    2.8421709430404007E-012   137.83232808500000     
 V2    2.8421709430404007E-012   69.763581262999992     
 Par   2.8421709430404007E-012   35.665185921999999     
 Omp   2.8421709430404007E-012   45.847050401000004     
 OPR   2.8421709430404007E-012   41.282760373000002     
 int   2.8421709430404007E-012   4.8537640529999999     
 V1    2.8421709430404007E-012   183.49185632699999     
 V2    2.8421709430404007E-012   90.303348412999995     
 Par   2.8421709430404007E-012   48.394379927999999     
 Omp   2.8421709430404007E-012   63.160495253000008     
 OPR   2.8421709430404007E-012   53.523109639000005     
Succeed   2000   0.077950   1.213441   0.000000  45.872964  22.575837  12.098595  15.790124  13.380777

I am particularly interested in comparing V1 and V2, the latter using broadcasting instead of a loop. How should I interpret the comparison between them according to the result?

I don’t know the compiler, compiler options or processor you used, but from what I see:
V2 with array syntax performs much better than V1 loop syntax.
Did you enable OpenMP ?
The MATMUL intrinsic is performing much better than any coded options, which shows there is much more improvement possible, in comparison the options I provided. It would suggest that the partitioning sizes are not appropriate for your cache sizes.

My Ryzen results are:

Succeed   2000   0.021353   0.753539   0.000000   0.901243   0.900175   0.709549   0.136021   0.087422
  
GFLOP/s   2000     0.027     9.918     0.000     8.342     8.365    10.612    56.898    87.511
1 Like

Sorry, I forgot to mention the compiler and options. What I did was simply

gfortran test_matmul.f90

I suppose this did not enable OpenMP.

This is very interesting. It is true in general that the whole array assignment

a = b

is faster than a loop?

We would hope that the compiler will optimise how to code this as a sequential memory copy. It is very basic array syntax, to get right.

You could try :

gfortran test_matmulp.f90 -fimplicit-none -O3 -march=native -fstack-arrays -ffast-math -fopenmp -o test_matmulp.exe

1 Like

Totally agree. I do not see any reason why anyone should write such an array operation by a loop. The compiler has all the information about what the user wants to do. They should be able to get things optimized easily, if they are willing to do so.

Meanwhile, I do not see a big essential difference between a = b and c = matmul(d, e). If complilers can easily optimize a = b, they should be able to optimize c = matmul(d, e) as well (not to mention that many other languages have already successfully done it), if they are willing to do so. Unfortunately, I see an unreasonably huge and arrogant resistance.

These things are all “quality of implementation” issues. Maybe a compiler will do these high-level constructs right, and maybe they won’t. Or maybe they will get them right when the arrays are contiguous, but not otherwise. As written, the statement c = matmul(d, e) might be expected to allocate some temporary storage (from the stack, or the heap) for the function evaluation, then assign that temporary array to c, and then release the temporary memory. One would hope that the optimizer would recognize that the memory allocation and deallocaton is unnecessary in this simple statement, but who knows. Or what if the statement were a little more complicated with the matmul() in the middle? Or what if instead of the intrinsic matmul, there were a user written function? None of these things are really defined by the standard, they are all up to the compiler and its optimization flags.

1 Like