Compiler option matmul speedup

Hi, I’m looking at the impact of different compiler options on the speed of vector matrix multiplication. I compared with both Fortran and C, and got essentially the same top speed but Fortran’s matmul intrinsic was much faster with no optimization turned on (and interestingly gets slowed way down by -O3). See the gist below. I’m curious if anybody has thoughts on the analysis I did - are there other options I should try, other circumstances in which the matmuls are occurring, something I overlooked? Thanks!

1 Like

You don’t say which compiler you are using, but from the options I’m guessing it’s gfortran. Please note that different compilers will not necessarily show the same behavior.

For information your Fortran OpenMP code is not correct: j and val should be private. In C they are declared within the parallel region, so they are implicitely private.

By the way, multithreading matrix-vector multiplications is known to not be very efficient, at least on consumer hardware.

1 Like

Yes it’s gfortran-10

Your example is matrix-vector multiplication. Perhaps the title could be clearer if it would be explicitly mentioned matrix-vector multiplication. If you try with Intel compilers, you can use -qopt-report to find out which optimizations are applied. On modern CPUs you may want to benefit from AVX2 or AVX512, …

Thanks, I made the change but it didn’t affect the performance. I am surprised (naively?) that matrix-vector multiplication is not efficient across threads because it seems like something that would be very easy to parallelize.

It’s easy to parallelize, but it has a moderately low arithmetic intensity (AI), that is the ratio between the number of operations and the number of memory read/write is low. In these conditions, the bottleneck is the bandwidth between the CPU and the memory, and using more cores does not help.

This is particularly true on the Intel Core CPUs, which have pretty good monocore performances (all the more than the turbo boost frequency is often enabled when a single core is used) that are high enough to saturate the bandwidth to/from the memory for such simple computations. This is less true on the Xeon line, which have a higher bandwith (and no turbo boost, iirc), or on the AMD CPUs, which have more cores with lower monocore performances.

1 Like

With gfortran you can use the -fexternal-blas compiler flag, which inserts calls to BLAS in place of the intrinsic matmul function above a certain matrix size. In addition you have to link an optimized BLAS library:

Library Link flags Comment
OpenBLAS -lopenblas
BLIS -lblis
Intel oneMKL see Link Line Advisor for OneMKL for Intel processors
Accelerate -framework Accelerate for macOS
AOCL-BLAS -lblis for AMD processors; derived from BLIS
Arm Performance Libraries -larmpl_lp64 for Arm processors
MATLAB BLAS -lmwblas for MATLAB MEX functions
4 Likes

AMD seems to have higher single-core bandwidth:
https://sites.utexas.edu/jdm4372/2023/04/25/the-evolution-of-single-core-bandwidth-in-multicore-processors/

1 Like

That’s true that in the last year, the bandwidth tend to grow faster than the monocore performances.

Thanks, your explanation made a big difference in how I approached the problem I’m working on.