Hi, I’m looking at the impact of different compiler options on the speed of vector matrix multiplication. I compared with both Fortran and C, and got essentially the same top speed but Fortran’s matmul intrinsic was much faster with no optimization turned on (and interestingly gets slowed way down by -O3). See the gist below. I’m curious if anybody has thoughts on the analysis I did - are there other options I should try, other circumstances in which the matmuls are occurring, something I overlooked? Thanks!
You don’t say which compiler you are using, but from the options I’m guessing it’s gfortran. Please note that different compilers will not necessarily show the same behavior.
For information your Fortran OpenMP code is not correct: j
and val
should be private
. In C they are declared within the parallel region, so they are implicitely private.
By the way, multithreading matrix-vector multiplications is known to not be very efficient, at least on consumer hardware.
Yes it’s gfortran-10
Your example is matrix-vector multiplication. Perhaps the title could be clearer if it would be explicitly mentioned matrix-vector multiplication. If you try with Intel compilers, you can use -qopt-report
to find out which optimizations are applied. On modern CPUs you may want to benefit from AVX2 or AVX512, …
Thanks, I made the change but it didn’t affect the performance. I am surprised (naively?) that matrix-vector multiplication is not efficient across threads because it seems like something that would be very easy to parallelize.
It’s easy to parallelize, but it has a moderately low arithmetic intensity (AI), that is the ratio between the number of operations and the number of memory read/write is low. In these conditions, the bottleneck is the bandwidth between the CPU and the memory, and using more cores does not help.
This is particularly true on the Intel Core CPUs, which have pretty good monocore performances (all the more than the turbo boost frequency is often enabled when a single core is used) that are high enough to saturate the bandwidth to/from the memory for such simple computations. This is less true on the Xeon line, which have a higher bandwith (and no turbo boost, iirc), or on the AMD CPUs, which have more cores with lower monocore performances.
With gfortran you can use the -fexternal-blas
compiler flag, which inserts calls to BLAS in place of the intrinsic matmul
function above a certain matrix size. In addition you have to link an optimized BLAS library:
Library | Link flags | Comment |
---|---|---|
OpenBLAS | -lopenblas |
|
BLIS | -lblis |
|
Intel oneMKL | see Link Line Advisor for OneMKL | for Intel processors |
Accelerate | -framework Accelerate |
for macOS |
AOCL-BLAS | -lblis |
for AMD processors; derived from BLIS |
Arm Performance Libraries | -larmpl_lp64 |
for Arm processors |
MATLAB BLAS | -lmwblas |
for MATLAB MEX functions |
AMD seems to have higher single-core bandwidth:
https://sites.utexas.edu/jdm4372/2023/04/25/the-evolution-of-single-core-bandwidth-in-multicore-processors/
That’s true that in the last year, the bandwidth tend to grow faster than the monocore performances.
Thanks, your explanation made a big difference in how I approached the problem I’m working on.