Hi, I’m looking at the impact of different compiler options on the speed of vector matrix multiplication. I compared with both Fortran and C, and got essentially the same top speed but Fortran’s matmul intrinsic was much faster with no optimization turned on (and interestingly gets slowed way down by O3). See the gist below. I’m curious if anybody has thoughts on the analysis I did  are there other options I should try, other circumstances in which the matmuls are occurring, something I overlooked? Thanks!
You don’t say which compiler you are using, but from the options I’m guessing it’s gfortran. Please note that different compilers will not necessarily show the same behavior.
For information your Fortran OpenMP code is not correct: j
and val
should be private
. In C they are declared within the parallel region, so they are implicitely private.
By the way, multithreading matrixvector multiplications is known to not be very efficient, at least on consumer hardware.
Yes it’s gfortran10
Your example is matrixvector multiplication. Perhaps the title could be clearer if it would be explicitly mentioned matrixvector multiplication. If you try with Intel compilers, you can use qoptreport
to find out which optimizations are applied. On modern CPUs you may want to benefit from AVX2 or AVX512, …
Thanks, I made the change but it didn’t affect the performance. I am surprised (naively?) that matrixvector multiplication is not efficient across threads because it seems like something that would be very easy to parallelize.
It’s easy to parallelize, but it has a moderately low arithmetic intensity (AI), that is the ratio between the number of operations and the number of memory read/write is low. In these conditions, the bottleneck is the bandwidth between the CPU and the memory, and using more cores does not help.
This is particularly true on the Intel Core CPUs, which have pretty good monocore performances (all the more than the turbo boost frequency is often enabled when a single core is used) that are high enough to saturate the bandwidth to/from the memory for such simple computations. This is less true on the Xeon line, which have a higher bandwith (and no turbo boost, iirc), or on the AMD CPUs, which have more cores with lower monocore performances.
With gfortran you can use the fexternalblas
compiler flag, which inserts calls to BLAS in place of the intrinsic matmul
function above a certain matrix size. In addition you have to link an optimized BLAS library:
Library  Link flags  Comment 

OpenBLAS  lopenblas 

BLIS  lblis 

Intel oneMKL  see Link Line Advisor for OneMKL  for Intel processors 
Accelerate  framework Accelerate 
for macOS 
AOCLBLAS  lblis 
for AMD processors; derived from BLIS 
Arm Performance Libraries  larmpl_lp64 
for Arm processors 
MATLAB BLAS  lmwblas 
for MATLAB MEX functions 
AMD seems to have higher singlecore bandwidth:
https://sites.utexas.edu/jdm4372/2023/04/25/theevolutionofsinglecorebandwidthinmulticoreprocessors/
That’s true that in the last year, the bandwidth tend to grow faster than the monocore performances.
Thanks, your explanation made a big difference in how I approached the problem I’m working on.