Compiler option matmul speedup

rbitr · September 15, 2023, 4:00pm

Hi, I’m looking at the impact of different compiler options on the speed of vector matrix multiplication. I compared with both Fortran and C, and got essentially the same top speed but Fortran’s matmul intrinsic was much faster with no optimization turned on (and interestingly gets slowed way down by -O3). See the gist below. I’m curious if anybody has thoughts on the analysis I did - are there other options I should try, other circumstances in which the matmuls are occurring, something I overlooked? Thanks!

gist.github.com

https://gist.github.com/rbitr/3b86154f78a0f0832e8bd1716152365d

_summary.md

Here I compared the effect of different compiler optimizations in both Fortran and C for a program that multiplies a matrix with a vector. The results are below.

 | Options                                               | C (loop)  | Fortran (intrinsic)  | Fortran (loop)|
 |-------------------------------------------------------|----------|---------------------|-------------- |
 |                                                       |828 ms    |104 ms               |835 ms |
 |-Ofast                                                 |110 ms    |112 ms               |110 ms |
 |-O3                                                    |362 ms    |361 ms               |363 ms |
 |-O3 -march=native                                      |362 ms    |363 ms               |361 ms | 
 |-O3 -march=native -ffast-math -funroll-loops           |90.3 ms   |92.8 ms              |89.5 ms |
 |-O3 -march=native -ffast-math -funroll-loops -fopenmp  |85.2 ms   |91.2 ms              |86.4 ms |

This file has been truncated. show original

mmc.c

#include <stdio.h>
#include <stdlib.h>
#include <time.h>


long time_in_ms() {
    struct timespec time;
    clock_gettime(CLOCK_REALTIME, &time);
    return time.tv_sec * 1000 + time.tv_nsec / 1000000;
}

This file has been truncated. show original

mmf.f90


module arg_parse
        implicit none

        type args
                integer :: n, i
                logical :: verbose
        end type args

        contains

This file has been truncated. show original

There are more than three files. show original

sblionel · September 15, 2023, 5:13pm

You don’t say which compiler you are using, but from the options I’m guessing it’s gfortran. Please note that different compilers will not necessarily show the same behavior.

PierU · September 15, 2023, 5:14pm

For information your Fortran OpenMP code is not correct: j and val should be private. In C they are declared within the parallel region, so they are implicitely private.

By the way, multithreading matrix-vector multiplications is known to not be very efficient, at least on consumer hardware.

rbitr · September 15, 2023, 5:28pm

Yes it’s gfortran-10

alirezagh76 · September 15, 2023, 5:54pm

Your example is matrix-vector multiplication. Perhaps the title could be clearer if it would be explicitly mentioned matrix-vector multiplication. If you try with Intel compilers, you can use -qopt-report to find out which optimizations are applied. On modern CPUs you may want to benefit from AVX2 or AVX512, …

rbitr · September 15, 2023, 7:40pm

Thanks, I made the change but it didn’t affect the performance. I am surprised (naively?) that matrix-vector multiplication is not efficient across threads because it seems like something that would be very easy to parallelize.

PierU · September 15, 2023, 8:23pm

It’s easy to parallelize, but it has a moderately low arithmetic intensity (AI), that is the ratio between the number of operations and the number of memory read/write is low. In these conditions, the bottleneck is the bandwidth between the CPU and the memory, and using more cores does not help.

This is particularly true on the Intel Core CPUs, which have pretty good monocore performances (all the more than the turbo boost frequency is often enabled when a single core is used) that are high enough to saturate the bandwidth to/from the memory for such simple computations. This is less true on the Xeon line, which have a higher bandwith (and no turbo boost, iirc), or on the AMD CPUs, which have more cores with lower monocore performances.

ivanpribec · September 15, 2023, 10:16pm

With gfortran you can use the -fexternal-blas compiler flag, which inserts calls to BLAS in place of the intrinsic matmul function above a certain matrix size. In addition you have to link an optimized BLAS library:

Library	Link flags	Comment
OpenBLAS	`-lopenblas`
BLIS	`-lblis`
Intel oneMKL	see Link Line Advisor for OneMKL	for Intel processors
Accelerate	`-framework Accelerate`	for macOS
AOCL-BLAS	`-lblis`	for AMD processors; derived from BLIS
Arm Performance Libraries	`-larmpl_lp64`	for Arm processors
MATLAB BLAS	`-lmwblas`	for MATLAB MEX functions

alirezagh76 · September 16, 2023, 2:49am

AMD seems to have higher single-core bandwidth:
https://sites.utexas.edu/jdm4372/2023/04/25/the-evolution-of-single-core-bandwidth-in-multicore-processors/

PierU · September 16, 2023, 6:35am

That’s true that in the last year, the bandwidth tend to grow faster than the monocore performances.

rbitr · September 16, 2023, 10:32pm

Thanks, your explanation made a big difference in how I approached the problem I’m working on.

Topic		Replies	Views
Testing the performance of `matmul` under default compiler settings Help	37	2788	August 11, 2022
Fortran has the fastest matmul Advocacy	2	572	January 12, 2024
Will using Vectorization speed up the program? Help	22	1884	November 26, 2023
Does LAPACK/BLAS automatically use multi cores or threads?	35	6276	August 3, 2022
Mapping matrix & vector arithmetic to BLAS calls	8	1353	July 20, 2022

Compiler option matmul speedup

Related topics