C++ Standard Library dense linear algebra interface

Interesting… so I guess the C++ standard is going to absorb BLAS now?

https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2023/p1673r12.html

2 Likes

The “B” in BLAS stands for “Basic”, but they are complicated. Arguably a subroutine, especially one that must be fast and may be called in an inner loop, should do one kind of calculation rather than one of many possible calculations determined by flag arguments, since it takes time to check those flags. A versatile subroutine may require many arguments and be harder for a new user to call correctly. The BLAS procedures look “busy” to me. Would it make sense to break them up into specialized procedures? For example, instead of the current dgemm, which multiplies two matrices, or the transposes of either matrix, there could be one subroutine for A*B, one for A'*B, one for A*B', and additional versions that add a multiple of a matrix C (something dgemm allows). It is faster to resolve things at compile time, and when there are many calls to dgemm for small matrices, it could make a difference. One can write Fortran wrappers for BLAS with fewer arguments, but the run-time cost of checking flag arguments remains.

It may hard for humans to manually streamline a BLAS implementation such as the C code of OpenBLAS while maintaining correctness and speed, but it should be possible to write a program to do so.

The interface of the BLAS subroutine dgemm is

subroutine dgemm ( character TRANSA,
character TRANSB,
integer M,
integer N,
integer K,
double precision ALPHA,
double precision, dimension(lda,*) A,
integer LDA,
double precision, dimension(ldb,*) B,
integer LDB,
double precision BETA,
double precision, dimension(ldc,*) C,
integer LDC
)

instead of the current dgemm, which multiplies two matrices, or the transposes of either matrix, there could be one subroutine for A*B, one for A'*B, one for A*B'

This is almost correct. The even better approach (which I don’t think Fortran is generic enough to support but I may be wrong) is to only have A*B but to use a diffferent interface than characters for dispatching at compile time to the right low level routine. (Julia for example uses it’s type system and multiple dispatch for this. Another option would be something like a symbol type that the compiler is able to introspect easily)

Ha, that’s kind of funny. If something like Eigen or OpenBLAS became packaged as part of the C++ standard library with easily callable interfaces and avoiding the above mentioned character type flags of LAPACK… I would honestly expect that to be the absolute end of Fortran once and for all.

I absolutely agree with this. For maximum speed, the routines should be nothing but vectorized loops that get inlined. Extra variables for flags, checking said flags, these are both things that a pure speed-first numerical kernel should not be caring about. Fortran already has the intrinsic functions of matmul and transpose, but it would be quite advanced compiler optimization required to really parse their usage and come up with the correct optimized routines.

you also probably want blocking (although arguably that should be done at the compiler level).

Every time this sort of topic comes up, it always seems to come down to just hand tuned assembly. Fortran has no mechanism include that. A pure Fortran source project will never be as fast as linking against OpenBLAS, Intel MKL, or any other external BLAS library (pretty much all written in C or C++, with the fast ones using aforementioned hand-written assembly in the core).

I think this is simply reality today, and in light of that fact, Fortran should probably double down on the “high level” aspect. At the same time there doesn’t seem to be much appetite for that, but I am hopeful for generics. I think a solid generic system is currently Fortran’s best bet as far as ‘exciting new features to attract increased active development.’ Languages need speakers/writers/readers, and programming languages need actively developed projects that people actually use.

Agree. The time left for Fortran is not much.

I don’t think it will change a lot. Eigen and BLAS were/are already easily callable with standardized interfaces from C++. I mean, the current situation didn’t refrain people from migrating from Fortran to C++.

3 Likes

I am more interested whether mdspan will support whole array expressions (like a = 2 * b + c) and also whether the capability of bounds check is built into the specification. Are these documented somewhere, possibly…?

Since the standard defines just an interface between the programmer and the underlying compiler, I’ve always wondered why there appears to be no appetite for defining a standard linear algebra interface for Fortran ala MATLAB. How the compiler decides to use that interface by say mapping the standard interface to explicit MKL functions is up to the developers and should be hidden from the programmer (with some mechanism to allow the programmer to select an alternative underlying linear algebra package). At least then we would have a better chance at truly transportable code. Since Fortran already has a distributed data object (co-arrays) you could also (with a lot of work) map the standard interface to SCALAPACK or Trilinos for parallel processing. I guess this would step on to many toes at IMSL and NAG.

Well, they were never part of the standard library. Once they are, then when the new C++ programmer is looking at doing linear algebra operations, they will find that “oh cool, the standard library just has a function that does this.” They’ll call it. It’ll be fast. If they’re especially curious they may look into the history and find that their standard library implementations are based on what used to be super-specialized, high performance libraries. Cool, they can just use routines out of std:: and get performance people used to have to learn entire libraries to get.

Speaking of history, I am rather interested in that aspect. My understanding was that Fortran was originally invented at a time when there weren’t really many “languages,” and anyone wanting high performance was kind of stuck writing machine code manually. How is it that over time, any/all performance advantage that Fortran once had has fallen to parity, or outright lower performance than the C-family of languages that didn’t even start developing compilers until 30 years later?

I want to write code that executes quickly. Usually in the first 1 or 2 sentences about “what is Fortran,” speed will be mentioned. Lots of the features one might expect to make code faster (pure, elemental) actually seem to do nothing in terms of enabling compilers to generate any faster binaries. It is just odd for a language that to my understanding, started out with the goal of going fast.

Having the function in the standard library is indeed a step forward, but just a small step. Any C++ developer who needs linear algebra was quickly able to find and use the Eigen and BLAS libraries.

I can’t see any reason why C/C++ could not eventually be as fast as Fortran, from the moment where the people in charge decide that speed is an objective.

elemental aims at having a rank-agnostic routine that is as fast as an inline loop on the elements. It can’t be faster than an inline loop.

For the rest, the main and most critical problem are the resources. Some features of the language (such as pure) possibly make some optimizations possible, but people are needed to effectively implement the optimizations in the compilers.

1 Like

You are late to the party. It has already been discussed on Discourse:

You can try the C++ Standard BLAS using the reference Kokkos stdBLAS implementation (GitHub - kokkos/stdBLAS: Reference Implementation for stdBLAS).

It’s also available (experimentally) as part of the NVIDIA HPC SDK version 22.7 (released July 2022).


In Fortran this requires compiler support. It can be done already by gfortran when you use the option -fexternal-blas and also by the NVIDIA Fortran compiler. For more info, see


See my previous response. gfortran and nvfortran already know how to do this when you turn on the right options. What the compilers don’t always know in advance is the size of your problem or the architecture or the platform you’d like to run on. There is no “one size fits all” routine.


These concerns are shared by many BLAS and LAPACK users. Luckily the authors have recently commented on this. In the C++ API for BLAS and LAPACK (PDF, 1.2 MB) they write:

Occasionally, users express concern about the overhead of error checks. For even modestly sized matrices, error checks take negligible time. [emphasis added] However, for very small matrices, with n < 20 or so, there can be noticeable overhead. Intel introduced MKL_DIRECT_CALL to disable error checks in these cases. However, libraries compiled for specific sizes, either via templating or just-in-time (JIT) compilation, provide an even larger performance boost for these small sizes; for instance, Intel’s libxsmm3 for extra-small matrix-multiply or batched BLAS for sets of small matrices. Thus, users with such small matrices are encouraged to use special purpose interfaces rather than try to optimize overheads in a general purpose interface.

With gfortran and matmul at least, you have two tuning levers, -fblas-matmul-limit=n for when -fexternal-blas is in effect, and -finline-matmul-limit=n otherwise.


No whole array expressions. Rationale is explained in the C++ proposals. It’s difficult to agree on a common notation. For example in Fortran we use * for element-wise multiplication, while in MATLAB it is dot- or matrix product. Python was faced with the same problem (PEP 465 – A dedicated infix operator for matrix multiplication | peps.python.org) and they introduced @ to prevent confusion.

The proposal document, MDSPAN P0009r16 doesn’t mention bound checking. There is an issue open for it in the reference implementation: mdspan: Add optional bounds checking · Issue #181 · kokkos/mdspan · GitHub


Intel Fortran has had !DIR$ LOOP_BLOCK for ages. The Cray and (former) SGI compilers also have loop blocking directives. OpenMP 5.1 introduces !$omp tile for this purpose. OpenACC 2.0 also offers a tile clause for loops.

It’s good to be able to separate the code semantics (your simple loop nest) from optimization.


You don’t need inline assembly. You can link the assembly “micro-kernels” just link you would link any other function. You can turn on link-time optimization or use directives to force inlining.

I don’t think the fact such libraries are in C and C++ means Fortran can’t be just as fast. Sure, you can’t do inline assembly or use SIMD intrinsics like with C and C++, but at the end of the day there are no magic tricks. A tutorial for the BLIS approach on optimizing GEMM is provided here: GitHub - flame/how-to-optimize-gemm
What makes it cumbersome in Fortran, IMO, is the absence of a standard preprocessor or templates.


The devil is in the details. For general linear algebra expressions, the problem is NP-complete. This recent ACM paper gives an overview of the topic: The Linear Algebra Mapping Problem. Current State of Linear Algebra Languages and Libraries | ACM TOMS. For BLAS expressions (so essentially matmul and variations), gfortran and nvfortran are able to do it already within certain constraints as mentioned above.


At the beginning it wasn’t a matter of performance but economics. As explained by John Backus in The History of Fortran I, II, and III (PDF, 1.6 MB),

[…] The cost of programmers associated with a computer center was usually at least as great as the cost of the computer itself. In addition, from one quarter to one half of the computer’s time was spent in debugging. Thus programming and debugging accounted for as much as three quarters of the cost of operating a computer; and obviously, as computers got cheaper, this situation would get worse.

This economic factor was one of the prime motivations which led me to propose the FORTRAN project in a letter to my boss, Cuthbert Hurd, in late 1953 (the exact date is not known but other facts suggest December 1953 as a likely date).

[…]

We went on to raise the question “…can a machine translate a sufficiently rich mathematical language into a sufficiently economical program at a sufficiently low cost to make the whole affair feasible?”


The standard BLAS routines fit on a single A4 page. You need to learn how to link them, e.g. -lblas (Reference BLAS), -lopenblas (OpenBLAS), -qmkl (with Intel Fortran, or using the MKL link line advisor), -framework Accelerate (on MacOS), and potentially how to install them. I admit all of these steps can be difficult when you are a beginner.

The future C++ std::linalg addition does not mandate the libraries be implemented in C++. They could just call an existing library instead. It’s up to your compiler provider to decide if you’ll be forced to perform explicit linking yourself, or if they append it internally.

Overall, I think it’s a nice addition to C++, and one that Fortran users can also benefit from. The mdspan class has a template parameter layout_left which corresponds to Fortran column layout, so a Fortran wrapper layer can be written with moderate effort.

Addendum: I mistakenly referred to OpenBLAS for the “How to Optimize Gemm” tutorial. It actually originates from the FLAME group at University of Texas. Both the BLIS and OpenBLAS libraries use the approach originally developed by Kazushige Goto and published in

Goto, K., & Geijn, R. A. V. D. (2008). Anatomy of high-performance matrix multiplication. ACM Transactions on Mathematical Software (TOMS) , 34 (3), 1-25. (preprint, PDF, 439 KB)

7 Likes

This is a truly great resource! Thanks for the link. I am going through it myself now. I’ll be happy to report back results here. So far, it looks like this is only going to be possible with assumed-size arrays everywhere, which as you pointed out - no macros means a bit messy to look at.

1 Like

Julia’s approach is a 30-year-long tradition in the Fortran language since Fortran 90. Multiple-dispatch does exist in Fortran since 1990, unlike what is often incorrectly said about Fortran in Julia’s discourse forum. The fact that BLAS/LAPACK routines require runtime flags as input arguments is the design choice of the original developers to keep backward compatibility. These runtime flags affect performance for small calculations. Writing code that outperforms OpenBLAS by converting these runtime flags to compile-time dispatch is not magic.

1 Like

I am greatly struggling in that regard. For one, OpenBLAS (at least the version I have) appears to be multithreaded by default. Even still, following the instructions outlined here: Home · flame/how-to-optimize-gemm Wiki · GitHub has thus far yielded varying degrees of success…

  • gfortran -Ofast -march=native -flto -fwhole-program results in intrinsic matmul always being faster, adding -fexternal-blas -lopenblas shows that matmul becomes multithreaded, with minimal (maybe 0?) impact for the other routines
  • ifort -Ofast -mavx2 -fp-model=fast=2 -fp-speculation=fast -ipo usually results in intrinsic matmul and the most naive possible triple nested loop being the fastest (by about 4x). Adding -qmkl does not change results, but I have an AMD processor, so that’s probably why.
  • ifx -Ofast -mavx2 -fp-model=fast=2 -fp-speculation=fast -ipo -qmkl displays pitiful performance, on par or worse than gfortran
  • flang -Ofast -march=native -flto is on par with gfortran and ifx (aka nothing to write home about, intrinsic matmul is fastest, yet still an order of magnitude slower than external blas with OpenBLAS from gfortran)

I understand that it is possible, even easy, to link to external libraries in Fortran codes. That is not what I am trying to do. I want to write Fortran, which can be compiled to a binary that produces comparable performance to libraries written in other languages (all C/C++ that I have seen).

Here are results for 256x256 matrix multiplication from each set of compiler and flags:

 GCC version 13.2.1 20230801 -- -march=znver3 -mmmx -mpopcnt -msse -msse2 -msse3 -mssse3 -msse4.1 -msse4.2 -mavx -mavx2 -msse4a -mno-fma4 -mno-xop -mfma -mno-avx512f -mbmi -mbmi2 -maes -mpclmul -mno-avx512vl -mno-avx512bw -mno-avx512dq -mno-avx512cd -mno-avx512er -mno-avx512pf -mno-avx512vbmi -mno-avx512ifma -mno-avx5124vnniw -mno-avx5124fmaps -mno-avx512vpopcntdq -mno-avx512vbmi2 -mno-gfni -mvpclmulqdq -mno-avx512vnni -mno-avx512bitalg -mno-avx512bf16 -mno-avx512vp2intersect -mno-3dnow -madx -mabm -mno-cldemote -mclflushopt -mclwb -mclzero -mcx16 -mno-enqcmd -mf16c -mfsgsbase -mfxsr -mno-hle -msahf -mno-lwp -mlzcnt -mmovbe -mno-movdir64b -mno-movdiri -mmwaitx -mno-pconfig -mpku -mno-prefetchwt1 -mprfchw -mno-ptwrite -mrdpid -mrdrnd -mrdseed -mno-rtm -mno-serialize -mno-sgx -msha -mshstk -mno-tbm -mno-tsxldtrk -mvaes -mno-waitpkg -mwbnoinvd -mxsave -mxsavec -mxsaveopt -mxsaves -mno-amx-tile -mno-amx-int8 -mno-amx-bf16 -mno-uintr -mno-hreset -mno-kl -mno-widekl -mno-avxvnni -mno-avx512fp16 -mno-avxifma -mno-avxvnniint8 -mno-avxneconvert -mno-cmpccxadd -mno-amx-fp16 -mno-prefetchi -mno-raoint -mno-amx-complex --param=l1-cache-size=32 --param=l1-cache-line-size=64 --param=l2-cache-size=512 -mtune=znver3 -Ofast -flto -fwhole-program -fpre-include=/usr/include/finclude/math-vector-fortran.h
test n=256
  MATMUL GFLOPS: 9.84, last element: 70.61
    MM00 GFLOPS: .81, last element: 70.61
    MM01 GFLOPS: .87, last element: 70.61
    MM02 GFLOPS: .82, last element: 70.61
    MM03 GFLOPS: .80, last element: 70.61
    MM04 GFLOPS: .80, last element: 70.61
    MM05 GFLOPS: 4.12, last element: 70.61
    MM06 GFLOPS: 4.13, last element: 70.61
    MM08 GFLOPS: 3.14, last element: 70.61
 GCC version 13.2.1 20230801 -- -march=znver3 -mmmx -mpopcnt -msse -msse2 -msse3 -mssse3 -msse4.1 -msse4.2 -mavx -mavx2 -msse4a -mno-fma4 -mno-xop -mfma -mno-avx512f -mbmi -mbmi2 -maes -mpclmul -mno-avx512vl -mno-avx512bw -mno-avx512dq -mno-avx512cd -mno-avx512er -mno-avx512pf -mno-avx512vbmi -mno-avx512ifma -mno-avx5124vnniw -mno-avx5124fmaps -mno-avx512vpopcntdq -mno-avx512vbmi2 -mno-gfni -mvpclmulqdq -mno-avx512vnni -mno-avx512bitalg -mno-avx512bf16 -mno-avx512vp2intersect -mno-3dnow -madx -mabm -mno-cldemote -mclflushopt -mclwb -mclzero -mcx16 -mno-enqcmd -mf16c -mfsgsbase -mfxsr -mno-hle -msahf -mno-lwp -mlzcnt -mmovbe -mno-movdir64b -mno-movdiri -mmwaitx -mno-pconfig -mpku -mno-prefetchwt1 -mprfchw -mno-ptwrite -mrdpid -mrdrnd -mrdseed -mno-rtm -mno-serialize -mno-sgx -msha -mshstk -mno-tbm -mno-tsxldtrk -mvaes -mno-waitpkg -mwbnoinvd -mxsave -mxsavec -mxsaveopt -mxsaves -mno-amx-tile -mno-amx-int8 -mno-amx-bf16 -mno-uintr -mno-hreset -mno-kl -mno-widekl -mno-avxvnni -mno-avx512fp16 -mno-avxifma -mno-avxvnniint8 -mno-avxneconvert -mno-cmpccxadd -mno-amx-fp16 -mno-prefetchi -mno-raoint -mno-amx-complex --param=l1-cache-size=32 --param=l1-cache-line-size=64 --param=l2-cache-size=512 -mtune=znver3 -Ofast -flto -fwhole-program -fexternal-blas -fpre-include=/usr/include/finclude/math-vector-fortran.h
test n=256
  MATMUL GFLOPS: 49.44, last element: 63.73
    MM00 GFLOPS: .81, last element: 63.73
    MM01 GFLOPS: 2.49, last element: 63.73
    MM02 GFLOPS: .74, last element: 63.73
    MM03 GFLOPS: .74, last element: 63.73
    MM04 GFLOPS: .74, last element: 63.73
    MM05 GFLOPS: 8.92, last element: 63.73
    MM06 GFLOPS: 8.86, last element: 63.73
    MM08 GFLOPS: 9.17, last element: 63.73
 Intel(R) Fortran Intel(R) 64 Compiler Classic for applications running on Intel
 (R) 64, Version 2021.9.0 Build 20230302_000000 -- -Ofast -mavx2 -fp-model=fast=
 2 -fp-speculation=fast -ipo -o test_ifort
test n=256
  MATMUL GFLOPS: 17.92, last element: 59.41
    MM00 GFLOPS: 21.10, last element: 59.41
    MM01 GFLOPS: .61, last element: 59.41
    MM02 GFLOPS: 2.84, last element: 59.41
    MM03 GFLOPS: 2.84, last element: 59.41
    MM04 GFLOPS: .58, last element: 59.41
    MM05 GFLOPS: 2.85, last element: 59.41
    MM06 GFLOPS: 2.85, last element: 59.41
    MM08 GFLOPS: 3.87, last element: 59.41
 Intel(R) Fortran Intel(R) 64 Compiler Classic for applications running on Intel
 (R) 64, Version 2021.9.0 Build 20230302_000000 -- -Ofast -mavx2 -fp-model=fast=
 2 -fp-speculation=fast -ipo -qmkl -o test_ifort_mkl
test n=256
  MATMUL GFLOPS: 18.35, last element: 59.41
    MM00 GFLOPS: 19.54, last element: 59.41
    MM01 GFLOPS: .63, last element: 59.41
    MM02 GFLOPS: 2.99, last element: 59.41
    MM03 GFLOPS: 2.98, last element: 59.41
    MM04 GFLOPS: .60, last element: 59.41
    MM05 GFLOPS: 2.99, last element: 59.41
    MM06 GFLOPS: 3.00, last element: 59.41
    MM08 GFLOPS: 4.16, last element: 59.41
 Intel(R) Fortran Compiler for applications running on Intel(R) 64, Version 2023
 .1.0 Build 20230320 -- -Ofast -mavx2 -fp-model=fast=2 -fp-speculation=fast -ipo
  -o test_ifx 
test n=256
  MATMUL GFLOPS: 7.42, last element: 59.41
    MM00 GFLOPS: .64, last element: 59.41
    MM01 GFLOPS: .97, last element: 59.41
    MM02 GFLOPS: .86, last element: 59.41
    MM03 GFLOPS: .85, last element: 59.41
    MM04 GFLOPS: .85, last element: 59.41
    MM05 GFLOPS: .87, last element: 59.41
    MM06 GFLOPS: 3.70, last element: 59.41
    MM08 GFLOPS: 3.71, last element: 59.41
 Intel(R) Fortran Compiler for applications running on Intel(R) 64, Version 2023
 .1.0 Build 20230320 -- -Ofast -mavx2 -fp-model=fast=2 -fp-speculation=fast -ipo
  -qmkl -o test_ifx_mkl 
test n=256
  MATMUL GFLOPS: 10.16, last element: 59.41
    MM00 GFLOPS: .67, last element: 59.41
    MM01 GFLOPS: 1.03, last element: 59.41
    MM02 GFLOPS: .87, last element: 59.41
    MM03 GFLOPS: .86, last element: 59.41
    MM04 GFLOPS: .85, last element: 59.41
    MM05 GFLOPS: 1.20, last element: 59.41
    MM06 GFLOPS: 3.89, last element: 59.41
    MM08 GFLOPS: 3.93, last element: 59.41
 flang Flang - 1.5 2017-05-01 -- -Ofast -march=native -flto -o test_flang
test n=256
  MATMUL GFLOPS: 8.97, last element: 60.90
    MM00 GFLOPS: 0.63, last element: 60.90
    MM01 GFLOPS: 0.64, last element: 60.90
    MM02 GFLOPS: 0.61, last element: 60.90
    MM03 GFLOPS: 0.62, last element: 60.90
    MM04 GFLOPS: 0.62, last element: 60.90
    MM05 GFLOPS: 2.64, last element: 60.90
    MM06 GFLOPS: 2.62, last element: 60.90
    MM08 GFLOPS: 2.46, last element: 60.90

I have implemented through Optimization 8. I do not believe that 7 or 9 really apply to Fortran? I will be moving on to the block 4x4 stuff next. Hopefully that brings some large improvements.

Click to see code + makefile

CODE:

module my_mod
use, intrinsic :: iso_fortran_env, only: rk => real64, i64 => int64
implicit none
private

    public :: rk, mm00, i64, mm01, mm02, mm03, mm04, mm05, mm06, mm08

    contains

        pure subroutine mm00(m, n, k, a, lda, b, ldb, c, ldc)
            integer, intent(in) :: m, n, k, lda, ldb, ldc
            real(rk), intent(in) :: a(lda,*), b(ldb,*)
            real(rk), intent(inout) :: c(ldc,*)
            integer :: i, j, p
            do i=1,m
                do j=1,n
                    do p=1,k
                        c(i,j) = c(i,j) + a(i,p)*b(p,j)
                    end do
                end do
            end do
        end subroutine mm00

        pure subroutine mm01(m, n, k, a, lda, b, ldb, c, ldc)
            integer, intent(in) :: m, n, k, lda, ldb, ldc
            real(rk), intent(in) :: a(*), b(*)
            real(rk), intent(inout) :: c(*)
            integer :: i, j
            do j=1,n
                do i=1,m
                    call add_dot(k, a(i), lda, b((j-1)*ldb+1), c((j-1)*ldc+i))
                end do
            end do
        end subroutine mm01

        pure subroutine add_dot(k, x, incx, y, gamma)
            integer, intent(in) :: k, incx
            real(rk), intent(in) :: x(*), y(k)
            real(rk), intent(inout) :: gamma
            integer :: p
            do p=1,k
                gamma = gamma + x((p-1)*incx+1)*y(p)
            end do
        end subroutine add_dot

        pure subroutine mm02(m, n, k, a, lda, b, ldb, c, ldc)
            integer, intent(in) :: m, n, k, lda, ldb, ldc
            real(rk), intent(in) :: a(*), b(*)
            real(rk), intent(inout) :: c(*)
            integer :: i, j
            do j=1,n,4
                do i=1,m
                    call add_dot(k, a(i), lda, b((j-1)*ldb+1), c((j-1)*ldc+i))
                    call add_dot(k, a(i), lda, b(j*ldb+1), c(j*ldc+i))
                    call add_dot(k, a(i), lda, b((j+1)*ldb+1), c((j+1)*ldc+i))
                    call add_dot(k, a(i), lda, b((j+2)*ldb+1), c((j+2)*ldc+i))
                end do
            end do
        end subroutine mm02

        pure subroutine mm03(m, n, k, a, lda, b, ldb, c, ldc)
            integer, intent(in) :: m, n, k, lda, ldb, ldc
            real(rk), intent(in) :: a(*), b(*)
            real(rk), intent(inout) :: c(*)
            integer :: i, j
            do j=1,n,4
                do i=1,m
                    call add_dot_1x4(k, a(i), lda, b((j-1)*ldb+1), ldb, c((j-1)*ldc+i), ldc)
                end do
            end do
        end subroutine mm03

        pure subroutine add_dot_1x4(k, a, lda, b, ldb, c, ldc)
            integer, intent(in) :: k, lda, ldb, ldc
            real(rk), intent(in) :: a(*), b(*)
            real(rk), intent(inout) :: c(*)
            call add_dot(k, a(1), lda, b(1), c(1))
            call add_dot(k, a(1), lda, b(ldb+1), c(ldc+1))
            call add_dot(k, a(1), lda, b(2*ldb+1), c(2*ldc+1))
            call add_dot(k, a(1), lda, b(3*ldb+1), c(3*ldc+1))
        end subroutine add_dot_1x4

        pure subroutine mm04(m, n, k, a, lda, b, ldb, c, ldc)
            integer, intent(in) :: m, n, k, lda, ldb, ldc
            real(rk), intent(in) :: a(*), b(*)
            real(rk), intent(inout) :: c(*)
            integer :: i, j
            do j=1,n,4
                do i=1,m
                    call add_dot_1x4_inline(k, a(i), lda, b((j-1)*ldb+1), ldb, c((j-1)*ldc+i), ldc)
                end do
            end do
        end subroutine mm04

        pure subroutine add_dot_1x4_inline(k, a, lda, b, ldb, c, ldc)
            integer, intent(in) :: k, lda, ldb, ldc
            real(rk), intent(in) :: a(*), b(*)
            real(rk), intent(inout) :: c(*)
            integer :: p
            do p=1,k
                c(1) = c(1) + a((p-1)*lda+1)*b(p)
            end do
            do p=1,k
                c(ldc+1) = c(ldc+1) + a((p-1)*lda+1)*b(ldb+p)
            end do
            do p=1,k
                c(2*ldc+1) = c(2*ldc+1) + a((p-1)*lda+1)*b(2*ldb+p)
            end do
            do p=1,k
                c(3*ldc+1) = c(3*ldc+1) + a((p-1)*lda+1)*b(3*ldb+p)
            end do
        end subroutine add_dot_1x4_inline

        pure subroutine mm05(m, n, k, a, lda, b, ldb, c, ldc)
            integer, intent(in) :: m, n, k, lda, ldb, ldc
            real(rk), intent(in) :: a(*), b(*)
            real(rk), intent(inout) :: c(*)
            integer :: i, j
            do j=1,n,4
                do i=1,m
                    call add_dot_1x4_inline_fused(k, a(i), lda, b((j-1)*ldb+1), ldb, c((j-1)*ldc+i), ldc)
                end do
            end do
        end subroutine mm05

        pure subroutine add_dot_1x4_inline_fused(k, a, lda, b, ldb, c, ldc)
            integer, intent(in) :: k, lda, ldb, ldc
            real(rk), intent(in) :: a(*), b(*)
            real(rk), intent(inout) :: c(*)
            integer :: p
            do p=1,k
                c(1) = c(1) + a((p-1)*lda+1)*b(p)
                c(ldc+1) = c(ldc+1) + a((p-1)*lda+1)*b(ldb+p)
                c(2*ldc+1) = c(2*ldc+1) + a((p-1)*lda+1)*b(2*ldb+p)
                c(3*ldc+1) = c(3*ldc+1) + a((p-1)*lda+1)*b(3*ldb+p)
            end do
        end subroutine add_dot_1x4_inline_fused

        pure subroutine mm06(m, n, k, a, lda, b, ldb, c, ldc)
            integer, intent(in) :: m, n, k, lda, ldb, ldc
            real(rk), intent(in) :: a(*), b(*)
            real(rk), intent(inout) :: c(*)
            integer :: i, j
            do j=1,n,4
                do i=1,m
                    call add_dot_1x4_inline_fused_accumulators(k, a(i), lda, b((j-1)*ldb+1), ldb, c((j-1)*ldc+i), ldc)
                end do
            end do
        end subroutine mm06

        pure subroutine add_dot_1x4_inline_fused_accumulators(k, a, lda, b, ldb, c, ldc)
            integer, intent(in) :: k, lda, ldb, ldc
            real(rk), intent(in) :: a(*), b(*)
            real(rk), intent(inout) :: c(*)
            integer :: p
            real(rk) :: c1, c2, c3, c4, a1
            c1 = 0.0_rk
            c2 = 0.0_rk
            c3 = 0.0_rk
            c4 = 0.0_rk
            do p=1,k
                a1 = a((p-1)*lda+1)
                c1 = c1 + a1*b(p)
                c2 = c2 + a1*b(ldb+p)
                c3 = c3 + a1*b(2*ldb+p)
                c4 = c4 + a1*b(3*ldb+p)
            end do
            c(1) = c(1) + c1
            c(ldc+1) = c(ldc+1) + c2
            c(2*ldc+1) = c(2*ldc+1) + c3
            c(3*ldc+1) = c(3*ldc+1) + c4
        end subroutine add_dot_1x4_inline_fused_accumulators

        !! can't really do 7, Fortran already passes by reference... 

        pure subroutine mm08(m, n, k, a, lda, b, ldb, c, ldc)
            integer, intent(in) :: m, n, k, lda, ldb, ldc
            real(rk), intent(in) :: a(*), b(*)
            real(rk), intent(inout) :: c(*)
            integer :: i, j
            do j=1,n,4
                do i=1,m
                    call add_dot_1x4_inline_fused_accumulators_unrolled(k, a(i), lda, b((j-1)*ldb+1), ldb, c((j-1)*ldc+i), ldc)
                end do
            end do
        end subroutine mm08

        pure subroutine add_dot_1x4_inline_fused_accumulators_unrolled(k, a, lda, b, ldb, c, ldc)
            integer, intent(in) :: k, lda, ldb, ldc
            real(rk), intent(in) :: a(*), b(*)
            real(rk), intent(inout) :: c(*)
            integer :: p
            real(rk) :: c1, c2, c3, c4, a1
            c1 = 0.0_rk
            c2 = 0.0_rk
            c3 = 0.0_rk
            c4 = 0.0_rk
            do p=1,k,4
                a1 = a((p-1)*lda+1)
                c1 = c1 + a1*b(p)
                c2 = c2 + a1*b(ldb+p)
                c3 = c3 + a1*b(2*ldb+p)
                c4 = c4 + a1*b(3*ldb+p)

                a1 = a(p*lda+1)
                c1 = c1 + a1*b(p+1)
                c2 = c2 + a1*b(ldb+p+1)
                c3 = c3 + a1*b(2*ldb+p+1)
                c4 = c4 + a1*b(3*ldb+p+1)

                a1 = a((p+1)*lda+1)
                c1 = c1 + a1*b(p+2)
                c2 = c2 + a1*b(ldb+p+2)
                c3 = c3 + a1*b(2*ldb+p+2)
                c4 = c4 + a1*b(3*ldb+p+2)

                a1 = a((p+2)*lda+1)
                c1 = c1 + a1*b(p+3)
                c2 = c2 + a1*b(ldb+p+3)
                c3 = c3 + a1*b(2*ldb+p+3)
                c4 = c4 + a1*b(3*ldb+p+3)
            end do
            c(1) = c(1) + c1
            c(ldc+1) = c(ldc+1) + c2
            c(2*ldc+1) = c(2*ldc+1) + c3
            c(3*ldc+1) = c(3*ldc+1) + c4
        end subroutine add_dot_1x4_inline_fused_accumulators_unrolled

        !! 9 is more stuff with pointers...

end module my_mod


program main
use, intrinsic :: iso_fortran_env, only: compiler_version, compiler_options
use, non_intrinsic :: my_mod
implicit none

    integer, parameter :: r_max = 10, methods = 2
    logical, parameter :: debug = .false.

    real(rk), allocatable :: x(:,:), y(:,:), z(:,:)
    integer :: i, n, r
    integer(i64) :: c1(methods), cr(methods), c2(methods)
    real(rk) :: gflops(methods)

    write(*,*) compiler_version()//' -- '//compiler_options()

    do i=8,8
        n = 2**i
        if (allocated(x)) deallocate(x)
        if (allocated(y)) deallocate(y)
        if (allocated(z)) deallocate(z)
        allocate(x(n,n), y(n,n), z(n,n))
        call random_number(x)
        call random_number(y)
        write(*,'(a,i0)') 'test n=',n

        !! matmul
        call system_clock(c1(1), cr(1))
        do r=1,r_max
            z = 0.0_rk
            z = matmul(x, y)
        end do
        call system_clock(c2(1))
        gflops(1) = real(r_max, rk)*real(n, rk)**3*1.0D-9/(real(max(c2(1) - c1(1), 1_i64), rk)/real(cr(1), rk))
        write(*,'(2(a,f0.2))') '  MATMUL GFLOPS: ',gflops(1),', last element: ',z(n,n)
        if (debug) write(*,*) z

        !! mm00
        call system_clock(c1(1), cr(1))
        do r=1,r_max
            z = 0.0_rk
            call mm00(n, n, n, x, n, y, n, z, n)
        end do
        call system_clock(c2(1))
        gflops(1) = real(r_max, rk)*real(n, rk)**3*1.0D-9/(real(max(c2(1) - c1(1), 1_i64), rk)/real(cr(1), rk))
        write(*,'(2(a,f0.2))') '    MM00 GFLOPS: ',gflops(1),', last element: ',z(n,n)
        if (debug) write(*,*) z

        !! mm01
        call system_clock(c1(1), cr(1))
        do r=1,r_max
            z = 0.0_rk
            call mm01(n, n, n, x, n, y, n, z, n)
        end do
        call system_clock(c2(1))
        gflops(1) = real(r_max, rk)*real(n, rk)**3*1.0D-9/(real(max(c2(1) - c1(1), 1_i64), rk)/real(cr(1), rk))
        write(*,'(2(a,f0.2))') '    MM01 GFLOPS: ',gflops(1),', last element: ',z(n,n)
        if (debug) write(*,*) z

        !! mm02
        call system_clock(c1(1), cr(1))
        do r=1,r_max
            z = 0.0_rk
            call mm02(n, n, n, x, n, y, n, z, n)
        end do
        call system_clock(c2(1))
        gflops(1) = real(r_max, rk)*real(n, rk)**3*1.0D-9/(real(max(c2(1) - c1(1), 1_i64), rk)/real(cr(1), rk))
        write(*,'(2(a,f0.2))') '    MM02 GFLOPS: ',gflops(1),', last element: ',z(n,n)
        if (debug) write(*,*) z

        !! mm03
        call system_clock(c1(1), cr(1))
        do r=1,r_max
            z = 0.0_rk
            call mm03(n, n, n, x, n, y, n, z, n)
        end do
        call system_clock(c2(1))
        gflops(1) = real(r_max, rk)*real(n, rk)**3*1.0D-9/(real(max(c2(1) - c1(1), 1_i64), rk)/real(cr(1), rk))
        write(*,'(2(a,f0.2))') '    MM03 GFLOPS: ',gflops(1),', last element: ',z(n,n)
        if (debug) write(*,*) z

        !! mm04
        call system_clock(c1(1), cr(1))
        do r=1,r_max
            z = 0.0_rk
            call mm04(n, n, n, x, n, y, n, z, n)
        end do
        call system_clock(c2(1))
        gflops(1) = real(r_max, rk)*real(n, rk)**3*1.0D-9/(real(max(c2(1) - c1(1), 1_i64), rk)/real(cr(1), rk))
        write(*,'(2(a,f0.2))') '    MM04 GFLOPS: ',gflops(1),', last element: ',z(n,n)
        if (debug) write(*,*) z

        !! mm05
        call system_clock(c1(1), cr(1))
        do r=1,r_max
            z = 0.0_rk
            call mm05(n, n, n, x, n, y, n, z, n)
        end do
        call system_clock(c2(1))
        gflops(1) = real(r_max, rk)*real(n, rk)**3*1.0D-9/(real(max(c2(1) - c1(1), 1_i64), rk)/real(cr(1), rk))
        write(*,'(2(a,f0.2))') '    MM05 GFLOPS: ',gflops(1),', last element: ',z(n,n)
        if (debug) write(*,*) z

        !! mm06
        call system_clock(c1(1), cr(1))
        do r=1,r_max
            z = 0.0_rk
            call mm06(n, n, n, x, n, y, n, z, n)
        end do
        call system_clock(c2(1))
        gflops(1) = real(r_max, rk)*real(n, rk)**3*1.0D-9/(real(max(c2(1) - c1(1), 1_i64), rk)/real(cr(1), rk))
        write(*,'(2(a,f0.2))') '    MM06 GFLOPS: ',gflops(1),', last element: ',z(n,n)
        if (debug) write(*,*) z

        !! can't really do 7, Fortran already passes by reference...

        !! mm08
        call system_clock(c1(1), cr(1))
        do r=1,r_max
            z = 0.0_rk
            call mm08(n, n, n, x, n, y, n, z, n)
        end do
        call system_clock(c2(1))
        gflops(1) = real(r_max, rk)*real(n, rk)**3*1.0D-9/(real(max(c2(1) - c1(1), 1_i64), rk)/real(cr(1), rk))
        write(*,'(2(a,f0.2))') '    MM08 GFLOPS: ',gflops(1),', last element: ',z(n,n)
        if (debug) write(*,*) z

        !! 9 is more stuff with pointers...

    end do

end program main

MAKEFILE:

#!/bin/bash

clear

gfortran -Ofast -march=native -flto -fwhole-program -o test_gfortran main.f90
gfortran -Ofast -march=native -flto -fwhole-program -fexternal-blas -lopenblas -o test_gfortran_openblas main.f90
ifort -Ofast -mavx2 -fp-model=fast=2 -fp-speculation=fast -ipo -o test_ifort main.f90
ifort -Ofast -mavx2 -fp-model=fast=2 -fp-speculation=fast -ipo -qmkl -o test_ifort_mkl main.f90
ifx -Ofast -mavx2 -fp-model=fast=2 -fp-speculation=fast -ipo -o test_ifx main.f90
ifx -Ofast -mavx2 -fp-model=fast=2 -fp-speculation=fast -ipo -qmkl -o test_ifx_mkl main.f90
flang -Ofast -march=native -flto -o test_flang main.f90

./test_gfortran
./test_gfortran_openblas
./test_ifort
./test_ifort_mkl
./test_ifx
./test_ifx_mkl
./test_flang

Read my comment in full again, please. Also, in general one cannot beat multi thread with single thread, and assembly has little to do with Blas routines. It is mostly a matter of efficient memory access. That is why cache oblivious algorithms can compete with hardware tuned code.

This comment?

I agree with it completely.

  • Julia - no comment, I am not concerned with Julia, or their community opinion of Fortran static/dynamic/multiple dispatch methods etc
  • Yes, I am well aware that any additional arguments and following conditional checks negatively impact performance. I also agree with you that this performance impact will be most noticeable when the overall runtime is short (small calculations). I believe I mentioned this same notion previously in this same thread
  • Writing code that outperforms OpenBLAS…

Typically, spawning threads has significant overhead when compared to small calculations (like 64x64 matrices, most things that are going to take less than a second to complete, etc). There are environment variables I believe I could set to force OpenBLAS singlethreaded, and I suspect it would still be significantly faster than the other approaches.

(base) [tyranids@valinor dmm]$ export OPENBLAS_NUM_THREADS=1
(base) [tyranids@valinor dmm]$ ./test_gfortran_openblas 
 GCC version 13.2.1 20230801 -- -march=znver3 -mmmx -mpopcnt -msse -msse2 -msse3 -mssse3 -msse4.1 -msse4.2 -mavx -mavx2 -msse4a -mno-fma4 -mno-xop -mfma -mno-avx512f -mbmi -mbmi2 -maes -mpclmul -mno-avx512vl -mno-avx512bw -mno-avx512dq -mno-avx512cd -mno-avx512er -mno-avx512pf -mno-avx512vbmi -mno-avx512ifma -mno-avx5124vnniw -mno-avx5124fmaps -mno-avx512vpopcntdq -mno-avx512vbmi2 -mno-gfni -mvpclmulqdq -mno-avx512vnni -mno-avx512bitalg -mno-avx512bf16 -mno-avx512vp2intersect -mno-3dnow -madx -mabm -mno-cldemote -mclflushopt -mclwb -mclzero -mcx16 -mno-enqcmd -mf16c -mfsgsbase -mfxsr -mno-hle -msahf -mno-lwp -mlzcnt -mmovbe -mno-movdir64b -mno-movdiri -mmwaitx -mno-pconfig -mpku -mno-prefetchwt1 -mprfchw -mno-ptwrite -mrdpid -mrdrnd -mrdseed -mno-rtm -mno-serialize -mno-sgx -msha -mshstk -mno-tbm -mno-tsxldtrk -mvaes -mno-waitpkg -mwbnoinvd -mxsave -mxsavec -mxsaveopt -mxsaves -mno-amx-tile -mno-amx-int8 -mno-amx-bf16 -mno-uintr -mno-hreset -mno-kl -mno-widekl -mno-avxvnni -mno-avx512fp16 -mno-avxifma -mno-avxvnniint8 -mno-avxneconvert -mno-cmpccxadd -mno-amx-fp16 -mno-prefetchi -mno-raoint -mno-amx-complex --param=l1-cache-size=32 --param=l1-cache-line-size=64 --param=l2-cache-size=512 -mtune=znver3 -Ofast -flto -fwhole-program -fexternal-blas -fpre-include=/usr/include/finclude/math-vector-fortran.h
test n=256
  MATMUL GFLOPS: 13.56, last element: 65.98
    MM00 GFLOPS: .83, last element: 65.98
    MM01 GFLOPS: 2.53, last element: 65.98
    MM02 GFLOPS: .78, last element: 65.98
    MM03 GFLOPS: .77, last element: 65.98
    MM04 GFLOPS: .77, last element: 65.98
    MM05 GFLOPS: 8.93, last element: 65.98
    MM06 GFLOPS: 8.89, last element: 65.98
    MM08 GFLOPS: 9.21, last element: 65.98

It appears this notion was correct.

As I proceed with further blocking and probably unrolling, perhaps I will see performance like Home · flame/how-to-optimize-gemm Wiki · GitHub

Well… I’m not sure what to do in Fortran when it comes to steps like these: One of the largest performance boosts was putting stuff in vector registers https://github.com/flame/how-to-optimize-gemm/blob/master/src/MMult_4x4_10.c

The next biggest win comes from packing the a matrix, which should be easily doable: https://github.com/flame/how-to-optimize-gemm/blob/master/src/MMult_4x4_13.c

I think I’ll have another go at this tomorrow, and probably start with trying to make a fully vectorizable Fortran-from-the-beginning 4x4 kernel. Or perhaps 8x8, since my machine has AVX2. I’m not sure yet, but I think starting over now that I really see where this is going makes sense.

1 Like

Here’s my addition to your menagerie of matrix multipliers. It has a slight disadvantage because it includes the prefactors alpha and beta.

pure subroutine dgemm_nn(m,n,k,alpha,a,lda,b,ldb,beta,c,ldc)
implicit none
integer, intent(in) :: m,n,k,lda,ldb,ldc
real(8), intent(in) :: alpha,beta,a(lda,k),b(ldb,n)
real(8), intent(inout) :: c(ldc,n)
! local variables
integer i,j,l
real(8) cc(m),t
!DIR$ ATTRIBUTES ALIGN : 64 :: cc
do j=1,n
  cc(1:m)=beta*c(1:m,j)
!DIR$ UNROLL_AND_JAM(8)
  do l=1,k
    t=alpha*b(l,j)
!$OMP SIMD SIMDLEN(8)
    do i=1,m
      cc(i)=cc(i)+t*a(i,l)
    end do
  end do
  c(1:m,j)=cc(1:m)
end do
end subroutine

Compiled with gfortran version 11:

gfortran-11 -Ofast -march=native -mtune=native -fomit-frame-pointer -fopenmp -ffpe-summary=none test557b.f90 -lmkl_gf_lp64 -lmkl_gnu_thread -lmkl_core -lpthread

on an AMD Ryzen Threadripper 2990WX 32-Core yields:

test n=256
  MATMUL GFLOPS: 6.86, last element: 65.45
    MM00 GFLOPS: .47, last element: 65.45
    MM01 GFLOPS: .53, last element: 65.45
    MM02 GFLOPS: .43, last element: 65.45
    MM03 GFLOPS: .44, last element: 65.45
    MM04 GFLOPS: .44, last element: 65.45
    MM05 GFLOPS: .72, last element: 65.45
    MM06 GFLOPS: 1.85, last element: 65.45
    MM08 GFLOPS: 2.81, last element: 65.45
    dgemm_nn GFLOPS: 7.77, last element: 65.45

On an Intel(R) Xeon(R) Platinum 8360Y with Intel Fortran 2021.9.0 20230302 and

ifort -O3 -xHost -ipo -qopenmp -qmkl=parallel test557b.f90 -liomp5 -lpthread -lm -ldl

in addition to setting
!$OMP SIMD SIMDLEN(16)
which is faster with avx512, yields:

test n=256
  MATMUL GFLOPS: 17.04, last element: 59.41
    MM00 GFLOPS: 19.38, last element: 59.41
    MM01 GFLOPS: 1.37, last element: 59.41
    MM02 GFLOPS: 5.34, last element: 59.41
    MM03 GFLOPS: 5.26, last element: 59.41
    MM04 GFLOPS: 5.22, last element: 59.41
    MM05 GFLOPS: 5.28, last element: 59.41
    MM06 GFLOPS: 5.22, last element: 59.41
    MM08 GFLOPS: 3.46, last element: 59.41
    dgemm_nn GFLOPS: 15.62, last element: 59.41

(Like you, I find ifx has abysmal performance).

The routine dgemm_nn works better because it copies a column of C into a temporary array on the stack and repeatedly uses that for accumulation. This seems to be more cache-friendly.

I spent several months trying to write a FFT library in Fortran which competed with FFTW and MKL for speed. For short length transforms the Fortran version was slightly faster. But for longer multi-dimensional complex transforms both FFTW and MKL were substantially faster. My conclusion was that you have to carefully manage memory in order to get good cache hit rates, and keep the SIMD lanes open as long as possible – this is difficult or even impossible in Fortran. Furthermore, I had to generate bespoke code for short length FFTs (up to n=64) and also use both !$OMP SIMD and Intel-compiler specific directives to force inlining. It was a good educational experience though.

3 Likes

Read that Flame/OpenBlas/Goto link again. The dgemm operation is a triiply nested, which you have to convert to six deep by blocking at each level, and then you have to choose permutations & block sizes depending on your hardware. All of that can be done in a high level language except for the last level: that needs to be an assembly coded accumulation straight into a block of registers.

So you may be right conceptually, but not if you look at implementation/performance.

1 Like