C++ Standard Library dense linear algebra interface

You are late to the party. It has already been discussed on Discourse:

You can try the C++ Standard BLAS using the reference Kokkos stdBLAS implementation (GitHub - kokkos/stdBLAS: Reference Implementation for stdBLAS).

It’s also available (experimentally) as part of the NVIDIA HPC SDK version 22.7 (released July 2022).


In Fortran this requires compiler support. It can be done already by gfortran when you use the option -fexternal-blas and also by the NVIDIA Fortran compiler. For more info, see


See my previous response. gfortran and nvfortran already know how to do this when you turn on the right options. What the compilers don’t always know in advance is the size of your problem or the architecture or the platform you’d like to run on. There is no “one size fits all” routine.


These concerns are shared by many BLAS and LAPACK users. Luckily the authors have recently commented on this. In the C++ API for BLAS and LAPACK (PDF, 1.2 MB) they write:

Occasionally, users express concern about the overhead of error checks. For even modestly sized matrices, error checks take negligible time. [emphasis added] However, for very small matrices, with n < 20 or so, there can be noticeable overhead. Intel introduced MKL_DIRECT_CALL to disable error checks in these cases. However, libraries compiled for specific sizes, either via templating or just-in-time (JIT) compilation, provide an even larger performance boost for these small sizes; for instance, Intel’s libxsmm3 for extra-small matrix-multiply or batched BLAS for sets of small matrices. Thus, users with such small matrices are encouraged to use special purpose interfaces rather than try to optimize overheads in a general purpose interface.

With gfortran and matmul at least, you have two tuning levers, -fblas-matmul-limit=n for when -fexternal-blas is in effect, and -finline-matmul-limit=n otherwise.


No whole array expressions. Rationale is explained in the C++ proposals. It’s difficult to agree on a common notation. For example in Fortran we use * for element-wise multiplication, while in MATLAB it is dot- or matrix product. Python was faced with the same problem (PEP 465 – A dedicated infix operator for matrix multiplication | peps.python.org) and they introduced @ to prevent confusion.

The proposal document, MDSPAN P0009r16 doesn’t mention bound checking. There is an issue open for it in the reference implementation: mdspan: Add optional bounds checking · Issue #181 · kokkos/mdspan · GitHub


Intel Fortran has had !DIR$ LOOP_BLOCK for ages. The Cray and (former) SGI compilers also have loop blocking directives. OpenMP 5.1 introduces !$omp tile for this purpose. OpenACC 2.0 also offers a tile clause for loops.

It’s good to be able to separate the code semantics (your simple loop nest) from optimization.


You don’t need inline assembly. You can link the assembly “micro-kernels” just link you would link any other function. You can turn on link-time optimization or use directives to force inlining.

I don’t think the fact such libraries are in C and C++ means Fortran can’t be just as fast. Sure, you can’t do inline assembly or use SIMD intrinsics like with C and C++, but at the end of the day there are no magic tricks. A tutorial for the BLIS approach on optimizing GEMM is provided here: GitHub - flame/how-to-optimize-gemm
What makes it cumbersome in Fortran, IMO, is the absence of a standard preprocessor or templates.


The devil is in the details. For general linear algebra expressions, the problem is NP-complete. This recent ACM paper gives an overview of the topic: The Linear Algebra Mapping Problem. Current State of Linear Algebra Languages and Libraries | ACM TOMS. For BLAS expressions (so essentially matmul and variations), gfortran and nvfortran are able to do it already within certain constraints as mentioned above.


At the beginning it wasn’t a matter of performance but economics. As explained by John Backus in The History of Fortran I, II, and III (PDF, 1.6 MB),

[…] The cost of programmers associated with a computer center was usually at least as great as the cost of the computer itself. In addition, from one quarter to one half of the computer’s time was spent in debugging. Thus programming and debugging accounted for as much as three quarters of the cost of operating a computer; and obviously, as computers got cheaper, this situation would get worse.

This economic factor was one of the prime motivations which led me to propose the FORTRAN project in a letter to my boss, Cuthbert Hurd, in late 1953 (the exact date is not known but other facts suggest December 1953 as a likely date).

[…]

We went on to raise the question “…can a machine translate a sufficiently rich mathematical language into a sufficiently economical program at a sufficiently low cost to make the whole affair feasible?”


The standard BLAS routines fit on a single A4 page. You need to learn how to link them, e.g. -lblas (Reference BLAS), -lopenblas (OpenBLAS), -qmkl (with Intel Fortran, or using the MKL link line advisor), -framework Accelerate (on MacOS), and potentially how to install them. I admit all of these steps can be difficult when you are a beginner.

The future C++ std::linalg addition does not mandate the libraries be implemented in C++. They could just call an existing library instead. It’s up to your compiler provider to decide if you’ll be forced to perform explicit linking yourself, or if they append it internally.

Overall, I think it’s a nice addition to C++, and one that Fortran users can also benefit from. The mdspan class has a template parameter layout_left which corresponds to Fortran column layout, so a Fortran wrapper layer can be written with moderate effort.

Addendum: I mistakenly referred to OpenBLAS for the “How to Optimize Gemm” tutorial. It actually originates from the FLAME group at University of Texas. Both the BLIS and OpenBLAS libraries use the approach originally developed by Kazushige Goto and published in

Goto, K., & Geijn, R. A. V. D. (2008). Anatomy of high-performance matrix multiplication. ACM Transactions on Mathematical Software (TOMS) , 34 (3), 1-25. (preprint, PDF, 439 KB)

7 Likes