In somewhat similar thread already mentioned by @ivanpribec in post #6 above, there was a discussion about including BLAS etc. into C++ standard library, with some guys prophesying the end of Fortran (a.k.a. EOF  ). But they will struggle with the very same problem. There is no single superfast C/C++ function using assembler inlines available for a general purpose
). But they will struggle with the very same problem. There is no single superfast C/C++ function using assembler inlines available for a general purpose libstdc++ or any other library, which all have to support a huge variety of CPU hardware. So IMHO the only way is to provide several functions for different β-marchβ or equivalent options.
That is not much different from what Fortran implementation would have to provide for intrinsics like matmul to make them really fast.