It is the *ad hoc* approaches, once they prove themselves successful, that become part of the standard. There is a whole open development project occurring now with the fortran standard library that consists of *ad hoc* approaches. The successful ones, which are widely useful and prove themselves, may eventually end up in the standard. This is, more or less, what programming (in any language) is about. One has a problem, so they write programs to solve it. If `matmul()`

isn’t fast enough, then you write an *ad hoc* replacement that is. If it takes 30 or 40 years for `matmul()`

to catch up, then you make your choice of which to use during that time and use it.

I agree.

On the other hand, Fortran is already relatively rich in terms of standard matrix/vector procedures, compared to C/C++ for example. The language will be largely improved if these procedures can be further sharpened (up to the level of MATLAB, for example). Such an improvement is a low-hanging apple.

It will be a great pity if the standard matrix/vector procedures stay uncompetitive compared with MATLAB in the foreseeable future (e.g., 10 years). It is to some extent even worse than not providing the procedures at all. Users will hate the language after a bitter experience of SEGFAULT when using standard procedures.

In this case, the real problem is obviously not the choice of using `matmul`

, but the 30 or 40 years. **I am sadly sure that Fortran will not survive that long if the situation comes true.**

I posted some timings here a few weeks ago that shows that this mapping works very well. Particularly for larger matrices, there is only a minor amount of overhead associated with the mapping compared to direct xGEMM() calls.

Can you explain briefly how this mapping works? Does it replace all matrix-matrix products with xGEMM() calls? Under what circumstances is copy-in/copy-out invoked (with associated temporary memory allocation). Does it recognize vector-matrix and matrix-vector products and replace those with xGEMV() calls?

If all of the BLAS routines (levels 1, 2, and 3) were incorporated into standard fortran, would there be any inherent advantages to the current situation in gfortran with compiler-recognized mapping?

*Within Fortran, there is a much smaller and cleaner language struggling to get out*

Yes, of course that is the problem, but what is there that can be done about it? We cannot force programmers to write fortran compilers for us, and for the existing fortran compilers, we cannot force them to write optimal matrix and vector routines.

Instead, we use the tools that we have available at any time, and we do the best that we can to make things better in the present and in the future. As to whether fortran will survive another 30 or 40 years, no one knows that. If someone had told me in the 1980s that the proposed intrinsic `matmul()`

routine in fortran 8x would generate segmentation faults because it exhausts stack space and it would be 10x to 100x slower than external library routines in 2022, I would not have believed them. I would have thought that even in the 1980s that this was a solved problem. Yet that is the situation we have been for the last 30+ years. Will it continue this way for another 30 years? Who knows?

I posted in a separate thread a while back that one veiwpoint for fortran intrinsic functions (all of them generally, not just array operations) is that fortran should provide robust algorithms that never segment fault and that always return the correct result (correct to the last few bits at least for floating point). Then it is up to the programmer who understands his problem and who knows where accuracy can be compromised for higher efficiency, or for tuning specific hardware, for storage requirement vs efficiency tradeoffs, etc. to optimize his application for performance. Except for the segment faults that matmul() sometimes generates, this is pretty much where we are in fortran now. Some compilers maybe do a little better than others, but generally speaking `matmul()`

is slow for all compilers and on all hardware. If a programmer wants to improve performance, there are dozens of ways to do so, from compiler options to external libraries to writing tuned assembler code. Or, as is often mentioned, one can contribute effort to one of the open source compiler projects.

Ah, this is another possible trade-off, efficiency for instruction set portability. I use clusters for computing. When the cluster is new, then all of the cpus are the same (number of cores, size of caches, SSE instruction sets, etc.). Thus one can set compiler options to write instruction sequences that are specific to that hardware. But over time, nodes fail and are replaced, and new nodes are added to the old machine. These new replacements usually have whatever new processors are available at the time, perhaps different SSE instructions, different numbers of cores, and so on. In this case, the old executables are now nonoptimal on the new nodes, or perhaps they even fail entirely if some feature of the old nodes is no longer supported. So this requires the executables to be recompiled so that they run on all of the available nodes, and this is usually some compromise in the performance.

I was thinking about this thread Does LAPACK/BLAS automatically use multi cores or threads?, but they both covered some of the same ground.

Yes, synthetic benchmarks can be unrealistic. I’ve seen compilers remove entire do loops in benchmarks because they recognize and remove dead code. But in general I’ve found that what one learns with tuning and optimizing benchmarks can be applied also to optimizing production codes.

**User**: I am looking for a language that is fairly easy to use and can solve my problems sufficiently well out of the box. I am not a programmer, so I may not know software and hardware that well, and I hope to focus on my problem. Programming is only the way to finish a part of my job, but it is not the job itself.

**MATLAB/Python/R/Julia**: Great! That is exactly why and how we design our language.

**Fortran**: Silly.

Vive le Fortran !

I would say that this is not a fair characterization of the comment by @kargl. Pointing out the problems with synthetic benchmarks, a very limited one in this particular case, is not the same as saying that a particular user’s needs are silly. All of those things that he mentions, the different SSE and AVX instructions, do have practical consequences. They can affect performance by an order of magnitude or more. In a language like fortran, the programmer has some control over that. In the scripting languages that you mention, he doesn’t. Whether that is good or bad depends on the perspective of the user, and even further, a particular user might opt for a scripting solution for one of his problems and for the more flexible and powerful fortran solution for another.

In Julia the user does have some control over that. But there is a common tension among new Julia users, which come to the language imagining that it will magically provide stunning performances without any effort, and that is not true. In general, either the user has a black-box package that solves effectively his/her specific problem, and that can be in any language, or he/she has to learn some programming, in any language. Thus, I do not agree with the general picture that @zaikunzhang pictured. That choice is not that simple, in any case.

Hello, nice to receive your questions / comments, and thank you very much for them. I am a MATLAB user. So let me focus on MATLAB.

Reading the documentation is not very often needed. It is **performant out of the box** for most of the work I need to do. As I mentioned, users need a tool that is fairly easy to use and can solve their problems sufficiently well.

Exceptional cases exist, of course, but they are exceptions.

This is impressive! However, most users do not care. They just want a tool that is fairly easy to use and can solve their problems sufficiently well on the platform they are using with the least programming effort, because they are not programmers.

My teachers did not teach me good English, but they did teach me that it is downgrading to answer questions that contain words like “silly”. So I will neglect this one. Thank you anyway.

By the way, the “test” mentioned above (it is too nonsystematic to be called a benchmark) did not contain any conclusion. Instead, it just showed some phenomena and ended openly with questions. I hope the English in that post was good enough so that anyone literate can understand it correctly. If no, I apologize again for my poor English.

Anyway, it is a bit surprising but still amusing to see people attacking imaginary conclusions that never exist.

For Julia information: Julia supports ARM64 (v7 and v8), powerpc, aarch64, i686, x86_64 and supports Linux, Windows, Mac, and FreeBSD (and has been run on Fujitsu which as far as I know is the only spark64 system around). Congrats if you have an alpha system, but I’m not sure that any maintenance cost is worth supporting an architecture that was dead 20 years ago. Despite supporting all relevant architectures for modern computers (though we do need risc-V) we have CPU feature detection code so that users aren’t giving up 20x performance by default unless they set architecture specific flags.

I’m not really a Matlab user, but I’ve watched over the shoulder those who are. If you have some code in Matlab that performs poorly, then the “solution” is to write the code in fortran, compile it, and then link that compiled code into Matlab. I might be incorrect, or out of date, about this, but as far as I know, there is not a way to actually optimize loops and expressions in the Matlab scripting code to achieve optimal performance. One must go outside the language, with fortran, or C, or assembly, to step beyond scripting level performance. I’m also unsure about the other scripting languages, R, Python, etc., about this issue. Is it possible to write, compile, and tune applications for optimal performance in these languages, or must the programmer go outside of the language to achieve these things? Of course, one can mix optimal assembler code with fortran code too, but with the popular fortran compilers one also has some control within the language over which instructions are allowed, how expressions are evaluated, how loops are optimized, how parallelization can be exploited, and so on.

I am a MATLAB user, so I would like to share my personal view, which may be wrong.

No, this is not true (it might have been true last century, but I did not know computers yet at that time).

Here is what I copy-paste from the official documentation of MEX (N.B.: it is not my opinion; I only paste it here for your reference; the bold fonts of the last sentence are by me):

This is not true either.

First of all, loops: **MATLAB users do not write loops** unless loops are intrinsically needed for the algorithm being coded (e.g., a line search algorithm for optimization intrinsically needs a loop, but a matrix-matrix product is NOT intrinsically a loop). Instead, they write almost everything in terms of matrix/vector operations (that’s in the name: MATrix LAB), which provide most likely the best representation of the mathematics and logic of the algorithm under consideration. I would like to quote my elaboration in the post Automatic arrays and intrinsic array operations: to use or not to use?:

Second, expressions: all the built-in operations (e.g., matrix-matrix/vector multiplication/addition, matrix factorization, linear solver …) are optimized and they are performant **out of the box** unless your problem is extraordinary or you specify a wrong platform-dependant option (most users do not and need not know how to specify such an option, because it is rarely needed). In addition, MATLAB optimizes expressions (up to a certain level). For example, suppose that you write

```
x = y*B'
```

where `B'`

is the transpose of a matrix `B`

, `y`

is a row vector, and `*`

is a matrix-vector product. According to my test (**without any tuning**), MATLAB will not take the transpose of B, but will automatically calculate it as

```
x = (B*y')'
```

Sure. See Performance of the MATLAB Compiler: Speedups by Two Orders of Magnitude Are Possible, which is an announcement made in 1996. I did not check what is the current status, but it seems to have elvolvded to the MATLAB coder, but I am not sure since I have never used/needed it.

Loops are rarely needed as mentioned above. MATLAB does provide (some) control on how the matrix-vector operations are evaluated — you can specify the library to link to, but you do not need to do so unless your platform is unusual (hence most users do not know the existence of such an option). Admittably, it does not provide control on how **expressions** should be evaluated — but why should the user control it if it is already optimized? To de-optimize it?

For parallelization in MATLAB, see the documentation of `parfor`

for example — **wait,** you do not really need to read the documentation. Just write

```
parfor iter = 1 : maxiter
% Your parallel code here
end
```

and it will work surprisingly well without any tuning or specifying any option. Of course, if you would like to control its behavior, you do need to consult the documentation, but most users do not need.

In my applications, it runs on a single computer. However, I know that it can work on multiple nodes, although I do not know how to do it since my current job can be done without doing so.

My co-author told me `parfor`

, and mentioned that the syntax is the same as a plain `for`

. That’s it.

This is half true. It is possible to design many algorithms avoiding loops, but that is not necessarily natural and may require significant creativity and experience. Many algorithms are just more natural if one can use loops effectively.

Additionally, that necessity of adapting the algorithm to the language frequently leads to suboptimal implementations with, in particular, lots of unnecessary intermediates.

This is one of the reasons for Julia to be nice, and for the proliferation of JIT compilers for Python. (Afaik even for Matlab there are some development in this front)

I do agree that there exist scenarios where a loop is intrinsically needed, as is stated below.

However, how many times have you seen an algorithm whose natural description involves more than two loops? I can only enumerate very few after years of study in computational mathematics. Most algorithms involve at most a main loop and not more than that.

Ideally, anything that is not described/defined as a loop (a typical example is a matrix-matrix multiplication) should not be coded as a loop in a language named after “formula translation”.

Surly, on a lower level, you still need to implement things like matrix-matrix multiplication as loops, but it is a level that should be handled by low-level programmers, and most users should spend little time on it, if ever. Otherwise, we are re-inventing wheels, wasting the efforts made by numerical analysts since the 1970s, and also wasting the time spent by Fortran compiler writers on intrinsic matrix/vector procedures.

I agree that intermediates are sometimes needed, but they cannot be completely avoided in any language. In addition, a good MATLAB user should be able to avoid them in most cases.

I do not agree that coding in matrix/vector operations is “adapting the algorithm to the language”. The truth is quite the opposite. In most cases, matrix/vector operations provide the most natural description of (numerical) algorithms, so no adaptation is needed.

In contrast, coding matrix/vector operations by loops is very often truly “adapting the algorithm to the language”, because most matrix/vector operations are not defined by loops in mathematics. Loops are only the way we compute them on computers.

Matrix operations are better expressed as such. But *algorithms* that do stuff with them (simulations, optimization) require loops all the time, and not using them requires extra effort.

Agreed.

I do agree that there exist scenarios where a loop is intrinsically needed, as is stated below.

There are areas where Matrix operations are almost all that matters. Important ones, like machine learning. But that is area dependent. In what I do (particle simulations) matrix operations are a small part of what matters for having performant code. Loops are not an exception, they permeate all the code.

But, back to the topic, there is no reason for Fortran to not have the simplest linear algebra interfaces possible.