Ah, this is another possible trade-off, efficiency for instruction set portability. I use clusters for computing. When the cluster is new, then all of the cpus are the same (number of cores, size of caches, SSE instruction sets, etc.). Thus one can set compiler options to write instruction sequences that are specific to that hardware. But over time, nodes fail and are replaced, and new nodes are added to the old machine. These new replacements usually have whatever new processors are available at the time, perhaps different SSE instructions, different numbers of cores, and so on. In this case, the old executables are now nonoptimal on the new nodes, or perhaps they even fail entirely if some feature of the old nodes is no longer supported. So this requires the executables to be recompiled so that they run on all of the available nodes, and this is usually some compromise in the performance.
I was thinking about this thread Does LAPACK/BLAS automatically use multi cores or threads?, but they both covered some of the same ground.
Yes, synthetic benchmarks can be unrealistic. I’ve seen compilers remove entire do loops in benchmarks because they recognize and remove dead code. But in general I’ve found that what one learns with tuning and optimizing benchmarks can be applied also to optimizing production codes.
User: I am looking for a language that is fairly easy to use and can solve my problems sufficiently well out of the box. I am not a programmer, so I may not know software and hardware that well, and I hope to focus on my problem. Programming is only the way to finish a part of my job, but it is not the job itself.
MATLAB/Python/R/Julia: Great! That is exactly why and how we design our language.
Vive le Fortran !
I would say that this is not a fair characterization of the comment by @kargl. Pointing out the problems with synthetic benchmarks, a very limited one in this particular case, is not the same as saying that a particular user’s needs are silly. All of those things that he mentions, the different SSE and AVX instructions, do have practical consequences. They can affect performance by an order of magnitude or more. In a language like fortran, the programmer has some control over that. In the scripting languages that you mention, he doesn’t. Whether that is good or bad depends on the perspective of the user, and even further, a particular user might opt for a scripting solution for one of his problems and for the more flexible and powerful fortran solution for another.
In Julia the user does have some control over that. But there is a common tension among new Julia users, which come to the language imagining that it will magically provide stunning performances without any effort, and that is not true. In general, either the user has a black-box package that solves effectively his/her specific problem, and that can be in any language, or he/she has to learn some programming, in any language. Thus, I do not agree with the general picture that @zaikunzhang pictured. That choice is not that simple, in any case.
Hello, nice to receive your questions / comments, and thank you very much for them. I am a MATLAB user. So let me focus on MATLAB.
Reading the documentation is not very often needed. It is performant out of the box for most of the work I need to do. As I mentioned, users need a tool that is fairly easy to use and can solve their problems sufficiently well.
Exceptional cases exist, of course, but they are exceptions.
This is impressive! However, most users do not care. They just want a tool that is fairly easy to use and can solve their problems sufficiently well on the platform they are using with the least programming effort, because they are not programmers.
My teachers did not teach me good English, but they did teach me that it is downgrading to answer questions that contain words like “silly”. So I will neglect this one. Thank you anyway.
By the way, the “test” mentioned above (it is too nonsystematic to be called a benchmark) did not contain any conclusion. Instead, it just showed some phenomena and ended openly with questions. I hope the English in that post was good enough so that anyone literate can understand it correctly. If no, I apologize again for my poor English.
Anyway, it is a bit surprising but still amusing to see people attacking imaginary conclusions that never exist.
For Julia information: Julia supports ARM64 (v7 and v8), powerpc, aarch64, i686, x86_64 and supports Linux, Windows, Mac, and FreeBSD (and has been run on Fujitsu which as far as I know is the only spark64 system around). Congrats if you have an alpha system, but I’m not sure that any maintenance cost is worth supporting an architecture that was dead 20 years ago. Despite supporting all relevant architectures for modern computers (though we do need risc-V) we have CPU feature detection code so that users aren’t giving up 20x performance by default unless they set architecture specific flags.
I’m not really a Matlab user, but I’ve watched over the shoulder those who are. If you have some code in Matlab that performs poorly, then the “solution” is to write the code in fortran, compile it, and then link that compiled code into Matlab. I might be incorrect, or out of date, about this, but as far as I know, there is not a way to actually optimize loops and expressions in the Matlab scripting code to achieve optimal performance. One must go outside the language, with fortran, or C, or assembly, to step beyond scripting level performance. I’m also unsure about the other scripting languages, R, Python, etc., about this issue. Is it possible to write, compile, and tune applications for optimal performance in these languages, or must the programmer go outside of the language to achieve these things? Of course, one can mix optimal assembler code with fortran code too, but with the popular fortran compilers one also has some control within the language over which instructions are allowed, how expressions are evaluated, how loops are optimized, how parallelization can be exploited, and so on.
I am a MATLAB user, so I would like to share my personal view, which may be wrong.
No, this is not true (it might have been true last century, but I did not know computers yet at that time).
Here is what I copy-paste from the official documentation of MEX (N.B.: it is not my opinion; I only paste it here for your reference; the bold fonts of the last sentence are by me):
This is not true either.
First of all, loops: MATLAB users do not write loops unless loops are intrinsically needed for the algorithm being coded (e.g., a line search algorithm for optimization intrinsically needs a loop, but a matrix-matrix product is NOT intrinsically a loop). Instead, they write almost everything in terms of matrix/vector operations (that’s in the name: MATrix LAB), which provide most likely the best representation of the mathematics and logic of the algorithm under consideration. I would like to quote my elaboration in the post Automatic arrays and intrinsic array operations: to use or not to use?:
Second, expressions: all the built-in operations (e.g., matrix-matrix/vector multiplication/addition, matrix factorization, linear solver …) are optimized and they are performant out of the box unless your problem is extraordinary or you specify a wrong platform-dependant option (most users do not and need not know how to specify such an option, because it is rarely needed). In addition, MATLAB optimizes expressions (up to a certain level). For example, suppose that you write
x = y*B'
B' is the transpose of a matrix
y is a row vector, and
* is a matrix-vector product. According to my test (without any tuning), MATLAB will not take the transpose of B, but will automatically calculate it as
x = (B*y')'
Sure. See Performance of the MATLAB Compiler: Speedups by Two Orders of Magnitude Are Possible, which is an announcement made in 1996. I did not check what is the current status, but it seems to have elvolvded to the MATLAB coder, but I am not sure since I have never used/needed it.
Loops are rarely needed as mentioned above. MATLAB does provide (some) control on how the matrix-vector operations are evaluated — you can specify the library to link to, but you do not need to do so unless your platform is unusual (hence most users do not know the existence of such an option). Admittably, it does not provide control on how expressions should be evaluated — but why should the user control it if it is already optimized? To de-optimize it?
For parallelization in MATLAB, see the documentation of
parfor for example — wait, you do not really need to read the documentation. Just write
parfor iter = 1 : maxiter % Your parallel code here end
and it will work surprisingly well without any tuning or specifying any option. Of course, if you would like to control its behavior, you do need to consult the documentation, but most users do not need.
In my applications, it runs on a single computer. However, I know that it can work on multiple nodes, although I do not know how to do it since my current job can be done without doing so.
My co-author told me
parfor, and mentioned that the syntax is the same as a plain
for. That’s it.
This is half true. It is possible to design many algorithms avoiding loops, but that is not necessarily natural and may require significant creativity and experience. Many algorithms are just more natural if one can use loops effectively.
Additionally, that necessity of adapting the algorithm to the language frequently leads to suboptimal implementations with, in particular, lots of unnecessary intermediates.
This is one of the reasons for Julia to be nice, and for the proliferation of JIT compilers for Python. (Afaik even for Matlab there are some development in this front)
I do agree that there exist scenarios where a loop is intrinsically needed, as is stated below.
However, how many times have you seen an algorithm whose natural description involves more than two loops? I can only enumerate very few after years of study in computational mathematics. Most algorithms involve at most a main loop and not more than that.
Ideally, anything that is not described/defined as a loop (a typical example is a matrix-matrix multiplication) should not be coded as a loop in a language named after “formula translation”.
Surly, on a lower level, you still need to implement things like matrix-matrix multiplication as loops, but it is a level that should be handled by low-level programmers, and most users should spend little time on it, if ever. Otherwise, we are re-inventing wheels, wasting the efforts made by numerical analysts since the 1970s, and also wasting the time spent by Fortran compiler writers on intrinsic matrix/vector procedures.
I agree that intermediates are sometimes needed, but they cannot be completely avoided in any language. In addition, a good MATLAB user should be able to avoid them in most cases.
I do not agree that coding in matrix/vector operations is “adapting the algorithm to the language”. The truth is quite the opposite. In most cases, matrix/vector operations provide the most natural description of (numerical) algorithms, so no adaptation is needed.
In contrast, coding matrix/vector operations by loops is very often truly “adapting the algorithm to the language”, because most matrix/vector operations are not defined by loops in mathematics. Loops are only the way we compute them on computers.
Matrix operations are better expressed as such. But algorithms that do stuff with them (simulations, optimization) require loops all the time, and not using them requires extra effort.
There are areas where Matrix operations are almost all that matters. Important ones, like machine learning. But that is area dependent. In what I do (particle simulations) matrix operations are a small part of what matters for having performant code. Loops are not an exception, they permeate all the code.
But, back to the topic, there is no reason for Fortran to not have the simplest linear algebra interfaces possible.
There remains the question of how much should be in the language itself, and how much should be simply in a library with a standard API. The latter approach can be done independently of the fortran standards process.
Also there is the quality of implementation issue, particularly regarding stack/heap allocation of temporary arrays used in the computation. Many think that a valid expression should never result in run time errors unless all resources are exhausted and it is impossible to compute. Others think that the results should be computed in the fastest possible way (e.g. stack storage only) and the programmer is responsible for rearranging the expression to avoid any runtime errors. Which of those two extremes should be the goal for fortran?
Libraries are perfectly fine, imho.
Fortran has included a nice amount of basic matrix/vector operations as intrinsic procedures. While adding more would be very nice, improving the existing ones are more important and urgent.
For intrinsic procedures, definitely the first one, IMHO. For libraries, it is the developers’ decision to make.
My first language is not English, and I understand that there is a technical meaning to “silly”. It means expending a lot of effort for no good result apart from amusement or distraction/diversion. When a technical person uses “silly” in a technical discussion, that is the first meaning that comes to mind.
This brings up something else that I’ve advocated in standard fortran, namely the ability to query the amount of stack, heap, or cache space that is available at run time at any moment during program execution. In the present situation, limits on these resources can cause a program to fail, yet the programmer has no ability to test for the available resources to prevent that failure. There are nonstandard, nonportable, ways for that test, but there are no standard and portable ways to do so. So when a programmer writes a library routine that requires these resources, at present he has no standard way to tune or modify the algorithm to automatically adjust itself based on that query.
Just to give one example, if one could query the amount of stack space, then the programmer could choose to use automatic array allocation of a temporary work array if that array fits on the stack. Otherwise the programmer could ALLOCATE the work array on the heap. That library routine could be written in a robust way so that it would never fail due to lack of stack space.
Yes, I understand that the fortran standard does not know anything about stacks, heap, or caches, but these are practical issues that are important. They have been important for decades to programmers, not just fortran programmers. It is long past time that fortran programmers are given access to this information.