There may be a misunderstanding that the matrix multiplication in matlab is not performant, or is used when “performance is not the primary concern”. I am sorry but this is wrong.
The matrix multiplication in matlab works much (much) better than the intrinsic matmul of Fortran compiled with the default compilation options, no matter which compiler it is. See the comparison at
Sure, the matrix multiplication in matlab cannot work so well without the Fortran library behind it.
Thank you for your observation. English is not my native language, so I apologize if my previous statement was imprecise. I would like to clarify that my intention was not to compare the performance of Fortran and Matlab, but rather to discuss the notation used for matrix multiplication. In Fortran, you can perform matrix multiplication using the overloaded operator .x. when utilizing IMSL libraries, por example. However, the performance of this (compact) approach may be affected when compared to specialized matrix multiplication subroutines.
Allow me to rephrase the last part of my statement: While performance may not be the primary concern, we can perform matrix multiplication using the .x. operator, such as C = A .x. B , as in Matlab syntax.
@zaikunzhang , as was pointed to you then, it is not an apples-to-apples comparison what you’re doing with MATLAB vs intrinsic matmul. In the case of Intel, MATLAB which is a “user” abstraction (albeit a commercial one) with your own abstraction around Intel MKL with its DGEMM offering will be closer to a reasonable comparison.
The problem with things like matmul is that they are often too inflexible to be used for real-world, long-term numerical projects.
As @PierU mentioned, the structure of matrix matters (real, complex, Hermitian, symmetric, banded, upper/lower triangular, single or double precision, sparse, etc.). This puts a huge burden on the compiler writers to cover all these cases efficiently.
This is true of some other features in modern Fortran, for example coarrays. These seem like a good idea on paper, but in practice don’t match the flexibility of the combination of MPI and OpenMP. For example, we may need to know if data is stored locally in shared memory or on another node in order to have efficient parallelism, or have deeply nested parallel loops with ranges determined at runtime. This seems to be currently impossible to replicate efficiently with coarrays, and again places the complex job of parallelism on the shoulders of the compiler writers.
And yet, not including these features (and others like OOP) would provide ammunition for people to claim that Fortran is old-fashioned and unable to compete with C++, Julia, Rust, etc.
The majority of quantum chemistry and condensed matter codes are written in Fortran: List of quantum chemistry and solid-state physics software - Wikipedia The reason for this is that Fortran is particularly well suited to converting the difficult mathematics into fast, readable code. If you take a look at the source code of these packages, you’ll find it’s mostly simple procedural Fortran 90: subroutines, functions, loops, conditional statements, contractions, etc.
I’ve been working on a condensed matter DFT code for the past twenty years, and found that the combination of Fortran 90* + BLAS + LAPACK + FFTW + OpenMP + MPI is basically perfect for the job. Most my time is not spent fixing ‘traditional’ bugs, like segmentation faults; rather it’s spent figuring out why the numbers are wrong: is it a problem with the physics, the mathematical derivation, the numerics, or just a typo in the code? The close relation between Fortran syntax and the mathematical algorithm makes doing this much easier.
(*With some Fortran 2003 features)
In other words, the only way to do good, efficient, and system-level and
portable C++ ends up to limit yourself to all the things that are
basically available in C. And limiting your project to C means that people
don’t screw that up, and also means that you get a lot of programmers that
do actually understand low-level issues and don’t screw things up with any
idiotic “object model” cr*p.
I think the same applies to Fortran, namely that for large-scale, complex, multi-year numerical projects, you’re better off limiting yourself to simple, procedural Fortran 90 and well-chosen set of libraries.
Yes, there are lots of possibilities. This is not just a question for the matmul() intrinsic, but also for the general math libraries themselves. There are separate PLAPACK, SCALAPACK and PBLAS libraries, for example, to handle parallel linear algebra codes with various storage/distribution conventions. And even within LAPACK, I think there is some missing functionality. A common operation in my codes is the matrix product A.B.C, very often with the first or the last matrix being transposed. There dozens of ways to compute that product, all with different operation counts and intermediate storage requirements, depending on the dimensions involved, the matrix subblocking, and so on. A very common example of that is the computation of A^T.B.A where B, and the result, are either symmetric or skew-symmetric. Back in the early 1990s, I lobbied the BLAS/LAPACK people to add those cases, but I was unsuccessful. The earlier EISPACK and LINPACK libraries had a special packed storage form for symmetric matrices, and separate routines to go with them. LAPACK does support some symmetric matrix operations, but they are stored within a rectangular array, not the packed storage convention of the earlier libraries. My codes date back to before LAPACK, and they still have some matrices stored in packed form.
I think it is possible to overload matmul() to handle many of those situations with special matrices or with special storage conventions, but I wonder if that is really a good idea? This is the same question regarding a general shallow_transpose() function vs. optional arguments in matmul().
With over 6000 employees and 1.25 billion USD revenue (2021), I’d consider it a scam if they didn’t. According to the Fortran Discourse site-statistics, we have about 1100 signed users. The whole Fortran Discourse can’t match the effort that goes into making MATLAB profitable. According to the MathWorks website, they sell over 2200 campus-wide licenses for universities and schools. How many universities or schools, make direct donations to organizations supporting Fortran, or other FOSS organizations? I doubt it’s anywhere near it.
The LANL report says:
Compiler technology costs are significant for Fortran and are trending upwards. Estimates of tens of
millions of dollars have been discussed recently
Tens of millions is still only on the order of ~1% of MATLAB’s revenue. As a different comparison, the estimated costs of the three US exascale supercomputers are:
With a few exceptions, nearly all the work on gfortran since 2002 or 3 has been done on a voluntary, unpaid basis to meet immediate needs required to support specific coding needs.
[…]
Thus, in my humble opinion, the “new generation of gfortran contributors” will only emerge if they are motivated either by need or are being paid to support gfortran. We give every assistance that that we can to newbies but our ability to do so is typically limited by our daytime jobs. That the gfortran compiler is written in C, with a tiny touch of C++, has also proven to be a barrier. All my younger colleagues are brought up to write their codes in Matlab/Octave or python and only delve into fortran when they have to and so the pool is ever diminishing.
Given that MATLAB supports usage of the Intel oneAPI compilers, perhaps you can lobby them as a large commercial player to exert some influence on the Intel Fortran team over the matmul issue, or even better, ask them to donate to free Fortran compilers like gfortran, which they are happy to point out, is available at no charge.
In total, NVIDIA technologies power 361 of the systems on the TOP500 list, including 90% of the new systems (see chart).
I’m not sure what we can infer from this, NVIDIA showing lack of interest in F2008+, or HPC users not putting enough pressure on them, or maybe even HPC users being satisfied with the classic combination Fortran 90 + BLAS + LAPACK + FFTW + OpenMP + MPI.
I find this bullet point silly, given that the same standards AND EVEN MORE, are available for C++: OpenMP, OpenACC, OpenCL, SYCL, C++ stdpar. If Fortran + OpenMP/OpenACC is not supported well on the MI250, that should be taken up with AMD and HPE Cray. As Jim Cownie explains in his essay Is Fortran a “dead” language?,
The compilers used in High Performance Computing [HPC] are mostly written by the providers of the hardware, and, normally, that means the designers of the CPU (or GPU) chips.
There are good reasons for this:-
To have compilers ready when a new chip comes out work has to start before the specifications are public, so an unrelated software company cannot do it.
If they are sane, the hardware architects should be worrying whether their wonderful new features can be used by compilers, so they need to be talking to compiler writers and accepting their feedback even before the specification of new instructions is finalised.
Since the price at which a CPU or GPU can be sold is related to its performance, having a compiler which improves the performance of whatever benchmark codes are used for this evaluation is valuable as it allows you to charge more for each machine you sell. Indeed, this may be a simpler, cheaper, way to achieve performance than using an improved process to make the chip, or having it run at a higher clock frequency (and thus consume more power).
Having an in-house compiler-team allows rapid fixes to be made if there are problems when running the benchmarks required for specific procurements.
To compete in a specific market the customers there must be able to run their codes on the machine you are trying to sell. In the HPC market that means having good Fortran support
As to why Fortran support may be lacking behind on the MI250 compared with C++, perverse incentives could be one explanation. Here are some slides I’ve borrowed from Jim Cownie’s talk, Growing Up as a (Software) Engineer:
@ivanpribec great points, as usual. One comment from me:
My understanding of this point is that they are saying AMD GPUs are not well supported, and there are multiple competing technologies, but none of them is “mature”. C++ has the same technologies, but I suspect they all work better on AMD GPU.
You provided a good answer why it is not supported well due to the mixed incentives.
I truly believe that Fortran users as well as managers of Fortran codes should not be held hostage because of this. I think to fix it is to just bring some enthusiasm into Fortran compilers, I want compilers to be leading the way, and eager to support every possible hardware and accelerator. I think it will happen.
And yet, when browsing about Matlab performance issues on the forums, things do look like the perfect heaven that is suggested here.
Don’t get me wrong, Matlab is definitely a great software (*). And I have no real doubt that Matlab sometimes does better with some complex array expressions than Fortran compilers do in practice. There are good reasons for that, not only the money they can put on it as you point out, but also the fact that this is the whole point of Matlab : there’s no escape with loops in the case where array expressions are not efficient.
Still, users sometimes complain about things that are not as fast as they expect, or because of huge temporary allocatations… Showing that even with 6000+ employees, getting this right is not as easy as a snap of the finger.
(*) even though the langage in itself is close to crap IMO. I guess this is all the weight of history, and the willingness to keep a large backward compatibility.
The important first step should be to identify current trends and future directions, which goes towards hierarchical parallelsim:
The same is true for SYCL, but SYCL wasn’t designed around hierarchical parallelism from the start, and they already suffer from this because it leads to numerous incompatible language features with hierarchical parallelism in SYCL/DPC++ already:
Also, hierarchical parallelism introduced with the Fortran 2018 standard is fully compatible with the coarray features from the Fortran 2008 standard. Coarray Fortran was correctly designed for this from the start.
But then, as the authors of the DPC++ book explain it for SYCL, we have to give patience to those who implement compilers and frameworks: “The other aspect of hierarchical parallelism that was briefly touched on in Chapter 4 is its implementation complexity. Mapping nested parallelism to accelerators is a challenge that is not unique to SYCL or DPC++, and this topic is the subject of much interest and research. As implementers gain experience with the implementation of hierarchical parallelism and the capabilities of different devices, we expect syntax in SYCL and DPC++ to evolve in alignment with standard practice.”, https://link.springer.com/content/pdf/bbm:978-1-4842-5574-2/1?pdf=chapter%20toc .
For over two years, 2019-2021, I witnessed a massive misinformation campaign against Fortran by JuliaComputing (Julia 1000X faster than Fortran (false), Julia faster than SciPy because of SciPy’s use of Fortran (false), Fortran lacks inlining and loop unrolling (false), …).
I’d be very surprised if that kind of statement came from JuliaLang. Can you point me to examples?
The 1000x speed-up I see is against Python in some cases, but Fortran has always been my benchmark. And JuliaLang have always stated that Julia can reach Fortran speeds (true), but never that it’s orders faster than Fortran (Matrix multiplication for some reason is faster in Julia than Fortran (2x) even with O3, which baffled me. But then I only tried gfortran).
But in general, the official discussion at JuliaLang treats C and Fortran as a performance benchmark (and fairly).
Admittedly, there are some fanboys out there that may spout such nonsense (1000x faster than Fortran) but they exist in every culture and are usually put in their place quickly.
Although, I have to admit, I use Julia much more than Fortran or MATLAB, these days. So there is a threat scenario for Fortran since Julia has solved the two language problem for me (partially, memory optimisation is sometimes a problem still).
But then I’m just a scientist, using these languages, not a CS.
I think that a sort of shallow_transpose or similar will be more useful in other context. I’m quite sure that in a simple expression a compiler can definitely optimize it.
But there are two situation where transpose is not allowed event though a real shallow will be useful.
real, pointer :: ap(:,:)
real :: a(n,m)
ap(:,:) => shallow_transpose(a) ! not allowed with transpose
Now, I prefer an alternative syntax, instead of a function called shallow_transpose one can use something like
a(:,:) => a % transpose
call mysub(a % transpose)
It reminds the use of a%im and %re to get the real and imaginary part of a complex number, they are definable and can be used in both context.
By the way this syntax is similar to the (‘.T’) python syntax to get the transpose of a matrix.
Similarly, one can allow a sort of a reorder pseudo method that will be equivalent to a change of the descriptor of the array (basically a sort of transpose for a general multi dimensional array). This pseudo method will need an order array so it will look like a function call.
ap => a % reorder(2,1) ! equivalent to a % transpose
Sorry for the misunderstanding. I was referring to the two-language-problem in scientific computing. Where a researcher (subject expert, but not CS) starts developing a model/method in a high level, interpreted language, like Matlab or Python, and then, when the increasing complexity requires high performance at HPC level, needs to re-implement the whole thing from scratch in a lower level, compiled language, like C, C++ or Fortran.
Julia solves that for many applications, since I can work on a very high abstraction level, even higher than, e.g., in Matlab; and still get close to C or Fortran performance.
Naive implementations in Julia, in my (limited) experience, are about a few times slower than Fortran, but with a few simple optimisations (to ensure type stability) I get on-par performance and straightforward parallelisability (Julia has proven Petaflop scaleability).
It doesn’t (yet) allow for the last step where real-time, or easily deployable binaries are required, but it pushes that boundary back quite a bit - which means a lot in research. Many projects are abandoned, or never used on larger models because the performance is not good enough in Python or Matlab, and nobody can afford the reimplementation.
With my first scientific Language having been Fortran, I can say that this problem has been more pressing now more scientific high performance software will be written in C++, which is just oodles more complicated to learn for a natural scientist than Fortran (so more difficult to move to a compiled language). With none of my students learning Fortran anymore, Julia has been a good experience for me so far (albeit a bit niche, but a niche with a lot of interesting people in).
This can occur even when the researcher uses the compiled language. My son is working as a C++ programmer at a financial company. He is rewriting some of the C++ code written by financial quants at his firm because it is too slow and not written well.
Intel Fortran supports DO CONCURRENT offload to GPU (using their OpenMP backend). I verified it for trivial cases but found some bugs, which I reported to Intel immediately. @rouson has been using it and he will have a better sense of how mature it is.
Cray, Intel, Fujitsu and NVIDIA all support CPU parallelism in DO CONCURRENT. Details can be found in BabelStream Fortran.
I thought Intel only supported offload to Intel GPUs not to their competitors products. Also to get DO CONCURRENT to work on a CPU with ifort, I think you have to turn on auto-parallelization.ie. --parallel. I’ve never had a case where Intel’s auto-parallelization improved performance (and in most cases ran slower than in serial). Can you offload with openMP without auto-parallel on Intel.
Intel supports only their GPUs when it comes to OpenMP and DC offload. Sorry that was not clear. You should check out the docs for IFX. I think -parallel was specific to ifort and isn’t supported with ifx.
Well, I wanted it to be definable.
Actually I was following the ideas of @RonShepard.
Take it as a sort of exercise. You can have a look at
that uses ISO_Fortran_binding to get something similar to:
ap => transpose(a)
It is still non conforming as I’m modifying the descriptor of the output pointer array outside the permitted routines.
But it seems to work with gfortran version 11 and ifort 2021.