An evaluation of risks associated with relying on Fortran for mission critical codes for the next 15 years

Two things I notice in these slides. Not sure where they get that do concurrent is not widely adopted (if they mean supported by the compiler). All the compilers that matter in HPC support do concurrent. The problem (and probably the meaning of the “widely adopted” comment) is that none of the compilers outside the most recent NVIDIA compilers actually do anything (most of the time) concurrently at least that I can see. (I’m would also like to hear from the compiler developers how many intrinsic functions etc. are threaded or designed to take advantage of multi-core processors). I also see no mention of coarrays and MPI but I guess those are not germane to the SYCL/GPU crowd.

2 Likes

I think they are mainly concerned with exploiting parallelism within a single GPU for the start. We can learn more in the full open-access research article about PSyclone: Transforming Fortran weather and climate applications to OpenCL using PSyclone | Proceedings of the 2023 International Workshop on OpenCL.

PSyclone is a domain-specific compiler, being chosen/developed as the “portability” approach for Fortran codes developed by the UK Met Office. I’m not 100 % sure, but I think there may have been a presentation about it at FortranCon.

Speaking of distributed memory parallelism, I have noticed one paper in the past which mentioned the idea of a SYCL PGAS concept, but I haven’t looked for or noticed any further developments in this direction

Current SYCL implementations do not provide
distributed-memory parallelism, so no explicit communication
statements and no data transfers can be specified, greatly
reducing the source of potential errors, but, of course, also
restricting the application to single nodes. While theoretically
it would be possible to implement a hybrid parallism concept
such as MPI+SYCL, this would violate the ideas and concepts
behind SYCL. Rather, a better route is to employ a SYCL
implementation that internally handles data transfers and exposes
a partitional global address space (PGAS) to the user. Basically,
SYCL already provides a task graph model from which data
dependencies could be derived and resolved accordingly.

2 Likes

If they don’t want to use Fortran leave them, but this evaluation is no proper work and should not be used as an excuse for rejecting Fortran for anyone. The authors completely ignore the coarray runtime (comprising coarrays and collective subroutines for communication) which should be expected as Fortran’s core component for (upcoming) heterogeneous programming in pure Fortran.

They don’t even mention coarrays at all which leaves the really bad taste that they only want to justify an already made decision against Fortran, no matter what. But then the only obvious reason for giving such to the public would be to fool others, and then even the sections with ‘Contrasting Reviewer Judgments’ appear as an attempt of (self-) delusion.

To say it crystal clear, it is just way too early in time to give any evaluation about how exactly heterogeneous programming will look like in the next years. Personally, I am learning much from SYCL /DPC++ yet, and I am using it to apply programming techniques with Coarray Fortran. But I can only assume if this will really work on future heterogeneous systems yet. Approaching heterogeneous computing with any programming technique must be done gradually with doing small steps, but we can’t just sit and wait anymore these days.

The main problem, as I see it yet, could be a limitation of what implementers (of any parallel programming language or framework) can really do. I see this for Coarray Fortran and I believe the authors of the DPC++ book see it very similar for SYCL/DPC++, for example see chapter 19 in the book, the explanations for memory models, especially pages 496 and 504-506:
Memory Model and Atomics | SpringerLink .

This is very relevant to Coarray Fortran as well and we can already apply some relevant programming techniques. The authors of ‘Modern Fortran explained’ declared Fortran 2008 SYNC MEMORY statement, and ATOMIC_DEFINE and ATOMIC_REF as deprecated features (section A.9.1 in the book). I can understand why, but this is still a mistake: In my (experimental) programming practice only a single ATOMIC_DEFINE and a single ATOMIC_REF and a single atomic coarray variable is required. The authors make this statement (on page 452): “We see the construction of reliable and portable code in this way as very difficult – it is all too easy to introduce subtle bugs that manifest themselves only occasionally.” In my practice, it is quite the contrary: atomic_define and atomic_ref are the only fully reliable technique to detect any failure (i.e. missing atomic operation) across coarray images. (I am using additional checks for non-atomic coarrays as well, but this is never 100% reliably.)

Two years ago, @rouson made a statement here: Learning coarrays, collective subroutines and other parallel features of Modern Fortran - #29 by rouson

From a plain Coarray Fortran language perspective this is certainly a valid statement: “Let the compiler do its job.” (this is a quote from his quote).

So far I did use a (single) atomic coarray to implement customized non-blocking and lightweight synchronization to allow for (single-image) asynchronous coroutines in Fortran (still experimental): GitHub - MichaelSiehl/Spatial_Fortran_1 .

I am newly experimenting with COLLECTIVE SUBROUTINES myself and was soon able to achieve similar results as with my coarray approach described in the above Github. (It requires the use of three different kernel types, a single task- kernel, a multi-task kernel, and an all-image kernel to allow for execution of co_broadcast, that must be on all coarray images always). It works (on a CPU), but on the receiving side there is certainly an implicit synchronization for this to work. I did not try out yet, but this should certainly disallow (single-image) asynchronous execution. If I am correct with my assumption, then the statements made two years ago by @rouson could turn out as not being valid any more because single-image asynchronous code execution (the customized coarray approach) could certainly be a key to keep coarray images more busy working than collective subroutines with an implicit synchronization (on the receive side).

Of course, all this is ongoing work and preliminary. I think it is just too early to make any valid evaluation yet.

4 Likes

I think coarrays have a great potential for the future, but they are unfortunately not ready yet for the kinds of production codes that LANL has (Coarrays: Not ready for prime time).

I must strongly object how the paper is criticized here in this thread by many. We are losing all credibility if we are only addressing the way the paper was written, and not taking the actual problems with Fortran seriously.

Is it not our point to try convince big players to keep using Fortran? The answer should be yes!

So let’s address the actual issues that they have.

Just read the page 6 of the report to get started on the serious issues that Fortran has, I just picked a few:

  • As a specific example, efforts to port xRAGE to the PGI Fortran compiler (now rebranded as the nvfortran compiler) have been ongoing since April of 2019 and xRAGE is still unable to be built and run with this compiler. It is important to note that nvfortran still does not support the Fortran 2008 standard, a full 15 years after its ratification by the standards committee.

  • Multiple competing standards for Fortran-based GPU programming with varied levels of robustness and support exist today (Fortran OpenMP Target offload and OpenACC). Neither of these technologies is robustly supported on the AMD GPU (MI250) today.

Do we all agree that these are real issues that we need to fix, or at least provide a plan how they will be fixed and when?

3 Likes

First, it’s not that bad, there are many cases of array syntax that are well handled by the optimizers of the usual compilers.

And second, I don’t think it’s ridiculous at all if explicit loops perform better in complicated cases. For instance in b = matmul(transpose(a),a) it’s actually better to not transpose a at all to compute the result: do you expect the compiler to see that? OK, maybe it could in this case, which looks quite simple. But there are endless combinations that you can write with array syntax and it can get terribly complex to analyse: where is the limit ?

In parts of codes that are not critical for the runtime I use array syntax as much as possible. But in critical parts I do not hesitate to revert to explicit loops if I suspect that the compilers will have trouble to correctly optimize a more compact code.

Another example: a subroutine that takes an array as argument with an assumed shape. In the calling routine the actual argument is an array section with a stride different than 1. The compiler can pass it as is thanks to the assumed shape, but maybe it would be more efficient to make a copy. It all depends on the computations in the routine: what should the compiler do?

2 Likes

This is an excellent example of where a programmer could specify shallow_transpose(a) as the argument, and expect the result to be computed correctly without the unnecessary memory allocation for the transpose(). At least it gives the programmer some control over the process instead of relying entirely on the compiler to recognize what is obvious to the programmer, but perhaps obscure to the compiler.

Some while ago I tested this transpose() actual argument situation with a few compilers. The subprogram had an intent(in) dummy argument declaration for that argument. I examined c_loc(a(1,1)) of the actual and dummy arguments to see if a temporary copy had been made. To my surprise at the time, there were compilers that, with the appropriate optimization flags, recognized this situation and the two addresses were identical. Then a few years later, after some compiler updates, I redid the same tests, and I found no compilers that got it right. One step forward, two steps back. That is when I decided that an explicit shallow_transpose() function that the user could invoke was best.

Of course, within the subprogram, it might still be optimal to make a copy of the argument, or, as is typical of matrix operations in linear algebra, to make local copies of subblocks of the input matrices. This ensures data locality, good use of registers, good cache use, good use of vector hardware, good use of gpu accelerators, and so on. The programmer writing portable code cannot possibly know the optimal choice for these low-level optimizations since they depend on hardware characteristics that cannot be queried by standard fortran intrinsics, so the compiler must step in at some point and do the right thing. But I think a shallow_transpose() function is a good step in the right direction.

2 Likes

Concerning this particular point, here are some humble comments from a non-expert user of both Fortran and MATLAB, a mediocre computational mathematician with modest experience in numerical computation.

In MATALB, any loop that can be replaced with an array operation should be replaced, otherwise, it is considered a coding error.

In other words, what MATLAB encourages is “give me the array (i.e., matrix-vector) formulation of your computation, I will do everything else, and they will be done in the best way. This includes optimizing the low-level computation, deciding whether copies are needed, etc, everything.”

How MATLAB does everything under the hood, I don’t know, but I do know that a significant part of it depends on Fortran (for the moment).

Whether MATLAB does everything well, I guess and I observe yes in most cases. Otherwise, MATLAB would not be able to survive.

At least, MATLAB never vomits a segfault however large the matrix you ask it to handle (it does tell you that your matrix is too large if its size is, say, 10^10). It does not need you to provide any option — everything works in an optimal way by default unless your hardware is peculiar (so that most users do not even know that MATLAB has options …). It discourages handcrafting your own “optimized loops”, because it believes that doing so is the job of its engineers, and people should let the professionals do the professional things.

However, the most important thing is not that MATLAB does it well. The important thing is the willingness to do so, the belief that matrix-vector operations are the correct way to formulate numerical algorithms, and the vision that algorithms should be coded in a way as close as possible to their mathematical formulations. I see similar things in other languages that are popular in scientific computing, e.g., Python and Julia.

If you find that MATLAB is not optimizing some array operations well enough, it will be considered a bug of MATLAB, and it will be fixed, as has been happening since last century. As long as the direction is correct and you are working in that direction, it is OK if you are not doing everything perfectly for the time being (or ever).

Reassuringly, I see @JeffH and @certik hold similar opinions on array operations in Fortran. So my understanding of things is not completely foolish.

I have made similar comments on this forum a few times, and I know what I will hear. Thanks in advance. Again, I am no expert on Fortran or MATLAB, and my understanding of numerical computation is superficial, so please forgive my ignorance.

2 Likes

I don’t think anyone denies the weaknesses and limitations. It is a debate about the least costly solution to fix these weaknesses. If a rocketship is complete and ready, but the launch pad is broken, one does not destroy and rebuild whole ship to fix the broken launch pad.
For over a decade, I have seen the C++ programmers and community bullying Fortran in formal conferences and events. For over two years, 2019-2021, I witnessed a massive misinformation campaign against Fortran by JuliaComputing (Julia 1000X faster than Fortran (false), Julia faster than SciPy because of SciPy’s use of Fortran (false), Fortran lacks inlining and loop unrolling (false), …). Such actions do impact the prospects for Fortran training in academia. Years of bullying and misinformation campaigns are slowly paying off. The solution to the lack of talent to maintain a working rocketship is not to build a new rocket but to train talented maintainers. That’s my perspective on this discussion.

4 Likes

I believe it is perfectly reasonable to expect that intrinsic functions offer performance no worse than a manual loop, provided the compiler is made aware of its target hardware. Gfortran, ifort, and ifx have such options with -march=native or -xHost, and I have no doubt other compiler such as flang and nvfortran have similar functionality. If any simple array operation like addition, subtraction, multiplication, division, or exponentiation was faster as an explicit do loop rather than a=a{operator}C where C is a constant, I would consider that a compiler bug. I think this is fair to extend to matmul as well in cases where array sizes and shapes are known at compile time.

Well, I am building a new rocket anyway. :slight_smile:

4 Likes

You are building a new modern, robust launch pad for the existing rockets, which I and everyone appreciate.

2 Likes

For this particular exemple I don’t think so. If matmul(b,a) is correctly implemented (e.g. a wrapper to some optimized lapack implementation) it will perform the computation trying to be cache friendly for both b and a, with the usual knowledge that they are column-major stored. If b is shallow_tranpose(a) then it gets row-major and matmul will have a very bad memory access pattern on it. Unless of course row-major storage is introduced in the standard as an alternative to column-major for all arrays.

But anyway it was just an example, one can find tons of array expressions which are unlikely to be perfectly optimized by the compilers.

About matmul() specifically, maybe it could have options similar to lapack routines for matrix/matrix multiplication (first argument should be transposed, etc…) but again where is the limit? lapack has many different routines for different cases (the input matrices are like this, the output matrix is like that…).

Matlab is an entirely different animal: it is basically an interpreted langage, and consequently explicity loops have terrible performances. So yes, in Matlab loops (at least inner loops on elements) must be avoided whenever possible.

In compiled langages (whether it is Fortran, C, …), in contrast, and with a decent compiler, with appropriate explicit loops you can potentially reach performances that are often close to the peak performances you can get on your hardware.

So in Matlab you have no real choice: it’s array operations or guaranteed terrible performances. In Fortran you have the choice. You can entirely rely on array syntax for simplicity, but sometimes you can revert to explicit loops when you want to ensure optimal performances on complex expressions.

This reminds me debate around goto: “Using goto’s is always a coding error!”… No, it’s just that most of time they are better ways than using goto’s. But in some cases, it can still be a reasonnable solution.

4 Likes

I confess to no longer being a Fortran user but over the years I have heard many folks diss Fortran for not being ‘modern’. But for anyone working in high level math or physics Fortran is a lot easier to learn than the math! Scientists do science … learn Fortran … and off they go coding. FORmula TRANslation – geddit? Fortran has been dissed by computer “professionals” for many years because it keeps them out of the loop. The model as seen in non scientific applications of computers - where domain specialists devolve programming to computer ‘experts’ is far from optimal and creates much Brownian futzing. Witness Fortran’s (and Cobol’s) long life and the constant popping up of low code/now code offerings and “4th generation” languages (remember them) that put the computer in the hands of the domain specialist.

3 Likes

Of course matmul() in this case would need to recognize that the argument was transposed and then map that to the appropriate BLAS argument or otherwise account for it explicitly. I only point out one quirk of the BLAS library is that one of the spacing arguments, either row or column, must be contiguous. If your fortran array does not have that property, then it cannot be mapped to a level-2 or level-3 BLAS routine. To give an example,

REAL :: A(n1,n2,n3,n4), B(m1,m2,n4,n5), C(n3,n5)
...
C = matmul( A(i,j,:,:), B(k,l,:,:) )

The matmul() reference is perfectly fine, but it cannot be mapped to DGEMM(). This is a limitation of the DGEMM() interface, not of fortran, so if fortran is expected to handle these cases efficiently, some additional work needs to be done within the fortran math library.

BTW, you point out another use case for shallow_transpose(). It could be used to map more or less directly from fortran arrays to row-major arrays in other languages such as C.

[edit] I just realized I should have said SGEMM() rather than DGEMM above. The matmul() intrinsic is generic, of course, but the BLAS routines are not.

1 Like

It’s not difficult to write an extended DGEMM that handles non-contiguous arrays, but what should the compiler do when such cases are encountered? Work in place with the extended version and a lot of cache misses? Or create temporary arrays if the copy overhead is lower than the gain in terms of locality? There is no unique answer, because the temporary array solution may be faster, but maybe the developer would prefer keeping the memory usage as low as possible for whatever reason. And such as decision for the compiler requires to be able to evaluate the cost in terms of cache misses compared to the cost of copies: for a matrix multiplication it’s not too difficult to decide some threshold, but for more general routines/algorithms it can get quite difficult.

In HPC, as of today, the developers still have to understand what happens behind the code they write, and cannot not entirely rely on abstractions, hoping the compilers will understand what they really want.

3 Likes

This is another way to effect a shallow transpose, but specifically for that one intrinsic function matmul(). There could be optional arguments added for this case like this:

function matmul( A, B, transA, transB )
real, intent(in) :: A(:,:), B(:,:)
logical, intent(in), optional :: transA, transB

There are the only two matrix arguments for this function, so two optional arguments are sufficient to cover all possible cases (four). If you want to return a transposed result, then this can be done by switching the order of the two matrix arguments. For example, C^T=A.B is the same as C=B^T.A^T. So effectively there are eight possible cases that are covered.

However, a separate shallow_transpose() function would be useful everywhere, not just for the matmul() case, in general expressions, as actual arguments, etc. And a side benefit is that it would make interfacing to row-major languages easier.

This matmul design you’re talking about really looks like an ugly hack to work around not having multiple dispatch. With multiple dispatch, all of these would just be A*B.

1 Like
1 Like

@RonShepard blas/lapack have more cases for matrix-matrix multiplication: is the input or output matrices symetric/hermitian? Upper or lower triangular? Tridiagonal? Or beyond blas/lapack: Toeplitz? Banded?.. There are infinite combinations… Matrices of practical problems have often special properties, and these properties can often be used to reduce the amount of computations. Should the standard take care of all of this? Or just provide a generic matmul() that can be used for simplicity when the performance is not critical?

I don’t get what you mean here. Apart the fact that * is the element-based multiplication and not the matrix multiplication, in A*B how do you specify for instance that A should be tranposed?

I would like to bring attention to the fact that it is possible to define overloaded operators for basic matrix operations, similar to the functionality provided by the IMSL linear_operators library (https://www.irya.unam.mx/computo/sites/manuales/imsl/F9040.pdf page 155). While performance may not be the primary concern, as in Matlab, we can perform matrix multiplication using the .x. operator, such as C = A .x. B.

Here is a user-friendly usage:
S11 = .i. ( R + ( tP .x. iQ ) .x. P ) .x. ( R - ( tP .x. iQ ) );