LFortran now supports all intrinsic functions

This is very useful, thanks for sharing.

Thanks @hkvzjal. Where is the fsum_pair mentioned there? I can’t find it. I think Julia uses the pair summation also by default.

Yes, there are many sum algorithms. How should the Fortran compiler choose? Probably a combination of heuristics, compiler options and pragmas.

@certik, The pair sum along with other sums were discussed here last year (May). Several of us posted code.

1 Like

Sorry, I deleted it but forgot that comment, at the end I found it to be slower and with similar precision to the alternatives. Specially compared to the chunked kahan version. It was indeed inspired from the thread mentioned by @rwmsu.

I could imagine that the choices could be simplified to: (1) Which is the optimal chunk size for a given kind? (Depending on the register size of the hardware) (2) Select to do a plain sequential sum if the size of the array is < chunk_size or the full chunked sum otherwise. (3) With a flag you could tune increasing even more the precision with something like the kahan version that maybe as compiler developer, you can optimize even better reducing the compromise in speed?

1 Like

I think one needs to be careful here with the summation of 32 bit reals. The ‘computational’ sum ( sum= 1.6777216E+07) is in fact correct due to the limitations of 32 arithmetic. This applies to C, C++, Fortran etc for 32 bit reals. Jane and I reported the issue to Intel of their sum (sum= 4.9997040E+07) and they said that the a customer paid for this, and returned this result - the Intel sum uses higher precision during the calculation. Nag, gfortran, Cray, IBM all return the ‘correct’ computational sum.

1 Like

“Correct”? I think that term should be reserved for the sum or dot-product that is the correctly rounded value from the infinite-precision computation, from a Kulisch long-accumulator or similar.

Everything else compromises on accuracy to provide low latency.

There is a sweet spot, that increases latency (relative to the minimum) by something like 10-20% and takes the error almost down to the higher-precision levels (7 digits to 14). Directly using higher-precision would ordinarily cost you 100% extra of latency.

If you’re grabbing (millions of) data items from non-cached memory, a few extra flops to improve accuracy (by compensating and using more sub-accumulators) are not going to hurt, you are just sitting there waiting for a new data-item to arrive at the arithmetic unit anyway.

1 Like

I agree with this. Not only does this “correct” return value not agree with exact arithmetic, it does not necessarily agree with floating point arithmetic when the items are summed in all possible (1/(N+1) * (2*N .choose. N) ) possible orders. This was discussed in a little more detail in the previous summation thread. This is ultimately because floating point addition is commutative but not associative, while exact arithmetic is both. Associative property - Wikipedia The computed result is also dependent on rounding modes of the various intermediate results, making the number of possible return values even larger than the Catalan numbers.

1 Like

In retrospect, calling the Fortran intrinsics SOME_SUM, SOME_PRODUCT, SOME_NORM2, & could have done more for education than thousands of lecture hours.

1 Like

The mathematical ‘correctness’ of the intrinsic summation
results for 32 bit reals depend on the number of data items

10,000,000

Intrinsic summation 4998846.5000000000
Kahan summation 4999268.5000000000

100,000,000

Intrinsic summation 16777216.0000000000
Kahan summation 50000388.0000000000

1,000,000,000

Intrinsic summation 16777216.0000000000
Kahan summation 499987104.0000000000

The Nag, gfortran, Cray and IBM compiler will generate
similar results to those shown above.

Perhaps computationally consistent would have been better than computationally correct.

There are always compromises and tradeoffs when evaluating mathematical functions in floating point arithmetic. As discussed previously in the summation thread that was referenced, a fortran compiler could give a fast result with larger errors, and then programmers who need more accuracy would be expected to write their own code. Or a fortran compiler could give the most accurate result possible, and programmers who need a faster result would be expected to write their own code. In addition to accuracy compared to the mathematically exact result, there are also issues with function evaluation involving avoiding unnecessary overflows, underflows, divide by zero, sqrt of negative numbers, stack overflow, and so on. In all these cases, the compiler must make some choice, and then the programmer would be expected to write his own code if he wants different behavior of some kind: more robust, higher accuracy, faster, etc.

At present, for the most part the fortran standard leaves those decisions up to the compiler for most functions. The hypot() intrinsic is an exception to this where the standard actually mentions avoiding unnecessary underflow and overflow. Maybe there are some others like that, but the underlying algorithms for most functions, such as sum() being discussed here, are not specified by the standard. Perhaps ironically, that puts even more burden on the programmer using those intrinsics. If a programmer needs, for example, an accurate sum() function, then he must write his own, regardless of what his current compiler does, because he might need to port his code to another compiler that has made different choices, or he must protect his code from future versions of that compiler that might make different choices. The only time that a programmer can use an intrinsic is when nothing is critical, not the speed, not the accuracy, not the possibility of a stack overflow, not the possibility of underflow or overlow, nothing. If any of these features is critical, then a cautious programmer must write his own version of that function.

Perhaps a better approach would be for the fortran standard to specify the algorithms in some detail along with the various features and a discussion of the tradeoffs. Then all programmers could rely on those functions for situations where those tradeoffs match the code requirements, and they could know exactly when they need to write their own codes for any standard-conforming compiler.

Then, once the algorithms are described in detail for the intrinsic functions within the standard, multiple algorithm choices could be specified in a portable way with, e.g. optional arguments. We could have a sum(x) that uses the default algorithm, we could have a sum(x,method='kahan') that uses compensation, or a sum(x,method='enhanced_precision') for an enhanced precision accumulation, or a sum(x,method='recursive_pairwise') for that algorithm, and so on.

With that kind of approach the programmer has a choice among several algorithms, and, to the extent possible, they would all be computationally consistent from compiler to compiler.

DIY procedures that perform “intrinsic” functions have a long history. I remember IBM and CDC and Cray manuals showing implementation descriptions of the descriptions and projects containing DIY procedures because many compilers did not even implement “basic” functions, or for the reasons stated. I unfortunately lost most of my collection. It included even basic procedures like the trig functions implemented for speed, accuracy, portability …

So it seems like we should be collecting intrinsic implementations for their various performance and accuracy. There is mention here that there was a recent change from LFortran implementing intrinsics in Fortran. And in another concurrent thread there is a discussion about a new implementation of a compiler test suite.

That surely seems to all complement one another. The compiler tester could include Fortran implementations of intrinsics that could be used along with intrinsic benchmarks and regression tests. Perhaps the LFortran Fortran intrinsic versions could be used to seed that, along with packaging up the interesting discussions of the sum function that occured in this very DIscourse site. The intrinsic user documentation could be thrown into the mix.

Part of the compiler test would then be a complete resource for the intrinsics – regression tests and benchmarks, various Fortran implementations, discussions about the intrinsics in a more formal format… The sum() discussions give a good guideline – providing the basis for various implementations, discussions of the pros and cons of the approaches, and comparisons to current compiler implementations. Add in the user-level descriptions any you have a home for all things intrinsic that fits nicely with the compiler tester.

If there is interest from the Compiler Test project, are the LFortran intrinsic Fortran versions of the procedures available as a seed for the DIY intrinsic examples/baselines?

Do you think this “sweet spot” should be the default in Fortran compilers? Then maybe with fast-math equivalent it would use the fastest (but sometimes not as accurate) method.

Could the method used to compute sum be specified by a compiler option?

That approach is inflexible. Suppose you have a single subroutine in which you want to compute the sum several different ways. The programmer somehow must be able to select the algorithm within the source code. My example above is with an optional argument, but there are other ways too, such as a mode option that can be set, and reset, by the program.

1 Like

Yes, the argument is the most flexible, but also incompatible with other compilers.

This is, largely, a matter of taste, informed by the anticipated needs of users from a particular compiler (latency, accuracy, auto-parallelism, reproducibility, compatibility, certification).
I can imagine scenarios where people need to know, and modify, what algorithm was used for any particular invocation of any intrinsic with inexact result. That’s a whole runtime subsystem that needs designing, with uses beyond Fortran. Until such a system can be tested, I think the Fortran committees should avoid prescribing additional syntax for users.

1 Like

Thanks @themos. I think it’s clear we have to have multiple implementations of most intrinsics, and the only question left is how the user should select it and what should be default.

1 Like

Rather than a “sweet spot,” which implies some kind of optimum, I rather think of this as diminishing returns. It is sometimes easy to speed up an initial implementation by a factor of two or three. After that, it then becomes difficult to find the next, say 50% improvement. And then after that, to find even a 5% improvement requires too much time. So somewhere in this process, the programmer just decides that it requires too much effort (time, money, etc.) to look for further improvements, and he spends his limited time on other more productive tasks.

I had a supervisor who used to say that any program can be improved by eliminating one instruction. Of course, that cannot be literally true, or every program could be implemented in zero instructions, but in a practical sense it is almost true; it is just a matter of how much time you want spend finding that instruction.

There is actually no “good answer” to that, as long as the standard doesn’t specify any implementation detail.

As a user who tries writing code that behaves “the same” whichever the compiler, I rely on the sum intrinsic (or the dot_product intrinsic, or …) only in the cases where I know that even the straight naive summation can do. Otherwise I write my own routines (and I do that very often to perform dot products on 10^10 elements or more. Even if at a given moment I am using a compiler that claims having the best summation algorithm in the West, I won’t use it, because I don’t want the code to be locked on this specific compiler.

EDIT: at the moment, specific algorithms (pairwise, Kahan…) should be in stdlib IMHO

1 Like

Based on the discussions, perhaps the right approach would be for stdlib to have 3 different modules that include implementations of the intrinsics. I.e

  • stdlib_intrinsics_fast
  • stdlib_intrinsics_accurate
  • stdlib_intrinsics_balanced

That way users that want/need some direction beyond just relying on their compiler’s implementation have a common set to draw from, rather than re-implementing their own.

2 Likes