Can one design coding rules to follow so that `-ffast-math` is safe?

It is well known that -ffast-math in GFortran (and other similar options in other compilers) that do not follow the IEEE standard exactly can change numerical results, including even making some codes to produce completely incorrect results, the most famous example is probably the Kahan summation algorithm.

Q: Is -ffast-math always safe?
A: No, it can make your code return wrong answers, such as the Kahan algorithm.

Q: Can one design rules to follow in your code, so that -ffast-math is safe, and define exactly what you mean by “safe”?
A: I think the answer might be yes, and that is what this post is about. I have made some progress towards this, but need help getting it over the finish line. If you are interested to help out, you can do so below.

Q: What is the point of trying to design such rules if we can simply turn off “-ffast-math” and be safe?
A: Turning off “-ffast-math” is definitely a valid approach. However, often the code might not be as performing. One can recover performance by being intentional about optimizations, for example by using “fast-math” to discover which optimizations might help, and then applying them explicitly, and testing the code that it still works with them (or applying such optimizations “by hand” in the source code itself). The alternative approach being proposed here is to program with a set of rules, that then give the compiler greater freedom to rearrange the computation to improve performance.


Some requirements:

  • Such “fast-math” rules, if they exist, must say that Kahan algorithm does not satisfy them (and thus such code might not work with “fast-math”).
  • The rules are for the Fortran programmer to follow to ensure enabling “fast-math” in the compiler will still produce correct results; as well as for compilers to know what they can and cannot do. The idea of these rules is that they give compilers more freedom to rearrange the computation.

About 4 years ago I wrote this document:

That tries to come up with such rules and then shows on many examples how to apply such rules. Let me know what you think and if you have ideas how to make the rules more rigorous.

In particular, I am aware that when I say “well conditioned” in the document, it might not always be exactly equal to the condition number from numerical analysis, although it seems related.

5 Likes

I think trying to make -ffast-math safe is the wrong approach. What you want instead is lexically scoped compiler directives. The advantage of this approach is that it makes it easy to combine some code that would benefit from compiler tricks with other code that is relying on precise IEEE guarantees.

I agree you want to be able to optionally “localize” compiler optimizations to a specific subroutine or a loop.

I think the main idea of what you are proposing is to specifically choose which optimizations to apply by the programmer, such as simplifying ((sum + y) - sum) - y to 0 in some specific line, but not allowing this “optimization” in another line, such as the Kahan summation subroutine.

That seems like a valid approach. But if there is a way to write the code to give the compiler the freedom to rearrange the code (within limits), that seems very much worth it for me. It would be optional, so you don’t have to use it if you don’t like it.

I would love that. In numerical code, where things are approximated and floating point operations are not exact, being fault tolerant is essential in most cases and fast math heuristics just fine. In some cases small random perturbations are applied to even avoid being exact and having to deal with expensive special cases too often. Relying on the last bit to be exact generally hints at some design flaw in the algorithm. In general!

I have encountered one really bad exception. Essentially it was computing a neighbourhood graph of a 3d point cloud with some kdtree, which was required to be symmetric. Single threaded just compute (x(1)-y(1))^2 + (x(2)-y(2))^2 + (x(3)-y(3))^2 and compare it with some distance^2 threshold. Then put x as neighbour into the list of y and vice versa. In parallel you do not want to do that, as this generates more locking overhead then it saves computation time. Without fast-math you can rely on (y(1)-x(1))^2 + (y(2)-x(2))^2 + (y(3)-x(3))^2 being bitwise the same as the norm above. However, when we went from avx to avx2/avx512, the compiler started (using the wider ymm/zmm registers) to do some loop unrolling to compute these norms for several pairs of points tightly packed in several vector registers in parallel. This changed associativity and resulted in some really rare and difficult to find errors due to a non-symmetric graph.
Based on this experience I think it will be difficult even to judge where and when fast-math should not be applied. Having good tests and run-time assertions is essential if fast-math should be used. If it is even worth the extra small amount of performance gain.

I would prefer to have better understand when -ffast-math is unsafe, ie have better documentation of when -ffast-math should not been used.
While I have read warnings from others of the problems with -ffast-math, I have never experienced these problems in a practical calculation using 64-bit AVX. Rather, it is identified only in a very contrived calculation, that does not relate to when I am looking for improved performance.

Although not exhaustive, my review of some examples of when the Kahan sum may fail using -ffast-math, the magnitude of error identified is less that the accuracy estimate, based on the maximum error estimate of the values being summed. Basically my impression of Kahn sum is if comparing the accuracy of a higher precision accumulator, this ignores the inaccuracy estimate of the value dble(x(i)), expecially for max(x(i)).

Can someone please better explain how -ffast-math can fail in a practical calculation, rather than a contrived example.

My example above is a first-hand experience of a very real simple problem (except that you need avx2, or maybe it suffices to use single precision and avx, have not tried that back when we had this bug). Due to parallelisation we did rely on ||x-y||^2 being symmetric to ||y-x||^2, instead of enforcing symmetry directly. As the code evolved over some years and some iteration steps, this problem was not spotted.

Martin, you probably know a lot more than me about the use of registers, but my personal experience is I struggle to understand what is happening when things don’t go right.

At the moment I am finding Windows 21H1 updates are playing havoc with my multi-threading on Ryzen 5900X. I do know the performance has deteriated significantly, which I suspect may be due to poor thread allocation to “logical processors”, especially when preferencing “hyper-threading” over seperate cores. (essentially primary and secondary “logical processors”?)

Similarly, when so many things are being optimised, it is difficult to identify -ffast-math as the cause.

I am impressed with your confidence when you blame “using the wider ymm/zmm registers”, but I struggle to identify the key reason for my apparent coding failures.

Sorry, then I misunderstood your question and thought you were interested in a high-level example. But I can explain to you the details (see below) of why fast-math failed here. It is essential to know that (x+y)+z /= x +(y+z) in floating point arithmetic, but x+y = y+x and (x-y)^2 = (y-x)^2 (and I hope I have not misread floating point arithmetic and fast-math optimisations, and these assumptions are actually wrong)

My confidence is from looking at the compiler assembler output via -S option flag. In some cases such as this, that is very convienent! What happened here was the following. There was a loop over a small set of points in a list, and for each point x in that list, the distance to a given point y was computed as

(x_i(1)-y(1))^2 + (x_i(2)-y(2))^2 + (x_i(3)-y(3))^2

each difference requires 64 bit in double precision. If your vector register is 256bit wide, you can pack 4 differences into one register and do the square for all four in one step. However, with three differences to square, one slot remains unused. For some reason, going from avx to avx2/512, the compiler decided to do loop unrolling. As far as I remember, the registers were loaded like:

ymm1: (x_1(1)-y(1)), (x_1(2)-y(2)), (x_1(3)-y(3)), (x_2(1)-y(1))
ymm2: (x_2(2)-y(2)), (x_2(3)-y(3)), (x_3(1)-y(1)), (x_3(2)-y(2))
ymm3: (x_3(3)-y(3)), (x_4(1)-y(1)), (x_4(2)-y(2)), (x_4(3)-y(3))

(I think at least 6, 8 or even 12 points were unrolled.)
Now if you try to add the squared components distributed like that, then you need to be careful about the order, which the compiler did not do due to fast-math, hence addition was done with different associativity depending on were each point occured in the list, and the final result was in some rare cases not symmetric anymore (hitting/exceeding the threshold distance). As far as I remember, the loop unrolling was not done without fast math. So there was real gain paid by a tiny amount of loss of precision.
(I also see that avx already had ymm registers, but for some reason like missing a crucial vector command or some such difference, the compiler did not do that optimisation with avx instead of avx2 option.)
I hope this helps to explain the problem, and also that I have not derailed the thread with this example. But I thought this is a good example how fast-math can break code really subtly and how difficult it can be to find and understand the problem.

See the document in my top post, in the section Examples: schools.rst · GitHub which has many worked out examples that fail, and it also shows how to fix it, both by not using “fast-math” as well as by using “fast-math” and reworking the code to produce the answer we want.

1 Like

I can’t explain the how, but I do have an example of failure. The large C++ code that I work with fails magnificently with -ffast-math, but no one has taken the time to figure out why.

To make progress, I can see two paths forward, both of which I think should be pursued:

  • one can try to work on the rules based on my document above, add more examples, try to find counter examples, make the rules more precise, etc.

  • start implementing these “IEEE unsafe” optimizations in LFortran, and for each of such rules implement several debugging tools, such as:

    • implement a “fuzzer”, which will apply such a rule (possibly in a random way), so that when you run your code in this “fuzzer” mode, it will effectively sample the answer from the space of possible “unsafe” optimizations. So just by running it a few times one should see right away the accuracy of the answer. Examples: summing an array in any order; simplification of expressions (((sum + y) - sum) - y0), etc.
    • compute the original code exactly as well as the transformed code, and compute an error, and communicate the error somehow, either giving a runtime warning if the error is larger than say 1 digit lost, or keep track of the error somehow.
    • Analyze each rule separately and see if it fits into our generalized rules from the document above

As an example, the expression ((sum + y) - sum) - y can be simplified to 0, and it will usually be very accurate, except if sum is large and y is small (I think), in which case the difference can be non-negligible. I don’t know if there is a way to do this to trigger at runtime robustly, for example for the Kahan summation algorithm. It will depend on the input values. So I can imagine having this on in “Debug” mode and test it on actual data that we care about, and see if it triggers. If it doesn’t, the idea would be that “fast-math” is safe then. If it triggers, then “fast-math” will be inaccurate and should not be used, or the algorithm should be reworked.

1 Like

What is the biggest speed up anyone has found on a code with -ffast-math over the safer -O2 or -O3? The average speedup on a set of codes would also be interesting. If the speedup is only 1.1 then maybe it’s not worth thinking about -ffast-math for most people, but if it can be 2.0 that is a different matter.

Often I can see 2x speedup, due to being able to use fma and better vectorization.