A piece of code that causes LLVM Flang to generate NaN/Inf randomly

See

N.B.:

  1. I am well aware that -ffast-math can cause un-mathematical results. Whatever the results might be, they should be deterministic.

  2. The behavior of the code is indeed quite complex — more complex than it may look like, varying across different combinations of the dimension, printing, compilation flags, and operating system. This is why I did not provide a minimal example — actually I was not able to because the pattern was unclear to me.

1 Like

I would argue the precise opposite. Returning the same answer is likely to make users think that this is the “correct” answer. And before you know it, they will be demanding to get that same answer with other flags. Better to give them garbage, since that is what they asked for.

Interesting. Let’s see what other compiler developers and Fortran programmers think about this point.

In case it is not obvious, numbers like 7.0E-45 are denormal floating point values. They are smaller than tiny() for the real32 kind (the default real kind), but they can have a binary representation if gradual underflow is enabled. I think the compiler is allowed to set them to zero at compile time, and also the run time behavior can vary depending on which exceptions are enabled. Further, the reciprocal of a number like that will overflow, so depending on what exceptions are enabled, which floating point flags are set, and how the expression is evaluated, the result could be huge(), INF, or NaN.

If I were a numerical analyst, I might want a specific kind of behavior with numbers like this. If I’m an applications programmer, I probably don’t care much exactly what the results are, but I would want to know that something isn’t normal.

3 Likes

FWIW, on my mac (M1),

gfortran-15 -ffpe-trap=invalid -O2 test_div_flang.f90

gives results with no NaN / Inf, while

gfortran-15 -ffpe-trap=invalid -O2 -ffast-math test_div_flang.f90

gives the following runtime error:

Program received signal SIGILL: Illegal instruction.

Backtrace for this error:
#0  0x100b6a103
#1  0x100b69083
#2  0x1947ad6a3
#3  0x1007a50e7
#4  0x1007a50e7
zsh: illegal hardware instruction  ./a.out

which seems to suggest that something like NaN or Inf was sent to some “instruction” (at the assembly level?) and detected somewhere. I wonder if a similar flag is available for LLVM flang also…? (I was not able to find such an option via flang --help).

In the case of flang, the “random” results of the program might be explained if such an “illegal” value was sent to some instruction and caused some strange behaviors (e.g. like memory corruption)…?

1 Like

One of the optimizations nuked the division; we never load anything into the descriptor - we see it allocate the space (56 bytes), save the pointer in the descriptor, but never populate it with results before the call.

570         movl    $56, %edi                                                       
571         callq   malloc@PLT                                                      
572         movq    %rax, %rbp                                                      
573         movq    %rax, 440(%rsp)                                                 
574         movq    $4, 448(%rsp)                                                   
575         movq    %r12, 456(%rsp)                                                 
576         movq    $1, 464(%rsp)                                                   
577         movq    $14, 472(%rsp)                                                  
578         movq    $4, 480(%rsp)                                                   
579         leaq    440(%rsp), %rsi                                                 
580         movq    %r13, %rdi                                                      
581         callq   _FortranAioOutputDescriptor@PLT                                 
3 Likes

As someone who has used -ffast-math in finite element calculations for many years, I disagree with your post.
When using floating point calculations, there is never a “correct answer”. We are asking for an answer with an acceptable round-off.
When this is not the case, there are many reasons for the answer being unacceptable, even when using -ffast-math. In all cases I have found the problem being a poor numerical modelling approach.
If the modelling approach is improved, then the errors due to -ffast-math are never significant.
I have only observed problems using -ffast-math when the floating point values are so unusual, due to the poor modelling approach.
I find this criticism of -ffast-math a gross exageration for practical usage.

In structural FE analysis, where localised excessive round-off can occur, there is typically sufficient redundency in the structural model equations that these round-off errors are not significant. I don’t work in turbulent analysis where a butterfly can change the results. Practical analysis of round-off in large systems of equations is very difficult to assess.

3 Likes

Yes, I’ve been using -ffast-math with great success also, and never had any issues. I started this thread here about it: Can one design coding rules to follow so that `-ffast-math` is safe?, and I link a document there about some rules to follow that make using -ffast-math “safe”.

1 Like

I admit it was an exaggeration and kiind of click-baity. The point I was trying to get across is that unless you have evidence to the contrary, -ffast-math output should be treated as garbage because there is no numerical analyst hiding inside the compiler (yet) and we know that for every innocuous-looking transformation there is an example where radically different outputs are possible. You may have convinced yourself that already-completed calculations A to Y were not unduly affected by the transformations involved but you cannot be sure that the Z calculation that you will do tomorrow will also be unaffected. In the early days, when computers still had names ending in -AC, people had no idea what the output would be, that is why they were using a computer in the first place. These days, almost everyone has a pretty good idea of what the output should be, roughly so many picobarns, Angstroms, dollars, degrees Kelvin. The game has changed from “I wonder what the answer is” to “I wonder if I can get a plausible answer faster”. The principle of “thou shall not fool thyself” requires a tool that reminds you of the instability inherent in cutting corners.

Having said all that, the actual bug that I see in LLVM-flang is that the runtime routine Fortran::decimal::ConvertToDecimal<24> hits a reference to an uninitialised variable, as valgrind reveals (after compiling with -O2 -ffast-math). This could be how different results end up in the output. But please don’t wait for “noise” like that before you suspect that your output digits may be fiction.

2 Likes

I am not aware of this description of the performance. Surely there is better information on the conditions for when -ffast-math calculations fail.
I thouight that denormalized floating point numbers were a cause, but other cases are not identified.

Do you see reproducible results when you run the executable under a debugger?

You always test in Debug mode (no -ffast-math), get the answer you want. Then you enable Release mode (and yes, also -ffast-math) and see if things change and if the accuracy is acceptable to you. The document I linked above talks about the techniques you can use to always (in my experience) get an acceptable answer ouf of -ffast-math as currently implemented in compilers.

1 Like

If LLVM Flang people “fixed” the problem above so that the program produces the same output for each run, but takes a bit longer to do so, would the customer be happy?

For commercial software development, it might yes, because the developers have to deal with to many issues from their own code and answerer about why results are different (even if better or faster) but if the only answerer to explain difference is some randomness produced by a compiler option such as fast-math, then it gets really complicated. Reproducibility is important to build trustworthiness.

3 Likes

Nonreproducibility is usually caused by accessing some undefined or uninitialized memory location, and then using that location somehow (as an array index, or a pointer, or in a floating point operation, etc.). It is unclear to me in this thread how the fast-math compiler option might cause that to occur.

2 Likes

I agree that things should be reproducible with -ffast-math also, unless there is some optimization that just fundamentally is not deterministic? Which one?

1 Like

A programming error (by the user, or the compiler writer) can cause undefined references leading to nonreproducibility.

But so can simple addition operations, if there are enough of them to be executed in parallel and combined in a non-reproducible order. Enforcing the same order takes synchronization which costs run-time, and some people would be happy enough with a different sum each time the program runs (because they have analysed the effect the variation will produce downstream).

I am one of those people. However that’s typically a parallel sum, correct? On a single core why would it give different result each time? A parallel code indeed will give different results each time, but you can run it on single core to make it reproducible.

Because it does not have to. A REDUCE intrinsic function with ORDERED=.FALSE. is not required to return the same value each time, because the order of reducing array elements is left unspecified.

Yes, but in practice a compiler will generate certain (unspecified) order, but then when you run the binary, I would think it would still be deterministic.