Note: The following floating-point exceptions are signalling: IEEE_INVALID_FLAG IEEE_DIVIDE_BY_ZERO

I have a code that when compiled with --fast-math I get, after running it (successfully):

Note: The following floating-point exceptions are signalling: IEEE_INVALID_FLAG IEEE_DIVIDE_BY_ZERO

Does anyone has any tip on how to debug that, given that the program does not throw any error? I can make the program break somewhere by adding the following compiler flags:

-ffpe-trap=zero,overflow,underflow

but the lines to where the error points do not make much sense (for instance in a line with an operation like x = (1 - y)**2, and y arrives to the line well defined. Worst, the error line changes sometimes even when I write a write(*,*) x, y to check the value of the variables before the error (the error then occurs somewhere else).

Update: If i use only --fast-math -ffpe-trap=zero

then the error occurs, apparently, in a system library:


Program received signal SIGFPE: Floating-point exception - erroneous arithmetic operation.

Backtrace for this error:
#0  0x7f5321d81ad0 in ???
#1  0x7f5321d80c35 in ???
#2  0x7f5321a9151f in ???
	at ./signal/../sysdeps/unix/sysv/linux/x86_64/libc_sigaction.c:0
#3  0x5564a17b1886 in ???
#4  0x5564a17b9083 in ???
#5  0x5564a17b9ca1 in ???
#6  0x5564a17eabbe in ???
#7  0x5564a17bbf7e in ???
#8  0x5564a17e71d0 in ???
#9  0x5564a179f318 in ???
#10  0x7f5321a78d8f in __libc_start_call_main
	at ../sysdeps/nptl/libc_start_call_main.h:58
#11  0x7f5321a78e3f in __libc_start_main_impl
	at ../csu/libc-start.c:392
#12  0x5564a179f364 in ???
#13  0xffffffffffffffff in ???
Floating point exception (core dumped)

Any working source sample? I understand it is gfortran. Which version?

Well, yes, but the code is quite large. To reproduce the problem you can try:

git clone https://github.com/m3g/packmol
cd packmol
make

and then create the example with the files I attach here:

cd example
../packmol < test.inp

in the example directory two files must be added: test.txt and W.txt:

test.txt (499 Bytes)
W.txt (109 Bytes)

Yes, I’m using gfortran, GNU Fortran (Ubuntu 11.3.0-1ubuntu1~22.04) 11.3.0.

It is a little complicated to run the example and try to track the bug, so I was only expecting some hints, but thanks for all the help.

(the error goes away if one edits the Makefile by removing the --fast-math flag, and recompiling with make clean; make).

Could it be due to some mathematical instructions reordering?

It is probably related to some operation resulting zero with it but not without it. But I am not able to track where that might be to be able to handle the exception manually.

How did you obtain that output?

(fastmath here provides quite a significant speedup, which I’m not willing to let go for now - the kind of solution the package gives is not something that needs to be “precise”. I would rather prefer to debug where those incorrect operations may occur and deal with them)

@lmiq,

Please take a look at the trivial example in this thread as just an illustration. The point being whether it is possible for you to consider standard Fortran IEEE-related intrinsics toward SET_HALTING_MODE, IEEE_SET_FLAG/IEEE_GET_FLAG, etc… A judicious use of these intrinscis can enable you to trap the floating-point exceptions or even break at the right location(s) that may be deep inside some number-crunching library(ies).

2 Likes

We need the input file test.inp in order to run the program, along with the files test.txt and W.txt.

Guessing that I could use test.txt as a substitute for test.inp, I ran the program, and I found (Using the Lahey-Fujitsu compiler, with the the option -NRtrap) that on line 1833 of gencan.f a division by gpsupn is performed when that variable has the value zero. Had the program been allowed to go to the next line, a similar problem would arise on that line when log(gpsupn) is evaluated there. Line 1832 contains:

if ( gpsupn .ge. 0.d0 ) then

It probably needs to say .gt. instead of .ge. The comment on line 1831 suggests that your intention was to write .gt.

c LM: changed to avoid error with gpsupn=0

You can reproduce these findings with your compiler of choice by using the appropriate options for trapping floating point division by zero.

1 Like

Indeed, it was me that changed that line (many years ago…). The problem is that after fixing that, I continue to have the problem somewhere else. I have a hard time trying to debug that, because, for instance, if I use .gt. there, the problem moves to line 2161, which contains:

                      kappa = log10( gpeucn2 / gpeucn20 )/
     +                        log10( epsgpen2 / gpeucn20 )

My issue is that if I try to print the values of the variables involved:

                     write(*,*) gpeucn2, gpeucn20
                      kappa = log10( gpeucn2 / gpeucn20 )/
     +                        log10( epsgpen2 / gpeucn20 )

The error is thrown before the printing:

.... ## all fine until this point
  Maximum internal distance of type            4 :    0.0000000000000000     

Program received signal SIGFPE: Floating-point exception - erroneous arithmetic operation.

Backtrace for this error:
#0  0x7fa2e1f52ad0 in ???
#1  0x7fa2e1f51c35 in ???
#2  0x7fa2e1c6251f in ???
        at ./signal/../sysdeps/unix/sysv/linux/x86_64/libc_sigaction.c:0
#3  0x5608f4201ee3 in gencan_
        at src/gencan.f:2163
#4  0x5608f42074f3 in easygencan_
        at src/gencan.f:737
#5  0x5608f42095d1 in pgencan_
        at src/pgencan.f90:73
#6  0x5608f429d95a in restmol_
        at src/restmol.f90:63
#7  0x5608f420ae1c in initial_
        at src/initial.f90:96
#8  0x5608f42981d3 in packmol
        at app/packmol.f90:686
#9  0x5608f429b09f in main
        at app/packmol.f90:36
Floating point exception (core dumped)

Then, if I move the write... upwards, the errors stops occurring that that line, and happens to occur in another unrelated position (line 1715).

So, again, I’m having a hard time trying to track the bug, as the error messages jump from place to place depending even on adding or not a write statement. I need a better debugging tool here :frowning: . Despite these errors appearing in floating point operations, it looks rather a memory corruption given the randomness of what’s happening.

I did notice the subsequent divisions by zero, but I did not know what code to substitute for the situation where a divisor was zero.

I think that the apparent uncertainty (regarding which line the zero-divide exception occurs in) is caused by the code having been compiled with at least some optimizations enabled/requested. Please try using the lowest optimization level that your compiler allows until all these bugs have been detected and fixed.

The problem is that the errors disappear when I remove the compiler optimizations, it appears when I use both -O3 and --fast-math.

Removing those I can print the variables, but I don’t see anything wrong with them (they’re not zero, for instance) :frowning: .

It is possible that there is a code generation bug. Unfortunately, you will have to create a reasonably short reproducer and post it to the GCC Bugzilla. Creating such a reproducer while preserving a code generation or optimization bug is often difficult.

I have narrowed down things a bit. I compiled using Cygwin Gfortran 11.3 on Windows, using the flags -ffpe-trap=zero --fast-math -O2 -fbacktrace -g . I modified the source file that reads the input data to open and read from a file instead of from redirected standard input. I then ran the resulting EXE using gdb, and the result was, after many lines of normal program output:

Thread 1 "packmol" received signal SIGFPE, Arithmetic exception.
0x000000010040ec5d in gencan (n=6, x=..., l=..., u=..., m=0, lambda=..., rho=..., epsgpen=0, epsgpsn=9.9999999999999995e-07, maxitnfp=20, epsnfp=0, maxitngp=1000, fmin=1.0000000000000001e-05, maxit=20, maxfc=200, udelta0=-1, ucgmaxit=-1, cgscre=2, cggpnf=0.0001, cgepsi=0.10000000000000001, cgepsf=1.0000000000000001e-05, epsnqmp=0.0001, maxitnqmp=5, nearlyq=.FALSE., nint=2, next=2, mininterp=4, maxextrap=100, gtype=0, htvtype=1, trtype=1, iprint=0, ncomp=50, f=187416100, g=..., gpeucn2=0, gpsupn=0, iter=0, fcnt=1, gcnt=1, cgcnt=0, spgiter=0, spgfcnt=0, tniter=0, tnfcnt=0, tnstpcnt=0, tnintcnt=0, tnexgcnt=0, tnexbcnt=0, tnintfe=0, tnexgfe=0, tnexbfe=0, inform=0, s=..., y=..., d=..., ind=..., lastgpns=..., w=..., eta=0.90000000000000002, delmin=0.01, lspgma=10000000000, lspgmi=1e-10, theta=9.9999999999999995e-07, gamma=0.0001, beta=0.5, sigma1=0.10000000000000001, sigma2=0.90000000000000002, sterel=9.9999999999999995e-08, steabs=1e-10, epsrel=1e-10, epsabs=9.9999999999999995e-21, infrel=1e+20, infabs=9.9999999999999997e+98) at gencan.f:1713
1713          epsgpen2 = epsgpen ** 2
(gdb)

Note that epsgpen is already reported to be exactly zero, so why just squaring it should cause SIGFPE is hard to understand. When I ran the same EXE without using GDB, the output ended with

Program received signal SIGFPE: Floating-point exception - erroneous arithmetic operation.

Note, however, that code compiled with --fast-math may introduce errors that do not cause any noticeable effects, which after being propagated through large blocks of code cause an FPE to occur. At the point where the FPE occurs or a large error is noticed, looking at a few lines of code in the vicinity may give no clue to what actually went wrong.

The OP’s code is about 13,000 lines, and sufficiently complex for someone else to hesitate to rule out any of the errors that can affect a Fortran program. I am not convinced that the program is error free, or that there is no interaction between these suspected errors and the optimizations that are performed as a result of -O3 --fast-math being specified.

Based on what has been written about --fast-math, I have to ask, are there any known circumstances where it is safe to use? In most “real” programs, the code path taken may change with the input data, so one can not rule out that any results produced may be wrong because incorrect calculations were performed for just that input data set. In other words, verifying results obtained with one set of input data should not give any confidence in the results that may be obtained with other sets of input data.

Here is a workaround that may, depending on the version of the compiler, may let you use --fast-math with all of your source files except 7+7 lines in your source file gencan.f .

  1. Move lines 1832 to 1838 to a new subroutine, say newsub1.f. (also change .ge. to .gt. on line 1832, as discussed earlier)
  2. Move lines 2159 to 2165 to a new subroutine, say newsub2.f .
  3. Compile all your sources except these new subroutines with your desired options, such as -O3 -ffpe-trap=zero --fast-math , For these short new subroutines, do not use --fast-math .
  4. Link and run.
2 Likes

Thanks for the tip.

In fact, what happened here is that after moving those pieces to new routines, the error disappears even with all flags active.

Do you by chance observe the same thing?

The current master branch of the repository is updated with those changes, in any case:

git clone https://github.com/m3g/packmol
cd packmol
make # with gfortran installed
cd testing
../packmol < ieee_signaling.inp

Now the error does not happen, even though everything is being compiled with the same flags as before. One can control the flags of the compilation of those separated routines with the IEEE_SIGNAL_FLAGS in the Makefile.

Just for the records: this is a package that has been essentially stable for more than a decade, using those flags. Those IEEE signals did not appear before, or at least I have never found one example where they appeared, and in any case the program even with that signaling runs fine and ends up with the correct results. I tend to believe that I may have some error in the code causing a memory corruption that causes such a strange and hard-to-track bug, but the bug is benign enough that it never manifest in actually breaking the execution.

Yes, most of the time, but whether this happens or not seems to depend on the other flags (such as -O2 vs. -O3, etc.) and the version of Gfortran used. I attempted to construct a short reproducer, but failed.

1 Like