Erroneous Arithmetic Operation - on a MIN statement?

garynewport · August 14, 2023, 10:35am

I have been tracking through a curious error that only occurs under certain conditions. The problem is that the code has to run for some time before those issues manifest.

Anyway, I have been hunting down issues as they arise, through the use of the debugging tools within GFortran and using a series of, basically, print statements.

I have been using the following compiler instructions:

gfortran -O -g -fbacktrace -ffpe-trap=invalid,zero,overflow modules.f90 stelcor.f90 dummymain.f90 -o dummymain

The -g is there so that I can use GDB, if I can get my head around it.

Anyway, I have been placing write statements around my code and have identified that the code breaks in a particular subroutine: In fact, I have been whittling it down to where in that subroutine it appears to break; yet the result makes no sense to me.

I am probably missing something obvious but here’s where the issue appears to be arising…

            if (j > 800) write (lunit, *) "addsub 11", j, n
            jcomp = min(j, n / 10)
            if (j > 800) write (lunit, *) "addsub 12"

This is within a loop, where j is the stepper variable. I have been able to identify that the code breaks well in to when j > 850. So, I set my output to generate as much successful data as it can, without making the execution time insanely long.

The output file gives me…

...
 addsub 11         815         816
 addsub 12
 addsub 13
 add 1
 add 2
 addsub 11         816         817
 addsub 12
 addsub 14
 addsub 11         801         817
 addsub 12
 addsub 11         802         817
 addsub 12
 addsub 11         803         817
 addsub 12
 addsub 11         804         817
 addsub 12
 addsub 11         805         817
 addsub 12
 addsub 11         806         817
 addsub 12
 addsub 11         807         817
 addsub 12
 addsub 11         808         817
 addsub 12
 addsub 11         809         817
 addsub 12
 addsub 11         810         817
 addsub 12
 addsub 11         811         817

Those are the last 28 lines of output.

As you can see, at the end of a previous run, the code (j = 815 and n = 816) goes through my two outputs, continues to a later stage of the subroutine (addsub13) and enters another subroutine; returning to this subroutine afterwards; where j = 816 and n = 817.

The curious thing is that, as you can see, the file ends on the addsub 11; where there is only 1 line of code being executed - a line that has run successfully multiple times, and where the values are, seemedly, perfectly fine.

The line number of this final line is 119776, whilst the size of the file is 15MB; so, this is not blowing out due to file size issues.

The error being reported is…

Program received signal SIGFPE: Floating-point exception - erroneous arithmetic operation.

And the trace states…

Backtrace for this error:
#0  0x7fb22be66960 in ???
#1  0x7fb22be65ac5 in ???
#2  0x7fb22bb5651f in ???
#3  0x55d4300ed3dc in __stelcor_module_MOD_atmos
        at /mnt/c/Users/garyn/OneDrive/Stored/Programming/NEW STELCOR/STELCOR - 13.08.2023/stelcor.f90:661
#4  0x55d4300f203b in __stelcor_module_MOD_gi
        at /mnt/c/Users/garyn/OneDrive/Stored/Programming/NEW STELCOR/STELCOR - 13.08.2023/stelcor.f90:1951
#5  0x55d4300f510e in __stelcor_module_MOD_henyey
        at /mnt/c/Users/garyn/OneDrive/Stored/Programming/NEW STELCOR/STELCOR - 13.08.2023/stelcor.f90:2632
#6  0x55d4300fcc4e in __stelcor_module_MOD_stelcor
        at /mnt/c/Users/garyn/OneDrive/Stored/Programming/NEW STELCOR/STELCOR - 13.08.2023/stelcor.f90:166
#7  0x55d4300fe028 in MAIN__
        at /mnt/c/Users/garyn/OneDrive/Stored/Programming/NEW STELCOR/STELCOR - 13.08.2023/dummymain.f90:249
#8  0x55d4300fe588 in main
        at /mnt/c/Users/garyn/OneDrive/Stored/Programming/NEW STELCOR/STELCOR - 13.08.2023/dummymain.f90:19

I have advice from another question I posted, which I will be trying out later, in terms of a debugging tool I might be able to utilise. However, right now I am puzzled and wondering what I am missing.

tyranids · August 14, 2023, 11:50am

Is min the actual intrinsic function min? MIN (The GNU Fortran Compiler) You are positive there isn’t a variable named min hiding in a common block or other global scope somewhere?

You might also try adding special logic before the line jcomp = min(j, n / 10) to say if (j == 811) then and add a bunch of debugging prints in that block. You can also much more easily set a breakpoint in gdb there and you know it will only trigger once you are near the problematic iteration.

If this is an older legacy code, I heavily suspect some memory shenanigans are about. Are your errors consistent for the same input to your program? Also last thing to check, are both j and n the same type? The prints make it look like they are both of type integer, are you sure of that?

EDIT: This is a good reference to get you through gdb - https://users.ece.utexas.edu/~adnan/gdb-refcard.pdf

feenberg · August 14, 2023, 12:39pm

Valgrind has solved many mysteries for me. The most mysterious problems are often out of bounds memory writes, which Valgrind can catch.

I would also try replacing min() with an assignment and an if statement, just so the failure isn’t in a library. That will make gdb more of a help. Place a “watch” on j and n. (Note that in gdb there is the “up” command if your program abends in a library routine. It gets up to the calling routine for “p” commands).

garynewport · August 14, 2023, 12:53pm

What an excellent response; thank you!

Okay. So, j and n are both declared as integers. j is declared at the top of the subroutine and is a local variable. n is used from another module that, in turn, is declared as an integer.

min is the intrinsic function and I have checked everywhere; no variable called min (local or global) in the code. Equally, everything is declared (implicit none) and almost all of my module uses are using only, to ensure that it only uses the variable it actually needs.

I did run a few prints to see if the values being used suddenly became erroneous; they don’t.

I have killed all historic COMMONs and EQUIVALENCEs.

Thank you for your guide. Using that, and other sources, I have now gotten GDB to run the code and I await it’s feedback.

I am also going to be looking at VALGRIND shortly.

Yes, this will turn out to be a silly declaration somewhere. I am also concerned that I am missing something obvious and that there is no fault here - just my interpretation of what is happening. I had that a short while ago when I thought I had isolated an error, only to realise that I had simply reached the capacity of the file size.

garynewport · August 14, 2023, 12:53pm

Excellent ideas; thank you!

tyranids · August 14, 2023, 1:14pm

A last quick sanity check: “jcomp” is also an integer, yes?

garynewport · August 14, 2023, 1:28pm

Yes. And the code is run, successfully, many times prior to this one.

However, I have now successfully run GDB and this points to a completely different area of the code; so I need to solve that one first. I thought I had gotten all the bugs out a couple of months ago but, oh no - more bugs appear. hahahaha!

tyranids · August 14, 2023, 3:50pm

Run same inputs fail different spot? Memory not being respect somewhere is 100% to blame.

You may try using the compiler flags -Wall -Werror -fcheck=all

rwmsu · August 14, 2023, 4:02pm

Just curious. You first post appears to imply that you are compiling with -O -g. Why not -O0 -g to eliminate potential optimization issues. Thats usually the first thing I suspect particularly after updating to a new version of a compiler

RonShepard · August 14, 2023, 5:38pm

I do not have an answer for your general question, but this particular one might involve how the file connected to lunit is buffered. There could well be more lines in the buffer, but the program aborts before that buffer is flushed, so you simply do not see them. Some quick workarounds are to write the debug info to standard output instead of a file – standard output (and standard error) is usually not buffered, or at least the buffer is small, maybe four bytes. Another possibility is to change the buffer status for the file – this might be done in the open statement or by setting an environment variable. Another workaround is to add the statement flush(lunit) after each write statement.

If the file is not local, say it is a network file, then these things might not work, or they might only partially work, because there will be several other buffer layers between your write statement and the final bits that constitute the file. These are all just debug options; after your code is working again, you don’t want to do these things because they will slow down your code.

garynewport · August 14, 2023, 9:06pm

I understood that -O was defaulting to -O0 (or -O1, I can’t remember which one).

I will try that but can state that, when I get close enough to the problem, the problem is within the code itself.

garynewport · August 14, 2023, 9:08pm

Yes, @RonShepard, I fear it might be an issue with writing to the file; since nothing else makes sense here. Even in my trace back data, it is pointing to another subroutine. I have run the program in GDB and it is pointing to the same subroutine as is given by the traceback; so I am optimistic that there IS an issue there. Just need to find it.

I think the file data is a red-herring.

rwmsu · August 14, 2023, 11:57pm

I understood that -O was defaulting to -O0 (or -O1, I can’t remember which one)

No compiler I’ve used in the last 30 years has defaulted to -O0 when you just specify -O. On ifort -O defaults to -O2. As best I can determine, gfortran defaults to something I guess is like -O1. The gfortran man pages only mentions -O in context with some specific optimizations. The gcc man pages (and I’m assuming -O with gcc is close to what you get with gfortran) mentions that -O tries to “reduce code size and execution time without performing any optimizations that take a great deal of compilation time” and lists a bunch of optimization flags that are enabled by -O. -O2 turns on another level of optimization above -O so its probably equivalent to -O1. I think you might get -O0 by not specifying any optimization flags but don’t quote me on that. I don’t use gfortran enough to say for certain what the various optimization levels actually do.

tyranids · August 15, 2023, 4:05am

I am almost certain it is at least -O1, if not -O2. Try explicit -O0 -g -Wall -Werror -fcheck=all -fbacktrace to help with debugging when using gfortran. If you have too may warnings you don’t care to fix, you can remove -Werror and compilation will still go through.

garynewport · August 15, 2023, 8:17pm

Yes, I think it is 1; I did read it somewhere.

I am testing GDB, which has been very informative at the moment. About to post a question on using this, actually, since it has already helped me to identify a potential issue.

Topic		Replies	Views
EXP failure with very small negative value Help	4	515	June 11, 2023
Resolving floating point exceptions signalled after program termination Help	9	1743	April 25, 2023
How do I chase down this error? Help	15	924	October 19, 2021
Two Compiler Errors Help	11	861	November 21, 2022
Compiler options to trap floating point errors	8	2541	January 28, 2022

Erroneous Arithmetic Operation - on a MIN statement?

Related topics