What @hkvzjal mentioned, looks like what people called “Heisenberg problem” in programming, or simply " heisenbug", Heisenbug - Wikipedia
I wonder, how do you guys fixing heisenbug? Or, what may cause heisenbug and how to prevent heisenbug?
Thanks!
PS.
For example, most heisenbug" I encounters are usually caused by accidently access the memory address which should not be accessed, and somehow the compiler did not give warning or error messages. Such as accessing the 10th element of an array while the array only contain 9 elements. Or it can be the when defining a function it contain 5 arguments, but when calling the function we did not supply 5 arguments. Sometimes I found using ‘optional’ argument in a subroutine or function may cause heisenbug" too.
Some of the heisenbug" can be found and fixed by enabling check routine interfaces like below
Heisenbugs, by their very nature, are difficult to find. There is no generally applicable strategy to hunt them down. The common causes you mention and possibile remedies are definitely the things to look for but it remains hard labour.
ei0=-1.d8; ei1=1.d8
de=1.d8
ikgap(1:3)=1
do ik=1,nkpt
ed0=-1.d8; ed1=1.d8
do ist=1,nstsv
e=evalsv(ist,ik)
if (e <= efermi) then
if (e > ed0) ed0=e
if (e > ei0) then
! transfer is a workaround for a bug in Intel Fortran versions 17 and 18
ikgap(1)=transfer(ik,ik)
ei0=e
end if
else
if (e < ed1) ed1=e
if (e < ei1) then
ikgap(2)=ik
ei1=e
end if
end if
end do
e=ed1-ed0
if (e < de) then
ikgap(3)=ik
de=e
end if
end do
The line after the comment should read:
ikgap(1)=ik
However this results in nonsensical output with Intel compilers version 17 and 18 with optimization -O2 and higher. The bug vanishes if you put in a print statement.
I think we found this by just commenting out various lines.
I didn’t really say that. What I said was that printing is not always a valid debugging method in Fortran —since printing is a side effect and those are not allowed in pure procedures.
I actually tend to use print*, exclusively for debugging, and write (... for proper output —unless the procedure is pure, in which case I try the debugger route.
(Even in Go, which has the superb delve, I tend to use fmt.Println for debugging and fmt.Printf for proper output)
In regards to heisenbugs, the one that puzzles me the most, is when compiling involves multiple libraries (with their own modules, etc.), and the bug is likely in the compiler… But as you try to create a MRE, the bug disappears.
My feeling from my experience is that heisenbugs (*) (I didn’t know the name, btw) are most of time compiler bugs.
(*) if it means bugs that vanish in debug mode with all checkings enabled, AND which have not the same behavior with some inserted prints (or whatever statement that is not supposed to fix anything)
I have found a few compiler bugs like this, but in my case the vast majority of my own heisenbugs are code errors, that is programmer errors, that are not caught during compilation or during runtime. These are usually array bounds errors, but where the error is obscured somehow from the compiler (e.g. assumed size declarations, or explicit shape declarations with incorrect array bounds, or mismatched arguments with external subprograms). With f90 and later, another type of error like this is an incorrect intent(out) declaration which should be intent(inout), or a pointer assignment that points to a compiler generated temporary instead of the expected target. These are programmer errors, not compiler errors, but they can be difficult to locate because changing compiler options can make the symptoms vanish while leaving the error still in the code.
On the other hand, I still have relatively simple looking code that uses parameterized data types that does not compile correctly on popular compilers. I also have similar issues with some object-oriented code. These are in fact compiler errors, not programmer errors, so these certainly do exist, even after a couple of decades since they were identified.
Pretty much the same experience. If it goes away with a print statement it is usually because of a shift in memory and it is likely array bound or type mismatch. Modules help to greatly reduce the mismatch issues. Sometimes it goes away because the print statement causes certain optimizations to go off but if the problem lingers and you do not find the error in the code the best tool in many circumstances is multiple compilers. If you have access to three or four compiles and the error only occurs on one of them it is a good time to access the bug reports if possible (I really like gfortran using bugzilla for that reason. It is easier to search with and accessible than most compiler bug trackers). So accessible bug trackers and multiple compilers help a lot. I used to teach a in-house Totalview class and vendor tools and external tools like valgrind can be great but I almost always start with print statements and deleting blocks of code. In my opinion debuggers are best used when looking for logic problems. As soon as memory is being clobbered or working with a compiler bug I have found myself debugging the debugger more than the code I am working on if I jump to a debugger first.
As somewhat implied, reducing the complexity is almost always a good direction to go after first trying some of the low-hanging fruit like turning on array bound checks and using other compiler switches that help debugging and the aforementioned print statements. Because of the I/O restrictions on newer features like PURE and SIMPLE procedures a debugger does get more attractive as it generally lets you step through and inspect those kinds of routines; but that is a relatively recent development.
But reducing the code complexity as much as possible not only is a good approach to dealing with practically any bugs but in my experience is particularly fruitful with real compiler bugs. If it is a compiler bug providing a 20-line reproducer is far more likely to get it worked on that providing a million-line code and saying “this does not work. Can you fix it?”. One vendor where I thought the guys probably hated me because I had 42 tickets open instead used to call and talk to me directly about the bugs when they were not supposed to and they said it was because a always verified the bug and made a small reproducer for (almost) all cases; and a few years later helped me get a job at their company, so another tip is be nice to your compiler developers!
Exactly. Optimization issues can also be tricky. I had a problem in a code a few years back with one version of ifort where I got the wrong answers with optimization turned off but got my expected answers with -O2. Never figured out was was going on so I just rewrote that section of the code (I was using some OO that I replaced with standard procedural code) and the problem went away. I’ve also had problems with debuggers not giving you the correct point in the code where something is going wrong. In my case, the two classic flang compilers (Nvidia and AMD) refused to compile a bit of code without an internal compiler error. I couldn’t see any possible error in the section of the code the debugger pointed to. Turned out the error was in a module that compiled correctly but triggered an error when the module was USED in the routine that I thought was triggering the ICE. Again, the traceback and debugger led me to believe the error was elsewhere.
I have a code with heisenbug, I generated the debug info when building the exe file. Intel oneAPI gdb and visual studio 2022’s debugger cannot locate at the source code where the memory corruption occurs, they just show some ??? symbols.
However, WinDbg clearly traced back to the exact subroutine where memory corruption occurs, thank god it helped me fixed the problem, otherwise I cannot sleep well
I guess the reason WinDbg works so good is because the code is built with Windows SDK