How can you prove that two codes will produce the same output?

@kargl wrote

But, I have been told that a change to a certified code (for some definition of certified) could
trigger an expensive review process (e.g., code in nuclear energy
industry or NASA).

and I have read similar statements before. Often stylistic questions are discussed here, and one may want to automate the conversion of a code to a perceived better style, but it is important not to inadvertently change the results that a code produces. A very incomplete list of how two codes can look different but produce the same results is
(1) fixed vs. free format
(2) capitalization (except within character strings)
(3) spacing, including within end if and end do
(4) the order that variables are declared and whether many variables are declared on the same line or
declared separately on multiple lines
(5) comments

One should have a test suite, but a test suite considers a finite number of cases.

The two trivial codes

integer :: i,j
i = 3
j = 2*i
print*,j
end
integer :: i
integer :: j
i = 3
j = 2*i
print*,j
end

when compiled with gfortran give executables of the same size, but when I run fc (Windows file compare) on them they are distinct:

c:\fortran\test>fc same.exe same1.exe
Comparing files same.exe and SAME1.EXE
00000088: 65 6C
000000D8: EE B5
000000D9: D9 7C
000000DA: 21 22
00000B76: 04 05
0004BE04: 2E 31
0004BE05: 66 2E
0004BE06: 39 66
0004BE07: 30 39
0004BE08: 00 30
001F06FA: 2E 31
001F06FB: 66 2E
001F06FC: 39 66
001F06FD: 30 39
001F06FE: 00 30

so using fc on executables is not the right approach. John Burkardt has written and translated a huge amount of Fortran code, and his style preference is to declare each variable on a separate line, as shown here. I have thought about writing a program to reformat his code to group variable declarations and make other changes, but how would I prove that the codes had not changed substantively, beyond running the driver programs he provides?

Writing good tests is certainly a difficult task. I use the following strategies:

  1. Analytic solutions for simple cases. These tests are simple but their use is also limited.
  2. Test for invariants, ideally with random input. For example a == sqrt(a**2) would be a test for sqrt if you trust your ** operator and ignore floating point inaccuracies.
  3. Compare to a previous result. These tests help to detect regressions, they might even ‘tell’ you if you have accidentally fixed a bug.

In your case, I would run the old code with a huge amount of random input and see whether the new code produce the results. If you’re modernization is beyond stylistic questions, getting reasonable tolerances for floating point comparison will be a challenge.

Test cases are the best generic solution. What if you were trying to answer this question for two codes written in entirely different languages? There are however special cases where you want to see a little deeper into exactly what real code is being generated, or even how a compiler option may be affecting your code. Remembering that basically on all machines there is really only one language, and all this higher level stuff is essentially for the convenience of us humans you can look one level closer to the true program. For instance if
you ran the same two little programs with gfortran

#!/bin/bash 
mkdir -p one
cat >one/simple.f90 <<\EOF
INTEGER   J
INTEGER   I
I = 3
J = 2*I
PRINT*,J
END
EOF
(
cd one
gfortran -S -fverbose-asm -fdump-tree-original-uid -fdump-tree-optimized-uid simple.f90
)
mkdir -p two
cat >two/simple.f90 <<\EOF
program simple
integer :: i,j

   i = 3
   j = 2 * i
   print *, j

end program simple
EOF
(
cd two
gfortran -S -fverbose-asm -fdump-tree-original-uid -fdump-tree-optimized-uid simple.f90
)

and did a diff of the files one/simple.s and two/simple.s and the only difference you saw were comments and line numbers you could be pretty certain you did not change the results the program would produce. But that is entering a whole new world and basically leaving Fortran behind pretty quickly; and since that is not the core purpose of the intermediate files, just a side-effect of the compiler methods I think the general solution is “test, test, test”.

PS; That is the same for some of the other discussions today like “ENDIF” versus “END IF”. It is true the compiler could just find “end”, for example. But “endif” is there for humans, not for the compilers. Heck, someone could write a language with hardly a syntactic clue to be found and replace everything like SUBROUTINE and FUNCTION and END and ENDIF and ENDO with just one or two character, like maybe “{” and “}”, but that would be C-razy :>

That reminds me, there are a lot of testing frameworks out there; and a lot of tools that are useful particularly for numeric libraries (I have a collection of my own). For various reasons I think you need more than one approach, but it would be useful to have some samples gathered in one place like the Fortran WIki or fortran-lang and ideally for me, in an fpm package on github showing working examples of numeric diff utilities, routines for comparing float values using different criteria, what some of those goofy Fortran intrinsics like EPSILON are actually good for, and logging tools.

I have gotten differences with the same code depending on the platform and compiler and compiler switches; and only testing (or “user testing”, as some like to think of bug reports as)
is really going to systematically find the changes.

To the list above I can add
(6) continuation lines vs. the same code on a single line
(7) :: in declarations
(8) old vs new comparison operators, for example .LT. vs <
(9) one line if vs. if block with one statement in the body
(10) print* vs. write (,)

It turns out that the abstract syntax tree produced by gfortran -fdump-fortran-original is invariant to these style choices, so that both

C     FIXED SOURCE FORM
      IMPLICIT NONE
      INTEGER N
      INTEGER ISUM
      INTEGER I
      PARAMETER (N=5)
      PRINT*,N
      ISUM=0
      DO I=1,N
      IF (ISUM.LT.8) ISUM=ISUM+I
      ENDDO
      PRINT*,ISUM
      END

and

! free source form
implicit none
integer, parameter :: n = 5
integer :: i,isum
write (*,*) n
isum = 0
do i=1,n
   if (isum < 8) then
       isum = isum +  & ! ugly, but I wanted to test a continuation line
                        i
   end if
end do
write (*,*) isum
end

give the same AST

Namespace: A-Z: (UNKNOWN 0)
procedure name = MAIN__
  symtree: 'MAIN__'      || symbol: 'MAIN__'       
    type spec : (UNKNOWN 0)
    attributes: (PROGRAM PUBLIC  SUBROUTINE IS-MAIN-PROGRAM)
  symtree: 'i'           || symbol: 'i'            
    type spec : (INTEGER 4)
    attributes: (VARIABLE )
  symtree: 'isum'        || symbol: 'isum'         
    type spec : (INTEGER 4)
    attributes: (VARIABLE )
  symtree: 'n'           || symbol: 'n'            
    type spec : (INTEGER 4)
    attributes: (PARAMETER IMPLICIT-SAVE)
    value: 5

  code:
  WRITE UNIT=6 FMT=-1
  TRANSFER 5
  DT_END
  ASSIGN MAIN__:isum 0
  DO MAIN__:i=1 5 1
    IF (< MAIN__:isum 8)
      ASSIGN MAIN__:isum (+ MAIN__:isum MAIN__:i)
    ENDIF
  END DO
  WRITE UNIT=6 FMT=-1
  TRANSFER MAIN__:isum
  DT_END

So for some style changes one can check that the AST is unchanged to verify that the meaning of the code has been preserved. Of course, the AST may change without changing the meaning of the code.

If I remove implicit none the Namespace line of the AST becomes

Namespace: A-H: (REAL 4) I-N: (INTEGER 4) O-Z: (REAL 4)

Also changing the AST is replacing do ... enddo with do ... continue with a numbered line.

You need a specification. What is the program’s behavior supposed to be? If you have that, you should be able to write a test suite that covers all of the aspects of the specification. If you make changes to the source code and that test suite still passes, you haven’t changed the program’s behavior, at least as far as what it’s supposed to do.

Granted, it’s really hard (some say impossible for certain classes of problem) to come up with a comprehensive enough specification and test suite, but it’s the only way to prove that the code behaves “correctly”. But, the Quality Assurance requirements in Nuclear industry arguably dictate that you have to do that. In most cases, even if your goal is to “prove” that the code is correct, what you really end up with is a high level of confidence that it isn’t wrong. It’s at least not wrong for all the things you tested for.

TLDR; It’s not possible to prove that two programs are behaviorally equivalent, just demonstrate to a to some degree of confidence that they aren’t different in ways that matter.