Bug Hunting in Old F77 Code

mecej4 · May 10, 2022, 3:07pm

Old Fortran codes did not always conform to the Fortran standard in force at their creation, and such deviations did not bother the users of the codes as long as the codes ran correctly on the machines in use at the time. During the last few days, I tangled with one such bug, and I felt that it would be useful to record some highlights of the experience.

The code in question is Hairer’s stiff ODE solver RADAU5. Ernst Hairer is a distinguished mathematician, and is the author of a superb 2-volume treatise on differential equations, the second volume of which is devoted to stiff ODEs. Ernst Hairer’s son, Martin Hairer, won the 2014 Fields Medal.

The three source files needed are dr1_radau5.f, radau5.f and dc_lapack.f, which you can download from Hairer’s site.

Compiling the three files with Lapack and BLAS libraries and running the resulting program produces the following output (a few digits may differ, depending on which compiler you use, etc.)

 X = 0.00    Y =  0.2000000000E+01 -0.6600000000E+00    NSTEP =   0
 X = 0.20    Y =  0.1858198964E+01 -0.7574791034E+00    NSTEP =  10
 X = 0.40    Y =  0.1693205231E+01 -0.9069021617E+00    NSTEP =  11
 X = 0.60    Y =  0.1484565763E+01 -0.1233096965E+01    NSTEP =  13
 X = 0.80    Y =  0.1083912895E+01 -0.6196077501E+01    NSTEP =  22
 X = 1.00    Y = -0.1863645642E+01  0.7535323767E+00    NSTEP = 123
 X = 1.20    Y = -0.1699724325E+01  0.8997692747E+00    NSTEP = 124
 X = 1.40    Y = -0.1493375073E+01  0.1213880891E+01    NSTEP = 126
 X = 1.60    Y = -0.1120780441E+01  0.4374794949E+01    NSTEP = 133
 X = 1.80    Y =  0.1869050586E+01 -0.7495753072E+00    NSTEP = 237
 X = 2.00    Y =  0.1706161101E+01 -0.8928074777E+00    NSTEP = 238
 X = 2.00    Y =  0.1706161101E+01 -0.8928074777E+00
       rtol=0.10D-03
 fcn= 2218 jac= 161 step= 275 accpt= 238 rejct=  8 dec= 248 sol=  660

Everything seems in order. The test problem (van der Pol equation) involves only two variables, and the program runs in less than a second.

However, I wish to use this ODE solver on a stiff ODE with 400 variables (Medakzo). Therefore, I wanted to identify the part of the code where most of the run time is spent, and improve the code by using modern Fortran or by calling fast MKL routines as replacements for expensive DO loops.

Profiling the code (still with the van der Pol problem) showed that the first and second most used routines were FVPOL, which implements the specific ODE right hand side functions, and SLVRAD, which forms and solves a dense set of linear equations using Lapack. There is not much to do to speed up FVPOL, and the number of times it gets called is controlled by the ODE solver algorithm. That took me into the source for SLVRAD, which is where my troubles began.

The section of code that forms the arguments to pass to the Lapack linear solver pair DGETRF/DGETRS is the following:

      DO I=1,N
         S2=-F2(I)
         S3=-F3(I)
         Z1(I)=Z1(I)-F1(I)*FAC1
         Z2(I)=Z2(I)+S2*ALPHN-S3*BETAN
         CONT(I)=Z3(I)+S3*ALPHN+S2*BETAN
      END DO
      DO I=N,1,-1
         Z2(2*I-1)=Z2(I)
         Z2(2*I)=CONT(I)
      END DO

All the arrays that you see here are dummy arguments. I felt that I could combine the two loops and eliminate the temporary variables, using this equivalent code:

      do i = n, 1, -1
         z1(i) = z1(i) - f1(i) * fac1
         z2(2*i)   = z3(i) - f3(i) * alphn - f2(i) * betan
         z2(2*i-1) = z2(i) - f2(i) * alphn + f3(i) * betan
      end do

The program output changed to

 X = 0.00    Y =  0.2000000000E+01 -0.6600000000E+00    NSTEP =   0
 X = 0.20    Y =  0.1899165797E+01 -0.7285341796E+00    NSTEP =1009
 X = 0.40    Y =  0.1787641272E+01 -0.8141696660E+00    NSTEP =2635
 X = 0.60    Y =  0.1660762839E+01 -0.9446167743E+00    NSTEP =5042
 X = 0.80    Y =  0.1556660519E+01 -0.1093780715E+01    NSTEP =****
 EXIT OF RADAU5 AT X=        0.9004E+00
  MORE THAN NMAX =      100000 STEPS ARE NEEDED
 X = 0.90    Y =  0.1556659825E+01 -0.1093786486E+01
       rtol=0.10D-03
 fcn=***** jac=**** step=**** accpt=**** rejct=*** dec=**** sol=*****

What happened? Oh, the arrays Z2 and Z3 are of formal size N, but the code treats Z2 as if it were of size 2*N. Yet, the original code ran fine on a number of test problems, whereas my “improved” version fails!

Unable to see why the seemingly correct replacement code caused the program to fail, I resorted to my favorite debugging tool: the PRINT statement. I added this before my replacement fused loop:

      write(*,'(A,2ES12.4)')'Before loop, Z3 = ',z3(1:n)

and this after the same loop:

      write(*,'(A,2ES12.4)')'After  loop, Z3 = ',z3(1:n)
      stop

The output after making these changes became:

 X = 0.00    Y =  0.2000000000E+01 -0.6600000000E+00    NSTEP =   0
Before loop, Z3 =  -9.7219E-01 -2.9460E+04
After  loop, Z3 =   8.0596E+04 -2.9460E+04

How can this be? Z3 appears only to the right of the = sign in the replacement loop. How did it get changed? Was it because of array bounds being crossed, or aliasing, or both?

[To be Continued; please post your comments and suggestion if the topic interests you. Thanks in advance!]

ivanpribec · May 10, 2022, 7:58pm

SciPy has their own Python implementation of the Radau IIA method located here: scipy/radau.py at v1.8.0 · scipy/scipy · GitHub. Perhaps you can use it to cross-check the algorithm.

What strikes me as odd is that in Python, the first column uses slightly different coefficients:

T = np.array([
    [0.09443876248897524, -0.14125529502095421, 0.03002919410514742],
    [0.25021312296533332, 0.20412935229379994, -0.38294211275726192],
    [1, 1, 0]])

and in Fortran (radau5.f):

      T11=9.1232394870892942792D-02
      T12=-0.14125529502095420843D0
      T13=-3.0029194105147424492D-02
      T21=0.24171793270710701896D0
      T22=0.20412935229379993199D0
      T23=0.38294211275726193779D0
      T31=0.96604818261509293619D0

while in Julia OrdinaryDiffEq.jl’s - file src/tableaus/firk_tableaus.jl - the coefficients match Hairer’s Fortran code:

function RadauIIA5Tableau(T,T2)
  T11  = convert(T, 9.1232394870892942792e-02)
  T12  = convert(T, -0.14125529502095420843e0)
  T13  = convert(T, -3.0029194105147424492e-02)
  T21  = convert(T, 0.24171793270710701896e0)
  T22  = convert(T, 0.20412935229379993199e0)
  T23  = convert(T, 0.38294211275726193779e0)
  T31  = convert(T, 0.96604818261509293619e0)

I haven’t looked at Hairer’s textbook to see what it says. The Julia implementation as a second cross-check can be found here: OrdinaryDiffEq.jl/firk_perform_step.jl at master · SciML/OrdinaryDiffEq.jl · GitHub

Perhaps the Fortran version is not the only one affected by bugs

PS: If of any use for cross-checking, a driver for the van der Pol equation solved by the SIRK3 method is located here: GitHub - ivan-pi/stiff3: Adaptive solver for stiff systems of ODEs using semi-implicit Runge-Kutta method of third order. The same Lapack routines are required.

wclodius · May 10, 2022, 8:22pm

I would check to see if Z2 is a real array aliased with a complex array either through an EQUIVALENCE statement or through argument passing. The Fortran storage equivalence rules allow a 2N real array to overlay an N complex array.

ivanpribec · May 10, 2022, 8:37pm

A search for “equivalence” gives no hits in radau5.f or dc_lapack.f.

The “Z” arrays are sections of the array work declared in radau5.f:

C vvv L373 vvv
      IMPLICIT DOUBLE PRECISION (A-H,O-Z)
      DIMENSION Y(N),ATOL(*),RTOL(*),WORK(LWORK),IWORK(LIWORK)

The entry points don’t appear to alias:

C vvv L593 vvv
C ------- PREPARE THE ENTRY-POINTS FOR THE ARRAYS IN WORK -----
      IEZ1=21
      IEZ2=IEZ1+N
      IEZ3=IEZ2+N

The call to the core integrator is:

C vvv L629 vvv
C -------- CALL TO CORE INTEGRATOR ------------
      CALL RADCOR(N,FCN,X,Y,XEND,HMAX,H,RTOL,ATOL,ITOL,
     &   JAC,IJAC,MLJAC,MUJAC,MAS,MLMAS,MUMAS,SOLOUT,IOUT,IDID,
     &   NMAX,UROUND,SAFE,THET,FNEWT,QUOT1,QUOT2,NIT,IJOB,STARTN,
     &   NIND1,NIND2,NIND3,PRED,FACL,FACR,M1,M2,NM1,
     &   IMPLCT,JBAND,LDJAC,LDE1,LDMAS2,WORK(IEZ1),WORK(IEZ2),
     &   WORK(IEZ3), ...

Inside of RADCOR there doesn’t appear to be any aliasing either:

C vvv L667 vvv
      SUBROUTINE RADCOR(N,FCN,X,Y,XEND,HMAX,H,RTOL,ATOL,ITOL,
     &   JAC,IJAC,MLJAC,MUJAC,MAS,MLMAS,MUMAS,SOLOUT,IOUT,IDID,
     &   NMAX,UROUND,SAFE,THET,FNEWT,QUOT1,QUOT2,NIT,IJOB,STARTN,
     &   NIND1,NIND2,NIND3,PRED,FACL,FACR,M1,M2,NM1,
     &   IMPLCT,BANDED,LDJAC,LDE1,LDMAS,Z1,Z2,Z3,
     &   Y0,SCAL,F1,F2,F3,FJAC,E1,E2R,E2I,FMAS,IP1,IP2,IPHES,
     &   CONT,NFCN,NJAC,NSTEP,NACCPT,NREJCT,NDEC,NSOL,RPAR,IPAR)
C ----------------------------------------------------------
C     CORE INTEGRATOR FOR RADAU5
C     PARAMETERS SAME AS IN RADAU5 WITH WORKSPACE ADDED 
C ---------------------------------------------------------- 
C         DECLARATIONS 
C ---------------------------------------------------------- 
      IMPLICIT DOUBLE PRECISION (A-H,O-Z)
      DIMENSION Y(N),Z1(N),Z2(N),Z3(N)

So it looks like crossing array bounds with the 2*i and 2*i-1 indexes is the true culprit.

mecej4 · May 10, 2022, 11:00pm

[Bug Hunting, Continued]

Thanks to @ivanprivec and @wclodius for their perceptive analyses of the problem.

The problem is caused by the fact that Z2(N+1:2*N) occupies the same memory as Z3(1:N), but this can become known only by following up the call chain to the place where Z2 and Z3 are allocated (by computing indices to portions of a large combined work array). Thus, setting values into Z2(N+1:2*N) changes Z3. Not only is the upper array bound of Z2 crossed, but values in Z3 are being changed through the name Z2, which the rules of Fortran forbid. Now we can appreciate why the rule is beneficial.

What the old code did is to utilize the array CONT as scratch space. Its use, along with the two DO loops in the old code, caused the old values of Z3 to be used to set all the N elements of CONT in the first loop. After that first loop was completed, the old Z3 values were no longer needed, so the clobbering of Z3 by Z2(N+1:2*N) in the second loop was harmless. After the call to DGETRS in the old code, new values were put into the Z3 array, and everything worked correctly.

My modification (combining the two DO loops into one) failed because, thinking that Z2 and Z3 were different variables, and thinking that because the old code worked the array overrun was not harmful, I rearranged the order of calculation. My rearrangement would have been correct only if Z2 and Z3 had been distinct, i.e., did not overlap.

The Fortran rules against aliasing give us the freedom to think of Z2 and Z3 as distinct variables. If those rules are violated, strange things can happen, and seemingly harmless changes such as my modification can cause the program to malfunction in mysterious ways!

So, what is the fix? Change the allocation of Z2 to the required size, 2N, by changing the statement

IEZ3=IEZ2+N

in radau5.f to

IEZ3=IEZ2+2*N

and increase the size of work array WORK by N, by changing

4 * nd * nd + 12 * nd + 20

to

4 * nd * nd + 13 * nd + 20

ivanpribec · May 11, 2022, 6:47am

I suppose the addition of an extra nd variables is negligible in comparison with the nd**2 part.

I was wondering this morning if you couldn’t just replace the parts which reference values Z2(N+1:2*N) with Z3 directly?

After taking a better look, I noticed @wclodius was right about aliasing through argument passing. The reason the array bounds of Z2 are crossed is because it is used as a complex array, just after the loops in question:

C vvv L898 - dc_lapack.f vvv
C ---  B=IDENTITY, JACOBIAN A FULL MATRIX
      DO I=1,N
         S2=-F2(I)
         S3=-F3(I)
         Z1(I)=Z1(I)-F1(I)*FAC1
         Z2(I)=Z2(I)+S2*ALPHN-S3*BETAN
         CONT(I)=Z3(I)+S3*ALPHN+S2*BETAN
      END DO
      DO I=N,1,-1
         Z2(2*I-1)=Z2(I)
         Z2(2*I)=CONT(I)
      END DO
      CALL DGETRS ('No transpose',N,1,E1,LDE1,IP1,Z1,N,IER)
      CALL ZGETRS ('No transpose',N,1,E2R,LDE1,IP2,Z2,N,IER)
      DO I=1,N
         CONT(I)=Z2(2*I)
         Z2(I)=Z2(2*I-1)
      END DO
      DO I=1,N
         Z3(I)=CONT(I)
      END DO
      RETURN

For the right-hand side of the complex system of equations, Z2 in this case, zgetrs expects an argument of type

		complex*16, dimension( ldb, * ) 	B

The indexes 2*i-1 and 2*i are accessing the real and complex parts of the array. (I’m assuming a complex array is laid out as r1, c1, r2, c2, …) The loops before and after the LAPACK calls are just packing and unpacking the complex right-hand side from an interleaved (r1,c1,r2,c2,…rn,cn) to a contiguous format (r1,r2,…,rn, c1, c2, …, cn).

ivanpribec · May 11, 2022, 6:51am

On a different note, for your system of 400 variables, is the Jacobian sparse or banded? The work array section of size 4*400*400 seems like an awful lot of memory.

Both Julia and SciPy allow to pass sparse matrices in CSC format if I’m not mistaken.

mecej4 · May 11, 2022, 9:06am

@ivanpribec,

The old code has other features that may cause problems. The call to SLVRAD in RADAU5.f contains two double precision 2-D real arrays as actual arguments: E2R, E2I. The corresponding dummy arguments in SLVRAD are a complex*16 array E2R, and a real*8 dummy scalar! For this to work, the array E2I must follow E4R immediately in memory, and you have to give up on interface checking.

The Radau5 code can be used with Lapack routines to do the linear algebra, in which case there are some complex variables, or with a “house” linear algebra package (Decsol), which uses only real variables. Very clever, and perhaps a good solution in the early 1970s, when Lapack and BLAS had not yet been born, but troublesome these days.

The Radau code does provide for dense or banded mass and jacobian matrices. For the Medakzo problem, which arises from the spatial discretization of the 1-D diffusion equation with two species, the jacobian has upper and lower bandwidth of 2. I had intended to use the dense solver with a smaller value of N such as 40 before turning to the banded form for larger N.

mecej4 · May 11, 2022, 9:15am

Unfortunately, no. If you look at the code, you will notice that in SLVRAD the array argument Z3 has three roles:

input array, used in forming one of the arguments to ZGETRS,
output array, returned to the caller near the end
scratch space

One has to be clear about which of these roles is being played when attempting to improve or speed-up the code. The old code did this even while Z3 was being clobbered by Z2! The amount of memory they had on the old mainframes was so little (128 kwords on the CDC 6600; up to 8 MB on the IBM 360) that such practices were almost necessary to get real work done.

JohnCampbell · May 12, 2022, 9:13am

Would changing “IEZ3=IEZ2+N” to “IEZ3=IEZ2+N*2” change the results in the original code where two DO loops are used ?
I would expect not, as all this should do would be to waste space in the WORK array, unless the aliasing of Z2:Z3 in the original code is important.

The original second DO loop adresses Z2 outside the allocated bounds, overwriting Z3 ( possibly after it has been used) should have been documented.
DO I=N,1,-1
Z2(2I-1)=Z2(I)
Z2(2I)=CONT(I)
END DO

Is this use of Z2(N+1:n*2) to generate the equivalent complex value pairs of Z2 ?, which is more “complex” memory mapping than I have used before in old FORTRAN.

The assumption in the first post : " I felt that I could combine the two loops and eliminate the temporary variables, using this equivalent code:" appears to be invalid, although using “IEZ3=IEZ2+N*2” might fix this ?

mecej4 · May 12, 2022, 9:35am

I think that the most serious error in the old code was the requirement that the actual arguments for Z2 and Z3 should have a certain memory layout (Z3 to follow Z2), and not making that requirement known in comments.

mecej4 · May 12, 2022, 9:51am

However, the rearranged code segment no longer uses CONT for scratch space, so it may be possible to remove CONT as an argument to SLVRAD, thus there would be no net change in memory needed. One would have to double check that CONT is not required elsewhere, or when different values of IOPT are passed.

JohnCampbell · May 12, 2022, 10:37am

A significant usage of “WORK” was to provide temporary arrays of arbitrary size.
Where their usage is local to a routine, these can now (F90+) be replaced by automatic or ALLOCATE arrays, depending on their size relative to the stack size.
This was a significant improvement in F90+, although many OS failed to provide a larger stack to help with this change.
Using automatic or ALLOCATE arrays in a routine can simplify the understanding and management of WORK, unless there are undocumented uses of these arrays beyond the “routine”.

mecej4 · May 12, 2022, 10:48am

Yes, and using local, allocated and automatic arrays will solve another bad property of old code: too many subprogram arguments, many of which are simply scratch space. If such a passed-around work array is of the wrong size or needs to be changed in size because of a change to an algorithm, changes have to be made in many places. Too many opportunities to err.

Some old codes used the Reverse Call Interface. MKL’s TRNLSP still uses this interface. An RCI library routine has to go through the chore of mincing up work space into dozens of arguments to pass down to other subprograms. I have seen cases where such mincing is needed only once, but is repeated thousands of times needlessly.

lfrancoi · December 12, 2022, 12:32pm

Hi @mecej4 ,

have you managed to refactor the original Fortran code ? If so, would you agree to share your code somewhere (github…) ? I am currently developping a DAE-compatible (differenital-algebraic equations) version of Scipy’s Radau, and I am comparing it to the original Fortran code. The modified Scipy code is available here: GitHub - laurent90git/DAE-Scipy: Modifications of Scipy's implicit solvers for the solution of differential-algebraic equations (DAEs).

Regards

CRquantum · December 15, 2022, 6:21pm

Great post.
Just curious, after the modifications, are there some speed improvements?

JohnCampbell · December 16, 2022, 5:32am

The work array was usually only defined in one place, either in the main program or in an include file.

Typically, I defined 2 arrays (8-byte real and 4-byte integer) then equivalenced them to provide an appropriate type argument for subroutine calls to minimise the needless compiler warnings for mixed type arguments. All work array subscripts had to be carefully matched to which array type was being used, which we were capable of getting right.
All these subscript calculations can be placed in a “mem_alloc” routine where this was managed, which basically emulated ALLOCATE that was to follow.

Looking back at these codes, it is amazing how small the work array was.

Topic		Replies	Views
Seeking Collaborators to Revive JModelica with RADAU5 and More! Projects	3	403	April 30, 2024
A new Runge-Kutta Fortran library Announcements	28	3263	October 15, 2023
Modern Fortran interface to ODEPACK Announcements	26	2677	October 14, 2022
Packs -- the classic Fortran libraries	32	3649	February 14, 2022
Seeking Recommendations for a Reliable DAE Solver for Fortran Help	8	300	October 1, 2024

Bug Hunting in Old F77 Code

Related topics