Help me find this gfortran bug for ARM machines

I’m running an older Fortran code on MacOS Big Sur with apple silicon. There is a compiler bug that I can not identify specifically. I’d like to find it, so I can make sure the GNU developers know about it.

Here is the code: GitHub - Nicholaswogan/ImpactAtmosphere at dev

compile with cmake in the standard way:

mkdir build
cd build
cmake ..
make

Run the test

./test

With Gfortran on Apple silicon I get

(base) nicholas@Nicholass-MacBook-Air build % ./test                             
 PRINT1:    1.6867517999999999E+028
 PRINT2:    2.5579538487363607E-013
 PRINT3:    2.5579538487363607E-013

This is wrong! The value should never change. When I run with Intel Fortran or gfortran on a x86 machine I get

(base) nicholas@ build % ./test                           
 PRINT1:    1.6867517999999999E+028
 PRINT2:    1.6867517999999999E+028
 PRINT3:    1.6867517999999999E+028

Here are the print statements in context:

    double precision, dimension(10), intent(in) :: y
    double precision, dimension(10), target :: N
    double precision, pointer :: N_ptr(:)

    ! more code here ....

    do i=1,nspecies
      N(i) = max(y(i),0.d0)
    enddo
    N_ptr => N
    NNN = N ! global NNN
    x(1) = dlog(tau_uv_init) ! intial condtions
    
    print*,'PRINT1: ',N_ptr(1)
    ! solve nonlinear system
    call lmdif2(fcn2,1,1,x,fvec,tol,info,iwa,wa,lwa,10000)
    print*,'PRINT2: ',N_ptr(1)
    if ((info.ne.1) .and. (info.ne.2) .and. (info.ne.3) .and. (info.ne.4)) then
      print*,'Non-linear solver failed in subroutine rhs ',info
      ierr = .true.
    endif
    tau_uv_init = dexp(x(1))

    print*,'PRINT3: ',N_ptr(1)

The subroutine takes y as input. y is copied to N. N_ptr points to N. Finally N is copied to NNN, which a is a global module variable. This is done to “pass” N to another function. Then the code makes a call to MINPACK (lmdif2).

The confusing thing is that N is local to this function (so is N_ptr), so it should not be affected by the call to MINPACK. But it is!!

I am not able to reproduce this behavior with a simpler code. Hoping someone knows this bug or can spot it, so we can send it to those who know how to fix these things.

This bug persists for MacOS ARM and for a ARM Ubuntu virtual machine run on MacOS ARM.

(base) nicholas@ build % gfortran --version
GNU Fortran (Homebrew GCC 11.2.0_3) 11.2.0

Thanks for any ideas!

Are you sure that lmdif2 is bug free? If it has an out of bounds access, it might write to the address of N. Did you run with debug options and/or valgrind?

The procedure argument passed to LMDIF itself ends up calling LMDIF. You have tried to circumvent recursion by using two copies of the Netlib lmdif1.f, each of those calls subroutine LMDIF, and the compiler does not know that subroutine LMDIF itself is recursive. You have to unravel this recursion and add the RECURSIVE attribute as appropriate.

I get similar incorrect output when I use -O2 with Gfortran 11.2 on Windows, so the problem is not specific to MAC-ARM. Compiling source file EvolveAtmFort.f90 with -O instead of -O3 or -O2 appears to suffice to restore the output to what is expected.

1 Like

@kargl . There are experimental releases for MacOS Apple M1. The goal here is to help identify those bugs.

@MarDie valgrind says everything is OK.

@mecej4 I think you are right. There is some recursion going on, which might be messing things up. I’m going to look more carefully at this.

Here are three errors that I found. There are three places in EvolveAtmFort.f90 where workspace arrays WA are allocated. Those should have been declared as type double precision, whereas you have declared them as type integer. As a consequence, the workspace passed in the second and third calls is smaller than what is needed, so some clobbering occurs in the Minpack routines.

Change the three declarations as just stated, and check that your Fortran compiler can compile all routines as recursive even without the attribute having been declared, or look for a compiler option to make all procedures recursive. You can then do away with lmdif2.f and replace all calls to lmdif2 with calls to lmdif1.

On Windows, after I made these fixes and changes, I compiled using Gfortran 11.2 with -O3, and the resulting program behaved normally. The problem that you encountered was caused by programmer error, not a compiler bug.

2 Likes

@mecej4 thanks so much! This is really helpful. After fixing the wa variable, it works on my MacOS M1 machine. I should have written module interfaces for the old minpack routine. This bug would have been caught by the compiler if I had.

I was suspicious of compiler bug because I found one recently: Why am I getting a segmentation fault here (M1 Mac)? - #3 by nicholaswogan

1 Like

Another observation about the algorithmic choices in the program may be helpful; it has nothing to do with the bug caused by the under-allocation of the work array wa.

You are calling lmdif2() from rhs and rhs_verbose. The purpose of each of these calls is to solve a single (scalar) nonlinear equation in one unknown (m=n=1). This is similar to calling the Lapack routine DGESV to solve a single linear equation ax = b in one unknown, instead of setting x = b/a inline. It would be far more efficient to use a more appropriate solver such as FZERO from the Slatec library. If you did so, you would also do away with the recursive calls to the Minpack routines.

1 Like

You don’t need to use the old minpack routines anymore. See:

Also, if you want fzero (or any other root solver), you can use:

Both are available via FPM and all procedures are present in a module, so you get interface checking for free.

3 Likes

You can also use our modern Fortran wrappers for the minpack routines here: GitHub - certik/minpack: Library for solving nonlinear equations and nonlinear least squares problems

1 Like

5 posts were split to a new topic: Moving minpack under fortran-lang

One of the non-linear solves has 1 unknown and the other has 3 unknowns. So I could apply fzero to the one unknown solve.