Do functions check size/rank of input arrays? If not why?

Consider this function:

   pure function so2nn_inlined(Hso,Nspin,Norb,L) result(Hnn)
      !! Local version of so2nn_reshape, doing all iω at once
      !! > with high optimization levels it should be inlined
      integer,intent(in)    :: Nspin,Norb,L
      complex(8),intent(in) :: Hso(Nspin*Norb,Nspin*Norb,L)
      complex(8)            :: Hnn(Nspin,Nspin,Norb,Norb,L)
      integer               :: io,is,istride
      integer               :: jo,js,jstride
      do concurrent(is=1:Nspin,js=1:Nspin,io=1:Norb,jo=1:Norb)
         istride = io + (is-1)*Norb  !spin-orbit stride
         jstride = jo + (js-1)*Norb  !spin-orbit stride
         Hnn(is,js,io,jo,:) = Hso(istride,jstride,:)
      end do
   end function so2nn_inlined

With main focus on the input array Hso, that has [Nspin*Norb,Nspin*Norb,L] shape.

Now consider this calling code:

function grad_delta_replica(a) result(dDelta)
      real(8),dimension(:)                                        :: a
      complex(8),dimension(Nspin,Nspin,Norb,Norb,Ldelta,size(a))  :: dDelta
      .
      .
      .
      complex(8),dimension(Nspin*Norb,Nspin*Norb)                 :: Htmp
      .
      .
      .
      dDelta(:,:,:,:,:,counter)=so2nn_inlined(Htmp,Nspin,Norb,Ldelta)
      .
      .
      .
end function grad_delta_replica

where Nspin, Norb, and Ldelta are module variables, so are known to grad_delta_replica.

Probably to many of you it would immediately come to the eye that I’m calling so2nn_inlined incorrectly, as Htmp has a wrong shape (even a wrong rank!). Nevertheless gfortran does not complain and even at runtime I had no error whatsoever. To be clear in my test Ldelta=1000 so there cannot be subtle confusion between [N,N,1] and [N,N] shapes.

The whole body of grad_delta_replica is actually quite big and dense of things happening (I was gradually refactoring parts of it to achieve faster execution) so I did not catch early my error, and struggled a lot recognizing it. What happened was that the final result (dDelta) was “a little off” (of course this is test case dependent, I’ve not been lucky here I guess) and the whole computation was just giving slightly different results than before (order of 1d-4 on the resulting array, which is optimized in terms of a cost function whose gradients are computed from this one, via chain rule). The final hint came from hunting some suspicious floating-point exceptions (IEEE_invalid specifically).

The bottom line is that this bug was hard to catch, especially since I never imagined I would need to check for size mismatch in a function call. I thought the compiler would do that for me. Why this is not happening? My compiler flags are:

-O0 -p -g -Wsurprising -Waliasing -fwhole-file -fcheck=all -fbacktrace -fbounds-check -ffree-line-length-none -ffpe-trap=invalid,overflow,zero

Especially from -fcheck=all I expected sizes to be checked no matter what. What am I missing?

https://fortran-lang.org/en/learn/best_practices/arrays/

Explicit-shape Array and Assumed-size Array don’t check the size and the rank.

It is the reason why modern fortran suggests Assumed-shape Array.

2 Likes

Checking the sizes would require some tests at runtime (and an explicit interface), which is something that we want to avoid in general: if the routine does a little amount of computation and is repeatedly called, then the runtime tests can severely arm the performances. Note that with assumed-shape dummy arguments no size check is performed either (and actually it is impossible to check the sizes in this case).

As for the rank, they could not be easily checked at the times where no interfaces exist. And the ability to pass a rank 1 array as an actual argument to a routine where the dummy argument is a rank 2 (or more) array has been extensively used in the past (and is still used, even if there are now other mechanisms to achieve that). With assumed-shape dummy arguments the rank is checked.

1 Like

Fair enough, here you have:

program MWE_for_Kargl

   implicit none

   integer, parameter   :: Norb=3, Nspin=2, Lmats=1000, Nk=5

   complex(8)           :: dDelta(Nspin,Nspin,Norb,Norb,Lmats,Nk)
   complex(8)           :: Htmp(Nspin*Norb,Nspin*Norb)
   integer              :: k

   Htmp = dcmplx(0d0,0d0)

   do k=1,Nk
      dDelta(:,:,:,:,:,k)=so2nn_inlined(Htmp,Nspin,Norb,Lmats)
   enddo

contains

   pure function so2nn_inlined(Hso,Nspin,Norb,L) result(Hnn)
      !! Local version of so2nn_reshape, doing all iω at once
      !! > with high optimization levels it should be inlined
      integer,intent(in)    :: Nspin,Norb,L
      complex(8),intent(in) :: Hso(Nspin*Norb,Nspin*Norb,L)
      complex(8)            :: Hnn(Nspin,Nspin,Norb,Norb,L)
      integer               :: io,is,istride
      integer               :: jo,js,jstride
      do concurrent(is=1:Nspin,js=1:Nspin,io=1:Norb,jo=1:Norb)
         istride = io + (is-1)*Norb  !spin-orbit stride
         jstride = jo + (js-1)*Norb  !spin-orbit stride
         Hnn(is,js,io,jo,:) = Hso(istride,jstride,:)
      end do
   end function so2nn_inlined

end program MWE_for_Kargl

This actually fails at runtime (segfault) so yes, I think is better than what I see on the bigger example. Nevertheless it is confirmed that even rank is not checked at compile time.

$ fpm @debug --target mwe
mwe.f90                                done.
mwe                                    done.
[100%] Project compiled successfully.

Program received signal SIGSEGV: Segmentation fault - invalid memory reference.

Backtrace for this error:
#0  0x7fee6fb153af in ???
#1  0x4012aa in so2nn_inlined
	at test/mwe.f90:30
#2  0x401514 in mwe_for_kargl
	at test/mwe.f90:14
#3  0x401591 in main
	at test/mwe.f90:13
<ERROR> Execution failed for object " mwe "
<ERROR>*cmd_run*:stopping due to failed executions
STOP 1

Thanks a lot! This clarifies a lot to me, I had forgot about this (and probably didn’t expect the caveat to apply to rank too).

Well. this is exactly the point in my original use case: that cost function is called many many times by the minimizer, so I indeed want to avoid wasting time on anything. That’s why I avoided assumed-shape (and a manual runtime check with a if construct), I thought this way the compiler would have checked for me at build time, my bad.

This is a little perplexing to me, the so2nn_inlined function is a module procedure in my original code (or at least I thought so… it is placed in the contains section of the same module where grad_delta_replica is). I thought an explicit interface was in place, am I wrong on this?

I find this interesting anyway, thanks for pointing out. Could you give an example (or a pointer to find out more?). Thanks anyway!

Concerning the evidence for what I claim to happen in my actual use case, I cannot really dump here the whole repository, but consider this (the whole body of grad_delta_replica, you just have to trust me that so2nn_inlined is placed in the same module and is identical to what I have shown you above).

     function grad_delta_replica(a) result(dDelta)
      real(8),dimension(:)                                        :: a
      complex(8),dimension(Nspin,Nspin,Norb,Norb,Ldelta,size(a))  :: dDelta,dDelta_ref
      integer                                                     :: ispin,iorb,jorb,ibath
      integer                                                     :: i,j,k,iw,counter,Nso
      complex(8),dimension(Nspin*Norb,Nspin*Norb)                 :: Htmp,Hbasis_so,Href
      real(8),dimension(Nspin*Norb)                               :: eigenH
      complex(8),dimension(Nspin*Norb,Nspin*Norb,Ldelta)          :: invHso,Haux
      complex(8),dimension(Nspin,Nspin,Norb,Norb,Ldelta)          :: invHnn,invHref
      real(8),dimension(Nbath)                                    :: dummy_Vbath
      type(nsymm_vector),dimension(Nbath)                         :: dummy_lambda
      !
      Nso = Nspin*Norb
      !
      !Get Hs
      counter = 0
      do ibath=1,Nbath
         allocate(dummy_lambda(ibath)%element(Nlambdas))
         !
         counter = counter + 1
         dummy_vbath(ibath) = a(counter)
         dummy_lambda(ibath)%element=a(counter+1:counter+Nlambdas)
         counter=counter+Nlambdas
      enddo
      !
      dDelta=zero
      dDelta_ref=zero
      counter=0
      do ibath=1,Nbath
         Htmp= nn2so_reshape( Hreplica_build(dummy_lambda(ibath)%element) ,Nspin,Norb)
         !
         !! OLD IMPLEMENTATION: loop over many inversions
         !!
         do iw=1,Ldelta
            Haux(:,:,iw) = zeye(Nspin*Norb)*xi*Xdelta(iw) - Htmp
            call inv(Haux(:,:,iw))
            !invHnn(:,:,:,:,iw) = so2nn_reshape(Haux(:,:,iw),Nspin,Norb)
            invHref(:,:,:,:,iw) = so2nn_reshape(Haux(:,:,iw),Nspin,Norb)
         enddo
         !!
         !! NEW IMPLEMENTATION: prediagonalization and optimized matrix products
         !
         call eigh(Htmp,eigenH,jobz='V',uplo='U')
         !
         do concurrent(i=1:Nso, j=1:Nso, iw=1:Ldelta)
            invHso(i,j,iw) = sum(Htmp(i,:)/(xi*Xdelta(iw)-eigenH(:))*conjg(Htmp(j,:)))
         end do
         !
         !Nso 2 NN reshape (local scope, all iω at once)
         invHnn = so2nn_inlined(invHso,Nspin,Norb,Ldelta)
         !
         if(any(abs(invHnn-invHref)>1d-18))then
            call splot('invHnn.dat',Xdelta,invHnn)
            call splot('invHref.dat',Xdelta,invHref)
            error stop "invHnn =/= invHref"
         endif
         !
         !Derivate_Vp
         counter = counter + 1
         dDelta(:,:,:,:,:,counter)=2d0*dummy_Vbath(ibath)*invHnn(:,:,:,:,:)
         dDelta_ref(:,:,:,:,:,counter)=2d0*dummy_Vbath(ibath)*invHref(:,:,:,:,:)
         if(any(abs(dDelta(:,:,:,:,:,counter)-dDelta_ref(:,:,:,:,:,counter))>1d-18))then
            call splot('dDelta.dat',Xdelta,dDelta(:,:,:,:,:,counter))
            call splot('dDelta_ref.dat',Xdelta,dDelta_ref(:,:,:,:,:,counter))
            error stop "dDelta =/= dDelta_ref"
         endif
         print *, 'Derivative of Vp passed'
         !
         !Derivate_lambda_p
         do k=1,Nlambdas
            counter = counter + 1
            Hbasis_so=nn2so_reshape(Hreplica_basis(k)%O,Nspin,Norb)
            do concurrent(iw=1:Ldelta)
               Htmp = matmul(matmul(invHso(:,:,iw) , Hbasis_so) , invHso(:,:,iw))
               Href = matmul(matmul(Haux(:,:,iw) , Hbasis_so) , Haux(:,:,iw))
               ! THESE WOULD BE THE CORRECT CALLS TO ANOTHER RESHAPE FUNCTION
               !dDelta(:,:,:,:,iw,counter)=so2nn_reshape(dummy_Vbath(ibath)**2*Htmp,Nspin,Norb)
               !dDelta_ref(:,:,:,:,iw,counter)=so2nn_reshape(dummy_Vbath(ibath)**2*Href,Nspin,Norb)
            enddo
            if(any(abs(Htmp-Href)>1d-18))then
               error stop "Htmp =/= Href"
            endif
            print *, 'Double matmul passed'
            Htmp = dummy_Vbath(ibath)**2 * Htmp
            Href = dummy_Vbath(ibath)**2 * Href
            ! THESE ARE THE TWO INCRIMINED LINES
            dDelta(:,:,:,:,:,counter)=so2nn_inlined(Htmp,Nspin,Norb,Ldelta)
            dDelta_ref(:,:,:,:,:,counter)=so2nn_inlined(Href,Nspin,Norb,Ldelta)
            ! if(any(abs(dDelta(:,:,:,:,:,counter)-dDelta_ref(:,:,:,:,:,counter))>1d-18))then
            !   call splot('dDelta.dat',Xdelta,dDelta(:,:,:,:,:,counter))
            !   call splot('dDelta_ref.dat',Xdelta,dDelta_ref(:,:,:,:,:,counter))
            !   error stop "dDelta =/= dDelta_ref"  <-- IF UNCOMMENTED IT WOULD FAIL HERE
            ! endif
            print *, 'Reshape passed'
            ! NEVERTHELESS IF I DON'T CHECK I GET TO THIS PRINT AND EXIT THE FUNCTION
         enddo
         !
      enddo
   end function grad_delta_replica

So as annoted in uppercase comments, with the above code I manage to get accross the whole function and exit. Then the result is nasty (I guess a lot of zeros here and there) and I catch a floating-point exception downstream (in the minimizer inner callgraph):

$ fpm @debug --target testFasterFit
minimize_krauth.f                      done.
optimize_cgfit_routines.f90            done.
minimize_sascha.f                      done.
VARS_GLOBAL.f90                        done.
mwe.f90                                done.
OPTIMIZATION.f90                       done.
BATH_AUX.f90                           done.
BATH_FIT.f90                           done.
libCGfit.a                             done.
testFasterFit.f90                      done.
testCGfit.f90                          done.
mwe                                    done.
testFasterFit                          done.
testCGfit                              done.
[100%] Project compiled successfully.
Reading bath from hamiltonian.restart
there are        9 lines in +hamiltonian.restart
                 #V_i             Lambda_i1
  1
  -9.917759089942E-02   -2.000245670544E+00
   2.390834374608E-01   -4.367278655013E-01
   1.308364827565E-01   -5.601539875953E-02
   4.908032834208E-02   -4.519910981637E-03
   5.479599716106E-02    6.086421383198E-03
   1.396752577662E-01    6.521317151957E-02
   2.406750847783E-01    4.984563931245E-01
 
( 1.00, 0.00)  ( 0.00, 0.00)  ( 0.00, 0.00)
( 0.00, 0.00)  ( 1.00, 0.00)  ( 0.00, 0.00)
( 0.00, 0.00)  ( 0.00, 0.00)  ( 1.00, 0.00)
 
At line 637 of file test/testFasterFit.f90
Fortran runtime warning: An array temporary was created for argument 'y1' of procedure 'sreada1_rc'
read:     weiss.in
read:     weiss.in
read:     weiss.in
 REFERENCE FIT:
\Chi2 fit with CG-nr and CG-weight: 1 on: weiss
 Fitted functions           9
CG: ftol updated to: 1.00E-05
CG: itmax updated to:  500
CG: istop update to:  0
 Iter,F_n =           1   4.8236829832142269E-005
        gradF:   1.1014166103377348E-005
  -2.5504306323596377E-007
  -7.9976204141721073E-005
  -3.9973400030288891E-006
   4.3119745248945053E-005
   1.1395555172914915E-004
   1.0004361834358329E-004
   1.0435552752348357E-004
  -1.0522610915076655E-004
   1.5794686669033230E-005
  -2.5433889783246183E-005
   1.3068602217302934E-004
   9.5620840808528795E-005
  -1.5100085122048181E-005
 A,B,ftol =   6.8187445330087951E-007   1.4372683502857203E-008   1.0000000000000001E-005
 Converged with (|F_n-F_n-1|/1+|F_n|), ||a_n - a_n-1||^2/1+||a_n||^2 < ftol_:   6.8187445330087951E-007   1.4372683502857203E-008
chi^2|iter =    4.755492249E-05 |     1  <--  All Orbs, All Spins
                 #V_i             Lambda_i1
  1
  -9.916657673332E-02   -2.000245925587E+00
   2.390034612567E-01   -4.367318628413E-01
   1.308796025017E-01   -5.590144320780E-02
   4.918037196042E-02   -4.415555454114E-03
   5.469077105191E-02    6.102216069867E-03
   1.396498238764E-01    6.534385754174E-02
   2.407707056191E-01    4.984412930394E-01
 
( 1.00, 0.00)  ( 0.00, 0.00)  ( 0.00, 0.00)
( 0.00, 0.00)  ( 1.00, 0.00)  ( 0.00, 0.00)
( 0.00, 0.00)  ( 0.00, 0.00)  ( 1.00, 0.00)
 
 NEW FASTER FIT:
\Chi2 fit with CG-nr and CG-weight: 1 on: weiss
 Fitted functions           9
CG: ftol updated to: 1.00E-05
CG: itmax updated to:  500
CG: istop update to:  0
 Derivative of Vp passed
 Double matmul passed
 Reshape passed
 Derivative of Vp passed
 Double matmul passed
 Reshape passed
 Derivative of Vp passed
 Double matmul passed
 Reshape passed
 Derivative of Vp passed
 Double matmul passed
 Reshape passed
 Derivative of Vp passed
 Double matmul passed
 Reshape passed
 Derivative of Vp passed
 Double matmul passed
 Reshape passed
 Derivative of Vp passed
 Double matmul passed
 Reshape passed
 Derivative of Vp passed
 Double matmul passed
 Reshape passed
 Derivative of Vp passed
 Double matmul passed
 Reshape passed
 Derivative of Vp passed
 Double matmul passed
 Reshape passed
 Derivative of Vp passed
 Double matmul passed
 Reshape passed
 Derivative of Vp passed
 Double matmul passed
 Reshape passed
 Derivative of Vp passed
 Double matmul passed
 Reshape passed
 Derivative of Vp passed
 Double matmul passed
 Reshape passed

Program received signal SIGFPE: Floating-point exception - erroneous arithmetic operation.

Backtrace for this error:
#0  0x7fece5e873af in ???
#1  0x4505e1 in __cgfit_routines_MOD_df1dim
	at ./src/optimize_cgfit_routines.f90:96
#2  0x44e269 in __cgfit_routines_MOD_dbrent_
	at ./src/optimize_cgfit_routines.f90:331
#3  0x451520 in __cgfit_routines_MOD_dlinmin
	at ./src/optimize_cgfit_routines.f90:69
#4  0x435f71 in __optimize_minimize_MOD_fmin_cg_df
	at ./src/OPTIMIZATION.f90:97
#5  0x47c720 in chi2_fitgf_replica
	at ./src/BATH_FIT.f90:176
#6  0x480d66 in __bath_fit_MOD_chi2_fitgf_generic_normal
	at ./src/BATH_FIT.f90:78
#7  0x42b6a9 in testdeltabuild
	at test/testFasterFit.f90:646
#8  0x42bc89 in main
	at test/testFasterFit.f90:573
<ERROR> Execution failed for object " testFasterFit "
<ERROR>*cmd_run*:stopping due to failed executions
STOP 1

As you can see from the stdout prints, I get across the “Derivative of Vp passed; Double matmul passed; Reshape passed” triad many times, until exiting the function, and catch a problem only later. Without the -ffpe-trap=invalid,overflow,zero flag it would even go further until finishing the program, with very subtle numerical errors on the fitted coefficients.

So yes, I don’t really know why the MWE does a segmentation fault inside so2nn_inlined and the whole program does not, but I guess at this point is kind of irrelevant.

I think this is the fix you’re looking for: link to godbolt

EDIT: uhm I guess I hadn’t read the lengthy thread well enough. With assumed-size arrays, the subroutine only receives an address to the beginning of the variable, and then decides it will have the size that you have coded, there’s not many ways it could be better guessed.

1 Like

Oh… thanks for spending time on this, that’s indeed the catch (well actually the output Hnn must be rank 2 too, and the whole call being done in a loop over that index). Maybe I should have been clear, I know (now) what was wrong and how to fix it, I was more asking about why the compiler didn’t help me catch the issue (and I think I got already the core answer). Sorry to have wasted your time for not being clear that I’m not looking for an actual fix (since I already have).

Yeah I this is another case of Knuth’s gold rule, “premature optimization is the root of all evil”. You were just trying to code the reshape function yourself!

Oh to all, sorry, I am actually silly after all. Crucial info I forgot to add (cause I totally forgot about, in first place): Nspin, Norb, Ldelta are only known at runtime so yeah, no matter what, it was silly to expect the compiler to check anything for me. And, being speed the main concern here, I surely don’t want runtime checks (at least not with -03 etc, maybe in debug mode they would be nice, but I understand if is not possibile to have both).

Eheh, maybe. But actually the whole code has been up and working for a few years, so it’s not exactly premature optimization. Then “bad executed” of course yes. :slight_smile:

I see your struggle, I think if you want to improve performance maybe you could try refactor some of your variables to reduce the number of jumps between non-contiguous chunks of memory:

Here for example, you are copying the same value across L=size(Hnn,5) floats which are far apart in memory:

Hnn(is,js,io,jo,:)

You’re not :slight_smile: … I haven’t been clear actually: because the rank mismatchs could not be checked at the times where no interface exist, and because the rank mismatchs had turned into a feature among the users rather than a pitfall, the choice has been made to continue allowing them even if now they would be easy to catch when interfaces are defined.

In the old times where allocatable and pointer were not part of the standard, there was a popular extension when one needed dynamic allocation: the Cray pointers (which ended up to be available in all major compilers). But they had a significant restriction: we could allocate only 1D arrays with them. So when nD arrays were needed the solution was to allocate a big 1D array, and to pass it to a lower level routine where it was declared with the desired rank.

1 Like

Yes, sure. Well, probably my actual sin has been to over-optimize. Profiling with valgrind (callgrind) I saw that switching from calling inv (which is just an high level wrapper to the suitable BLAS calls to invert a generic matrix) to this ‘diagonalize, then invert element-wise, then rotate back’ scheme already gave quite a nice speedup. The runtime-share of the so2nn_reshape calls was actually pretty negligible in comparison and this try at “optimize that too, since I’m here” has been payed with blood! :smiley:

@rgba, you want to go through this thread if you haven’t already:

1 Like

This is extremely on point and furthermore a delightfully rich debate on the topic, thanks so much for surfacing it here!

In fortran when assumed size, array(*), and explicit shape, array(m,n), dummy argument declarations are used, then the association between the actual and dummy arguments is through storage sequence association. This is explained in some detail in the language standard for both numerical and character variables, and this convention dates back to f66 for numerical and logical arrays and to f77 for characters. When assumed shape, array(:,:), is used, then a different argument association is done. The former cases can occur with both implicit and explicit interfaces, while the latter only occurs with explicit interfaces. That means that the two approaches can be mixed together, but that sometimes the association requires copy-in/copy-out to occur by the compiler behind the curtain. A further complication is that pointer assignments can now also change the rank (since f2008 I think). This is yet another situation that requires knowledge of storage sequence association to really understand how it works.

1 Like