Unexpected seg. fault allocating a not-allocated array

Hello everyone,

I am facing the following error:

Program received signal SIGSEGV: Segmentation fault - invalid memory reference.

Backtrace for this error:
#0  0x7f85ba9f2960 in ???
#1  0x7f85ba9f1ac5 in ???
#2  0x7f85ba6df51f in ???
	at ./signal/../sysdeps/unix/sysv/linux/x86_64/libc_sigaction.c:0
#3  0x7f85ba740ac1 in _int_malloc
	at ./malloc/malloc.c:3937
#4  0x7f85ba742138 in __GI___libc_malloc
	at ./malloc/malloc.c:3329
#5  0x55d084f35ac1 in __grisbolt_hamiltonian_MOD_build_h_sector_fulldiag
	at ../../GRISBOLT/src/GRISBOLT_AIM/GRISBOLT_HAMILTONIAN.f90:60
#6  0x55d084f23c84 in __grisbolt_aim_MOD_diagonalize_aim
	at ../../GRISBOLT/src/GRISBOLT_AIM/GRISBOLT_AIM.f90:117
#7  0x55d084f24e53 in __grisbolt_aim_MOD_solve_aim_problem
	at ../../GRISBOLT/src/GRISBOLT_AIM/GRISBOLT_AIM.f90:48
#8  0x55d084ebca15 in sc_cycle
	at ././src/BETHE_X_GRISBOLT.f90:153
#9  0x55d084eb76cd in __bethe_x_grisbolt_MOD_solve_bethe_x
	at ././src/BETHE_X_GRISBOLT.f90:104
#10  0x55d084ea4a83 in MAIN__
	at app/bethe_x.f90:103
#11  0x55d084ea4c44 in main
	at app/bethe_x.f90:3
Segmentation fault (core dumped)
<ERROR> Execution for object " bethe_x " returned exit code  139
<ERROR> *cmd_run*:stopping due to failed executions
STOP 139

The error appear when allocating a non-allocated array in the following subroutine:

  subroutine build_H_sector_fulldiag(aim_problem,SectorI,Hmat,ifrag_)
    type(AIM),allocatable,intent(in)   :: aim_problem
    type(sector)                       :: SectorI
    complex(8),dimension(:,:)          :: Hmat
    integer,optional :: ifrag_
    !
    integer                            :: dim_ib, dim_ii
    integer,dimension(:), allocatable  :: ib_up,ib_dw, ii_up,ii_dw
    integer                            ::  ifrag
[other variables]
    !
    ifrag=1; if(present(ifrag_)) ifrag=ifrag_
    !
    dim_ib = Nbath/Nspin; dim_ii = Nimp/Nspin

    if(allocated(ib_up)) deallocate(ib_up)
    if(allocated(ib_dw)) deallocate(ib_dw)
    allocate(ib_up(dim_ib),ib_dw(dim_ib))

    if(allocated(ii_up)) deallocate(ii_up)
    if(allocated(ii_dw)) deallocate(ii_dw)
    allocate(ii_up(dim_ii))
    allocate(ii_dw(dim_ii))

[a lot of code]

end subroutine build_H_sector_fulldiag

Where Nbath, Nimp and Nspin are module integer variables.
I traced back the error that appears in the last allocation, that is " allocate(ii_dw(dim_ii)) " .
I preventively deallocate it two lines before, I checked that dim_ii have the proper integer value.
Thank you in advance if you can help me!

P.S.
I am using:

  • fpm 0.10.0 alpha
  • FPM_FFLAGS=" -ffree-line-length-none -fPIC -w -funroll-loops -fcheck=all -g -O0 -fbacktrace -fbounds-check "

I don’t see any issue at first sight. I would recommend creating a minimal reproducible example (MRE) of this. Either it becomes obvious what the problem is, or it’s a gfortran bug in which case you can report MRE to them.

1 Like

If you add stat=istat, errmsg=msgstr clauses to allocate, with proper declaration of these two variables, will it prevent SegFault? And if so, what error message appears in msgstr?

1 Like

Does the calling program have an explicit interface for this subroutine? I think that is necessary for allocatable dummy arguments. If there is a mismatch, then it can corrupt the allocation status of other arrays, even local ones.

If everything is working correctly, all of the local allocatable arrays in your subroutine should be deallocated upon entry, and automatically deallocated upon exit. Since you are testing the allocation status of some of those arrays and apparently finding some of them allocated upon entry, that is evidence that the allocation tables are being corrupted somehow.

Unfortunately it still SegFault even with the stat and errmsg variables.

1 Like

I am checking the allocation but none of them is allocated, I did check by printing the values but I didn’t add it to the snipped of code I posted.
I added those lines as a safe measurement to make sure that the problem was not having a pre-allocated variable.

I tried to provide an explicit interface to the subroutine calling “build_H_sector_fulldiag” as follows:

subroutine diagonalize_aim(aim_problem,state_list,verb_,ifrag_)
    interface
       subroutine build_H_sector_fulldiag(aim_problem,SectorI,Hmat,ifrag_)
         USE GRISBOLT_COMMON, only: AIM
         USE GRISBOLT_FOCKSPACE, only: sector
         type(AIM),allocatable,intent(in)   :: aim_problem
         type(sector)                       :: SectorI
         complex(8),dimension(:,:)          :: Hmat
         integer,optional :: ifrag_
       end subroutine build_H_sector_fulldiag
    end interface
    !> routine to find the GroundState(s) of the AIM
    type(AIM),allocatable,intent(inout)             :: aim_problem
    type(sparse_espace), intent(inout)              :: state_list
    !
[rest of the code]
end subroutine

and now it fails to compile with the following error:

bethe_x.f90                            done.
bethe_x                                failed.
[100%] Compiling...
/usr/bin/ld: build/gfortran_D60D5F445CDA73CD/BETHE_2ORB_GRISOLT/libBETHE_2ORB_GRISOLT.a(.._.._GRISBOLT_src_GRISBOLT_AIM_GRISBOLT_AIM.f90.o): in function `__grisbolt_aim_MOD_diagonalize_aim':
/home/samuele/GRISB/TESTS_GRISBOLT/BETHE_2ORB_GRISBOLT/../../GRISBOLT/src/GRISBOLT_AIM/GRISBOLT_AIM.f90:128: undefined reference to `build_h_sector_fulldiag_'
/usr/bin/ld: /home/samuele/GRISB/TESTS_GRISBOLT/BETHE_2ORB_GRISBOLT/../../GRISBOLT/src/GRISBOLT_AIM/GRISBOLT_AIM.f90:147: undefined reference to `build_h_sector_fulldiag_'
collect2: error: ld returned 1 exit status
<ERROR> Compilation failed for object " bethe_x "
<ERROR> stopping due to failed compilation
STOP 1

So it does not recognize it, maybe I am doing something wrong passing it.
Anyway before passing the interface the code was entering the subroutine.

From the …MOD… names referenced in the backtrace message, we know that build_H_sector_fulldiag is a module procedure. This means that posting “snippets” of code will not help us help you, because too much context is missing. The wheels may be coming off at the last ALLOCATE, but the damage is probably being done a lot earlier, in a different subprogram. I suggest using “-fcheck=all” to recompile the entire program. Also, does valgrind report anything suspicious? If that doesn’t help, come back.

2 Likes

Yes, I will try to create a MRE as soon as possible.
In the meanwhile, I am already using -fcheck=all and I used valgrind with the following options:

valgrind   --leak-check=full --show-leak-kinds=all --track-origins=yes -s   fpm run bethe_x

And the leak summary + first two errors say:


==774496== LEAK SUMMARY:
==774496==    definitely lost: 91,911 bytes in 3,877 blocks
==774496==    indirectly lost: 71,794 bytes in 2,099 blocks
==774496==      possibly lost: 454 bytes in 8 blocks
==774496==    still reachable: 495,266 bytes in 5,980 blocks
==774496==         suppressed: 0 bytes in 0 blocks
==774496== 
==774496== ERROR SUMMARY: 596 errors from 538 contexts (suppressed: 0 from 0)
==774496== 
==774496== 1 errors in context 1 of 538:
==774496== Conditional jump or move depends on uninitialised value(s)
==774496==    at 0x4B0FCF2: _gfortran_execute_command_line_i4 (in /usr/lib/x86_64-linux-gnu/libgfortran.so.5.0.0)
==774496==    by 0x13FE03: __fpm_filesystem_MOD_run (fpm_filesystem.F90:995)
==774496==    by 0x1374F9: __fpm_MOD_cmd_run (fpm.f90:621)
==774496==    by 0x115F36: MAIN__ (main.f90:78)
==774496==    by 0x11549E: main (main.f90:13)
==774496==  Uninitialised value was created by a stack allocation
==774496==    at 0x13FCED: __fpm_filesystem_MOD_run (fpm_filesystem.F90:949)
==774496==
==774496== 
==774496== 1 errors in context 2 of 538:
==774496== realloc() with size 0
==774496==    at 0x48502F0: realloc (vg_replace_malloc.c:1801)
==774496==    by 0x144D58: __fpm_filesystem_MOD_list_files (fpm_filesystem.F90:442)
==774496==    by 0x16B16B: __fpm_sources_MOD_add_sources_from_dir (fpm_sources.f90:108)
==774496==    by 0x134DBD: __fpm_MOD_build_model (fpm.f90:198)
==774496==    by 0x1366DB: __fpm_MOD_cmd_run (fpm.f90:495)
==774496==    by 0x115F36: MAIN__ (main.f90:78)
==774496==    by 0x11549E: main (main.f90:13)
==774496==  Address 0x5c0b720 is 0 bytes after a block of size 0 alloc'd
==774496==    at 0x484880F: malloc (vg_replace_malloc.c:446)
==774496==    by 0x144CC4: __fpm_filesystem_MOD_list_files (fpm_filesystem.F90:442)
==774496==    by 0x16B16B: __fpm_sources_MOD_add_sources_from_dir (fpm_sources.f90:108)
==774496==    by 0x134DBD: __fpm_MOD_build_model (fpm.f90:198)
==774496==    by 0x1366DB: __fpm_MOD_cmd_run (fpm.f90:495)
==774496==    by 0x115F36: MAIN__ (main.f90:78)
==774496==    by 0x11549E: main (main.f90:13)
==774496== 

I don’t understand if it is something that has to do with fpm, I may have to specify that I am using fpm 0.10.0 alpha

1 Like

Before running valgrind, you should just run the executable compiled with -fcheck=all “alone” (i.e. without valgrind)

1 Like

Thank you PierU, I was already running -fcheck=all in the original post, I updated it with some info, I am using

  • fpm 0.10.0 alpha
  • FPM_FFLAGS=" -ffree-line-length-none -fPIC -w -funroll-loops -fcheck=all -g -O0 -fbacktrace -fbounds-check "
2 Likes

I guess that when using this command, valgrind is analysing the execution of the fpm executable, not the execution of your own executable.

2 Likes

You are right! But something tricky is happening…
I was able to find the executable and I moved it to the main folder.
If I just run the executable I get the same SegFault error as the original post.
If I run it with valgrind the executable doesn’t stop at the same point and runs until the end without errors (and “does the job it is meant to do”).
Valgrind output anyway returns 23 errors from 9 contexts:

==784700== LEAK SUMMARY:
==784700==    definitely lost: 408 bytes in 8 blocks
==784700==    indirectly lost: 4,150 bytes in 90 blocks
==784700==      possibly lost: 0 bytes in 0 blocks
==784700==    still reachable: 34,264 bytes in 48 blocks
==784700==         suppressed: 0 bytes in 0 blocks
==784700== 
==784700== ERROR SUMMARY: 23 errors from 9 contexts (suppressed: 0 from 0)
==784700== 
==784700== 1 errors in context 1 of 9:
==784700== Conditional jump or move depends on uninitialised value(s)
==784700==    at 0x521A9FA: ??? (in /usr/lib/x86_64-linux-gnu/libgfortran.so.5.0.0)
==784700==    by 0x10E838: MAIN__ (bethe_x.f90:98)
==784700==    by 0x10EC1E: main (bethe_x.f90:3)
==784700==  Uninitialised value was created by a stack allocation
==784700==    at 0x521A8FE: ??? (in /usr/lib/x86_64-linux-gnu/libgfortran.so.5.0.0)
==784700== 
==784700== 
==784700== 8 errors in context 2 of 9:
==784700== Invalid write of size 8
==784700==    at 0x121280: __bethe_x_grisbolt_MOD_solve_bethe_x (BETHE_X_GRISBOLT.f90:82)
==784700==    by 0x10EA6F: MAIN__ (bethe_x.f90:103)
==784700==    by 0x10EC1E: main (bethe_x.f90:3)
==784700==  Address 0x7b01098 is 8 bytes after a block of size 64 alloc'd
==784700==    at 0x484880F: malloc (vg_replace_malloc.c:446)
==784700==    by 0x11EA6D: __bethe_x_grisbolt_MOD_solve_bethe_x (BETHE_X_GRISBOLT.f90:59)
==784700==    by 0x10EA6F: MAIN__ (bethe_x.f90:103)
==784700==    by 0x10EC1E: main (bethe_x.f90:3)
==784700== 
==784700== 
==784700== 8 errors in context 3 of 9:
==784700== Invalid write of size 8
==784700==    at 0x121271: __bethe_x_grisbolt_MOD_solve_bethe_x (BETHE_X_GRISBOLT.f90:82)
==784700==    by 0x10EA6F: MAIN__ (bethe_x.f90:103)
==784700==    by 0x10EC1E: main (bethe_x.f90:3)
==784700==  Address 0x7b01090 is 0 bytes after a block of size 64 alloc'd
==784700==    at 0x484880F: malloc (vg_replace_malloc.c:446)
==784700==    by 0x11EA6D: __bethe_x_grisbolt_MOD_solve_bethe_x (BETHE_X_GRISBOLT.f90:59)
==784700==    by 0x10EA6F: MAIN__ (bethe_x.f90:103)
==784700==    by 0x10EC1E: main (bethe_x.f90:3)
==784700== 
==784700== ERROR SUMMARY: 23 errors from 9 contexts (suppressed: 0 from 0)

I hate decoding valgrind outputs :cold_face:… Nonetheless I would closely look at what happens around bethe_x.f90:103

1 Like

I feel you :laughing: thank you for your help anyway!
Unfortunately bethe.f90:103 is just a call to a subroutine the “solve the problem my project is meant to solve”, I was hoping for a more fine-grained error output by valgrind :pensive:

I found the bug! It was obviously a stupid mistake, I was writing a big matrix in a smaller one because I had hard-coded a dimension :pensive:
Thank you for the support!

2 Likes

GFortran is usually pretty good and catching such an error with -fcheck=all at runtime. It didn’t catch it this time?

As a user I want the compiler to catch all errors when I compile in Debug mode. It should never segfault.

The code was:

matrix_a( 1 , : , : ) = kronecker_product(matrix_b,matrix_c)

with matrix_B and C hardcoded 2x2 and 3x3 while matrix_A is allocatable but the dimension was 2x4x4 (since matrix_C should have been 2x2).

At compilation time could not be catched since A was allocatable but I would like to have a clearer message at runtime for those situations :confused:

1 Like

The Intel compiler ifx, the NAG compiler nagfor, and the LLVM compiler flang produce the required clearer message.
fatal Fortran runtime error(arrbug.f90:11): Assign: mismatching element counts in array assignment (to 16, from 36)

program arr_bug
  use iso_fortran_env, only : compiler_version, compiler_options
  real , allocatable :: matrix_a(:,:,:)
  real :: matrix_b(2,2), matrix_c(3,3)

  print '(A,/,A)', compiler_version(), compiler_options()
  allocate (matrix_a(2,4,4))
  matrix_a = -1
  matrix_b = 1
  matrix_c = 1
  matrix_a(1,:,:) = krone(matrix_b,matrix_c)
  print *,matrix_a(1,:,:)
contains
  function krone(x,y) result(z)
    real :: x(:,:), y(:,:), z(size(x,dim=1)*size(y,dim=1),size(x,dim=2)*size(y,dim=2))
    z = 42
  end function krone
end program arr_bug
4 Likes