Stack overflow, heap:arrays and Fortran array functions

I developed a MWE to test possible stack-overflow runtime errors (in Windows) when using the array functions reshape, sum, any, where. My example has three files: mymod1.f90, mymod2.f90 and test_reshape.f90 (attached at the end of this post)

mymod1 is compiled with the flag /heap-arrays0 so that it always runs
mymod2 instead is compiled without /heap-arrays0 to test if a stack overflow occurs
mymod1 and mymod2 are identical and test the following Fortran functions:
reshape, sum, any, where
Note that ALL arrays are declared as allocatable

Nonetheless, there is a stack overflow when calling test2 in mymod2. The stack overflow occurs when the where statement is reached, not with the others. I found this behavior quite strange. Does this mean that where creates some temporary arrays on the stack, while reshape,sum,any do not?

What seems to transpire from this test is that

  1. Declaring ALL arrays as allocatable is not enough to avoid problems with the stack (at least in Windows). You need to compile with /heap-arrays0 to be on the safe side.

  2. But reshape,sum,any are ok even without /heap-arrays0, while where gives the run time error

I was trying to avoid compiling with /heap-arrays0 because in some of my codes (not shown here) there is a big performance penalty, but it seems than in that case it becomes risky to use some Fortran array functions like where which are however quite useful

Any comment greatly appreciated, thanks!

Here is the main program:

! Compilation
! ifort /heap-arrays0 -c mymod1.f90
! ifort -c mymod2.f90
! ifort -c test_reshape.f90
! ifort mymod1.obj mymod2.obj test_reshape.obj -o win_run.exe
! Run program
! .\win_run.exe

program test_reshape
	use mymod1, only: test1
    use mymod2, only: test2
	implicit none
    integer :: n,m

    n = 10000
    m = 2000

    write(*,*) "-----------------------"

    write(*,*) "Calling sub test1.."
	call test1(n,m)

    write(*,*) "-----------------------"

    write(*,*) "Calling sub test2.."
	call test2(n,m)

end program test_reshape

and here are the two modules:

module mymod1
! ifort /heap-arrays0 -c mymod1.f90
implicit none
contains

subroutine test1(n,m)
implicit none
integer, intent(in) :: n, m
real(8), allocatable :: x(:,:), xvec(:)
real(8) :: sum_x

allocate(x(n,m))
call RANDOM_NUMBER(x)

xvec = reshape(x,[n*m])
write(*,*) "reshape is ok"

sum_x = sum(x)
write(*,*) "sum is ok"

if (any(x>0.1d0)) then
    write(*,*) "Some elements of x are larger than 0.1"
endif
write(*,*) "any is ok"

where (xvec<0.5d0)
    xvec = 0d0
elsewhere
    xvec = 1d0
end where
write(*,*) "where is ok"

write(*,*) "Shape of x = ", shape(x)
write(*,*) "Shape of xvec = ", shape(xvec)

end subroutine test1
end module mymod1
module mymod2
! ifort -c mymod2.f90
implicit none
contains

subroutine test2(n,m)
implicit none
integer, intent(in) :: n, m
real(8), allocatable :: x(:,:), xvec(:)
real(8) :: sum_x

allocate(x(n,m))
call RANDOM_NUMBER(x)

xvec = reshape(x,[n*m])
write(*,*) "reshape is ok"

sum_x = sum(x)
write(*,*) "sum is ok"

if (any(x>0.1d0)) then
    write(*,*) "Some elements of x are larger than 0.1"
endif
write(*,*) "any is ok"

where (xvec<0.5d0)
    xvec = 0d0
elsewhere
    xvec = 1d0
end where
write(*,*) "where is ok"

write(*,*) "Shape of x = ", shape(x)
write(*,*) "Shape of xvec = ", shape(xvec)

end subroutine test2
end module mymod2

mymod1.f90 (669 Bytes)
mymod2.f90 (669 Bytes)
test_reshape.f90 (1.6 KB)

The standard requires the compiler to fully evaluate the where condition (if it is an expression) before executing the statements in the construct. In the general case this means that a temporary array is needed. In your case the compiler could possibly identify that the temporary array is not needed, but such analysis can quickly get very complex: my guess is that ifort/ifx uses a temporary array as soon as an array that is present in the condition is also assigned in the statements (you could check that by assigning to another array). Maybe it also depends on the optimization level during teh compilation.

1 Like

On top of that, for the logic presented, the where construct can be replaced (avoiding any possible temporary created) with a do concurrent:

   do concurrent (i = 1 : size(xvec)) shared(xvec)
      if (xvec(i) < 0.5d0) then
         xvec(i) = 0.d0
      else
         xvec(i) = 1.d0
      endif
   enddo
1 Like

Thanks! I replaced the where construct with your do concurrent example and it runs ok without overflows. However, when compiling, I get the following warning:

mymod2.f90(28): warning #5423: Locality information is ignored without one of these command line qualifiers ‘/Qopenmp or Qparallel’
do concurrent (i=1:size(xvec)) shared(xvec)
^

Does it mean that I have to add the compilation flag /Qparallel? Does it improve performance?

For the Intel compilers, DO CONCURRENT does not parallelize unless you use one of those options.

1 Like

@aledinola ,

Your situation is specific to Intel Fortran compiler and as such, you may also want to inquire at the Intel Fortran forum and try to get feedback from current Intel Support staff as to their recommendation.

The issue with WHERE construct is with the mask-expr for which the standard states, “The mask-expr in a WHERE statement, WHERE construct statement, or masked ELSEWHERE statement, is evaluated at most once per execution of the statement” toward which a compiler is likely to create a temporary on the stack, unless it offers a way around.

However, in this situation, an alternative is also the MERGE intrinsic which has ELEMENTAL semantics and can work well and not incur you the penalty due to the overhead of /Qparallel

xvec = merge( 0d0, 1d0, xvec < 0.5d0 )
write(*,*) "merge is ok"
C:\temp>ifort /standard-semantics /free p.f
Intel(R) Fortran Intel(R) 64 Compiler Classic for applications running on Intel(R) 64, Version 2021.10.0 Build 20230609_000000
Copyright (C) 1985-2023 Intel Corporation.  All rights reserved.

Microsoft (R) Incremental Linker Version 14.36.32537.0
Copyright (C) Microsoft Corporation.  All rights reserved.

-out:p.exe
-subsystem:console
p.obj

C:\temp>p.exe
 -----------------------
 Calling sub test1..
 reshape is ok
 sum is ok
 Some elements of x are larger than 0.1
 any is ok
 merge is ok
 Shape of x =  10000 2000
 Shape of xvec =  20000000
 -----------------------
 Calling sub test2..
 reshape is ok
 sum is ok
 Some elements of x are larger than 0.1
 any is ok
 merge is ok
 Shape of x =  10000 2000
 Shape of xvec =  20000000
3 Likes

Thanks! The construct with merge is great. The do concurrent is also ok in that it avoids the stack overflow; however in more complex situations it is slow, perhaps due to the overhead of Qparallel, as you said.

As far as I remember I had better performance with forall rather than do concurrent (but I prefer not to use forall since it is now deprecated)

Regarding the stack size on Windows, I believe it’s 1 MB by default. To increase it, you need to add a linker option: /STACK (Stack allocations) | Microsoft Learn

1 Like

Thanks! So instead of /heap-arrays0 I should increase the size of the stack? How can I modify the compilation instructions shown above to do that?

@aledinola, the following
ifx/ifort -o win_run.exe mymod1.f90 mymod2.f90 test_reshape.f90 -link -stack:<put a suitable value here> should do the job.

1 Like

If I want to set the stack size to 2MB (assuming 1MB is the default), should I compile with the following line:

ifort -o win_run.exe mymod1.f90 mymod2.f90 test_reshape.f90 -link -stack:2

Sorry for the basic question :slight_smile:

The stack size is in bytes.

See here for /F and /STACK: Stack compiler/linker options - Intel Community

2 Likes

If you use visual studio, you can set those stack or heap size in the linker option below,

The largest allowed size is 2^31 - 1 = 2147483647
It probably means 2^31/1024/1024=2048MB=2GB.

1 Like

Thanks! I am wondering if increasing the stack size leads to worse performance… Setting /heap:arrays0 does, at least in my setup (ifort on Windows)

Another trick that I found out is to use heap arrays only on certain files and not on the entire project. But is it possible to do this in Visual Studio? If I set 0 in the field for heap arrays, this applies to the entire project

1 Like

Thanks for reminding this trick :slight_smile:
Sure, you could definitely set heap array 0 for only the files you want, instead of using it for the whole project.

For example, I have a file a called ran.f90 as below, in the project, from the solution explorer in VS, I only right click this file, and in optimization tab, I set the value in “Heap Arrays” as 0,

In this case, only this file ran.f90 is using heap arrays 0.

I think what you suggested, i.e., only setting heap arrays 0 for certain files which really need heap arrays 0, is better than setting the whole project as heap arrays 0.

1 Like

Managing a heap is much more difficult than managing a stack. If you just play around a little with a simple fortran code to mimic this you will see the difference. Is the garbage collection done on every “deallocate” step, or is it done only when an “allocate” step fails? It is a “pay me now” or “pay me more later” situation. In contrast, stack management just requires a couple of integer operations for each push or pop step.

As for performance differences with a larger stack vs. a smaller stack, I don’t think there would be much or any difference until the stack is so large that the program can no longer run. Then there would be an infinite performance difference. A possible exception would be when there is a single pool of memory, and the stack is allocated at the low addresses and the heap allocations occur in the high addresses that are left over. A big stack might then squeeze the heap, requiring, among other things, more frequent garbage collection.

3 Likes

The only disadvantage I know of for setting a large stack is that if it is TOO large, the program won’t start.

3 Likes

It’s really unclear to me what is the default policy of gfortran for the allocations of the automatic/temporary arrays. I’ve made a small program where a routine is repeatedly called, and prints the addresses of 2 local scalars, of an automatic array with different sizes, and an allocatable array. The routine is declared recursive to be sure that no variable is static.

  • basically, everything is allocated in the highest part of the address space
  • if the arrays are 128kiB or larger, they are allocated in the lower part instead (why is that ?)
  • the two scalars are always allocated at the same addresses, 4 bytes apart. They really look like on a stack, as expected.
  • the automatic array is generally allocated at different addresses from call to call, and not that close from the scalars. The allocatation strategy appears to be the same for both the automatic array and the allocatable array, that is on heap.

Code compiled with gfortran 13, no option, on macOS:

program testack
implicit none

integer :: i

do i = 0, 20
   call testloc(i)
end do

CONTAINS

   recursive subroutine testloc(i)
   integer, intent(in) :: i
   real :: s, a(2**i), t
   real, allocatable :: b(:)
   a = 0.0
   allocate( b, source=a )
   print*, storage_size(a) * size(a) / 8, loc(s), loc(t), loc(a), loc(b)

   end subroutine

end program

And the output:

           4      140732877400700      140732877400696      140375574461632      140375574461648
           8      140732877400700      140732877400696      140375574461632      140375574461648
          16      140732877400700      140732877400696      140375574461632      140375574461648
          32      140732877400700      140732877400696      140375574461648      140375574461680
          64      140732877400700      140732877400696      140375574461680      140375574461744
         128      140732877400700      140732877400696      140375574461744      140375574461888
         256      140732877400700      140732877400696      140375574462016      140375574462272
         512      140732877400700      140732877400696      140375574461872      140375574462528
        1024      140732877400700      140732877400696      140375578644480      140375578645504
        2048      140732877400700      140732877400696      140375578646528      140375578648576
        4096      140732877400700      140732877400696      140375578650624      140375578654720
        8192      140732877400700      140732877400696      140375578658816      140375578667008
       16384      140732877400700      140732877400696      140375578675200      140375578691584
       32768      140732877400700      140732877400696      140375578707968      140375578740736
       65536      140732877400700      140732877400696      140375578773504      140375578839040
      131072      140732877400700      140732877400696           4340514816           4340645888
      262144      140732877400700      140732877400696           4340776960           4341039104
      524288      140732877400700      140732877400696           4341301248           4341825536
     1048576      140732877400700      140732877400696           4342349824           4343398400
     2097152      140732877400700      140732877400696           4344446976           4346544128
     4194304      140732877400700      140732877400696           4348641280           4352835584
1 Like

Interesting test! One question: why does storage_size(a)*size(a) always give the same result? At any subroutine call, the dimension of the automatic array a increases, so its size and storage size should increase as well

Not sure what you mean, but it doesn’t: this is the first column, and the value effectively increases at each call

1 Like