Correct, I didn’t read well sorry
The compiler is allowed to allocate b(:)
on the stack like this because b(:)
is a local array, and it must disappear upon subroutine return the same way that an automatic array like a(:)
must disappear. There must be some threshold size where it switches to the low heap addresses for both arrays.
I next modified your code as real, allocatable, save :: b(:)
, and also adding the deallocate statement before return. I was expecting b(:)
to always have a low heap address with that change. However, I get the same result as before. I guess the compiler looks ahead to see the deallocate statement and it doesn’t let me trick it.
@aledinola , there is a compromise on most operating systems between larger stack sizes and the use of heap memory, that is especially the case on Windows OS where users are generally better off writing programs that can work well with the default stack size which is not large.
Is it possible for you to show a small but complete example of “Setting /heap:arrays0 does, at least in my setup (ifort on Windows)”?
The question is, what does tell you that a
and possibly b
are on the stack for small sizes? The two scalars are clearly allocated 4 bytes apart, but on the first 3 lines a
and t
are allocated 140732877400696 - 140375574461632 = 357302939064
bytes (357GB) apart. I though that all addresses (even if virtual addresses) in the stack should be contiguous: maybe I’m wrong?
@FortranFan the code is part of a project with many modules and subprograms, so I think it’s not so easy to come up with a MWE that reproduces the issue. But as soon as I have time I will try to provide an example, if I manage.
The code uses OpenMP and without the flag /heap-arrays0 I get a stack overflow. Then I added /heap-arrays0 to the entire project and it runs but very slow. A co-worker of mine on this project has Mac and she does not have any issue with memory.
How did I solve the problem? I compiled only some files with /heap-arrays0 and not the entire project. Basically what @CRquantum explained. In this way, the code runs without stack overflow and is way faster.
Does she use the Intel compiler as well?
The ultimate way is to modify the code in order to avoid the big allocations on the stack.
-
she uses ifort as well, same version
-
yes, indeed I also changed a bit the code to make it more robust to potential stack overflows. But this means that I had to remove for example “where” statements because they create large automatic arrays for the mask (I guess) and replace them with less concise do loops with nested if conditions (now I learnt that there are better alternatives e.g. merge, as @FortranFan kindly suggested me above)
By default, using Gfortran (and ifort), allocatable arrays (such as b) go on the heap, while automatic arrays (such as a) go on the stack.
Using Gfortran, if selecting using “-fstack-arrays” for large local arrays, it is necessary to also increase the stack size. My link .bat file to achieve this is:
set tce=load_gf.tce
set load=gfortran
set stack_options=-Wl,-stack,536870912,-Map=%program%.map
del %program%.map
del %program%.exe
now >>%tce%
%load% @load_gf.txt -fopenmp -fstack-arrays %stack_options% >>%tce% 2>&1
dir %program%.* /od >>%tce%
notepad %tce%
My file load_gf.txt lists all .o files (in multiple directories) and finally -o program.exe
For a large 64-bit Fortran multi-threaded program, I typically set the stack to 0.5 GBytes (per thread). This requires a large virtual memory address space, but has no direct impact on the physical memory requirement and no significant performance problem.
I think the default stack size for 64-bit is more than 1 MBytes. For contemporary OS, such a small stack provision is wrong.
When you run the above code with your stack_options, what is the output? Am I misinterpreting from the addresses what is a heap address and what is a stack address?
Ron,
I have listed a modified program to give more information on the stack and heap and build using Gfortran 64-bit Ver 11.1 on Win 10.
It demonstrates the use of a 500 Mbyte stack for each thread for variables s,t & a but shared heap among all threads for b.
The memory addresses are reported as page no : byte number, which can be easier to understand…
Note for Gfortran : win-64 implementation, each thread stack defaults to the same stack size as the primary thread. (this may not be teh same default on other OS)
I do have other more detailed stack reporting routines based on the winapi routines “GetCurrentThreadStackLimits” and “GlobalMemoryStatusEx”, but don’t know the equivalent for other OS
The test is carried out on a Ryzen 5900X configured for 12 threads and 64 GBytes of memory so there is no problem with physical or virtual memory addresses.
program testack
implicit none
integer :: i
write (*,*) &
'id th size loc_s : stack loc_t : stack loc_a : stack loc_b : heap'
!$omp parallel do private (i)
do i = 0, 20
call testloc(i)
end do
!$omp end parallel do
CONTAINS
recursive subroutine testloc(i)
integer, intent(in) :: i
integer :: thread
real :: s, a(2**i), t
real, allocatable :: b(:)
integer*4, external :: omp_get_thread_num
character*18, external ::mem_j18
thread = omp_get_thread_num ()
a = 0.0
allocate( b, source=a )
write (*,11) i, thread, storage_size(a) * size(a) / 8, mem_j18(loc(s)), mem_j18(loc(t)), mem_j18(loc(a)), mem_j18(loc(b))
11 format ( i3, i3, i15, 4a19 )
end subroutine
end program
character*18 function mem_j18 (value)
!
! Convert a memory address into a character string +n,nnn,nnn,nnn:n,nnn
!
implicit none
integer*8 :: value
integer*8 :: page, offset
character :: cp*18, co*18
character, external :: bus_j18*18
!
page = value / 4096
offset = value - page*4096
cp = bus_j18 (page)
co = bus_j18 (offset)
mem_j18 = cp(7:18) // ':' // co(14:18)
end function mem_j18
character*18 function bus_j18 (value)
!
! Convert a long integer into a character string +n,nnn,nnn,nnn,nnn
!
implicit none
integer*8 :: value
character :: number*26
character, external :: bus_j26*26
number = bus_j26 (value)
call fix_power_string (number, 8)
bus_j18 = number(9:)
end function bus_j18
character*26 function bus_j26 (value)
!
! Convert a long integer into a character string +n,nnn,nnn,nnn,nnn,nnn,nnn
! 10^19 = 2^64
!
implicit none
integer*8 :: value
!
character :: number*26
integer*8 :: ten = 10, k
integer*4 :: n, digit, zero, minus, comma
!
if (value >= 0) then
k = value
minus = 0
else
k = -value
minus = 1
end if
zero = ichar ('0')
n = len(number) ! last character
comma = mod(n+1,4) ! comma position
number = ' '
do
digit = mod (k,ten)
number(n:n) = char (digit+zero)
n = n-1
k = k/ten
if (k==0) exit
if (n<=minus) then ! overflow
number = repeat ('#',len(number))
exit
end if
if (mod(n,4) == comma) then
number(n:n) = ','
n = n-1
end if
end do
if (value < 0) then
if ( n < 1 ) then
number(1:1) = '#'
else
number(n:n) = '-'
end if
end if
!
bus_j26 = number
!
end function bus_j26
subroutine fix_power_string (number, k)
!
! fix that string fits in number(k+1:)
!
implicit none
character number*(*)
integer :: k, n
if ( number(k:k) == ' ' ) return
n = len(number)
do
if ( number(k:k) == ' ' ) exit
if ( number(n-3:n-3) == ',' ) then
number = ' ' // number(:n-4) // 'k'
else if ( number(n:n) == 'k' ) then
number = ' ' // number(:n-5) // 'm'
else if ( number(n:n) == 'm' ) then
number = ' ' // number(:n-5) // 'g'
else if ( number(n:n) == 'g' ) then
number = ' ' // number(:n-5) // 't'
else
number = repeat ('#',n)
number(1:k) = ' '
exit
end if
end do
end subroutine fix_power_string
set program=%1
set tce=load_gf.tce
set load=gfortran
set options=-g -fimplicit-none -O2 -march=native -ffast-math -fopenmp -fstack-arrays -o %program%.exe
set stack_options=-Wl,-stack,536870912,-Map=%program%.map
del %program%.o
del %program%.map
del %program%.exe
echo #### Start of New Test ##################################################################################### >> %tce%
now >> %tce%
%load% %program%.f90 %options% %stack_options% >>%tce% 2>&1
dir %program%.* /od >> %tce%
echo ================================================================ >> %tce%
now >> %tce%
set g >> %tce%
set options >> %tce%
%program% >>%tce%
notepad %tce%
#### Start of New Test #####################################################################################
It is now Friday, 20 October 2023 at 19:33:38.443
Volume in drive C has no label.
Volume Serial Number is 88FF-836A
Directory of C:\temp\parallel
20/10/2023 07:33 PM 3,663 teststack.f90
20/10/2023 07:33 PM 15,208 teststack.obj
20/10/2023 07:33 PM 3,473,432 teststack.exe
20/10/2023 07:33 PM 891,249 teststack.map
4 File(s) 4,383,552 bytes
0 Dir(s) 232,296,456,192 bytes free
================================================================
It is now Friday, 20 October 2023 at 19:33:38.702
gcc.ver=11.1.0
gcc_dir=C:\Program Files (x86)\gcc_eq\gcc_11.1.0
gcc_path=C:\Program Files (x86)\gcc_eq\gcc_11.1.0\bin;C:\Program Files (x86)\gcc_eq\gcc_11.1.0\libexec\gcc\x86_64-w64-mingw32\11.1.0
options=-g -fimplicit-none -O2 -march=native -ffast-math -fopenmp -fstack-arrays -o teststack.exe
id th size loc_s : stack loc_t : stack loc_a : stack loc_b : heap
6 3 256 786,415:2,956 786,415:2,960 786,415:2,688 264,348: 160
4 2 64 655,343:2,956 655,343:2,960 655,343:2,880 264,348: 432
2 1 16 395,423:2,956 395,423:2,960 395,423:2,928 264,348: 512
0 0 4 132,911:2,748 132,911:2,752 132,911:2,720 264,348: 544
8 4 1024 917,487:2,956 917,487:2,960 917,487:1,920 264,348: 576
10 5 4096 1,048,559:2,956 1,048,559:2,960 1,048,558:2,944 264,349:3,776
7 3 512 786,415:2,956 786,415:2,960 786,415:2,432 395,424: 496
5 2 128 655,343:2,956 655,343:2,960 655,343:2,816 264,350:3,792
3 1 32 395,423:2,956 395,423:2,960 395,423:2,912 264,350:3,936
12 6 16384 1,179,631:2,956 1,179,631:2,960 1,179,627:2,944 395,424:2,128
1 0 8 132,911:2,748 132,911:2,752 132,911:2,720 395,428:2,400
9 4 2048 917,487:2,956 917,487:2,960 917,487: 896 395,429:2,688
11 5 8192 1,048,559:2,956 1,048,559:2,960 1,048,557:2,944 395,430: 656
14 7 65536 1,310,703:2,956 1,310,703:2,960 1,310,687:2,944 395,432: 672
13 6 32768 1,179,631:2,956 1,179,631:2,960 1,179,623:2,944 395,448: 688
15 7 131072 1,310,703:2,956 1,310,703:2,960 1,310,671:2,944 395,429:2,688
16 8 262144 1,441,775:2,956 1,441,775:2,960 1,441,711:2,944 395,429:2,688
17 8 524288 1,441,775:2,956 1,441,775:2,960 1,441,647:2,944 395,429:2,688
18 9 1048576 1,572,847:2,956 1,572,847:2,960 1,572,591:2,944 395,681: 64
19 10 2097152 1,703,919:2,956 1,703,919:2,960 1,703,407:2,944 395,694: 64
20 11 4194304 1,834,991:2,956 1,834,991:2,960 1,833,967:2,944 395,694: 64