Difference between automatic arrays on the heap and allocatable arrays

Is there any practically significant difference between automatic arrays saved on the heap and allocatable arrays?

As an example, consider the following code (edit: I added functions with array results inspired by the comment of @RonShepard).

module s_mod
implicit none
private
public :: sf_auto, sf_alloc, af_auto, af_alloc

contains

function sf_auto(n) result(s)
integer, intent(in) :: n
real :: s
real :: x_auto(n)
x_auto = 1.0
s = sum(x_auto)
end function sf_auto

function sf_alloc(n) result(s)
integer, intent(in) :: n
real :: s
real, allocatable :: x_alloc(:)
allocate (x_alloc(n))
x_alloc = 1.0
s = sum(x_alloc)
deallocate (x_alloc)
end function sf_alloc

function af_auto(n) result(a_auto)
integer, intent(in) :: n
real :: a_auto(n)
a_auto = 1.0
end function af_auto

function af_alloc(n) result(a_alloc)
integer, intent(in) :: n
real, allocatable :: a_alloc(:)
allocate (a_alloc(n))
a_alloc = 1.0
end function af_alloc

end module s_mod

program test
use :: s_mod, only:sf_auto, sf_alloc, af_auto, af_alloc
implicit none
integer :: i
integer, parameter :: n = 20

do i = 1, n
    write (*, *) 2**i, sf_auto(2**i)
end do

do i = 1, n
    write (*, *) 2**i, sf_alloc(2**i)
end do

do i = 1, n
    write (*, *) 2**i, sum(af_auto(2**i))
end do

do i = 1, n
    write (*, *) 2**i, sum(af_alloc(2**i))
end do

end program test

Name this piece of code as test.f90. Compile it with

gfortran test.f90

or

ifort -heap-arrays test.f90

In both cases, x_auto and a_auto will be saved on the heap. Then, is there any practically significant difference between sf_auto and sf_alloc? Is there any scenario in which they will behave noticeably differently? Are they actually “equivalent” on the assembly level with only minor differences? What if the code is compiled using other compilers while ensuring that sf_auto is saved on the heap?

This question is asked mostly out of curiosity, but I hope it is not totally irrelevant, recalling that @awvwgk mentioned in a previous thread that the implementation of stdlib uses automatic arrays where possible.

Thanks.

2 Likes

I would like to hear see what some of the compiler writers say about this issue. Usually, an automatic array is allocated on the stack, and when the function or subroutine returns, the entire stack frame is popped and everything is returned back for subsequent stack allocations. Local (non saved) allocatable arrays are similarly deallocated from the heap upon return.

But a function result must be a little different somehow. In this case, the automatic or allocatable entity must live a little longer since it is used in an expression in the calling program. How exactly does that work, is it different for automatic and allocatable function return results, and is there any practical (e.g. performance) difference?

3 Likes

Thank you for the comments. You are right. This case is even more interesting than the senario where the arrays are only local variables. I have edited my post to include such functions.

I did some brief exploration on compiler explorer. Compiler Explorer

It looks like both compilers treat the two cases differently. But also looks like gfortran unconditionally uses malloc for the automatic array, and ifort has some conditional logic to allow it to possibly go on the stack (I’m not an assembly programmer so I’m not sure).

2 Likes

From the perspective of the program’s operation, there is no difference. Using the heap involves some overhead - how much this matters depends on how often you call the routine and how much work it does - the more work in the routine, the less it matters how the array is allocated.

Using the heap has the practical effect of allowing much larger automatic arrays than you would get using the stack, as the stack is typically limited in size much more than the heap.

ifort allows you to specify a size breakpoint for when it uses the heap, but it requires that the size be known at compile time (there is no run-time check). I recommend leaving the size at zero (or not specifying it at all, as I have yet to see a case where it helps.

3 Likes

It is not uncommon for programs to die at run time because an automatic array, or even a stack allocation of an intermediate within an array expression, exceeds the stack limit. I think most compilers have this feature, not just ifort. It has always seemed to me that this should not be the default case for fortran. It seems like the default should be to attempt the stack allocation, and if that fails to resort to heap allocation, and then only if that fails to kill the process. Then the programmer should have some control, with combinations of compile time and run time settings, to force the compiler to use only stack allocation (e.g. if performance is critical) or heap allocation (e.g. if he knows that the allocation will always exceed stack limits, or if he wants to prioritize precious stack space for other entities). But I think the “default” behavior should be that a standard-conforming program “just works”.

1 Like

I totally agree. The intrinsic support of arrays is a nice feature of Fortran, and it should not be a feature that can easily lead to crashes.

For example, the following code will crash with a segmentation fault on Linux or Mac when i = 10 if compiled with ifort without any options. In today’s applications, a matrix of size 1024-by-1024 is really not that big. I would be surprised if a new user of Fortran is not surprised by this behavior.

Code:

module s_mod
implicit none
private
public :: sf_auto

contains

function sf_auto(n) result(s)
integer, intent(in) :: n
real(kind(0.0D0)) :: s
real(kind(0.0D0)) :: x_auto(n, n)
x_auto = 1.0D0
s = sum(x_auto)
end function sf_auto

end module s_mod

program test_mem
use :: s_mod, only:sf_auto
implicit none
integer :: i

do i = 1, 30
    write (*, *) 2**i, sf_auto(2**i)
end do

end program test_mem

Result:

$ uname -a && ifort --version && ifort test.f90 && ./a.out
Linux zP 5.15.0-56-generic #62-Ubuntu SMP Tue Nov 22 19:54:14 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
ifort (IFORT) 2021.8.0 20221119
Copyright (C) 1985-2022 Intel Corporation.  All rights reserved.

           2   4.00000000000000     
           4   16.0000000000000     
           8   64.0000000000000     
          16   256.000000000000     
          32   1024.00000000000     
          64   4096.00000000000     
         128   16384.0000000000     
         256   65536.0000000000     
         512   262144.000000000     
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image              PC                Routine            Line        Source             
libc.so.6          00007F8B6540A520  Unknown               Unknown  Unknown
a.out              0000000000404275  Unknown               Unknown  Unknown
a.out              000000000040415D  Unknown               Unknown  Unknown
libc.so.6          00007F8B653F1D90  Unknown               Unknown  Unknown
libc.so.6          00007F8B653F1E40  __libc_start_main     Unknown  Unknown
a.out              0000000000404075  Unknown               Unknown  Unknown

Thank you @sblionel for the informative explanation.

It is great to hear that there is no practical difference between automatic arrays on the heap and allocatable arrays. This makes me determined to use automatic arrays whenever possible (in combination with compiler options like heap-arrays), following the same preference as the implementation of stdlib. Comparing sf_auto with sf_alloc or af_auto with af_alloc, it is clear to me that the automatic version is cleaner and more intuitive.

The price is that all arrays in the code would have the overhead of using the heap. This is acceptable in my applications if not negligible. In addition, using the stack has the risk of causing segmentation faults. In many applications, I think it is better to stay on the safe side.

Another price is the need for compiler options like -heap-arrays. For those who write code only for themselves, this is not a real problem. It is more of a concern if we aim to release the source code for others to use, since the users may use our code in part rather than as a whole, possibly missing the compiler options we specify. However, since there are many other options needed already, e.g., fp-model=strict, I do not think one more is a huge burden.

There are 2 aspects to this question;

  1. allocation of the array : defining automatic arrays is considered more efficient than allocatable arrays, but this relates more to stack vs heap. I would expect any allocation process for an array on the heap would be similar, so no difference.
  2. use of the array : when using an array, the memory address has little effect, although there may be some issue of short address vs long address. It is expected that small arrays on the stack may benefit from short address, but this is not significant for large arrays. For this reason, I would not place small arrays on the heap.

As arrays get larger, any performance difference would be more difficult to identify.

Another aspect of arrays on the heap has always interested me, especially for multi-thread, and that is alignment. Rather than 8-byte or 16-byte alignment, I would like to be able to select 4kbyte memory page alignment, especially if subsequent arrays are for a different thread. This could possibly affect performance, by seperating memory page use between L1 caches, although the potential gains are difficult to predict.
@sblionel, do you know of any consideration of this page alignment option for performance improvement ?

1 Like

Attention @greenrongreen , please keep the above two comments in mind along with what you may have come across the Priority Support request I had placed twice on behalf of my employer and the teams I have interacted with at work.

These are the teams for whom standard behavior is what they seek off-the-shelf, if at all they decide to use Fortran. That is, they longer want anything to do with the nonstandard legacy of FORTRAN and its dialects and nor do they want to bother any more with the history with IBM specific FORTRAN or the Microsoft days with PowerStation on Windows or DEC/Digital Fortran. As and when they use Fortran, they want the ISO / IEC standard behavior in their Fortran programs by default.

This was the basis of above-mentioned our two Priority Support requests for a driver that comes with Intel oneAPI stand-alone Fortran installer package or as part of oneAPI HPC toolkit. A driver, perhaps named IFS (!), that by default yields standard-conformance. This is as opposed to IFORT and IFX drivers where Intel retains backward compatibility with earlier nonstandard defaults to please its “big” customers and thus the users of IFORT and IFX who seek standard conformance need to apply -standard-semantics or some other combination options.

Regrettably both our Priority Support requests were denied.

Intel Fortran allows you to specify alignment for allocatable arrays, through the ATTRIBUTES ALIGN directive. If you are using AVX instructions, I think it aligns whole arrays on 64-byte boundaries, maybe 256? Of course, individual elements will just be “naturally aligned” based on their type.

If you have an application that needs something like 4K byte alignment, I’d suggest instead using the appropriate OS routine to allocate and use pointers. I’m skeptical that this will make a noticeable difference in typical applications. Indeed, you’re usually better off keeping memory references as close together as possible to improve cache performance.

One way to “avoid” the problem is to let the users deal with it by adding a work array dummy argument:

function sf_user(n,x_work) result(s)
integer, intent(in) :: n
real, intent(out) :: x_work(n)
x_work = 1.0
s = sum(x_work)
end function

Potential benefits of this approach are

  • users have a choice between static, automatic, and allocatable arrays
  • reusing existing allocations can lower memory consumption and pressure on the system memory allocator
  • possibility to use other allocators (e.g. tcmalloc or std::pmr::polymorphic_allocator) and exception-handling conventions, especially if calling the routine from other programming languages
  • allow your algorithm to be used on some embedded system where dynamic allocation is not supported

For Fortran users that prefer convenience over flexibility, you can always offer a light wrapper (or instruct them to write one themselves):

function sf_easy(n) result(s)
   integer, intent(in) :: n
   real :: x_work(n), s
   s = sf_user(n,x_work)
end function

You can look at this from the viewpoint of separation of concerns and orthogonality. The routine should encapsulate the algorithm, and not be concerned with differences between stack/heap allocations, and what happens if the stack limit is exceeded.

2 Likes

As far as the API is concerned, another approach is to make the work(:) array optional. Then the routine can use it if present(), and otherwise it can use is own workspace.

But there is still the question in that second case of whether the workspace is automatic (on the stack) or allocatable (on the heap). I still think a good compiler at run time should first try the stack, and if that fails it should use heap. A standard conforming program should not fail unless the resources are exhausted. A programmer cannot now test for the availability of either stack or heap space, those details are not even recognized by the standard.

2 Likes

Ron, Does such a “good compiler” actually exist ?

All compilers I have used (over a long time) will give a stack overflow if the stack is too small.
I have often asked for this capability, but never observed it available.

Even the option of requesting large heap arrays is only provided at compile time for local arrays of known size, so all automatic arrays either go on the stack or the heap.
I have not succesfully used an option of all arrays (both local and automatic) going on the heap, or an option of allocatable arrays being able to go on the stack.
My expectation is ALLOCATE arrays or save arrays always go on the heap, while local and automatic go on the stack.

1 Like

This is what I have always kept in mind. With these consequences:

  • use allocatable rather than automatic arrays as soon as they can be potentially “large”
  • when array syntax expressions can raise large temporary arrays, use loops instead

But then you will likely duplicate a lot of code depending on which one is used, or you will use a pointer to avoid the duplication, which is not fully satisfactory either.

I agree. But when releasing source code, you have no control on which compiler (a “good one” or not) will be used

It is possible on both Windows and Linux to make memory queries.

Here is the output from a program on a Windows platform, that uses a windows api.

Intel(R) Fortran Intel(R) 64 Compiler Classic for applications running on Intel
(R) 64, Version 2021.8.0 Build 20221119_000000
Memory usage 11 %
Total physical 68,412,305,408
Available physical 60,602,290,176
Total page file 78,612,852,736
Available page file 69,079,638,016
Total virtual 140,737,488,224,256
Available virtual 140,733,141,639,168

Here is the equivalent on Linux.

ian@dell-5820:/mnt/c/document/fortran/4th_edition_update> ./ch4304_ifort_icx.out
Intel(R) Fortran Intel(R) 64 Compiler Classic for applications running on Intel
(R) 64, Version 2021.8.0 Build 20221119_000000
Total ram 68412305408 68,412,305,408
Free ram 60622983168 60,622,983,168
Share ram 0 0
Buffer ram 0 0
Total swap 127830847488 127,830,847,488
Free swap 127825920000 127,825,920,000
Total high 142548992 142,548,992
Free high 278528 278,528
ian@dell-5820:/mnt/c/document/fortran/4th_edition_update>

Jane and I used the C interop facilities to call the relevent C routines from Fortran.

Here is the Windows Fortran source

include ‘integer_kind_module.f90’
include ‘display_with_commas_module.f90’
include ‘memory_module_windows.f90’

program ch4303

use iso_fortran_env
use memory_module_windows
use display_with_commas_module

print *,compiler_version()
print *,’ Memory usage ‘,MemoryLoad(),’ %’
print *,’ Total physical ‘,display_with_commas(TotalPhysical())
print *,’ Available physical ‘,display_with_commas(AvailablePhysical())
print *,’ Total page file ‘,display_with_commas(TotalPageFile())
print *,’ Available page file ‘,display_with_commas(AvailablePageFile())
print *,’ Total virtual ‘,display_with_commas(TotalVirtual())
print *,’ Available virtual ',display_with_commas(AvailableVirtual())

end program ch4303

Here is the Linux Fortran source.

include ‘integer_kind_module.f90’
include ‘ch4304_memory_module_linux.f90’
include ‘display_with_commas_module.f90’

program ch4304

use iso_fortran_env
use memory_module_linux
use display_with_commas_module

print *,compiler_version()
print *,’ Total ram ‘,totalram() ,’ ‘,display_with_commas(totalram())
print *,’ Free ram ‘,freeram() ,’ ‘,display_with_commas(freeram())
print *,’ Share ram ‘,sharedram() ,’ ‘,display_with_commas(sharedram())
print *,’ Buffer ram ‘,bufferram() ,’ ‘,display_with_commas(bufferram())
print *,’ Total swap ‘,totalswap() ,’ ‘,display_with_commas(totalswap())
print *,’ Free swap ‘,freeswap() ,’ ‘,display_with_commas(freeswap())
print *,’ Total high ‘,totalhigh() ,’ ‘,display_with_commas(totalhigh())
print *,’ Free high ‘,freehigh() ,’ ',display_with_commas(freehigh())
end program ch4304

We can provide links to the complete set of source files used if people are interested.

We had the use of a Mac for a while, but couldn’t find any way of doing it on
that platform.