Allocatable vs adjustable

You could have separate memory blocks for integers and reals. That would be legal, but it does increase the amount of book-keeping.

I think that’s only one impediment. Compilers generally are able to multi-version for array contiguity, at least for simple cases, but the choice when to multi-version depends on flags and internal heuristics. The more serious impediment is the aliasing of pointer targets, although this should be easier to resolve in Fortran, as the target attribute is required for pointer association. I think one more factor is the scope in which the pointer is used. In a local scope it may be simple to prove that a pointer array does not alias. But it is easy to create a context where proving it may not be possible:

module a_mod
real, pointer :: a(:) 
end module

module b_mod
real, pointer :: b(:)  
end module

program aliasing
use a_mod
use b_mod
real, allocatable :: c(:)
external :: initialization
allocate(c(1000))
call initialization(1000)       ! maybe a => b, or b => a, no way of knowing...
c = a + b
print *, maxval(c)
end program

One way to resolve such issues of sub-optimal auto-vectorization is to exploit the basic rule of dummy argument analysing and perform the work in a subroutine. This has been discussed here before: Fortran pointers, and aliasing - #9 by lkedward. So in the snippet above, we could pack the sum into a subprogram:

call initialization(1000)  ! maybe a => b, or b => a, no way of knowing...
call do_sum(a,b,c) ! we guarantee that a and b aren't associated
print *, maxval(c)
contains
subroutine do_sum(a,b,c)
   real, intent(in) :: a(:), b(:)  ! adding contiguous here can help
   real, intent(out) :: c(:)
   c = a + b
end subroutine
end program
1 Like

Actually, dummy arguments that are not modified in the routine are allowed to alias. So here b and a could still be associated to each other, but ccould not be associated to a or b.

1 Like

Right. In that case I should have picked a slightly more complex example

a = a + b*c

where the pointer is being written to. But I hope the main idea of using subprograms to provide aliasing guarantees was clear.

1 Like

Yes you are aliasing but without some of the dangers that comes with pointers. You are just allowing the compiler to generate another name/symbol for something that could be restricted to just being local to a routine. I think the issues with aliasing come not so much from the ability to provide an alternative name for something but the fact you can change the association of that name to another variable in the same scoping unit in the case of pointers. I don’t see how what I’m proposing is much different than what is in the majority of pre F90 CFD codes I’ve seen over the years. Example.

Integer nq, ncells
parameter nq=3, ncells=10
real work(ncells, nq)

Call flux(ncells, work(:,1), work(:,2), work(:,3), flux)

end
Subroutine flux(nc, rho, rhou, rhoE, flux)
integer nc
real rho(nc), rhou(nc), rhoE(nc), flux(nc,3)
! compute flux
end

Basically, what I’m asking for is something that works more like argument association and less like a pointer. And yes, I know argument association is probably technically a form of aliasing but code like I show in my example has worked for 50 plus years without resorting to a CONTIGUOUS statement or pointers

I can’t really see the difference. At the moment you can use different names to access the same memory locations, the compiler has a harder time for some optimizations when these memory locations have to be written. Maybe having a single alias can make things easier, but this doesn’t fundamentally change the problem IMHO.

Well, it is. With argument association, you are promising the compiler that a dummy argument that is modified is not aliased to any other dummy argument. So at the very least a similar contract should exist for an alias feature (e.g. after a(1:5,1:2) => work(1:10), refering to w(1:10) should not be allowed).

I have built a small test program to see penalty of using an allocatable array vs an automatic array. A function is repeatedly called to allocate and initialize a real array (automatic or allocatable) with sizes from 2^0 to 2^{30}. The function returns something just to avoid that the compiler completely optimize out the allocation.

program ap
use omp_lib
implicit none

   integer :: n , m, i, k, lu
   real :: s
   double precision :: tic, toc
   
   open(newunit=lu,file="/dev/null")
   
   do k = 0, 30
   
      n = 2**k
      m = 2**(30-k)
      
      write(*,"(A,I2,A,I2)",advance='no') "2**", 30-k, " allocations of array size = 2**", k
      
      s = 0.0
      tic = omp_get_wtime()
      do i = 1, m
         s = s + with_auto_array(n)
      end do
      toc = omp_get_wtime()
   
      write(*,"(A,F8.4,A)",advance='no') " Automatic: ", toc-tic, " sec."
      write(lu,*) s
      
      s = 0.0
      tic = omp_get_wtime()
      do i = 1, m
         s = s + with_alloc_array(n)
      end do
      toc = omp_get_wtime()
  
      write(*,"(A,F8.4,A)",advance='no') " Allocatable: ", toc-tic, " sec."
      write(lu,*) s

      write(*,*)
   end do
   
contains
   
   real function with_auto_array(n) result(r)
      integer, intent(inout) :: n
      real :: tmp(n)
      tmp(:) = 1.0
      r = tmp(n/2)
   end function
   
   real function with_alloc_array(n) result(r)
      integer, intent(in) :: n
      real, allocatable :: tmp(:)
      allocate( tmp(n) )
      tmp(:) = 1.0
      r = tmp(n/2)
   end function

end

Here is what I obtain with gfortran 14 on an old Mac (2011):

$ gfortran -Ofast -march=native -flto -fopenmp allocatable_penalty.f90 && ./a.out 
2**30 allocations of array size = 2** 0 Automatic:   3.3944 sec. Allocatable:  96.4065 sec.
2**29 allocations of array size = 2** 1 Automatic:   1.5807 sec. Allocatable:  48.1466 sec.
2**28 allocations of array size = 2** 2 Automatic:   0.7873 sec. Allocatable:  23.4161 sec.
2**27 allocations of array size = 2** 3 Automatic:   0.3925 sec. Allocatable:  11.8151 sec.
2**26 allocations of array size = 2** 4 Automatic:   0.2377 sec. Allocatable:   5.9888 sec.
2**25 allocations of array size = 2** 5 Automatic:   0.1537 sec. Allocatable:   3.0879 sec.
2**24 allocations of array size = 2** 6 Automatic:   0.1209 sec. Allocatable:   2.2891 sec.
2**23 allocations of array size = 2** 7 Automatic:   0.1084 sec. Allocatable:   1.4512 sec.
2**22 allocations of array size = 2** 8 Automatic:   0.0992 sec. Allocatable:   0.4809 sec.
2**21 allocations of array size = 2** 9 Automatic:   0.1055 sec. Allocatable:   0.2859 sec.
2**20 allocations of array size = 2**10 Automatic:   0.0985 sec. Allocatable:   0.1870 sec.
2**19 allocations of array size = 2**11 Automatic:   0.0929 sec. Allocatable:   0.1438 sec.
2**18 allocations of array size = 2**12 Automatic:   0.0941 sec. Allocatable:   0.1185 sec.
2**17 allocations of array size = 2**13 Automatic:   0.1032 sec. Allocatable:   0.1555 sec.
2**16 allocations of array size = 2**14 Automatic:   0.1523 sec. Allocatable:   0.1626 sec.
2**15 allocations of array size = 2**15 Automatic:   0.1511 sec. Allocatable:   0.1923 sec.
2**14 allocations of array size = 2**16 Automatic:   0.1479 sec. Allocatable:   0.1805 sec.
2**13 allocations of array size = 2**17 Automatic:   0.2161 sec. Allocatable:   0.2179 sec.
2**12 allocations of array size = 2**18 Automatic:   0.1996 sec. Allocatable:   0.1993 sec.
2**11 allocations of array size = 2**19 Automatic:   0.1964 sec. Allocatable:   0.2003 sec.
2**10 allocations of array size = 2**20 Automatic:   0.2213 sec. Allocatable:   0.2327 sec.
2** 9 allocations of array size = 2**21 Automatic:   0.4185 sec. Allocatable:   1.1132 sec.
2** 8 allocations of array size = 2**22 Automatic:   0.5092 sec. Allocatable:   1.1668 sec.
2** 7 allocations of array size = 2**23 Automatic:   0.4943 sec. Allocatable:   1.1686 sec.
2** 6 allocations of array size = 2**24Illegal instruction: 4

There’s a huge overhead with very small allocatable arrays. The times get similar starting from a few thousand elements. Then, for a few millions of elements, the allocatable arrays take more time again.

gfortran here can’t allocate an automatic array of 2^{24} elements, that is 64GiB. This is consistent with the macOS stack size, that I’ve set to the maximum possible and which is slightly less than that.

5 Likes

Same on a linux machine with ifort 21. I set the stack size to “unlimited”. The observations are similar:

% ifort -Ofast -march=native -fopenmp allocatable_penalty.f90 && ./a.out 
2**30 allocations of array size = 2** 0 Automatic:   5.3433 sec. Allocatable:  73.1021 sec.
2**29 allocations of array size = 2** 1 Automatic:   2.9513 sec. Allocatable:  36.6175 sec.
2**28 allocations of array size = 2** 2 Automatic:   1.1157 sec. Allocatable:  18.8595 sec.
2**27 allocations of array size = 2** 3 Automatic:   0.5681 sec. Allocatable:   9.6591 sec.
2**26 allocations of array size = 2** 4 Automatic:   0.1547 sec. Allocatable:   4.3573 sec.
2**25 allocations of array size = 2** 5 Automatic:   0.0876 sec. Allocatable:   2.1702 sec.
2**24 allocations of array size = 2** 6 Automatic:   0.0542 sec. Allocatable:   1.1869 sec.
2**23 allocations of array size = 2** 7 Automatic:   0.0412 sec. Allocatable:   0.5767 sec.
2**22 allocations of array size = 2** 8 Automatic:   0.0413 sec. Allocatable:   0.3104 sec.
2**21 allocations of array size = 2** 9 Automatic:   0.0411 sec. Allocatable:   0.1814 sec.
2**20 allocations of array size = 2**10 Automatic:   0.0411 sec. Allocatable:   0.1103 sec.
2**19 allocations of array size = 2**11 Automatic:   0.0420 sec. Allocatable:   0.0943 sec.
2**18 allocations of array size = 2**12 Automatic:   0.0416 sec. Allocatable:   0.0691 sec.
2**17 allocations of array size = 2**13 Automatic:   0.0525 sec. Allocatable:   0.0992 sec.
2**16 allocations of array size = 2**14 Automatic:   0.0699 sec. Allocatable:   0.0830 sec.
2**15 allocations of array size = 2**15 Automatic:   0.0715 sec. Allocatable:   0.0774 sec.
2**14 allocations of array size = 2**16 Automatic:   0.0702 sec. Allocatable:   0.0729 sec.
2**13 allocations of array size = 2**17 Automatic:   0.0691 sec. Allocatable:   0.0710 sec.
2**12 allocations of array size = 2**18 Automatic:   0.0993 sec. Allocatable:   0.1003 sec.
2**11 allocations of array size = 2**19 Automatic:   0.1748 sec. Allocatable:   0.1755 sec.
2**10 allocations of array size = 2**20 Automatic:   0.1767 sec. Allocatable:   0.1784 sec.
2** 9 allocations of array size = 2**21 Automatic:   0.1884 sec. Allocatable:   0.1816 sec.
2** 8 allocations of array size = 2**22 Automatic:   0.3155 sec. Allocatable:   0.2170 sec.
2** 7 allocations of array size = 2**23 Automatic:   0.4841 sec. Allocatable:   0.4718 sec.
2** 6 allocations of array size = 2**24 Automatic:   0.5471 sec. Allocatable:   0.9742 sec.
2** 5 allocations of array size = 2**25 Automatic:   0.5479 sec. Allocatable:   1.0794 sec.
2** 4 allocations of array size = 2**26 Automatic:   0.5632 sec. Allocatable:   1.5333 sec.
2** 3 allocations of array size = 2**27 Automatic:   0.6129 sec. Allocatable:   2.2037 sec.
2** 2 allocations of array size = 2**28 Automatic:   0.7594 sec. Allocatable:   2.1327 sec.
2** 1 allocations of array size = 2**29 Automatic:   0.9342 sec. Allocatable:   3.2511 sec.
2** 0 allocations of array size = 2**30 Automatic:   1.6200 sec. Allocatable:   2.5826 sec.

@PierU beautiful, thanks for the benchmarks.

If you have time, can you also benchmark small arrays with automatic reallocation of LHS (default in GFortran) vs disabling it with -fno-realloc-lhs? Do a code that doesn’t need to reallocate. I always wondered if there is an overhead and how big.

I’m surprised that dynamic stack allocation is still recommended. It exists in C++, but it’s so dangerous that is not commonly taught, leading to few people aware of ‘alloca()’. Stack allocation using syntax ‘int arr[sz]’, where 'sz" is a variable, is part of the C99 standard, but in C++ it can only be used as GCC extension.

This is something that the programmer has control over. If the lhs is written as array=, then the compiler must test for the possibility of reallocation. If it is written as array(:)=, then the compiler cannot reallocate so presumably all such testing is eliminated (unless enabled as a debug tool). The programmer must, of course, know that the lhs is already allocated to the correct size.

2 Likes

I thought that if lhs is written array= and array is allocatable then it need not be already allocated.

Yes. that’s correct. But since the behavior may be a (surprising) deviation from what Fortran 90/95 did, some compilers have a flag to enable or disable the feature. Something similar happens with recursion (which became the default in Fortran 2018, but required an explicit keyword in previous standards).

ifort/ifx has a -standard-semantics flag that makes the compiler behave in a way expected by the Fortran standard. As a side-efect, the flag also disables some behavior inherited from DEC (and Microsoft FPS).

When the feature is enabled, reallocation might or might not occur, since the compiler should check for a change in shape.

In my experience, performance loss due to reallocation of small arrays is negligible (as in, a few microseconds).

But if thousands of elements are expected and final size is unknown, then performance loss is significant (as in, a few seconds per thousand elements). A dynamic array pattern can be applied to mitigate the issue.

My understanding is that the potential problems with the stack allocation in C++ come from the alloca(), which can be easily misused, e.g. by allocating some stack memory managed by a global pointer. Such misuse is not possible in Fortran, the scope any object allocated on the stack is always local to the routine where the object is declared, and the takes care of all the allocations and deallocations on its own.

Right. But it should be noted that in a Fortran 95 valid code that is compiled with a F2003+ compiler, no reallocation on assignment should occur.

! F95 valid code
real, allocatable :: x(:), y(:)
...
x = y ! is legal in F95 only if x and y have the exact same shape
      ! In F2003+, the shape conformance is tested, but no reallocation occur if the shapes conform

Not really. The problem is the size of allocation you cannot predict. It creates nasty bugs extremely difficult to find in large code bases.

I don’t know in C++, but in Fortran the only problem with stack allocations is when they result in a stack overflow. My experience with both the Intel compiler and gfortran is that the program crashes as soon as it happens, so that it’s not that difficult to diagnose under a debugger.

Nonetheless, as a general rule I would recommend using automatic arrays only for small arrays (i.e. less than 10k). In routines that are called recursively with a large recursion depth, it’s even probably better to not use automatic arrays at all.

3 Likes

In addition I think it should be possible to implement a Debug check in a compiler for stack arrays so that you get a nice runtime error with a stacktrace instead of a segfault when you run out of stack. That way you know where the problem is and can decide as a user how to handle it, whether with some compiler options or changing the code.

Regarding x = y, the part that seems not clean is that if those arrays are not allocatable, then I think there is no overhead. But just by making them allocatable you suddenly incur a runtime check. The other issue is that by using x = y instead of x(:,:) = y it might still work, but you accidentally might reallocate the LHS, when you didn’t intend to, in other words, it can hide bugs (maybe you allocated it to the correct size, but you made an off-by-one error). It seems this feature is useful for interactive mode (say in LFortran), or for some quick testing, but I can’t think when it was ever useful to me in production codes. For performance reasons I always want to control the allocation manually and ensure arrays have the right length before doing computations on them. Maybe it is needed for returning allocatable arrays from a function? I usually prefer a subroutine with intent(out), allocatable, there the semantics is clear and clean.

3 Likes

@PierU
I am getting a different relative performance between automatic and allocatable arrays on Windows 10 and Gfortran 14.2

My modified code is

 module loc_tmp
   integer*8 :: loc_auto, loc_alloc
 end module loc_tmp

program ap
use iso_fortran_env
use omp_lib
use loc_tmp
implicit none

   integer :: n , m, i, k, lu
   real :: s, au(5)
   real, allocatable :: al(:)
   double precision :: tic, toc
   
   write (*,*) compiler_version ()
   write (*,*) compiler_options ()

!z   open(newunit=lu,file="/dev/null")
   open (newunit=lu,file="dev_null.txt")
   
   do k = 0, 30
   
      n = 2**k
      m = 2**(30-k)
      
      write(*,"(A,I2,A,I2)",advance='no') "2**", 30-k, " allocations of array size = 2**", k
      
      s = 0.0
      tic = my_get_wtime()
      do i = 1, m
         s = s + with_auto_array(n)
      end do
      toc = my_get_wtime()
   
      write(*,"(A,F8.4,A)",advance='no') " Automatic: ", toc-tic, " sec."
      write(lu,*) n, s, toc-tic, loc_auto
      
      s = 0.0
      tic = my_get_wtime()
      do i = 1, m
         s = s + with_alloc_array(n)
      end do
      toc = my_get_wtime()
  
      write(*,"(A,F8.4,A)",advance='no') " Allocatable: ", toc-tic, " sec."
      write(lu,*) n, s, toc-tic, loc_alloc

      write(*,*)
   end do

   allocate ( al(5))
   write (lu,*) loc(s), loc(au), loc(al)
   
contains
   
   real function with_auto_array(n) result(r)
      integer, intent(inout) :: n
      real :: tmp(n)
      tmp(:) = 1.0
      loc_auto = loc(tmp)
      r = tmp(n/2)
   end function
   
   real function with_alloc_array(n) result(r)
      integer, intent(in) :: n
      real, allocatable :: tmp(:)
      allocate( tmp(n) )
      tmp(:) = 1.0
      loc_alloc = loc(tmp)
      r = tmp(n/2)
   end function

   real*8 function my_get_wtime ()
     integer*8 :: clock, rate
     call system_clock ( clock, rate )
     my_get_wtime = dble (clock) / dble (rate)
   end function my_get_wtime
  
end

The reported results saved are :

 GCC version 14.2.0
 -cpp -iprefix C:/Program Files (x86)/gcc_eq/gcc_14.2.0/bin/../lib/gcc/x86_64-w64-mingw32/14.2.0/ -U_REENTRANT -mtune=generic -march=x86-64 -fno-underscoring -fdollar-ok
2**30 allocations of array size = 2** 0 Automatic:  33.9688 sec. Allocatable:  37.5021 sec.
2**29 allocations of array size = 2** 1 Automatic:  17.2286 sec. Allocatable:  18.8676 sec.
2**28 allocations of array size = 2** 2 Automatic:   8.8427 sec. Allocatable:   9.5493 sec.
2**27 allocations of array size = 2** 3 Automatic:   4.6037 sec. Allocatable:   4.8974 sec.
2**26 allocations of array size = 2** 4 Automatic:   2.4854 sec. Allocatable:   2.5678 sec.
2**25 allocations of array size = 2** 5 Automatic:   1.3920 sec. Allocatable:   1.5294 sec.
2**24 allocations of array size = 2** 6 Automatic:   0.8679 sec. Allocatable:   0.9873 sec.
2**23 allocations of array size = 2** 7 Automatic:   0.6288 sec. Allocatable:   0.7521 sec.
2**22 allocations of array size = 2** 8 Automatic:   0.5132 sec. Allocatable:   0.6007 sec.
2**21 allocations of array size = 2** 9 Automatic:   0.4020 sec. Allocatable:   0.5218 sec.
2**20 allocations of array size = 2**10 Automatic:   0.3779 sec. Allocatable:   0.4836 sec.
2**19 allocations of array size = 2**11 Automatic:   0.3513 sec. Allocatable:   0.4638 sec.
2**18 allocations of array size = 2**12 Automatic:   0.3502 sec. Allocatable:   0.4545 sec.
2**17 allocations of array size = 2**13 Automatic:   0.3418 sec. Allocatable:   0.4591 sec.
2**16 allocations of array size = 2**14 Automatic:   0.3487 sec. Allocatable:   0.4519 sec.
2**15 allocations of array size = 2**15 Automatic:   0.3481 sec. Allocatable:   0.4492 sec.
2**14 allocations of array size = 2**16 Automatic:   0.3439 sec. Allocatable:   0.4478 sec.
2**13 allocations of array size = 2**17 Automatic:   0.3455 sec. Allocatable:   0.4470 sec.
2**12 allocations of array size = 2**18 Automatic:   0.9597 sec. Allocatable:   1.0474 sec.
2**11 allocations of array size = 2**19 Automatic:   0.8920 sec. Allocatable:   0.9780 sec.
2**10 allocations of array size = 2**20 Automatic:   0.8650 sec. Allocatable:   0.9525 sec.
2** 9 allocations of array size = 2**21 Automatic:   0.8597 sec. Allocatable:   0.9464 sec.
2** 8 allocations of array size = 2**22 Automatic:   0.8680 sec. Allocatable:   0.9352 sec.
2** 7 allocations of array size = 2**23 Automatic:   0.8531 sec. Allocatable:   0.9442 sec.
2** 6 allocations of array size = 2**24 Automatic:   0.8267 sec. Allocatable:   0.9371 sec.
2** 5 allocations of array size = 2**25 Automatic:   0.8154 sec. Allocatable:   0.9656 sec.
2** 4 allocations of array size = 2**26 Automatic:   0.8510 sec. Allocatable:   0.9558 sec.
2** 3 allocations of array size = 2**27 Automatic:   0.8623 sec. Allocatable:   0.9479 sec.
2** 2 allocations of array size = 2**28 Automatic:   0.8210 sec. Allocatable:   0.9345 sec.
2** 1 allocations of array size = 2**29 Automatic:   0.8515 sec. Allocatable:   0.9458 sec.
2** 0 allocations of array size = 2**30 Automatic:   0.7930 sec. Allocatable:   0.9772 sec.

Note:
I have no idea what stack size I am getting, but the compiler options I used are being reported.
tic =omp_get_wtime() is an awful implementation on windows
2^30 allocations of array size = 2** 0 produces a lot of errors reporting r = tmp(n/2)
2^30 Automatic should be a 4-gbyte array on the stack ?
Memory address report suggests the automatic arrays are on the heap for my Wincows / gfortran 14.2

Using allocatable arrays is a more robust approach for large arrays.
I only use automatic arrays for “small” problems

1 Like

So, it looks like gfortran on Windows puts all automatic arrays on the heap, regardless their sizes…

Ah, yes, I missed that bug. r = tmp(n/2+1) would fix it.

1 Like

Gfortran versions 11.1.0 and 14.2.0 both appear to use the heap for the automatic arrays in this program case, even with new compile options of:

set program=%1

del %program%.exe

set vec=-fimplicit-none -O3 -march=native -ffast-math -fopenmp -fstack-arrays
set basic=-g -fimplicit-none -O2 -march=native -fopenmp

gfortran %program%.f90 %basic% -o %program%.exe

%program% > %program%.log

type %program%.log
type dev_null.txt

I have obtained these built versions from equation.com
Not sure why this has been the case ?

The stack or the heap are just different memory addresses when using arrays for computation.
If you use -fopenmp on windows, each thread stack can be in significantly different locations; it is just a virtual memory address !
ifort has a similar approach for thread stacks.
In the past I have set all thread stacks to 500 MBytes, but this change to automatic arrays on the heap could change this need.
I should resurrect my heap vs stack analysis for private arrays from 2019.