Allocatable vs adjustable

A big difference between automatic and allocatables is that with automatic, the allocation is done once - when the procedure is entered. (And yes, generally on the stack.) Whereas allocatables are allocated either by explicit allocation, or implicitly when the variable is assigned values during array assignment. Reallocation to a different shape can also occur during array assignment.

I tend to find allocatables handier to use because of the implicit allocation/reallocation feature. (Especially with character strings.) It can make for much cleaner code. Though when performance is critical, if the arrays will be of reasonable size, automatics have lower overhead.

Using the heap-arrays option by Intel Fortran will prevent the program from crashing in this case if thereā€™s not enough space in the stack. Note that intrinsic procedures called on very large arrays will still crash though.

1 Like

This can be useful to compile and run an existing code that was previously compiled with another compiler that can allocate very large automatic arrays, but would recommend not writing new codes that rely on such very large automatic arrays.

I donā€™t think soā€¦ This would be terribly annoying.

I was also surprised by that remark, since I started using the -heap-arrays flag precisely to avoid issues with the pack intrinsic.

1 Like

I struggle with with this regularly with Windows, and struggle to find good guidance. I try to avoid heap-arrays because concerns of reduced speed, and try to manage with larger stack sizes. I know there is some guidance of 100MB being a reasonable upper stack size limit. However that still crashes intrinsic procedures for certain problem sets. I have been using 200MB stack sizes for some programs so far no issues. My user base also uses engineering workstations that have a large ram sizes, so that may not be the case for all.

Sorry, I must have remembered wrong. This other thread confirms that intrinsic procedures are fine when -heap-arrays is used.

I think that part of the issue is that -heap-arrays is a very simple concept: arrays smaller than the specified limit go on the stack, while the others go on the heap. So, there is no control on whether there is space for it, and the developer is left to check if the worst case scenario runs, which for numerical codes is user-dependent. Even that is not trivial, since Iā€™m not aware of a tool to monitor stack usage, and therefore I donā€™t know if a procedure brings the stack usage to 99%. Obviously, the solution is to set the limit to 0 and take the performance hit, which is suboptimal. It would be nice if -heap-arrays could be improved to check if thereā€™s space on the stack, and if that is not enough just allocate on the heap regardless of the limit.

Thereā€™s no free lunchā€¦ Here, there is no perfect solution that can fit all the use cases. A test on the available stack size has also an overhead.

2 Likes

Iā€™m sure youā€™re correct, that checking every time would be a significant performance penalty. It does seem like there could be a way to have a more graceful failure though, where in case of a failure, instead of an exit it retries on the heap.

Iā€™m sure this has been discussed beforeā€¦and if I had to guess not pursued due to the added complexity.

This still means a test for each automatic array and for each callā€¦

Since the behaviors and related options of the compilers tend to vary, let alone the OSā€™s (on macOS itā€™s not possible to set a stack size larger than ~64 MB), I tend to follow a conservative approach: automatic arrays only if you their expected size is a few thousand bytes at most, and allocatable arrays otherwise.

The ā€˜traditionalā€™ Cray machines of the 1970s-1990s only had a single address space for data. So the stack was kept in heap blocks. Normally just a single heap block that was ā€˜big enoughā€™ was needed - as determined by the linker. (Add up local variable space in the various call chains, then use the max.) But in the case of recursion, automatic arrays, or the like, additional heap blocks could be allocated for stack use during run time. There were environment variables that could be set for tuning this - if any performance issues were observed.

I often lament that Fortran 77, and even Fortran 66, didnā€™t have automatic arrays - and didnā€™t encourage stack based local variables in general. These features were well-known in ALGOL 60 circles, easy to implement, and would have saved a lot of programming headaches on the small memory, non-virtual, machines of the day.

It might be good to get some input to this discussion from compiler writers, but I think checking for stack overflow is just a single comparison of two (unsigned) integer values. That is some overhead, but it is only minimal. This is basically what was done manually in many f77 era codes that attempted to reuse efficiently whatever workspace memory was available for variable sized arrays. It is also trivial to query how much stack space is available at any time, and programmers of that era used that information to tune their algorithms to fit the available resources.

Heap allocation can be expensive because it might involve searching in a loop over all allocated blocks of memory for an unallocated block sufficiently large for the array. Deallocation in turn can involve garbage collection, which is combining smaller deallocated blocks into a single large deallocated block to prepare for future allocations.

As for testing every automatic array, that is an implementation detail. One way is indeed to test each automatic array, and if it overflows then invoke a heap allocation. Another way is to compute all of the array addresses on the stack, then test just the last one to see if it overflows, and if it does, then use the fallback of testing and allocating them one at a time. Or just allocate all the arrays at once on the heap. If the test shows that everything fit, then just use those stack addresses. There are probably other approaches too that can reduce the overall testing and heap allocation effort.

1 Like

For ā€œallocatable vs adjustableā€ arrays, I think it may be useful to compare the performance obtained with most recent compilers, because memory management (stack vs heap etc) seems to depend on compilers, optimization levels, and other flags (e.g., -fstack-arrays in Gfortran). For example, the following Q/A page has some test code with the timing obtained with the compilers at that time (GCC-4.8 and ifort-14, very old!!)

!! test.f90 

!------------------------------------------------------------------------           
subroutine use_automatic( n )
    integer :: n

    integer :: a( n )   !! local automatic array (with unknown size at compile-time)
    integer :: i

    do i = 1, n
        a( i ) = i
    enddo

    call sub( a )
end

!------------------------------------------------------------------------           
subroutine use_alloc( n )
    integer :: n

    integer, allocatable :: a( : )  !! local allocatable array                      
    integer :: i

    allocate( a( n ) )

    do i = 1, n
        a( i ) = i
    enddo

    call sub( a )

    deallocate( a )  !! not necessary for modern Fortran but for clarity                  
end

!------------------------------------------------------------------------           
program main
    implicit none
    integer :: i, nsizemax, nsize, nloop, foo
    common /dummy/ foo

    nloop = 10**7
    nsizemax = 10

    do i = 1, nloop
        nsize = mod( i, nsizemax ) + 1

        call use_automatic( nsize )
        ! call use_alloc( nsize )                                                   
    enddo

    print *, "foo = ", foo   !! to check if sub() is really called
end

!! sub.f90 

!------------------------------------------------------------------------
subroutine sub( a )
    integer a( * )
    integer foo
    common /dummy/ foo

    foo = a( 1 )
end

I remember an old electronic structure code uses such a method, i.e., allocating a large chunk of memory only once (via some small C code + common block) and then reuse part of it very efficiently (maybe a bit similar to memory pool (?)). But the interface was not very readable to me because the program was written in F77 and non-descriptive variable names were used extensively. With modern Fortran, I guess it might be possible to make a similar memory system but more modern, easy-to-use, and also efficient. But at the same time, because recent compilers are also very efficient, I am not sure to what extent such a custom memory system can provide better performance as compared to the plain use of allocatable or stack arrays (which may also depend on the type of calculation).

That code might have been one of mine. :wink: Of course, I copied that idea from other codes I had seen. One problem with this approach is that you end up aliasing integer dummy arrays to real actual arrays, or visa versa, and that is not, and has never been, legal fortran. It worked on all the compilers that I have ever used, but still, you want to stay within the standard when possible.

Using modern features, you can use pointer arrays of the correct type, and use c_f_pointer() to do the aliasing. That still is not strictly legal, but it gets rid of the dozens/hundreds/thousands of compiler errors about mismatched arguments that otherwise would occur. And this approach still lets you query the amount of memory left (or used), something that standard fortran allocate/deallocate does not support. I still have not settled on a nice interface to package the logic for this.

Here we would need the input of OS writers in addition to compiler writers. AFAIK with the paged virtual memory + lazy allocation, the allocation step simply checks that the requested size doesnā€™t exceed the virtual memory size (which is fixed). And itā€™s only when accessing the elements of the arrays (first touch) that some memory pages are effectively attributed. The deallocation step is fast, as it is asynchronous process: it returns immediately, and all happens in a background OS process that effectively frees the pages, performs some garbage collection, etcā€¦

I tend to agree, but to be consistent, this checking should be performed for all local variables, scalars and arrays, not only for arrays. At the end, itā€™s a lot of additional tests.

In my experience, having a large work array that you allocate once and then reference contiguous slices as needed via some kind of argument association and maybe pointer association (with a contiguous attribute) is still faster than trying to allocate/deallocate the same variables each time you enter and leave a procedure. A few years back I played around with using a memory pool like structure where I had a private array in a module that could be allocated at the start of the program based on user input and module procedures that would return contiguous pointers of required ranks and sizes that I could ā€œaliasā€ to contiguous segments of the private array for local work variables. At the time CONTIGUOUS was still new in most compilers and I did have issues in some cases with vectorization. I realized that a dedicated ā€œmemory poolā€ was really not a great idea for my application so I just made whatever temporary work arrays I needed private module variables that were only referenced by the routines in the module that used them and had an initialization program that would allocate the variables once at the start of the program. This avoided any name aliasing issues and in the end sped up my code by a factor of about 2X because of improved vectorization. However, Iā€™ve always felt Fortran needed a better way of assigning an alternative name (aka alias) to a contiguous slice of an array than using pointers or the ASSOCIATE construct (which can get very unwieldy in a hurry). My idea is to allow an ALIAS attribute to a variable that would allow it to point to other variables or contiguous slices of an array as an alternative name. However, unlike pointers, they can only 1. be associated once in a scoping unit (procedure or module) and canā€™t be reassociated to a different variable in the same procedure and 2. be associated with contiguous blocks of data. ie something like

USE memory pool, only: work
USE some_types: a_type

real, alias :: a(:,:)
real, alias :: c_t

a(1:5,1:2) => work(1:10)
c_t => a_type%some_long_variable_name

Yes you can do this now with pointers but I think that something like this would allow you to use pointer like aliasing without some of the problems that come with pointers particularly with respect to vectorization and overall code optimization

just my 2 cents

1 Like

Back in the Olden Days ā„¢, before F90 and before Unix, the usual technique was to assume blank common was at the end of memory. (Not always a safe assumption.) For each system/OS, youā€™d write a small set of procedures, often in assembly language, to inquire to the OS what the process/job memory limits were, where blank common is, and then extend memory to the limit.

      REAL RPOOL(1)
      INTEGER IPOOL(1)
      COMMON // RPOOL
      EQUIVALENCE (RPOOL, IPOOL)

Then ā€˜overindexā€™ the common block as needed.

Note that equivalencing integers and reals is perfectly standard. Especially since the Standard required both to be the same size (i.e., one numeric storage unit). F77 messed this up a bit because it does not allow equivalencing characters to numeric variables. Some compilers do allow it as an extension. But it remained a reason some folks stuck with Hollerith constant style character manipulation for a long time (and perhaps still do).

A significant advancement was on Cray systems - which had ā€œCray pointersā€. You could call the malloc-like memory manager package, stuff the address into the pointer, then use the associated array as needed. But since blank common is no longer used, you canā€™t use equivalence. So often not a simple replacement. Youā€™d have to use a second pointer to the integer array and stuff the same address into both.

1 Like

With pointers, the vectorization issue is solved with the contiguous attribute. And your alias proposal doesnā€™t avoid aliasing, even if a single alias is allowed (which may be difficult to check by the compiler in the general case): in your example you still have aliasing between a and work(1:10). This is by design of any alias-like feature.

The electronic structure code GAMESS does this still! I am thinking on refactoring this at some point and this discussion is quite good to think about if it is worth doing.

@rwmsu I am debating on the idea of a memory pool for my application, mostly because I would like to design a good methodology to administer arrays and data such that communication between CPUs and GPUs can be done easily and with minimal errors.

Thanks for this discussion, it is quite interesting and gives me ideas!