-frecursive .vs. fmax-stack-var-size .vs. -unlimit -s

Dear all,

I have a question of how to heap arrays or increase stack size in gfortran in the best way.

I know Intel Fortran has -heap-arrays option to heap stuff on heap. So no worry about stack size any more. My personal experience is that, if we compile a program with multiple f90 or f files, this -heap-arrays should only be apply on those files which really need to heap arrays. For files which do not need this flag, just do not add this flag.

Anyway, now for gfortran, I wanted to know what flags can best mimic Intel’s -heap-arrays. After some searching, from a warning from my code by gfortran, I found that frecursive seems kind of similar with Intel’s -heap-arrays, it says,

Consider increasing the ‘-fmax-stack-var-size=’ limit (or use ‘-frecursive’, which implies unlimited ‘-fmax-stack-var-size’) - or change the code to use an ALLOCATABLE array. If the variable is never accessed concurrently, this warning can be ignored, and the variable could also be declared with the SAVE attribute. [-Wsurprising]

see it seems either -fmax-stack-var-size=xxx and -frecursive’ should be fine.
Have anyone used -frecursive’ and is it good?

I am asking this because, like, I define a random number generator function to generate a size(n) random number array, like

function rand(n)
integer :: n
real :: rand(n)
some stuff
return
end function rand

So I use it even for just 1 element random number, so rand(1). I mean I can define rand as allocatable array, but then if I call the function frequently, the frequent allocate and deallocate may cause some performance issues perhaps. So I just define it is rand(n), however if n is big it will casue stackoverflow. So for the file contain this function and those will use this function, it seems I need to use -heap-arrays for those files.

On the other hand, I remember at least both @certik and @shahmoradi recommended

-unlimit -s

But, uhm, how to use this -unlimit -s?
Like, do I put this as a flag somewhere at the Fortran linking stage? Or, just type it in the terminal? Is it possible to apply -unlimit -s to my code only? So that the OS stack limit is still its default value.

Thank you very much in advance!

1 Like

Perhaps my memory is faulty, but I thought gfortran automatically did “heap arrays”.

On Linux, the stack limit is a shell setting, limited by what was specified when the kernel was built. “ulimit” (not unlimit) is a shell command, not a compile or link switch. Some shells use the syntax “limit stacksize unlimited”. Except that both are misnomers - they set the stack size limit to the kernel-defined max.

What’s worse is that Linux does “lazy allocation”, so that you can ask for more address space but it doesn’t get allocated until you touch it. (I’m not sure this applies to the stack, however.)

2 Likes

Hi @RCquantum, on Linux, you can execute ulimit -s unlimited on a bash command line before running your code. This will theoretically resolve all Stack Overflow problems. On Windows, you won’t have any luck without heap-arrays, as far as I know. If you use gfortran on any platform, set -fmax-stack-var-size=10 (10 bytes stack max) to allocate anything large on the heap. As far as I am aware, -frecursive flag overwrites -fmax-stack-var-size and causes all allocations to happen on the stack. So you should not specify both flags simultaneously. In my experience, using heap allocations reduces runtime performance by about ~5% or so. But the flexibility it offers outweighs the potential performance penalty, in my opinion. I remember Julia developers (or the community, hard to differentiate the two in the old days) bragged about the automatic allocation of all arrays on the heap in Julia years ago when stack overflow was a big deal in Python applications and wrappers. That might have changed by now.

2 Likes

Thank you very much Dr. Fortran @sblionel and @shahmoradi !
I see Dr. Fortran @sblionel . Yeah, indeed I notice that it seems gfortran perhaps does not need to manually specify -heap-arrays as it automatically does so. Because the same code in a particular file of mine, for Intel Fortran I have to specify -heap-arrays otherwise it stack overflow, however for gfortran I do not really need a flag for that.
Thank you @shahmoradi , now I am more clear about the usage of ulimit -s unlimited. I just wish the code does not stack overflow, LOL. Thank you for the -fmax-stack-var-size=10 trick (which basically act like Intel’s -heap-arrays) and the explanation of -frecursive.
Yeah, -heap-arrays may have some impact in performance. Usually the performance is small enough. However, for the FLINT ODE solver,

I do notice that if I apply -heap-arrays to all of its files, it decrease its performance by at least a factor of 10. So for the files in FLINT solver I definitely do not add any flags like -heap-arrays. So, since then, I apply -heap-arrays only to the files which contain function/subroutine/array which really need heap arrays.

PS. More info about gfortran flags might be found here,

2 Likes

Assuming I understand your description of lazy as deferred physical memory allocation, isn’t this “lazy” a good thing ?
My approach (for Windows gFortran) is to allocate a 500MByte stack size, which is “lazy” in that it is a virtual address that is not used (progressively allocated physical memory) until it is touched.
My understanding is that code + primary stack can not exceed 2 GBytes (4gb?) (ie 32-bit addressing as 64-bit is not yet 64-bit !)

Generally (and more specifically for OpenMP private arrays) it is better to have small arrays on the stack, but once arrays are larger than a memory page (4kbytes), the heap disadvantage disappears. For OpenMP shared arrays, heap arrays are just as efficient as stack arrays, providing ALLOCATE is not a high frequency operation.
For much larger arrays (many gbytes), I use ALLOCATE.

There has been a question about the implementation of -fmax-stack-var-size=xxx, as to if xxx is applied to only local arrays but not automatic arrays. I prefer to select -fstack-arrays (for hopefully local, automatic and private arrays) and then use ALLOCATE to select heap arrays.

You should check the relationship between -fopenmp, -frecursive and -fstack-arrays.

1 Like

Maybe. Let’s say your application does an ALLOCATE of a large array, and checks the status of the allocation, doing some recovery or giving a meaningful message if it fails. With lazy allocation, the ALLOCATE itself will succeed, but the application will get a segfault sometime later when it tries to access the allocated data if it turns out there is insufficient VM available. I would prefer to know earlier that the allocation failed.

2 Likes

I get very few ALLOCATE errors. If an ALLOCATE fails, this is more a logistical problem, that there is insufficient memory installed.
Poor planning, not a code error.

1 Like

So, you would rather get a segfault in some random part of your program rather than a meaningful error at the point of allocation?

2 Likes

I do check for errors with allocate, but the error state is not a sufficient report.
For my applications, the failure of ALLOCATE is effectively when there is insuficient physical memory. This is not reported.
I have pc’s with different amounts of installed memory and if I “forget” and run/test an application on a pc with insufficient memory, I don’t get a segfault, but everything just stops / goes to sleep. A segfault and exit would be a much better outcome. What else can you do?
If there is likely to be insufficient memory, I will run task manager to monitor the memory usage, but this should be estimated before running the program.
(Allocate errors are basically for a 32-bit OS.)

Basically, the error reporting for ALLOCATE is not effective, as it reports the allocation vs available virtual memory. I havn’t used virtual memory for a long time ( 80’s?) You would not combine virtual memory use with OpenMP.

Does anyone use virtual memory for production work ?

Steve,

Thanks for your comments on ALLOCATE errors. I don’t want to be contrary, but in using Allocate on 64-bit OS, I am struck by how infrequently I get an error, while coding the error handling for stat /= 0 can become very extensive, but not very effective.

There is also the most frequent (just about only) problem I experience, which is running a program on the wrong PC which has insufficient installed memory. In this case a program crash would be far preferable to the slow burn of virtual memory. (I wonder how the old disk paging mini’s were so acceptable?)

There are a few usages of ALLOCATE which I find could be improved. I wonder if others find this.

  1. Allocate reports a failure if virtual memory allocation is exceeded. Could there be an option for the limit to be physical memory limit. (assuming there can be a definition of excluding other process allocations) My strategy of defining large virtual stacks could also add to this complexity.

1a) The use of ALLOCATE on 64-bit OS vs 32-bit OS is very different, as with 32-bit, virtual memory is mostly smaller than physical memory, but on 64-bit, virtual is usually not the issue. I think the Fortran Standard ALLOCATE approach is more based on a 32-bit OS environment.

  1. When using allocate for private heap arrays in Openmp, could an OPTION=“new_page” be to allocate these arrays on a new memory page (perhaps for arrays larger than 10 memory pages). This suggestion is based on my assumption that by not sharing private heap arrays for different threads on the same memory page, this could improve performance, ie resulting in less changing memory pages in cache that are shared between threads. I have no definate knowledge this would be effective. No compiler I use provides this option?

OpenMP has introduced some (private) array memory management, although I think they are more for using GPU’s. I have not tried them as yet, as I am not aware of what compilers support OpenMP Ver 5.1 memory options and the memory option alternatives don’t appear to relate to Heap vs Stack memory.

For me, the use of large allocate arrays is a logistical problem, as I need to use the right PC to make sure there is enough physical memory installed. If I do this correctly, there should be no allocate errors.
My approach is to ensure there is enough installed memory for the solution algorithm I am using. When/until that approach fails, I will then need to look for a new solution algorithm.
Perhaps my memory usage experience is too limited/lazy.

I would be interested if others have similar or different views.

1 Like

I have a similar problem: in windows with ifort I add the flag heap arrays 0 because otherwise I often get a stack overflow. However, in some projects this has led to significant performance losses. The problem is that not using heap arrays 0 in windows is very risky because sooner or later you get a stack overflow. I find this quite annoying.

Yeah, heap arrays 0 is a easy fix.

If we define fixed size array, especially high dimensional array, such as

real(8) :: A(nx,ny,nz)

In Windows, I remember if nx * ny * nz> 800000000, it will have overflow error. Especially for high dimensional arrays, sometimes it is easy for its size to be larger than 800000000 and we did not aware of it immediately. Perhaps it is best to find what arrays caused the overflow issues, and perhaps use allocatable arrays if possible.

1 Like

On Windows, the stack is fixed size, allocated as part of the link process. It shares the 2GB static code and data area, even on 64-bit Windows. The default stack size is 100MB (I think), but you can change this using the Linker > System > Stack reserve size property (/STACK Linker option). If you make this too large, your executable will not start.

Using /heap-arrays gets around stack size limits, but involves an allocate/deallocate step for each array temp. (I’ll note that the value you specify for this option doesn’t do anything useful - may as well use 0.) If your application is creating lots of temps, then yes, that will be slow, but even stack temps involve copying data, so it’s a good idea to try to minimize the number of times these get created during a run.

1 Like

The default might be much smaller, only 1 Mb.

2 Likes

Yes, you’re right. 1MB I know this used to be the case, but thought it had been increased. /STACK (Stack allocations) | Microsoft Learn

2 Likes

I think it is around 8MB, small in any case

1 Like

I think this is going too far in implementation details, which is generally avoided by the standard. For allocates, Fortran is dependent on the way the OS handles the allocation requests (e.g. lazy allocation on Linux, which makes all allocations succesfull even when exceeding the available memory).

That said, some inquiry functions about the available memory (physical, swap…) could make sense to help writing portable code. When I am writing code that possibly requires a lot a memory, I am inquiring the available RAM before dimensionning the biggest arrays. But this is done in a non-portable way (reading /proc/meminfo on Linux).

About a “new_page” option for multithreaded codes, I am expecting the OS to take care of that on their own (i.e. to allocate requests from different threads to different pages).

1 Like

I know of several large fortran programs that continue to use the old f77 style memory management because of this limitation. I had thought that all of these programs would have switched over to using fortran allocatable and automatic arrays soon after f90 was available, but has not happened. The practical difference is that with the f77 approach (allocating a single large workspace array, then using argument aliasing and/or pointer assignments to access that memory), the programmer knows at all times how much memory has been used and how much is available, and the running program will not crash because of memory access violations within that workspace. The programmer can examine the available memory and change block size parameters within an algorithm, or even choose from among several algorithms based on workspace availability. With the f90+ allocatable array approach, the programmer cannot inquire how much memory has been used or how much is available for the next task. It is not unusual to allocate some memory, apparently successfully, and then the program subsequently seg faults from a memory access violation. All of this is beyond the control of the programmer. The reason is that the fortran processor itself is continually allocating and deallocating temporary work arrays within expressions, or with copy-in/copy-out argument association, or with automatic arrays, and so on. That, combined with lazy allocation, puts all of this outside of the control of the programmer.

I don’t know if there is really a solution to this problem. However, having some standard inquiry functions regarding stack memory, heap memory, virtual memory, physical memory, and so on might help at least in some cases. On some supercomputers for example, the different job submit queues have different memory limitations, so just the ability of a program to query what that limit is in some standard way would help avoid unnecessary job crashes.

1 Like

I remember this from even the 1970s and 1980s. On an IBM mainframe, there might be 1 to 2 MB of physical memory. But each job that ran would only have access to a small fraction of that (maybe 64kB, or 128kB), and the OS would run many of those small jobs simultaneously. The term was “timesharing”, which you don’t hear any more these days. That is what IBM thought virtual memory was. But, for example, a VAX, might have maybe 1 MB of memory, but your job might have large arrays and use several times that amount of memory. The OS would swap parts of your job in and out of the physical memory using the spinning hard disk as backing store. That is what the VAX, and of course many other computers of that era, thought virtual memory was. Unix, and these days Linux, uses this latter idea of virtual memory, with the global swapspace dedicated and configured as the system boots. Of course in modern times, we’re talking about GBs of physical RAM and TBs of swapspace for desktop and even laptop computers.

1 Like

@RonShepard As I said, I am regularly writing some codes that can potentially fill up the whole available memory, because I’m working in a domain where data volumes can be huge. So, today is no different from yesterday: I have to take into account the physical limitations of the machines where the code is running, and it’s not a big deal.

Most of time at the beginning of the execution I can predict, at least roughly, how much allocatable memory the code will need, get the physical memory, and compare them to decide what to do: authorize execution, abort execution, or proceed with some chunking when possible. Then, I am carefully avoiding any syntax that can result in large temporary arrays or in copy-in/copy-out. Again, it’s not a big deal, and not outside of control.

In the case the needed memory is not predictable at all at the begining of he execution, it’s of course different, and some inquiry functions would be useful. With Fortran as it is, in such cases I would keep track of the total allocated memory during execution.

One thing that is missing in Fortran and that can be very convenient is the memory map mechanism. There’s no magic, it doesn’t give unlimited memory with guaranteed performance, but again it can be convenient in some cases.

2 Likes