Allocatable vs adjustable

I was trying to define what size the stack should be on a 64-bit OS.

To remove all these stack problems, I think if you use -march=native, by default, the stack size should be assigned the installed memory size, but not more than 2^(48-1) / omp_get_max_threads()

After all, it is only an address where all unused virtual memory is not even allocated a virtual memory page.

It is a shame 64-bit compiler developers do not think this way and do away with most stack overflow errors. There would definately be some who still try to exceed this limit !

AFAIK the stack size is “decided” by the OS at the moment it loads an executable, it does not depend on the compilation. At least on *nix systems. And the stack space of a process is reserved and exclusive to this process. So you can’t set by default the stack size to the total available memory: you could run only one process like this. What is a mystery to me, however, is how an “unlimited” stack size is managed…

A resource is “unlimited”, not from math (or philosophy) but from the point of view of the computer, so it’s value is most likely an “unsigned huge” (e.g., 0xffffffffffffffff).

The POSIX’s symbolic constant is RLIM_INFINITY, defined in <sys/resouece.h>.

Which probably means the stack will keep growing until either process completion or crash.

This is partly true.
The stack size is “decided” by the linker, but unfortunately is limited by the OS. The linker may be provided by the compiler or by the OS. Present limits in 64-bit OS are just not necessary

The stack “size” sets out a (virtual) address range that is reserved for stack variables. Unfortunately this is far too small in Windows and probably other OS.
In 64-bit, my googling suggests the virtual address range can be 2^48 bytes (64 terabytes), far greater than the physical memory limits of say one terabyte : 2^40 bytes.

The important points are:

  1. that the 2^48 virtual address space is much larger than can be used.
  2. virtual memory pages are only allocated to actual arrays in the program and
  3. physical memory pages are only allocated to actual arrays that are initialised/used.
  4. only the virtual memory page table must be able to support the address range.

“If” you have a 64 GByte stack size, you still only take space that is allocated or used by the program. The stack is much less likely to clash with the heap.
If your arrays in the stack(s) and heap are spread over a terabyte address range, this can all function based on the allocated memory pages required are less than the physical memory pages or the virtual memory storage space.

If you look at the memory address of heap arrays or thread stack arrays in an Ifort OMP program, you can see these are spaced significantly with no issue.
Depending on the address capacity of the virtual memory page table, many gigabyte stacks should not be an issue, or introduce any inefficiency.

By allocating a larger memory address space to the stack, we simply don’t get stack overflow.
Programs that use (say) 100 GBytes of memory should not be limited to a stack size of less than 1 GBytes, but do remain limited to available physical memory pages.

Stack limits in 64 bit OS should be changed, especially in Windows

this discussion has been very useful for an experiment I am interested in. Is it worth it in 2025 to design a Fortran program that uses a memory pool approach to avoid allocate calls at runtime?

Following what @RonShepard mentioned about legacy programs using a big array and indexing into it in funky ways, nowadays we could design a cool extended type that is a memory_pool type with a integer, and double precision buffer and some smart indexing functions.

But is it worth going through this hassle? I have been toying with a small program that allocates some matrices and then does a dgemm with them. So far the performance difference has been quite small, small enough that I would like to avoid going to the trouble of implementing a memory pool.

I think this could be a useful library for quite a few programs.

I’ve got a project that could definitely use such a thing. I profiled the program, which uses a multi-level data structure of derived types with allocatable components of derived type with allocatable components …, and it spends a lot of its time waiting for memory allocation. I even wrote a little mini-app that demonstrates you could save ~30% of the run time if you could avoid all the memory allocations. Unfortunately, the strategy for doing so was terrible from a maintainability standpoint.

If you could design a library that used a memory pool, but kept the semantics of allocatable variables you’d enable efficient and maintainable programs using this style.

As I stated in another thread, I tried to implement a memory pool “class” a few years ago and came to the same conclusion. If the issue being addressed is providing temporary work space inside a procedure (or global data inside a module), I think its better to encapsulate both the temporary arrays and the procedures that use them in a module with the temp. arrays declared as private. Then at the start of your program you call some kind of initialization routine that allocates the arrays to some specified value and only change them when your memory requirements change dynamically. I’ve found this is a lot better and faster than relying on automatic arrays or reallocating the temp arrays each time you enter the procedure. In the applications areas I’m familiar with you don’t need to reallocate arrays from an original fixed size unless you are doing some kind of dynamic adaptive mesh refinement. Yes, this is like a memory pool but without the maintainability problems in my opinion. Some times its better to use a specific tool designed for a single purpose instead of trying to build a universal tool applicable to all cases.

1 Like

If this is done in fortran, then I think it must be done with pointers, not individual allocatable arrays. That means that the arrays have the semantics of pointer arrays, not those of allocatable arrays.

One possible way to avoid this limitation would be if the move_alloc() intrinsic subprogram were to be generalized a little to allow pointer arguments. This would allow a pointer array, with all of its baggage, to be transformed into an allocatable array, with all of its advantages, at runtime with only a shallow copy amount of effort. This change would eventually have to be within the fortran standard itself to be portable.

This is the sad bit, allocatable arrays are basically memory safe unless you’re doing some weird things and loosing that is unfortunate. The design then goes around how to avoid memory leaks while keeping the usable interface good.

For a memory pool, there would probably not be memory leaks, since the pool, once allocated, would then live for the rest of the program’s execution life. But there are still all the other problems with pointers, including the performance-related ones that allocatable arrays solve so elegantly related to possible hidden aliases. The challenge with the move_alloc() generalization would be to keep those semantic advantages of the allocatable arguments, but to shift the responsibility to the programmer to avoid the hidden aliases in a clear and manageable way.

1 Like

I am sure that many of us who used FORTRAN before F90/95 already have our own memory pool management library.

Is it worth it in 2025 ? In most cases I think definately not.

Why would you avoid ALLOCATE ? There arn’t many cases where you would want to.
ALLOCATE uses the heap and for large arrays, the heap is much better than the stack.
The stack (automatic arrays) can be effective for many small arrays, but in most cases this is not significant.

The only feature of my MEM_ALLOC library that I miss is is to “allocate the largest array possible and tell me what it is” option, which is typically done when you don’t know the size required. ( there is a subsequent resize option to reset the array size, when you do know )

Those of us who used FORTRAN waited a long time for ALLOCATE and combined with automatic arrays, this was one of the best Fortran improvements ever for computational programming. (Just consider all the very old libraries that required you to provide work arrays, because they could not generate one, or risk stack overflow)

1 Like

Emphasis in the “probably”! I could see a bad design becoming problematic, but yeah, you’re right. Basically, with this and @JohnCampbell response it seems that spending time in a memory pool might not be the best use of my time.

Thanks so much for the insights !

At the end, I’m still unsure why the stack allocations are faster… I’m not a system guy, so I can only guess how the allocations work:

  1. finding an address at the start of a contiguous block of size N in the virtual memory space
  2. mapping the virtual memory pages to physical memory locations (which implies finding which locations are free)

Stack allocation:

  1. is very fast, as the stack memory is… a stack; the address is simply the top of the stack
  2. … I don’t know… is the stack memory premapped when the process starts (in which case there would be no cost during the rest of the execution), or not (in which case there would be no difference with a heap allocation)

Heap allocation:

  1. can be slow, as the heap memory space is fragmented by nature
  2. is it faster or slower than 1.? I guess it’s usually faster, but it’s just a guess.

“Allocations” in a “manually managed” pool of static memory:

  1. is fast if the pool is managed as a stack; can be slow if the pool is managed as a heap
  2. the mapping occurs only at the first access to a virtual page, as the page is allocated once and never released from the OS point of view.

Stack allocations are extremely fast. When entering a new procedure, just bump the stack pointers by the appropriate amount and you are done. Typically if the program tries to access data beyond the current limit of the stack, a page fault occurs and the OS automagically provides a page to accommodate the memory reference.

The heap is generally kept in a different area of the address space. Fortran ALLOCATE typically just calls malloc - which in turn maintains its own memory pool. Note that on linux, there are ways to tune it, via mallopt and friends. At some point it can even use mmap(2) for large allocations.

Someone told me once that stack was mapped to physically faster memory that was “closer” to the CPU. I’ve always been sceptical that was true. I would think that if it was then the OS would have to be totally aware of what memory you have installed (caches through DRAMs) and malloc would have to be smart enough to take advantage of it. I doubt that is the case but who knows, I could be wrong.

Not really. But cache locality can be an issue. Generally a smaller memory footprint, which stack allocation encourages, can help with cache locality.

Ages ago, someone once told me she had gone to a class where they were talking about arranging scalar variables together in the declarations to help make programs faster. As I thought about it, it wouldn’t have helped at all with the CDC mainframes we were using. But on an IBM mainframe with a cache (and back then they were only a few K bytes), it might have been a reasonable optimization technique to improve locality of reference. Also perhaps something an optimizing compiler could do automatically - overriding attempts by the programmer to do the same. No idea if any modern compilers bother to do this. L2 and L3 caches are big enough these days, that it doesn’t seem like it would be worth the effort.

2 Likes

Maybe some architectures in the past had different levels of RAM, but I’m pretty sure that none of the current architectures (and especially the x86 one) have that.

Aren’t the different levels of cache on modern processors effectively different levels of RAM. They might run at the same speed (not sure about that) but I think Level 1 is closer to the CPU on the die than level 2 and level 3 so its access time is (or at least should be) a little faster.

Yes, this feature of almost all modern processors is called nonuniform memory access, or NUMA. The concept extends further to accessing remote memory (coarrays and MPI) and to backing store on external devices such as hard disks or SSDs. Non-uniform memory access - Wikipedia

However, the way I understand it, stack memory is not mapped directly to one of those memory levels, rather stack memory naturally ends up there simply because it is accessed repeatedly by the program.

edit: Another example of NUMA is local gpu memory. This is fast memory, or at least low-latency memory, that the gpu uses during its instruction execution. To use this feature, data must be copied into and out from the cpu memory. New gpu architectures instead share the memory between the gpu and cpu (integrated graphics) avoiding the copy-in/copy-out latency but at the expense of slower memory accesses.

1 Like

That is all true. Cache algorithms are all about keeping ‘most recently used’ (and most likely to be used/re-used) data as close to the CPU as possible. At the closest L1 level, there are typically separate instruction and data caches. The algorithms can then be optimized for each use case. The data cache doesn’t really distinguish between stack data and heap data.