Why stack is faster than heap and what exactly is stack?

Dear all,

This question puzzles me for a while,
Why stack is faster than heap, and what exactly is stack?

I know there are many places discussed this problem like,

But I am still confused.
The main issue is that, they say both stack and heap are in the memory (RAM).
If so, how can stack be faster than heap? After all they are all in memory, the speed is depend on the the speed of the memory like bandwidth and cas latency. I simply do not see any reason that stack will be much faster than heap, if they are both in the RAM.

I mean, if both stack and heap are in RAM, then why do we even need heap? Just put everything on the stack and that is it.

My understanding is that, in order to make stack really useful, stack should be in the CPU cache, so it is small but way faster than RAM.

Eg, my laptop is xeon 2186M with
Cache size: L1 - 384 KB, L2 - 1.5 MB, L3 - 12 MB.
The speed of L1 is 10 times L2, the speed of L2 is 10 times L3, the speed of L3 is 10 times RAM.

So when heap-arrays in intel Fortran, I should set the heap size smaller than 12MB.
So like below, I set it to 10240KB so 10MB.

In some of my code, actually I do notice that if I set heap-arrays as 0, it can be 50% slower than setting heap-arrays as 128, 256, 512, ā€¦10240 etc. I can almost certain the speedup by setting non-zero heap array size is that, some arrays are really stored in cache than in memory.

What do you guys think about stack and heap, and how to use them and set them?

Thank you very much in advance.

2 Likes

You are correct. The stack and the heap come from the same place: the computerā€™s RAM. The difference is that the compiler generally knows where to find variables that are on the stack when it compiles the code. The heap is allocated dynamically as the program runs, so the compiler doesnā€™t know where the variables are coming from, reducing optimization opportunities. ā€œStack variablesā€ are generally used often enough that the compiler doesnā€™t even put them on the stack, it just keeps them in a register. ā€œRegister spillā€ is when the compiler has to put variables on the stack because it ran out of registers. Even then, stack variables are often used enough that they are kept in the CPU cache. Stack variables cannot be accessed from another thread, so the compiler can assume they arenā€™t going to change. Stack variables do not require allocating on the heap, which is managed by a memory allocator, which is a slow process. Stack variables do not need to be ā€œfreedā€, which is also a slow process. These are the major reasons why people say the ā€œstackā€ is faster than the ā€œheapā€, even though they all come from the same place physically.

7 Likes

Thank you very much indeed! Your explanation is very clear! I really appreciate that!

I have two questions.

  1. Uhm, alright, so, about the subroutine do_cool_stuff(n), what if inside it, I have such an array,
real :: my_array(:)

Is my_array( : ) heap or stack, or it depends, just like stack_or_heap_array(n)?

  1. You mentioned that,

Uhm, what is this ā€œfeatureā€? You mean heap array?
Are you suggesting not using heap-arrays?

I usually set /heap-arrays0, in order to prevent stack overflow. But I never really think about it too much.
I think you are right. I will try to get rid of heap-arrays and use the system default.
This heap-array stuff really sometimes will cause the performance drop issue.

In the end, could you please, when convenient, give some examples as how to prevent stack overflow?

Thank you very much indeed! You explanation is great!

Shared resources are slower than private resources because of the overhead of managing the sharing.

The stack area of memory is private to the procedure, the heap is shared by all procedures (currently executing or not).

In addition, there is the lifetime issue: the stack area is recycled at the time of completion of the procedure. Some other procedure will use it. Only one procedure is executing at one time (ignore multithreading for now). Stack use ties lifetime of variables to lifetime of procedures, a nice abstraction. But some Fortran variables are meant to survive completion of procedures (they have the SAVE attribute), so they cannot be put on the stack.

Itā€™s called a stack because we always recycle the last thing we ordered (it shadows procedure entry and termination). On the heap, recycling is not predictable: any item could be the next to be recycled. This creates ā€œholesā€, leading to inefficiency: some memory requests cannot be satisfied because no hole is big enough (even though the total hole space is more than enough). The stack doesnā€™t have holes but it can run out quicker because it is generally provisioned smaller than the heap by the operating system.

[Added] Thereā€™s no error handling when using the stack. If it runs out, thereā€™s no recovery and your program will crash if youā€™re lucky (and will initiate WW3 if you are not). There is error handling when using the heap so you can take action (ALLOCATE will fail with a STAT= specifier being set).

6 Likes

Thanks to @dwwork and @themos for their lucid explanations. I have had the same questions as the OP for some time. I found a post What scientists must know about hardware to write fast code by BioJulia Developer Jakob Nybo Nissen:

Inside RAM, data is kept on either the stack or the heap . The stack is a simple data structure with a beginning and end, similar to a Vector in Julia. The stack can only be modified by adding or subtracting elements from the end, analogous to a Vector with only the two mutating operations push! and pop! . These operations on the stack are very fast. When we talk about ā€œallocationsā€, however, we talk about data on the heap. Unlike the stack, the heap has an unlimited size (well, it has the size of your computerā€™s RAM), and can be modified arbitrarily, deleting any objects.

Intuitively, it may seem obvious that all objects need to be placed in RAM, must be able to be retrieved and deleted at any time by the program, and therefore need to be allocated on the heap. And for some languages, like Python, this is true. However, this is not true in Julia and other efficient, compiled languages. Integers, for example, can often be placed on the stack.

Why do some objects need to be heap allocated, while others can be stack allocated? To be stack-allocated, the compiler needs to know for certain that:

  • The object is a reasonably small size, so it fits on the stack. This is needed for technical reasons for the stack to operate.
  • The compiler can predict exactly when it needs to add and destroy the object so it can destroy it by simply popping the stack (similar to calling pop! on a Vector ). This is usually the case for local variables in compiled languages.
2 Likes

Sorry I wasnā€™t clear. I havenā€™t been in fortran for a while, so I sometimes forget what some things are called.

The ā€œfeatureā€ is the ā€œautomatic arrayā€. This is the stack_or_heap_array in the example I gave. This is when you declare a new array in a function/subroutine with a size that is dependent on an integer argument to that function.


[Edit: Turns out what I say in this section about automatic arrays is incorrect. It seems compilers do not implement them in the fashion I describe here. See my reply below]

You are telling the compiler: I need an array of size N in this function. It is then the compilerā€™s job to decide where to put that array: on the stack, or on the heap. The compiler has a predefined limit ā€˜Lā€™. if N < L, it puts it on the stack. if N > L, it puts it on the heap. What I dislike is that it introduces ambiguity into the code. You donā€™t know what the compiler will do. There is nothing inherently wrong with allocating arrays on the heap, there is something wrong with not being in control of when that will occur. But I guess Iā€™m a control freak.


As far as a stack overflow, you have to work REALLY hard to make that happen. In modern operating systems, stack sizes are quite large. Donā€™t worry about it unless it actually happens. If it does, the fix is not to adjust some compiler flag, but to fix the code so it doesnā€™t happen. This comes down to basically two things: Avoiding large automatic arrays and avoiding recursive functions that never end.

[Edit: Since the way I thought compilers implemented automatic arrays was wrong, it becomes much more important to understand how your specific compiler handles them and how to control them with compiler flags, because refactoring code to remove them can be painful.]


I think you will find that real :: my_array(:) doesnā€™t compile. See here: Compiler Explorer

1 Like

@dwwork, I am not sure that some of your claims apply to most Fortran compilers.
Certainly the use of stack vs heap very much depends on the size of typical arrays being used.

By default, ā€œThe compiler has a predefined limit ā€˜Lā€™ā€ is not typically available.
Generally, the compiler makes this decision at compile time, so if array size is not available, the decision options are limited. ( I am not aware of any exceptions but they may exist) So local arrays can be sent to the stack or heap, but automatic arrays, generally not.

You can use a compile option like -fmax-stack-var-size=n, but this is applied at compile time to arrays whose size is known.
When creating arrays on entry to a subroutine or function, there are 3 main types of arrays:
local_arrays(n), of a defined size as n is a parameter, so the compiler has options.
automatic_arrays(n), of an unknown size as n is a routine argument, so off to the heap
allocatable :: array( : ), will always go to the heap.

At compile time, local_arrays typically go to the stack, but can be directed to the heap if ā€œLā€ has been set in a compile option like -fmax-stack-var-size.
As automatic_array size is not known at compile time, it is usually directed to the heap, unless an option like -fstack-arrays is selected.
I do not know of a compile option where this decision between stack or heap is delayed to at run time.

The issue about stack size being adequate really depends on the size of arrays you are typically using. ā€œStack Overflowā€ when a lot of calculation has been completed is very annoying ! (and the recovery path can be difficult)
In 64-bit OS and OpenMP, I do set all stacks (1 main stack + 1 for each additional thread) to 500 MBytes and expect to have no problems (itā€™s all just virtual memory, no physical memory is allocated), but still have to direct some larger arrays to heap, using ALLOCATE. However, most Fortran users have to be aware of the stack as limited use for local arrays.

To an earlier point, it is true that heap allocation is slower than stack, but this depends on how frequently the allocate/deallocate operation takes place, in comparison to the calculation phase. Heap arrays are a useful feature.
Stack vs the caches is another issue again, but generally stack is bigger than cache, but a small local active stack in L1 cache is a preferred outcome.
As with your other preferences for stack over heap, it really depends on the type of memory management vs compute time profile that your code has. For a routine that is called many (billions of) times, heap use would be a problem, so stack use would be preferred, but for my style of calculation, where use of the array is the dominant phase, heap for arrays of unknown size are easier to manage.

It all depends on how you cope with a stack overflow and how often it occurs !
If you are running a program to produce project results, this is not the time for a stack overflow.

2 Likes

Youā€™re right, I did blindly make some claims about automatic arrays, so Iā€™ve put together some simple investigations on godbolt. I made a subroutine that takes in a length parameter, creates an automatic array, and then passes that to an external subroutine.

gfortranā€™s default seems to be to call malloc() for the automatic array: Compiler Explorer
If I add -fstack-arrays, it doesnā€™t, and just adjusts the stack pointer: Compiler Explorer
Intelā€™s default is to put it on the stack: Compiler Explorer
Adding the -heap-arrays flag shows that now it calls for_allocate(), (whatever that is): Compiler Explorer. Changing the size parameter of -heap-arrays has no effect on the assembly.

So it seems that I was wrong. The two compilers that I have access to do NOT make a runtime-decision about where to put the arrays. That doesnā€™t preclude a compiler from doing that (it could very well be done). The decision is currently made at compile-time through the use of compiler flags.

However, I still stand by my conclusion that automatic arrays should be avoided, precisely because of this ambiguity. Different compilers do different things, and I donā€™t really like that.

1 Like

[Edit: The original post was incorrect in describing how automatic arrays are implemented. I couldnā€™t edit it, so Iā€™ve deleted it and Iā€™m reposting with more accurate information. Sorry everything is out of order now.]

I didnā€™t answer the second part of your question: The stack is a region of memory that is set aside for your program by the operating system. In theory all variables declared in your functions are located ā€œon the stackā€. As your program enters and leaves functions, the amount of stack your program uses expands and contracts. All this is rigged by the compiler at compile-time. As such, the compiler has a good deal of information about the location of these variables in memory, and can optimize accordingly.

Things that are not known at compile-time have to be allocated on ā€œthe heapā€. This requires going through a memory allocator, which is a complex and slow beast. This list includes any allocatable arrays or variables. One important special case is when you declare an array based on an integer dummy argument. The example below shows the 3 main possibilities.

subroutine do_cool_stuff(n)
    integer, intent(in) :: n    ! stack variable

    real :: stack_array(4)
    real, allocatable :: heap_array(:)
    real :: stack_or_heap_array(n)
end subroutine

The first stack_array has a size that is known at compile-time, so the compiler will put it on the stack. The size of the heap_array is not known at compile-time, so when you allocate it, it will be placed on the heap. As Iā€™ve mentioned, this is slow, and should be done as little as possible.


[Edit: My original incorrect description :slightly_frowning_face:]
The size of the stack_or_heap_array is not known at compile time, so what will the compiler do? In this case, it actually is possible to allocate an unknown amount on the stack, however, the stack is a finite resource. If you pass in n = 437289473289 , you could very well exhaust the stack and produce a ā€˜stack overflowā€™ error (See where the name comes from??) To prevent this, most fortran compilers set a limit, so that if n is larger than some value, it will allocate stack_or_heap_array on the heap, instead of on the stack.

In my opinion, using this ā€œfeatureā€ of fortran is counter-productive and should not be used, as it can produce sudden performance drops without telling you why. It is analogous to the ā€˜Variable Length Arraysā€™ in C, which are frowned upon in some circles.

And donā€™t worry about toggling all those little parameters, the defaults should be good enough.


[Edit: The correct description :grinning:]
The size of the stack_or_heap_array is not known at compile time, so what will the compiler do? This is compiler dependent. Some compilers choose to allocate on the stack, others choose to allocate on the heap. If placed on the stack, then the performance will be fast, but runs the risk of causing a stack overflow, since the stack size is limited by the operating system. If placed on the heap, then allocation/deallocation of these arrays will be be slow, but avoids causing a stack overflow.

If you use automatic arrays, it is important to know how your compiler will handle these, and what the performance implications are. You will need to understand the appropriate compiler-specific flags that control these behaviors, like the stack size and whether they are put on the stack or the heap. If on the stack, you will need to know how to set the stack size of your program. (Donā€™t ask me, I donā€™t know how to do it)

I still would avoid automatic variables, as I donā€™t really want to have to read about each compiler that I compile with and learn all the tricky little flags. You can avoid that headache by just not using automatic arrays.


1 Like

The x86 and x64 architectures provide a dedicated ESP/RSP register as the stack pointer, but there have been a number of widely used architectures where there was no dedicated stack pointer ā€“ the IBM 370, and many RISC architectures, including PPC and MIPS, are examples. On such machines, ā€œstackā€ is simply a data structure, and different compilers might use different registers as their stack pointer! On the old mainframes where Fortran grew to prominence, often all variables were static, and there was no need/use for the kind of stack that we are familiar with today.

There have also been large computers that had no registers accessible to users, namely, the so called stack machines, such as the Burroughs 6000 series.

The Fortran language is high level, and generally avoids specifying how language features, such as automatic arrays, should be implemented, instead being content with specifying what properties such arrays are to have. The only occurrence of ā€œstackā€ in the Fortran 2018 standard is as a variable name in an example.

3 Likes

Btw, on Apple M1, here is the behavior of ulimit:

$ ulimit -a
-t: cpu time (seconds)              unlimited
-f: file size (blocks)              unlimited
-d: data seg size (kbytes)          unlimited
-s: stack size (kbytes)             8176
-c: core file size (blocks)         0
-v: address space (kbytes)          unlimited
-l: locked-in-memory size (kbytes)  unlimited
-u: processes                       2666
-n: file descriptors                2560
$ ulimit -s unlimited
$ ulimit -a          
-t: cpu time (seconds)              unlimited
-f: file size (blocks)              unlimited
-d: data seg size (kbytes)          unlimited
-s: stack size (kbytes)             65520
-c: core file size (blocks)         0
-v: address space (kbytes)          unlimited
-l: locked-in-memory size (kbytes)  unlimited
-u: processes                       2666
-n: file descriptors                2560
$ ulimit -s 70000    
ulimit: value exceeds hard limit

So it seems the maximum stack size that is allowed is 65520 KB ~ 64 MB.

Consequently it would seem that one cannot have larger arrays allocated on stack, but I have not tested it.

2 Likes

Well, this young gun barely remembers 32-bit computers, and Iā€™ve never programmed for anything other than x86_64, so itā€™s good to know to watch for stuff like this on other architectures.

1 Like

The Operating Systems I have used are very primitive with regards to stack management.
What should happen AT RUN TIME is that if an array is large or the stack is exhausted, the array should be allocated to the heap or a stack extension, not report a stack overflow and stop the program.
Imagine that environment !! The web site StackOverflow might not even exist.
Why canā€™t the OS manage this failure in a better way ?
I should point out that for mult-threaded processing, the availability of a seperate stack for each thread is a good design feature for OpenMP usage, so some improvements to STACK have been made.

As I indicated, in 64-bit OS using gFortran, I can allocate each stack as 1GBytes in size and mitigate a lot of these problems. Each stack has a virtual memory address and physical memory is only allocated as each stack grows. It is a solution that suits my analysis calculations, but may not suit other programs that can benefit from a local/near stack address.
I think that stack management could have been given a make-over when 64-bit OS was introduced.

Also, @dwwork, you should reconsider automatic arrays in Fortran, as they provide a very useful function; being more flexible than local arrays and easier to use than ALLOCATE arrays. The performance penalty of using either type of array is usually insignificant, except for special cases, such as high frequency ALLOCATE/DEALLOCATE.

1 Like

This Apple M! 64-bit OS certainly needs a rework of stack design. 64 MBytes is hardly 64-bit usage.
The existing design is based on 32-bit program usage.
The stack size (or stack extension) could(should) be at least 20% of the virtual memory address space, something like 2^44 of the 2^48 addressable virtual memory. (or a virtual address beyond the size of physical memory)
To say address space is ā€œunlimitedā€ shows their ā€œlimitedā€ understanding of how 64-bit actually is used.

1 Like