Automatic arrays and intrinsic array operations: to use or not to use?

It is great to hear that the developers of stdlib have the same preference as me. However, what if automatic arrays do lead to stack overflows and disasters happen on the user’s side? Should we just remind the user about the correct compiler options?

I see. Thank you for the explanation. In my homemade eye, I implement eye1(.) and eye2(.,.) and then declare eye as a module procedure, so that automatic arrays can be used in both cases.

1 Like

I don’t know the answers to all of your questions, but one thing to note is that stack allocation can only be used when the compiler can see both the creation and the destruction of the object. Temporary arrays used within an expression evaluation or within a subprogram declaration of an automatic array are a couple of examples. If the object has the SAVE attribute, where it cannot see when destruction occurs, then it must use heap allocation. Stack allocation is much cheaper than heap allocation, by anywhere from one to two orders of magnitude, especially if garbage collection is invoked during the heap allocation or deallocation step. On the other hand, stack allocation is often limited in size, so the optimal choice is not always obvious to the compiler.

I’d guess that other languages don’t need to worry about this the same way we do in fortran because they are already so inherently inefficient, it doesn’t matter if they use heap for everything, or if they attempt to use stack, fail, and fall back to heap allocation. All of that happens with less effort than parsing the next line of user code, so it is all hidden by the nature of interpreted languages. Instead these languages rely on library routines, often written in fortran ironically, to do the heavy lifting.

However, even in scientific programming we use library routines to do our heavy lifting too instead of writing explicit do-loops for common operations. The level-1 BLAS has been available since the 1970s for exactly this purpose, and the level-2 and level-3 routines were added over the next decade. This relieves the compiler of the burden of such optimizations as matrix sub blocking and strip mining of loops. The array syntax of fortran helps with this too, but as you point out, there is “many a slip, twixt cup and lip” in how well these work. The FORALL syntax for array operations was notoriously bad at resulting in unnecessary memory allocation overhead, and often times the equivalent do-loop performed better. I think FORALL is now deprecated because of this feature.

3 Likes

I guess you are talking about arrays as dummy variables. I agree that automatic arrays (maybe “explicit-shape arrays” is more precise?) are not ideal in that case. Assumed-shape arrays should be used instead. Some may not agree but it is the suggestion from Fortran Best Practices. On my side, I totally agree with it and follow it strictly.

However, aren’t automatic arrays nice if you use them as local variables within procedures? They are the most natural representation/declaration of matrices.

I always use allocatable arrays, specifically because of the uncertainty of whether an automatic array might cause a crash. But also, often enough to just default to it, the sizes of the arrays are reliant on something conditional, or that may not be present (like in a previous example). So my defaults for arrays are that they are either:

  • Assumed shape dummy arguments
  • Allocatable

There are some exceptions, but they aren’t worth thinking about until you’ve really got a solid foundational understanding of a lot of other things.

3 Likes

Thank you for the clear response. So, automatic arrays are not preferred in your code, if ever used.

Assume that the Fortran automatic arrays were as robust as in other popular languages, would you still prefer allocatable arrays, even if the sizes are readily available when you declare them?

(Sorry, I am afraid that a feature easily leading to stack overflows cannot be claimed robust — or it is at least not robust for naive users like me. Some may argue that pointers in C, for example, can also lead to disasters easily. But such disasters normally occur only if the code is invalid. With automatic arrays, however, is there any good reason to deem real :: A(10, 10) as valid but real :: A(1000, 1000) as invalid, provided that the latter gives you a crash?)

1 Like

Thank you very much for the insights.

To everybody, I would like to ask the following question: Which one do you prefer, (capable of) being efficient at the expense of occasional crashes, OR being robust at the expense of occasional inefficiency?

Recall that one of the slogans of Julia is “Fast as Fortran, Beautiful as Python”, or even “Faster than Fortran, cleaner than Numpy”. There are two possibilities.

Possibility 1. The Julia people are lying. Efficiency and elegance simply cannot coexist. They eat each other.

Possibility 2. It is possible to achieve efficiency (with the help of Fortran) while keeping elegant, maybe not as much as the Julia people claim. If this is the case, why could other languages achieve such a nice balance with the help of Fortran, but Fortran itself could not? Is there a paradox here?

Disclaimer: I am not a Julia user. My knowledge about this language comes completely from the current forum. I use only Fortran and MATLAB, both heavily.

For my personal projects I’m very much in line with Brad’s approach, I use assumed-shape dummy arguments and allocatable arrays. Usually, I tend to structure my code in a way that only the main program or a central driver has to perform allocations, using the possibility to pass unallocated arrays to optional dummy arguments removes a lot of clutter from the procedures and the check for a feature with present feels more natural. This makes in my opinion an allocatable array much more powerful than an automatic one.

So far I had one application for an explicit-shape dummy argument, which is (like expected) lying about the array boundaries to safe a couple of reshape operations and a potential copy in a hot loop. The same routine also makes use of an automatic array for similar performance reasons (the dimensions are small enough to be always safe).

1 Like

There are a lot of assumptions in this paragraph.

Do these other languages provide the opportunity to pay such attention ?

Do you know if MATLAB/Python/Julia/R create temporary arrays or not ? Temporary arrays can be on the stack or on the heap. I agree that large temporary arrays on the stack can be a problem, but these can be managed by Fortran compiler options. Fortran does have options !

Regarding Fortran having options for more efficient approaches via libraries which are often tuned to specific matrix attributes, is that a bad thing ? or possibly an advantage ?

Before you trust too much of “my reading”, I suggest you test the single thread MATMUL in gfortran. My experience is it performs very well.

DGEMM is provided by a number of libraries (MKL, LAPACK etc). It has been a strength of Fortran that these libraries are available. I only wish their optimisation approaches could be better documented for us Fortran users to learn and adapt to other coding.

You say DO loops are not mathematical, but are series expansions mathematical ? Perhaps Taylor was not a mathematician by this interpretation.

Lots of other languages claim to be very efficient. For the types of calculations their intrinsic functions have been tuned for; this may well be the case.
Fortran provides the framework for efficient computation, although the continual changes in hardware require changed programming approaches. I am always trying to adapt to these changes.
Fortran provides a framework for this adaption.

2 Likes

Thank you @JohnCampbell for your very detailed and insightful response.

First, I acknowledge that my claim about “the performance of matmul is sometimes inferior to a naive triple loop” is WRONG . My apologies to everyone who read this and many thanks to John for pointing it out.

Second, I would like to emphasize that I am more than happy to hear advantages of Fortran, particularly intrinsic like matmul etc, because I always use them. I rely on them.


Now, let me share my views on your other comments. Since I know only MATLAB (in addition to Fortran), I will focus on it.

Yes, MATLAB does. In MATLAB, you can choose which implementation of BLAS to link to. However, the good news is that you almost never need to do so unless you have quite particular hardware. The default setting works perfectly in most cases. This is why most users do not even know the existence of such options.

I do not know, but probably they do create temporary arrays. However, MATLAB is never supposed to crash due to the creation of temporary arrays — whatever options you use when running your code. If a crash does happen, everybody (the user and the vendor) will agree that it is a bug of MATLAB that should be fixed.

Sure. But the advantage will become much bigger if it does not depend on options, which vary from one compiler to another. In MATLAB, such options are not needed unless, again, you have quite particular hardware.

Totally agree. Thank you for pointing this out.

For your reference, here is what I said:

I am sorry, but according to my limited and humble knowledge of mathematics, Taylor expansion is not considered a matrix/vector operation in normal contexts. I will be very happy to learn from a different opinion.

Totally agree. Regarding MATLAB, it turns out that “the types of calculations their intrinsic functions have been tuned for” mean almost all calculations that MATLAB users normally do. What interests me is why we couldn’t tune the Fortran intrinsic functions also up to the same optimality, without dependence on options. If such a requirement is too high, avoiding stack overflows might be a reasonable starting point.

Sorry, I do not see quite well the relation of this point to the questions under discussion, but I totally agree.

Many thanks again for the detailed response and helpful perspectives.

1 Like

Thank you for this information. Being a bit curious, may I ask why in stdlib the contributors agree to take the opposite preference between automatic and allocatable arrays? What is the consideration behind it, despite the possibility of stack overflows under some circumstances?

I see. Will this sometimes lead to passing some variables unnecessarily? Suppose that a subroutine other than the main program or central diver has some need some arrays that are not shared with others, then it seems that such arrays should be declared/allocated locally. Sorry if I understand something wrongly.

1 Like

I’m mostly using this for requesting properties, like a derivative:

If they are requested the main driver only has to do an allocation

This makes it easier to pass through the request for gradients to other containers

There have been discussions on this topic, mainly around in the review of linalg functions. Those two might offer some insights:

Potentially there are other threads on this at the stdlib repo, but I cannot find them now.

1 Like

Thanks for your clarification. We are somewhat discussing different things here, but DO LOOPS can be very important in Fortran. For my use of OpenMP, loops are the main construction that I use, so a key computing approach.

I should explain, I have never used MATMUL in production code. However matrix multiplication (MATMUL / DGEMM / do loops) is a computation that is very easy to program in OpenMP and I have used this calculation to identify coding approaches that are more effective for OpenMP. These can include:

  • loop order
  • matrix partitioning
  • matrix size
  • vector/AVX instructions for inner loop

One of the main problems with making MATMUL more efficient in OpenMP is increasing the amount of computation per block of data read from memory. This has been highlighted in another post by @themos when referring to the Roofline model. There is much to learn from this.
https://fortran-lang.discourse.group/t/does-the-speed-bandwidth-of-memory-matter/3154

I would also like to highlight alternative ways of defining arrays in Fortran, as their use gives Fortran some flexibility. These are:

  • ALLOCATE arrays, which are best for large arrays and placed on the heap.
  • Local arrays, which are of a defined size, usually small and placed on the stack
  • Automatic arrays, which are a size defined by a routine argument and placed on the stack.
  • Temporary arrays, generated by the compiler, that can cause unwanted stack overflow problems.

Unfortunately, these temporary arrays can be a consequence of Fortran array syntax and cause many problems, which DO loops and allocate arrays can avoid.
It is a strength of Fortran as it can provide flexibility for manageing these different arrays.

1 Like

I use automatic arrays whenever possible.

It’s the compiler job to make sure it does not run out of stack and not segfault. I am aware of the fact that many compilers will just segfault for larger arrays, but to me, from the user’s perspective, that seems like a bug in the compiler.

Now, how to fix it? I am not sure, we have to explore options. In the worst case the compiler can allocate them on the heap, which should not be slower than using allocate, but there might be faster options, such as inserting some quick check when the function runs if it is going to run out of stack. I guess there are two issues:

  • Nice error messages instead of a segfault when we run out of stack (applies for infinite recursion as well)
  • Actually allowing to use larger arrays that would not fit into stack

For the second issue it seems the array probably must be allocated outside of the stack by the compiler. Possibly make this configurable, I believe GFortran allows to allocate even allocatable arrays in stack with an option (-fstack-arrays). On linux one can change the stack size with ulimit. On macOS I believe you can’t increase the stack beyond quite a small limit. So perhaps the user can choose on the platform and the willingness to call ulimit by hand. If the stack is small, then one could imagine instructing the compiler to insert some checks and/or use heap for automatic arrays. And the default should be not to segfault (so the user must insert compiler options to speedup if it can result in a possible segfault).

2 Likes

Intel ifort has a -heap-arrays compiler option that lets you set a threshold on the size of an array that gets put on the stack. Anything below the threshold goes on the stack, anything above gets put on the heap. I think NVIDIA/PGI always puts automatic arrays on the heap by default. To get them on the stack you have to use the -Mstack-arrays option.

4 Likes

A problem with deciding to use stack arrays or heap arrays is when is the decision made.
If the decision is made at compile time, then the array size must be known, so automatic arrays can not be selected.
If the decision is made at run time, then provision for either stack or heap array must be provided in the code at compile time, plus the provision for their release.

Anothe issue that is highlighted in posts is “but there might be faster options”. This is not easy to assess.
In OpenMP it is often better to have private stack arrays on a thread stack, rather than as heap arrays on a shared heap, but for a single thread process the difference between stack and heap is just a different address and memory page so the preferred can be debatable.
It can be that many small arrays might show a preference for the stack, but placing larger arrays on the stack can dilute the benefit of a small local stack in L1 cache. Not a clear answer.

1 Like

Great to know this! I take exactly the same point of view. I hope many others share this view as well.

This is exactly what I hope.

1 Like

I think this is the right approach for compiler writers. Namely, there should at least be compiler options that specify stack, heap, or the option at runtime to try stack first and fall back to heap if that fails. That last option should probably be the default, with the other two there to allow the programmer to tune the code.

I guess a question would be whether to allow the programmer to specify any of that in the fortran code itself, either with compiler directives or with specification statements of some kind. I have mixed feelings about this approach. Remember that the fortran standard itself never mentions stacks, heaps, registers, or cache, so introducing specification statements to control those things would be an entirely new direction for the language. And if it were done in a standard way with compiler directives, that would also be a new direction for the language.

3 Likes

I think this gets to the point about automatic arrays vs statically sized arrays. I’d suspect neither real :: A(10, 10) nor real :: A(1000, 1000) will cause any crash. However, all bets are off with something like

program example
  call sub(10)
  call sub(10000)
contains
  subroutine sub(n)
    integer, intent(in) :: n
    real :: A(n, n)
  end subroutine
end program

In this simple case the compiler will probably get things right, since it can see everything. The problem is if the procedure and call are in separate translation units, there’s no way the compiler can know what is the best approach for A. The most efficient code is (usually) to just assume it will fit on the stack and always put it on the stack. Anything else would require extra overhead, for something that might not be needed. Fortran compilers are often pressured to “go fast at all costs”, so that’s what they default to.

2 Likes

Thank everyone for the interesting and insightful discussions.

Up to now, there has been a good discussion on automatic arrays, but not as much on the segfault of intrinsic procedures. Let me give an example. It may take a few minutes to run, depending on your hardware.

! test_matmul.f90

module maprod_mod
implicit none
private
public :: maprod, rk
integer, parameter :: rk = kind(0.0D0)

contains

function maprod(X, Y) result(Z)
real(rk), intent(in) :: X(:, :), Y(:, :)
real(rk), allocatable :: Z(:, :)
integer :: i, j, k

if (size(X, 2) /= size(Y, 1)) then
    write (*, *) "Wrong size"
    stop
end if

allocate (Z(size(X, 1), size(Y, 2)))
Z = 0.0_rk
do j = 1, size(Y, 2)
    do i = 1, size(X, 1)
        do k = 1, size(X, 2)
            Z(i, j) = Z(i, j) + X(i, k) * Y(k, j)
        end do
    end do
end do
end function maprod

end module maprod_mod

program test_matmul
use, non_intrinsic :: maprod_mod, only : maprod, rk
implicit none
integer, parameter :: n = 2000
real(rk), allocatable :: A(:, :), B(:, :)

write (*, *) 'Initialize'
allocate (A(n, n), B(n, n))
A = 0.0_rk

write (*, *) 'Loop'
B = maprod(A, A)
write (*, *) 'Succeed'

write (*, *) 'MATMUL'
B = matmul(A, A)
write (*, *) 'Succeed'
end program test_matmul

This is what I obtained on my laptop (Ubuntu 20.04, intel(R) Core™ i7-10610U CPU @ 1.80GHz) with ifort.

$ uname -a
Linux zX18 5.11.15-051115-generic #202104161034 SMP Fri Apr 16 10:40:30 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
$ ulimit -s 
8192
$ ifort -v
ifort version 2021.6.0
$ ifort -g -warn all test_matmul.f90 && ./a.out
 Initialize
 Loop
 Succeed
 MATMUL
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image              PC                Routine            Line        Source             
a.out              000000000040666A  Unknown               Unknown  Unknown
libpthread-2.31.s  00007FC387AC8420  Unknown               Unknown  Unknown
a.out              000000000040573A  Unknown               Unknown  Unknown
a.out              0000000000403822  Unknown               Unknown  Unknown
libc-2.31.so       00007FC3878E4083  __libc_start_main     Unknown  Unknown
a.out              000000000040372E  Unknown               Unknown  Unknown

N.B.:

  1. The code does not involve automatic arrays.

  2. Without special compiler options, gfortran and nagfor did not segfault.

  3. I am well aware of the reason behind the segfault and how to resolve it. However, I take a viewpoint similar to @certik : a segfault of an intrinsic function (under the default settings) is hardly desirable for whatever reason. Imagine that we are not talking about Fortran but MATLAB or Python — or even C. How does this become OK when it comes to (a major implementation of) Fortran?

P.S.: This issue was reported to Intel more than 10 years ago (e.g., matmul segmentation fault for huge matrices - Intel Communities). Since the behavior persists, it seems that not many people regard it as a bug.

3 Likes

Does this happen because the result of MATMUL() is stored in a temporary array on the stack before being copied to the result B? Or is it because internally MATMUL is generating temporary arrays?

If it is the former, then maybe the general solution is to propose to the language committee that a subroutine interface should be added to the language for matrix-matrix products. A simplified version of the BLAS xGEMM() subroutine would be a good starting point. This would also address the problems associated with unnecessary temporary arrays for operations like MATMUL(TRANSPOSE(A),B).

2 Likes