Automatic arrays and intrinsic array operations: to use or not to use?

zaikunzhang · July 25, 2022, 4:42pm

Question: Is it considered a good practice to use automatic arrays and intrinsic array operations (matmul, transpose, etc)?

The reasonable answer, I guess and I hope, is yes, it is a good practice.

For automatic arrays, let us focus on their usage as local variables. I do not intend to discuss what kind of arrays should be used as dummy variables. You may see Fortran Best Practices for such discussions.

Now let me explain why I have this question.

For me, one of the core strengths of Fortran is its native support for matrices and matrix/vector operations. By “native support for matrices”, I mean particularly automatic arrays — to get a matrix, you do not need to do anything more than declare it. As for matrix/vector operations, I mean intrinsic array operations such as matmul, transpose, dot_product, and broadcasting.

Thanks to such support, we can write scientific computing code in a way that is very similar to how the algorithms are formulated in mathematics. Without such support, a language should not be called a “Formula Translator”.

Take matrix/vector operations as an example. Without intrinsic support for them, we have to either re-invent these wheels ourselves, or code them as loops whenever such operations appear in an algorithm. However, as a mathematician, I can tell that mathematicians rarely consider matrix/vector operations as loops — loops are only how we calculate them, but not how we define/understand them. Loops are used only when we describe truly iterative algorithms, e.g., CG, steepest descent, …

Not surprisingly, most other popular languages in scientific computing are also strong at supporting matrices and matrix/vector operations. My most familiar language is MATLAB — if you call it a language. It supports almost all matrix/vector operations you can name. A well-knowledged MATLAB programmer can code a sophisticated algorithm without using any loop but only using matrix/vector operations.

Indeed, as a common principle in MATLAB programming, you should never code a matrix/vector operation in loops. It is a bad practice to re-invent any tool that the language intrinsically provides. You may argue that this is because MATLAB is slow at loops as a scripting language. But I understand it from a more philosophical perspective: why does the language provide the functionality if we are not encouraged to use it?

So, it seems that the only reasonable answer to my question is “yes, it is a good practice; do use automatic arrays and intrinsic array operations as long as you can”.

However, is it really the case?

Take automatic arrays as an example. These arrays are often allocated on the stack. In that case, a double-precision matrix of modest size will lead you to a stack overflow (I know that “stack” is not a specification of the language standard but a choice of the compiler implementations, but should we expect all users to know this? Anyway, such knowledge does not relieve the hurt of a stack overflow). On macOS, this size is about 256*256. Of course, we can tell the compiler not to save automatic arrays at the stack, or do so only for small arrays, or simply use allocatable arrays instead of automatic ones. However, let us acknowledge that it is an inconvenience, which essentially does not exist in other popular languages that have native support for matrices. Imagine that you are using a piece of sophisticated code that contains automatic arrays, then you will encounter disasters if you do not specify the correct compiler options. Surely, the author of the code should document the compiler options clearly, and the user should read the documentation carefully. But why don’t users of MATLAB/Python/Julia/R need to pay such attention when using intrinsic functionalities of the languages?

Now take matmul as another example. Consider the code

A = B + matmul(C, D)

A couple of problems may occur if you code Fortran in that way. First, a temporary array will likely be created for matmul(C, D); in addition to the performance penalty, this temporary array may again be allocated on the stack, which may again lead to a mysterious segmentation fault if you are unlucky even when the matrices are not so big (see my reply below for a simple example of such a segmentation fault). Second, according to my reading and testing, the performance of matmul ~~is sometimes inferior to a naive triple loop~~ (This is wrong, thank @JohnCampbell for pointing it out, and my apologies for misinformation), let alone optimized subroutines. Surely, if you are knowledged enough, you can choose the correct compiler option and link to the correct library so that matmul is translated to, e.g., gemm. But, again, why don’t users of MATLAB/Python/Julia/R need to pay such attention when using intrinsic functionalities of the languages?

Indeed, in MATLAB, A = B + C*D is almost the only reasonable way to code the matrix calculation illustrated above. Any code involving loops is essentially wrong unless these matrices have special structures. In addition, MATLAB users normally (but not always) do not need to worry about stack overflow or which library to link to. They simply rely on the intrinsic functionality of the language.

With these questions in mind, I checked the code of stdlib. Very interestingly, it seems that the contributors have diverse opinions.

Example 1: stdlib_stats_corr.fypp, Lines 119–226:

          do i = 1, size(x, 2)
            center(:, i) = x(:, i) - mean_
          end do
          #:if t1[0] == 'r'
            res = matmul( center, transpose(center))
          #:else
            res = matmul( center, transpose(conjg(center)))
          #:endif

We see matmul and transpose, which I like very much. However, we also see a loop that can be replaced with broadcasting. There must be some careful consideration behind the code, but it is simply interesting to see code of two different “styles” side by side.

Example 2: stdlib_linalg.fypp, diag versus eye:

! Part of the code for `diag` (vector to matrix):
 #:for k1, t1 in RCI_KINDS_TYPES
      module function diag_${t1[0]}$${k1}$(v) result(res)
        ${t1}$, intent(in) :: v(:)
        ${t1}$ :: res(size(v),size(v))
      end function diag_${t1[0]}$${k1}$
    #:endfor

! Code for `eye`
    pure function eye(dim1, dim2) result(result)

        integer, intent(in) :: dim1
        integer, intent(in), optional :: dim2
        integer(int8), allocatable :: result(:, :)

        integer :: dim2_
        integer :: i

        dim2_ = optval(dim2, dim1)
        allocate(result(dim1, dim2_))
        
        result = 0_int8
        do i = 1, min(dim1, dim2_)
            result(i, i) = 1_int8
        end do

    end function eye

We see that diag declares its result as an automatic array, while eye declares it as an allocatable array. Why are they different?

Before the end, let me stress that the native support for matrices and matrix/vector operations is a core strength of Fortran. It would be extremely wired if we are indifferent about its performance and even its occasional failure under default settings.

Sorry for this long post. What is your opinion on my question (see the very top line)? Thank you very much for your attention.

Related posts (to the authors of the posts, I apologize for bugging you with an @, but I just thought you might be interested in this topic since you discussed similar things before):

Reply to “Is the Fortran Standard crucial for the future of the Fortran language?” (see the comment on “Decent array support”) by @Pap
Mapping matrix & vector arithmetic to BLAS calls by @nshaffer
Automatic vs. allocatable arrays for function results by @milancurcic and replies by @Beliavsky @certik @zoziha @themos
Reply to “Best practice of allocating memory in Fortran” by @JohnCampbell
When to use allocatable arrays? by @fortran4r
Why a function returns an array is much slower than a subroutine returns an array? (real MWE included) by @CRquantum
Reply to “Why abandon Fortran for Linear Algebra?” by @Beliavsky
Julia: Fast as Fortran, Beautiful as Python by @certik
Fortran: An ideal language for writing “Templates” of numerical algorithms (?) by zaikunzhang

Back in the good old days – the “Golden Era” of computers, it was easy to separate the men from the boys (sometimes called “Real Men” and “Quiche Eaters” in the literature). During this period, the Real Men were the ones that understood computer programming, and the Quiche Eaters were the ones that didn’t. A real computer programmer said things like “DO 10 I=1,10” and “ABEND” (they actually talked in capital letters, you understand) …

In the 1950’s von Neumann … was confronted with the FORTRAN concept; John Backus remembered von Neumann being unimpressed and that he asked “why would you want more than machine language?”

awvwgk · July 25, 2022, 4:49pm

In stdlib we actually prefer automatic arrays where possible, even if this can lead to stack overflows with some compilers. The reason for using an allocatable return value in eye is that the second dimension is optional and we cannot reliably declare the shape of the automatic array anymore.

zaikunzhang · July 25, 2022, 5:04pm

It is great to hear that the developers of stdlib have the same preference as me. However, what if automatic arrays do lead to stack overflows and disasters happen on the user’s side? Should we just remind the user about the correct compiler options?

I see. Thank you for the explanation. In my homemade eye, I implement eye1(.) and eye2(.,.) and then declare eye as a module procedure, so that automatic arrays can be used in both cases.

RonShepard · July 25, 2022, 10:38pm

I don’t know the answers to all of your questions, but one thing to note is that stack allocation can only be used when the compiler can see both the creation and the destruction of the object. Temporary arrays used within an expression evaluation or within a subprogram declaration of an automatic array are a couple of examples. If the object has the SAVE attribute, where it cannot see when destruction occurs, then it must use heap allocation. Stack allocation is much cheaper than heap allocation, by anywhere from one to two orders of magnitude, especially if garbage collection is invoked during the heap allocation or deallocation step. On the other hand, stack allocation is often limited in size, so the optimal choice is not always obvious to the compiler.

I’d guess that other languages don’t need to worry about this the same way we do in fortran because they are already so inherently inefficient, it doesn’t matter if they use heap for everything, or if they attempt to use stack, fail, and fall back to heap allocation. All of that happens with less effort than parsing the next line of user code, so it is all hidden by the nature of interpreted languages. Instead these languages rely on library routines, often written in fortran ironically, to do the heavy lifting.

However, even in scientific programming we use library routines to do our heavy lifting too instead of writing explicit do-loops for common operations. The level-1 BLAS has been available since the 1970s for exactly this purpose, and the level-2 and level-3 routines were added over the next decade. This relieves the compiler of the burden of such optimizations as matrix sub blocking and strip mining of loops. The array syntax of fortran helps with this too, but as you point out, there is “many a slip, twixt cup and lip” in how well these work. The FORALL syntax for array operations was notoriously bad at resulting in unnecessary memory allocation overhead, and often times the equivalent do-loop performed better. I think FORALL is now deprecated because of this feature.

zaikunzhang · July 26, 2022, 1:35am

I guess you are talking about arrays as dummy variables. I agree that automatic arrays (maybe “explicit-shape arrays” is more precise?) are not ideal in that case. Assumed-shape arrays should be used instead. Some may not agree but it is the suggestion from Fortran Best Practices. On my side, I totally agree with it and follow it strictly.

However, aren’t automatic arrays nice if you use them as local variables within procedures? They are the most natural representation/declaration of matrices.

everythingfunctional · July 26, 2022, 2:07am

I always use allocatable arrays, specifically because of the uncertainty of whether an automatic array might cause a crash. But also, often enough to just default to it, the sizes of the arrays are reliant on something conditional, or that may not be present (like in a previous example). So my defaults for arrays are that they are either:

Assumed shape dummy arguments
Allocatable

There are some exceptions, but they aren’t worth thinking about until you’ve really got a solid foundational understanding of a lot of other things.

zaikunzhang · July 26, 2022, 2:40am

Thank you for the clear response. So, automatic arrays are not preferred in your code, if ever used.

Assume that the Fortran automatic arrays were as robust as in other popular languages, would you still prefer allocatable arrays, even if the sizes are readily available when you declare them?

(Sorry, I am afraid that a feature easily leading to stack overflows cannot be claimed robust — or it is at least not robust for naive users like me. Some may argue that pointers in C, for example, can also lead to disasters easily. But such disasters normally occur only if the code is invalid. With automatic arrays, however, is there any good reason to deem real :: A(10, 10) as valid but real :: A(1000, 1000) as invalid, provided that the latter gives you a crash?)

zaikunzhang · July 26, 2022, 3:33am

Thank you very much for the insights.

To everybody, I would like to ask the following question: Which one do you prefer, (capable of) being efficient at the expense of occasional crashes, OR being robust at the expense of occasional inefficiency?

Recall that one of the slogans of Julia is “Fast as Fortran, Beautiful as Python”, or even “Faster than Fortran, cleaner than Numpy”. There are two possibilities.

Possibility 1. The Julia people are lying. Efficiency and elegance simply cannot coexist. They eat each other.

Possibility 2. It is possible to achieve efficiency (with the help of Fortran) while keeping elegant, maybe not as much as the Julia people claim. If this is the case, why could other languages achieve such a nice balance with the help of Fortran, but Fortran itself could not? Is there a paradox here?

Disclaimer: I am not a Julia user. My knowledge about this language comes completely from the current forum. I use only Fortran and MATLAB, both heavily.

awvwgk · July 26, 2022, 7:34am

For my personal projects I’m very much in line with Brad’s approach, I use assumed-shape dummy arguments and allocatable arrays. Usually, I tend to structure my code in a way that only the main program or a central driver has to perform allocations, using the possibility to pass unallocated arrays to optional dummy arguments removes a lot of clutter from the procedures and the check for a feature with present feels more natural. This makes in my opinion an allocatable array much more powerful than an automatic one.

So far I had one application for an explicit-shape dummy argument, which is (like expected) lying about the array boundaries to safe a couple of reshape operations and a potential copy in a hot loop. The same routine also makes use of an automatic array for similar performance reasons (the dimensions are small enough to be always safe).

JohnCampbell · July 26, 2022, 7:41am

zaikunzhang:

A couple of problems may occur if you code Fortran in that way. First, a temporary array will likely be created for matmul(C, D); in addition to the performance penalty, this temporary array may again be allocated at the stack, which may again lead to a mysterious segmentation fault if you are unlucky even when the matrices are not so big. Second, according to my reading and testing, the performance of matmul is sometimes inferior to a naive triple loop, let alone optimized subroutines. Surely, if you are knowledged enough, you can choose the correct compiler option and link to the correct library so that matmul is translated to, e.g., gemm. But, again, why don’t users of MATLAB/Python/Julia/R need to pay such attention when using intrinsic functionalities of the languages?

There are a lot of assumptions in this paragraph.

Do these other languages provide the opportunity to pay such attention ?

Do you know if MATLAB/Python/Julia/R create temporary arrays or not ? Temporary arrays can be on the stack or on the heap. I agree that large temporary arrays on the stack can be a problem, but these can be managed by Fortran compiler options. Fortran does have options !

Regarding Fortran having options for more efficient approaches via libraries which are often tuned to specific matrix attributes, is that a bad thing ? or possibly an advantage ?

Before you trust too much of “my reading”, I suggest you test the single thread MATMUL in gfortran. My experience is it performs very well.

DGEMM is provided by a number of libraries (MKL, LAPACK etc). It has been a strength of Fortran that these libraries are available. I only wish their optimisation approaches could be better documented for us Fortran users to learn and adapt to other coding.

You say DO loops are not mathematical, but are series expansions mathematical ? Perhaps Taylor was not a mathematician by this interpretation.

Lots of other languages claim to be very efficient. For the types of calculations their intrinsic functions have been tuned for; this may well be the case.
Fortran provides the framework for efficient computation, although the continual changes in hardware require changed programming approaches. I am always trying to adapt to these changes.
Fortran provides a framework for this adaption.

zaikunzhang · July 26, 2022, 9:53am

Thank you @JohnCampbell for your very detailed and insightful response.

First, I acknowledge that my claim about “the performance of matmul is sometimes inferior to a naive triple loop” is WRONG . My apologies to everyone who read this and many thanks to John for pointing it out.

Second, I would like to emphasize that I am more than happy to hear advantages of Fortran, particularly intrinsic like matmul etc, because I always use them. I rely on them.

Now, let me share my views on your other comments. Since I know only MATLAB (in addition to Fortran), I will focus on it.

Yes, MATLAB does. In MATLAB, you can choose which implementation of BLAS to link to. However, the good news is that you almost never need to do so unless you have quite particular hardware. The default setting works perfectly in most cases. This is why most users do not even know the existence of such options.

I do not know, but probably they do create temporary arrays. However, MATLAB is never supposed to crash due to the creation of temporary arrays — whatever options you use when running your code. If a crash does happen, everybody (the user and the vendor) will agree that it is a bug of MATLAB that should be fixed.

Sure. But the advantage will become much bigger if it does not depend on options, which vary from one compiler to another. In MATLAB, such options are not needed unless, again, you have quite particular hardware.

Totally agree. Thank you for pointing this out.

For your reference, here is what I said:

I am sorry, but according to my limited and humble knowledge of mathematics, Taylor expansion is not considered a matrix/vector operation in normal contexts. I will be very happy to learn from a different opinion.

Totally agree. Regarding MATLAB, it turns out that “the types of calculations their intrinsic functions have been tuned for” mean almost all calculations that MATLAB users normally do. What interests me is why we couldn’t tune the Fortran intrinsic functions also up to the same optimality, without dependence on options. If such a requirement is too high, avoiding stack overflows might be a reasonable starting point.

Sorry, I do not see quite well the relation of this point to the questions under discussion, but I totally agree.

Many thanks again for the detailed response and helpful perspectives.

zaikunzhang · July 26, 2022, 10:09am

Thank you for this information. Being a bit curious, may I ask why in stdlib the contributors agree to take the opposite preference between automatic and allocatable arrays? What is the consideration behind it, despite the possibility of stack overflows under some circumstances?

I see. Will this sometimes lead to passing some variables unnecessarily? Suppose that a subroutine other than the main program or central diver has some need some arrays that are not shared with others, then it seems that such arrays should be declared/allocated locally. Sorry if I understand something wrongly.

awvwgk · July 26, 2022, 10:27am

I’m mostly using this for requesting properties, like a derivative:

github.com

tblite/tblite/blob/5db81cc/src/tblite/xtb/singlepoint.f90#L82-L85


      
          !> Gradient with respect to cartesian coordinates
          real(wp), contiguous, intent(out), optional :: gradient(:, :)
          !> Strain derivatives with respect to strain deformations
          real(wp), contiguous, intent(out), optional :: sigma(:, :)

If they are requested the main driver only has to do an allocation

github.com

tblite/tblite/blob/5db81cc/app/driver_run.f90#L128-L130


      
          if (config%grad) then
             allocate(gradient(3, mol%nat), sigma(3, 3))
          end if

This makes it easier to pass through the request for gradients to other containers

github.com

tblite/tblite/blob/5db81cc/src/tblite/xtb/singlepoint.f90#L167-L168


      
          call calc%interactions%update(mol, icache)
          call calc%interactions%get_engrad(mol, icache, eint, gradient, sigma)

There have been discussions on this topic, mainly around in the review of linalg functions. Those two might offer some insights:

Potentially there are other threads on this at the stdlib repo, but I cannot find them now.

JohnCampbell · July 26, 2022, 12:20pm

Thanks for your clarification. We are somewhat discussing different things here, but DO LOOPS can be very important in Fortran. For my use of OpenMP, loops are the main construction that I use, so a key computing approach.

I should explain, I have never used MATMUL in production code. However matrix multiplication (MATMUL / DGEMM / do loops) is a computation that is very easy to program in OpenMP and I have used this calculation to identify coding approaches that are more effective for OpenMP. These can include:

loop order
matrix partitioning
matrix size
vector/AVX instructions for inner loop

One of the main problems with making MATMUL more efficient in OpenMP is increasing the amount of computation per block of data read from memory. This has been highlighted in another post by @themos when referring to the Roofline model. There is much to learn from this.
https://fortran-lang.discourse.group/t/does-the-speed-bandwidth-of-memory-matter/3154

I would also like to highlight alternative ways of defining arrays in Fortran, as their use gives Fortran some flexibility. These are:

ALLOCATE arrays, which are best for large arrays and placed on the heap.
Local arrays, which are of a defined size, usually small and placed on the stack
Automatic arrays, which are a size defined by a routine argument and placed on the stack.
Temporary arrays, generated by the compiler, that can cause unwanted stack overflow problems.

Unfortunately, these temporary arrays can be a consequence of Fortran array syntax and cause many problems, which DO loops and allocate arrays can avoid.
It is a strength of Fortran as it can provide flexibility for manageing these different arrays.

certik · July 26, 2022, 1:25pm

I use automatic arrays whenever possible.

It’s the compiler job to make sure it does not run out of stack and not segfault. I am aware of the fact that many compilers will just segfault for larger arrays, but to me, from the user’s perspective, that seems like a bug in the compiler.

Now, how to fix it? I am not sure, we have to explore options. In the worst case the compiler can allocate them on the heap, which should not be slower than using allocate, but there might be faster options, such as inserting some quick check when the function runs if it is going to run out of stack. I guess there are two issues:

Nice error messages instead of a segfault when we run out of stack (applies for infinite recursion as well)
Actually allowing to use larger arrays that would not fit into stack

For the second issue it seems the array probably must be allocated outside of the stack by the compiler. Possibly make this configurable, I believe GFortran allows to allocate even allocatable arrays in stack with an option (-fstack-arrays). On linux one can change the stack size with ulimit. On macOS I believe you can’t increase the stack beyond quite a small limit. So perhaps the user can choose on the platform and the willingness to call ulimit by hand. If the stack is small, then one could imagine instructing the compiler to insert some checks and/or use heap for automatic arrays. And the default should be not to segfault (so the user must insert compiler options to speedup if it can result in a possible segfault).

rwmsu · July 26, 2022, 1:53pm

Intel ifort has a -heap-arrays compiler option that lets you set a threshold on the size of an array that gets put on the stack. Anything below the threshold goes on the stack, anything above gets put on the heap. I think NVIDIA/PGI always puts automatic arrays on the heap by default. To get them on the stack you have to use the -Mstack-arrays option.

JohnCampbell · July 26, 2022, 2:17pm

A problem with deciding to use stack arrays or heap arrays is when is the decision made.
If the decision is made at compile time, then the array size must be known, so automatic arrays can not be selected.
If the decision is made at run time, then provision for either stack or heap array must be provided in the code at compile time, plus the provision for their release.

Anothe issue that is highlighted in posts is “but there might be faster options”. This is not easy to assess.
In OpenMP it is often better to have private stack arrays on a thread stack, rather than as heap arrays on a shared heap, but for a single thread process the difference between stack and heap is just a different address and memory page so the preferred can be debatable.
It can be that many small arrays might show a preference for the stack, but placing larger arrays on the stack can dilute the benefit of a small local stack in L1 cache. Not a clear answer.

zaikunzhang · July 26, 2022, 4:03pm

Great to know this! I take exactly the same point of view. I hope many others share this view as well.

This is exactly what I hope.

RonShepard · July 26, 2022, 4:46pm

I think this is the right approach for compiler writers. Namely, there should at least be compiler options that specify stack, heap, or the option at runtime to try stack first and fall back to heap if that fails. That last option should probably be the default, with the other two there to allow the programmer to tune the code.

I guess a question would be whether to allow the programmer to specify any of that in the fortran code itself, either with compiler directives or with specification statements of some kind. I have mixed feelings about this approach. Remember that the fortran standard itself never mentions stacks, heaps, registers, or cache, so introducing specification statements to control those things would be an entirely new direction for the language. And if it were done in a standard way with compiler directives, that would also be a new direction for the language.

everythingfunctional · July 26, 2022, 5:44pm

I think this gets to the point about automatic arrays vs statically sized arrays. I’d suspect neither real :: A(10, 10) nor real :: A(1000, 1000) will cause any crash. However, all bets are off with something like

program example
  call sub(10)
  call sub(10000)
contains
  subroutine sub(n)
    integer, intent(in) :: n
    real :: A(n, n)
  end subroutine
end program

In this simple case the compiler will probably get things right, since it can see everything. The problem is if the procedure and call are in separate translation units, there’s no way the compiler can know what is the best approach for A. The most efficient code is (usually) to just assume it will fit on the stack and always put it on the stack. Anything else would require extra overhead, for something that might not be needed. Fortran compilers are often pressured to “go fast at all costs”, so that’s what they default to.

Topic		Replies	Views
Fortran: Array Language (video) Advocacy	20	1020	February 3, 2024
Testing the performance of `matmul` under default compiler settings Help	37	2781	August 11, 2022
Array programming languages	7	1149	June 22, 2021
LFortran now supports all intrinsic functions Announcements	44	1451	January 5, 2025
Weekly Blog Posts for "Supporting Arrays and Allocatables in LFortran" GSoC-2021	20	2069	August 16, 2021

Automatic arrays and intrinsic array operations: to use or not to use?

Related topics