Automatic arrays and intrinsic array operations: to use or not to use?

Thank you @JohnCampbell for your very detailed and insightful response.

First, I acknowledge that my claim about “the performance of matmul is sometimes inferior to a naive triple loop” is WRONG . My apologies to everyone who read this and many thanks to John for pointing it out.

Second, I would like to emphasize that I am more than happy to hear advantages of Fortran, particularly intrinsic like matmul etc, because I always use them. I rely on them.


Now, let me share my views on your other comments. Since I know only MATLAB (in addition to Fortran), I will focus on it.

Yes, MATLAB does. In MATLAB, you can choose which implementation of BLAS to link to. However, the good news is that you almost never need to do so unless you have quite particular hardware. The default setting works perfectly in most cases. This is why most users do not even know the existence of such options.

I do not know, but probably they do create temporary arrays. However, MATLAB is never supposed to crash due to the creation of temporary arrays — whatever options you use when running your code. If a crash does happen, everybody (the user and the vendor) will agree that it is a bug of MATLAB that should be fixed.

Sure. But the advantage will become much bigger if it does not depend on options, which vary from one compiler to another. In MATLAB, such options are not needed unless, again, you have quite particular hardware.

Totally agree. Thank you for pointing this out.

For your reference, here is what I said:

I am sorry, but according to my limited and humble knowledge of mathematics, Taylor expansion is not considered a matrix/vector operation in normal contexts. I will be very happy to learn from a different opinion.

Totally agree. Regarding MATLAB, it turns out that “the types of calculations their intrinsic functions have been tuned for” mean almost all calculations that MATLAB users normally do. What interests me is why we couldn’t tune the Fortran intrinsic functions also up to the same optimality, without dependence on options. If such a requirement is too high, avoiding stack overflows might be a reasonable starting point.

Sorry, I do not see quite well the relation of this point to the questions under discussion, but I totally agree.

Many thanks again for the detailed response and helpful perspectives.

1 Like

Thank you for this information. Being a bit curious, may I ask why in stdlib the contributors agree to take the opposite preference between automatic and allocatable arrays? What is the consideration behind it, despite the possibility of stack overflows under some circumstances?

I see. Will this sometimes lead to passing some variables unnecessarily? Suppose that a subroutine other than the main program or central diver has some need some arrays that are not shared with others, then it seems that such arrays should be declared/allocated locally. Sorry if I understand something wrongly.

1 Like

I’m mostly using this for requesting properties, like a derivative:

If they are requested the main driver only has to do an allocation

This makes it easier to pass through the request for gradients to other containers

There have been discussions on this topic, mainly around in the review of linalg functions. Those two might offer some insights:

Potentially there are other threads on this at the stdlib repo, but I cannot find them now.

1 Like

Thanks for your clarification. We are somewhat discussing different things here, but DO LOOPS can be very important in Fortran. For my use of OpenMP, loops are the main construction that I use, so a key computing approach.

I should explain, I have never used MATMUL in production code. However matrix multiplication (MATMUL / DGEMM / do loops) is a computation that is very easy to program in OpenMP and I have used this calculation to identify coding approaches that are more effective for OpenMP. These can include:

  • loop order
  • matrix partitioning
  • matrix size
  • vector/AVX instructions for inner loop

One of the main problems with making MATMUL more efficient in OpenMP is increasing the amount of computation per block of data read from memory. This has been highlighted in another post by @themos when referring to the Roofline model. There is much to learn from this.
https://fortran-lang.discourse.group/t/does-the-speed-bandwidth-of-memory-matter/3154

I would also like to highlight alternative ways of defining arrays in Fortran, as their use gives Fortran some flexibility. These are:

  • ALLOCATE arrays, which are best for large arrays and placed on the heap.
  • Local arrays, which are of a defined size, usually small and placed on the stack
  • Automatic arrays, which are a size defined by a routine argument and placed on the stack.
  • Temporary arrays, generated by the compiler, that can cause unwanted stack overflow problems.

Unfortunately, these temporary arrays can be a consequence of Fortran array syntax and cause many problems, which DO loops and allocate arrays can avoid.
It is a strength of Fortran as it can provide flexibility for manageing these different arrays.

1 Like

I use automatic arrays whenever possible.

It’s the compiler job to make sure it does not run out of stack and not segfault. I am aware of the fact that many compilers will just segfault for larger arrays, but to me, from the user’s perspective, that seems like a bug in the compiler.

Now, how to fix it? I am not sure, we have to explore options. In the worst case the compiler can allocate them on the heap, which should not be slower than using allocate, but there might be faster options, such as inserting some quick check when the function runs if it is going to run out of stack. I guess there are two issues:

  • Nice error messages instead of a segfault when we run out of stack (applies for infinite recursion as well)
  • Actually allowing to use larger arrays that would not fit into stack

For the second issue it seems the array probably must be allocated outside of the stack by the compiler. Possibly make this configurable, I believe GFortran allows to allocate even allocatable arrays in stack with an option (-fstack-arrays). On linux one can change the stack size with ulimit. On macOS I believe you can’t increase the stack beyond quite a small limit. So perhaps the user can choose on the platform and the willingness to call ulimit by hand. If the stack is small, then one could imagine instructing the compiler to insert some checks and/or use heap for automatic arrays. And the default should be not to segfault (so the user must insert compiler options to speedup if it can result in a possible segfault).

2 Likes

Intel ifort has a -heap-arrays compiler option that lets you set a threshold on the size of an array that gets put on the stack. Anything below the threshold goes on the stack, anything above gets put on the heap. I think NVIDIA/PGI always puts automatic arrays on the heap by default. To get them on the stack you have to use the -Mstack-arrays option.

4 Likes

A problem with deciding to use stack arrays or heap arrays is when is the decision made.
If the decision is made at compile time, then the array size must be known, so automatic arrays can not be selected.
If the decision is made at run time, then provision for either stack or heap array must be provided in the code at compile time, plus the provision for their release.

Anothe issue that is highlighted in posts is “but there might be faster options”. This is not easy to assess.
In OpenMP it is often better to have private stack arrays on a thread stack, rather than as heap arrays on a shared heap, but for a single thread process the difference between stack and heap is just a different address and memory page so the preferred can be debatable.
It can be that many small arrays might show a preference for the stack, but placing larger arrays on the stack can dilute the benefit of a small local stack in L1 cache. Not a clear answer.

1 Like

Great to know this! I take exactly the same point of view. I hope many others share this view as well.

This is exactly what I hope.

1 Like

I think this is the right approach for compiler writers. Namely, there should at least be compiler options that specify stack, heap, or the option at runtime to try stack first and fall back to heap if that fails. That last option should probably be the default, with the other two there to allow the programmer to tune the code.

I guess a question would be whether to allow the programmer to specify any of that in the fortran code itself, either with compiler directives or with specification statements of some kind. I have mixed feelings about this approach. Remember that the fortran standard itself never mentions stacks, heaps, registers, or cache, so introducing specification statements to control those things would be an entirely new direction for the language. And if it were done in a standard way with compiler directives, that would also be a new direction for the language.

3 Likes

I think this gets to the point about automatic arrays vs statically sized arrays. I’d suspect neither real :: A(10, 10) nor real :: A(1000, 1000) will cause any crash. However, all bets are off with something like

program example
  call sub(10)
  call sub(10000)
contains
  subroutine sub(n)
    integer, intent(in) :: n
    real :: A(n, n)
  end subroutine
end program

In this simple case the compiler will probably get things right, since it can see everything. The problem is if the procedure and call are in separate translation units, there’s no way the compiler can know what is the best approach for A. The most efficient code is (usually) to just assume it will fit on the stack and always put it on the stack. Anything else would require extra overhead, for something that might not be needed. Fortran compilers are often pressured to “go fast at all costs”, so that’s what they default to.

2 Likes

Thank everyone for the interesting and insightful discussions.

Up to now, there has been a good discussion on automatic arrays, but not as much on the segfault of intrinsic procedures. Let me give an example. It may take a few minutes to run, depending on your hardware.

! test_matmul.f90

module maprod_mod
implicit none
private
public :: maprod, rk
integer, parameter :: rk = kind(0.0D0)

contains

function maprod(X, Y) result(Z)
real(rk), intent(in) :: X(:, :), Y(:, :)
real(rk), allocatable :: Z(:, :)
integer :: i, j, k

if (size(X, 2) /= size(Y, 1)) then
    write (*, *) "Wrong size"
    stop
end if

allocate (Z(size(X, 1), size(Y, 2)))
Z = 0.0_rk
do j = 1, size(Y, 2)
    do i = 1, size(X, 1)
        do k = 1, size(X, 2)
            Z(i, j) = Z(i, j) + X(i, k) * Y(k, j)
        end do
    end do
end do
end function maprod

end module maprod_mod

program test_matmul
use, non_intrinsic :: maprod_mod, only : maprod, rk
implicit none
integer, parameter :: n = 2000
real(rk), allocatable :: A(:, :), B(:, :)

write (*, *) 'Initialize'
allocate (A(n, n), B(n, n))
A = 0.0_rk

write (*, *) 'Loop'
B = maprod(A, A)
write (*, *) 'Succeed'

write (*, *) 'MATMUL'
B = matmul(A, A)
write (*, *) 'Succeed'
end program test_matmul

This is what I obtained on my laptop (Ubuntu 20.04, intel(R) Core™ i7-10610U CPU @ 1.80GHz) with ifort.

$ uname -a
Linux zX18 5.11.15-051115-generic #202104161034 SMP Fri Apr 16 10:40:30 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
$ ulimit -s 
8192
$ ifort -v
ifort version 2021.6.0
$ ifort -g -warn all test_matmul.f90 && ./a.out
 Initialize
 Loop
 Succeed
 MATMUL
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image              PC                Routine            Line        Source             
a.out              000000000040666A  Unknown               Unknown  Unknown
libpthread-2.31.s  00007FC387AC8420  Unknown               Unknown  Unknown
a.out              000000000040573A  Unknown               Unknown  Unknown
a.out              0000000000403822  Unknown               Unknown  Unknown
libc-2.31.so       00007FC3878E4083  __libc_start_main     Unknown  Unknown
a.out              000000000040372E  Unknown               Unknown  Unknown

N.B.:

  1. The code does not involve automatic arrays.

  2. Without special compiler options, gfortran and nagfor did not segfault.

  3. I am well aware of the reason behind the segfault and how to resolve it. However, I take a viewpoint similar to @certik : a segfault of an intrinsic function (under the default settings) is hardly desirable for whatever reason. Imagine that we are not talking about Fortran but MATLAB or Python — or even C. How does this become OK when it comes to (a major implementation of) Fortran?

P.S.: This issue was reported to Intel more than 10 years ago (e.g., matmul segmentation fault for huge matrices - Intel Communities). Since the behavior persists, it seems that not many people regard it as a bug.

3 Likes

Does this happen because the result of MATMUL() is stored in a temporary array on the stack before being copied to the result B? Or is it because internally MATMUL is generating temporary arrays?

If it is the former, then maybe the general solution is to propose to the language committee that a subroutine interface should be added to the language for matrix-matrix products. A simplified version of the BLAS xGEMM() subroutine would be a good starting point. This would also address the problems associated with unnecessary temporary arrays for operations like MATMUL(TRANSPOSE(A),B).

2 Likes

Try the heap-arrays flag to compile:

ifort -heap-arrays test_matmul.f90 && ./a.out

OR gfortran without it:

gfortran test_matmul.f90 && ./a.out
2 Likes

Thank you very much for raising this. Hope this will help others who encounter the same bug, but it does not solve the problem itself.

I am not sure. This is what I obtained by valgrin ./a.out:

 Initialize
 Loop
 Succeed
 MATMUL
==324261== Warning: client switching stacks?  SP change: 0x1ffeffb4e0 --> 0x1ffd176ce0
==324261==          to suppress, use: --max-stackframe=32000000 or greater
==324261== Invalid write of size 8
==324261==    at 0x4050E9: MAIN__ (test_matmul.f90:47)
==324261==  Address 0x1ffd176ce0 is on thread 1's stack
==324261== 
==324261== Warning: ignored attempt to set SIGKILL handler in sigaction();
==324261==          the SIGKILL signal is uncatchable
==324261== Warning: ignored attempt to set SIGSTOP handler in sigaction();
==324261==          the SIGSTOP signal is uncatchable
==324261== Warning: ignored attempt to set SIGRT32 handler in sigaction();
==324261==          the SIGRT32 signal is used internally by Valgrind
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image              PC                Routine            Line        Source             
a.out              000000000040601A  Unknown               Unknown  Unknown
libpthread-2.31.s  00000000049AF420  Unknown               Unknown  Unknown
a.out              00000000004050E9  Unknown               Unknown  Unknown
a.out              0000000000403822  Unknown               Unknown  Unknown
libc-2.31.so       00000000049E2083  __libc_start_main     Unknown  Unknown
a.out              000000000040372E  Unknown               Unknown  Unknown

and by gdb ./a.out:

Using host libthread_db library "/usr/lib/x86_64-linux-gnu/libthread_db.so.1".
 Initialize
 Loop
 Succeed
 MATMUL

Program received signal SIGSEGV, Segmentation fault.
0x00000000004050e9 in test_matmul () at test_matmul.f90:47
47	B = matmul(A, A)
(gdb) bt
#0  0x00000000004050e9 in test_matmul () at test_matmul.f90:47
#1  0x0000000000403822 in main ()
#2  0x00007ffff7c85083 in __libc_start_main (main=0x4037f0 <main>, argc=1, argv=0x7fffffff9968, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7fffffff9958)
    at ../csu/libc-start.c:308
#3  0x000000000040372e in _start ()
(gdb) 

It seems to me that the segfault occurred when writing the result of matmul to B, but I am not sure.
(BTW, why doesn’t ifort print the call stack of the error even when -g is specified?)

This sounds truly interesting. It is something I hope for. How could we make it happen?

We need an extendable stack, much like the heap.
I will often set the stack to 500 MBytes (on Windows), but even with that, you still have to be careful. For OpenMP, I find this essential.
It should not be this way.

With 64-bit programs, the stack design is not fit for purpose and should be changed.

I do not know what other OS provide. Do any provide an extendable stack ?

2 Likes

I agree. However, pushing some change on the OS level seems even more difficult than on the language level as suggested by @RonShepard , which sounds like a reasonable proposal no matter what happens on the OS side.

Has someone proven that the Fortran specification requires allocating temporaries for TRANSPOSE and MATMUL, and that it is not just poor quality compilers doing this? I see no reason why compilers can’t use “as if” semantics and eliminate temporaries that have no detectable impact on program execution. If I write B += transpose(A), the compiler can optimize this statement as a whole to loops that require no temporary.

2 Likes

Technically, I agree with the approach you specify.

I do not want to judge the quality of a compiler by only the performance of matmul. Indeed, all the compilers I have tested are respectable and high-quality ones. It is just a bit surprising to me that such a failure is permitted by the high-quality compiler(s) for such a long time even after bug reports, yet not many people seem to care.

1 Like

I have used c_loc()/loc() to look at addresses for this situation, and I have seen cases with intent(in) dummy arguments and transpose(A) actual arguments where it was clear that the underlying data had the same addresses. This was with user-written code, not with the intrinsic matmul(), so it depended on optimization level and so on. Ironically, later versions of the same compilers sometimes did not do this optimization. I’ve wondered why it was dropped. The intrinsic matmul(), which can also be inlined directly into the code and then optimized accordingly, would have even more opportunity to recognize and optimize this case.

I have advocated adding some operations like SHALLOW_TRANSPOSE() to the language that would expose this capability to the programmer. This is a pointer kind of operation that would just fiddle around with the metadata and would not actually copy the underlying data. I think the result could be modifiable in the language, so it would work for any intent() in addition to being available, for example, within an expression. After a pointer assignment such as

AT => SHALLOW_TRANSPOSE(A)

then A(i,j) would reference the same memory as AT(j,i), and AT could be used as an actual argument in a subroutine call and associated with an assumed shape dummy array argument of any intent. The operation would always be O(1) effort rather than O(N**2) effort and with no associated memory allocation.

Something ike matmul(AT,B) or matmul(SHALLOW_TRANSPOSE(A),B) would then work without the explicit transpose operation or without the prior memory allocation and copy operation that might otherwise be required.

It could also be used in mixed-language situations where the rows and columns of the two languages could be brought into agreement.

1 Like