Automatic arrays and intrinsic array operations: to use or not to use?

Thank everyone for the interesting and insightful discussions.

Up to now, there has been a good discussion on automatic arrays, but not as much on the segfault of intrinsic procedures. Let me give an example. It may take a few minutes to run, depending on your hardware.

! test_matmul.f90

module maprod_mod
implicit none
private
public :: maprod, rk
integer, parameter :: rk = kind(0.0D0)

contains

function maprod(X, Y) result(Z)
real(rk), intent(in) :: X(:, :), Y(:, :)
real(rk), allocatable :: Z(:, :)
integer :: i, j, k

if (size(X, 2) /= size(Y, 1)) then
    write (*, *) "Wrong size"
    stop
end if

allocate (Z(size(X, 1), size(Y, 2)))
Z = 0.0_rk
do j = 1, size(Y, 2)
    do i = 1, size(X, 1)
        do k = 1, size(X, 2)
            Z(i, j) = Z(i, j) + X(i, k) * Y(k, j)
        end do
    end do
end do
end function maprod

end module maprod_mod

program test_matmul
use, non_intrinsic :: maprod_mod, only : maprod, rk
implicit none
integer, parameter :: n = 2000
real(rk), allocatable :: A(:, :), B(:, :)

write (*, *) 'Initialize'
allocate (A(n, n), B(n, n))
A = 0.0_rk

write (*, *) 'Loop'
B = maprod(A, A)
write (*, *) 'Succeed'

write (*, *) 'MATMUL'
B = matmul(A, A)
write (*, *) 'Succeed'
end program test_matmul

This is what I obtained on my laptop (Ubuntu 20.04, intel(R) Core™ i7-10610U CPU @ 1.80GHz) with ifort.

$ uname -a
Linux zX18 5.11.15-051115-generic #202104161034 SMP Fri Apr 16 10:40:30 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
$ ulimit -s 
8192
$ ifort -v
ifort version 2021.6.0
$ ifort -g -warn all test_matmul.f90 && ./a.out
 Initialize
 Loop
 Succeed
 MATMUL
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image              PC                Routine            Line        Source             
a.out              000000000040666A  Unknown               Unknown  Unknown
libpthread-2.31.s  00007FC387AC8420  Unknown               Unknown  Unknown
a.out              000000000040573A  Unknown               Unknown  Unknown
a.out              0000000000403822  Unknown               Unknown  Unknown
libc-2.31.so       00007FC3878E4083  __libc_start_main     Unknown  Unknown
a.out              000000000040372E  Unknown               Unknown  Unknown

N.B.:

  1. The code does not involve automatic arrays.

  2. Without special compiler options, gfortran and nagfor did not segfault.

  3. I am well aware of the reason behind the segfault and how to resolve it. However, I take a viewpoint similar to @certik : a segfault of an intrinsic function (under the default settings) is hardly desirable for whatever reason. Imagine that we are not talking about Fortran but MATLAB or Python — or even C. How does this become OK when it comes to (a major implementation of) Fortran?

P.S.: This issue was reported to Intel more than 10 years ago (e.g., matmul segmentation fault for huge matrices - Intel Communities). Since the behavior persists, it seems that not many people regard it as a bug.

3 Likes

Does this happen because the result of MATMUL() is stored in a temporary array on the stack before being copied to the result B? Or is it because internally MATMUL is generating temporary arrays?

If it is the former, then maybe the general solution is to propose to the language committee that a subroutine interface should be added to the language for matrix-matrix products. A simplified version of the BLAS xGEMM() subroutine would be a good starting point. This would also address the problems associated with unnecessary temporary arrays for operations like MATMUL(TRANSPOSE(A),B).

2 Likes

Try the heap-arrays flag to compile:

ifort -heap-arrays test_matmul.f90 && ./a.out

OR gfortran without it:

gfortran test_matmul.f90 && ./a.out
2 Likes

Thank you very much for raising this. Hope this will help others who encounter the same bug, but it does not solve the problem itself.

I am not sure. This is what I obtained by valgrin ./a.out:

 Initialize
 Loop
 Succeed
 MATMUL
==324261== Warning: client switching stacks?  SP change: 0x1ffeffb4e0 --> 0x1ffd176ce0
==324261==          to suppress, use: --max-stackframe=32000000 or greater
==324261== Invalid write of size 8
==324261==    at 0x4050E9: MAIN__ (test_matmul.f90:47)
==324261==  Address 0x1ffd176ce0 is on thread 1's stack
==324261== 
==324261== Warning: ignored attempt to set SIGKILL handler in sigaction();
==324261==          the SIGKILL signal is uncatchable
==324261== Warning: ignored attempt to set SIGSTOP handler in sigaction();
==324261==          the SIGSTOP signal is uncatchable
==324261== Warning: ignored attempt to set SIGRT32 handler in sigaction();
==324261==          the SIGRT32 signal is used internally by Valgrind
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image              PC                Routine            Line        Source             
a.out              000000000040601A  Unknown               Unknown  Unknown
libpthread-2.31.s  00000000049AF420  Unknown               Unknown  Unknown
a.out              00000000004050E9  Unknown               Unknown  Unknown
a.out              0000000000403822  Unknown               Unknown  Unknown
libc-2.31.so       00000000049E2083  __libc_start_main     Unknown  Unknown
a.out              000000000040372E  Unknown               Unknown  Unknown

and by gdb ./a.out:

Using host libthread_db library "/usr/lib/x86_64-linux-gnu/libthread_db.so.1".
 Initialize
 Loop
 Succeed
 MATMUL

Program received signal SIGSEGV, Segmentation fault.
0x00000000004050e9 in test_matmul () at test_matmul.f90:47
47	B = matmul(A, A)
(gdb) bt
#0  0x00000000004050e9 in test_matmul () at test_matmul.f90:47
#1  0x0000000000403822 in main ()
#2  0x00007ffff7c85083 in __libc_start_main (main=0x4037f0 <main>, argc=1, argv=0x7fffffff9968, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7fffffff9958)
    at ../csu/libc-start.c:308
#3  0x000000000040372e in _start ()
(gdb) 

It seems to me that the segfault occurred when writing the result of matmul to B, but I am not sure.
(BTW, why doesn’t ifort print the call stack of the error even when -g is specified?)

This sounds truly interesting. It is something I hope for. How could we make it happen?

We need an extendable stack, much like the heap.
I will often set the stack to 500 MBytes (on Windows), but even with that, you still have to be careful. For OpenMP, I find this essential.
It should not be this way.

With 64-bit programs, the stack design is not fit for purpose and should be changed.

I do not know what other OS provide. Do any provide an extendable stack ?

2 Likes

I agree. However, pushing some change on the OS level seems even more difficult than on the language level as suggested by @RonShepard , which sounds like a reasonable proposal no matter what happens on the OS side.

Has someone proven that the Fortran specification requires allocating temporaries for TRANSPOSE and MATMUL, and that it is not just poor quality compilers doing this? I see no reason why compilers can’t use “as if” semantics and eliminate temporaries that have no detectable impact on program execution. If I write B += transpose(A), the compiler can optimize this statement as a whole to loops that require no temporary.

2 Likes

Technically, I agree with the approach you specify.

I do not want to judge the quality of a compiler by only the performance of matmul. Indeed, all the compilers I have tested are respectable and high-quality ones. It is just a bit surprising to me that such a failure is permitted by the high-quality compiler(s) for such a long time even after bug reports, yet not many people seem to care.

1 Like

I have used c_loc()/loc() to look at addresses for this situation, and I have seen cases with intent(in) dummy arguments and transpose(A) actual arguments where it was clear that the underlying data had the same addresses. This was with user-written code, not with the intrinsic matmul(), so it depended on optimization level and so on. Ironically, later versions of the same compilers sometimes did not do this optimization. I’ve wondered why it was dropped. The intrinsic matmul(), which can also be inlined directly into the code and then optimized accordingly, would have even more opportunity to recognize and optimize this case.

I have advocated adding some operations like SHALLOW_TRANSPOSE() to the language that would expose this capability to the programmer. This is a pointer kind of operation that would just fiddle around with the metadata and would not actually copy the underlying data. I think the result could be modifiable in the language, so it would work for any intent() in addition to being available, for example, within an expression. After a pointer assignment such as

AT => SHALLOW_TRANSPOSE(A)

then A(i,j) would reference the same memory as AT(j,i), and AT could be used as an actual argument in a subroutine call and associated with an assumed shape dummy array argument of any intent. The operation would always be O(1) effort rather than O(N**2) effort and with no associated memory allocation.

Something ike matmul(AT,B) or matmul(SHALLOW_TRANSPOSE(A),B) would then work without the explicit transpose operation or without the prior memory allocation and copy operation that might otherwise be required.

It could also be used in mixed-language situations where the rows and columns of the two languages could be brought into agreement.

1 Like