As @zaikunzhang has shown before, the stack problem occurs on Linux too, even for expressions which seemingly don’t involve temporaries:
allocate (A(n, n), B(n, n))
B = matmul(A, A) ! <-- segfault with ifort and ifx if n is big
The segfault remains even when using the option -standard-realloc-lhs
, or replacing B
with B(:,:)
. It appears the problem is a temporary created for the expression on the left-hand side.
Not all problems can be fitted nicely into terms of dense linear algebra,
The matrix A could be sparse, or we may even deal with more general problems of the form,
where the Op(*) transformation potentially requires more elaborate (or indirect) data structures. At a high-level, we can always build abstractions.
The Fortran community also has an affinity for parallel programming. To do so they depend on OpenMP’s loop-related directives. The same goes for the OpenACC directive !$acc parallel loop
.
Here’s an example taken from a PDE solver:
integer, parameter :: m = 500, n = 500
real, dimension(m,n,3) :: u,v,h
real, dimension(m,n) :: f
!---begin time marching loop
do step 1,nsteps
old = mod(istep+2, 3) +1
mid = mod(istep+3, 3) +1
new = mod(istep+1, 3) + 1
! ------ interior calculations ------
u(:,:,new) = u(:,:,old) - (2*dt) * ( u(:,:,mid)*d_dx(u(:,:,mid)) + v(:,:,mid)*d_dy(u(:,:,mid)) - f(:,:)*v(:,:,mid) + g*d_dx(h (:,:,mid)) )
v(:,:,new) = v(:,:,old) - (2*dt) * u(:,:,mid)*d_dx(v(:,:,mid)) + v(:,:,mid)*d_dy(v(:,:,mid)) + f(:,:)*u(:,:,mid) + g*d_dy(h(:,:,mid)))
h(:,:,new) = h(:,:,old) - (2*dt) * u(:,:,mid)*d_dx(h(:,:,mid)) + v(:,:,mid)*d_dy(h(:,:,mid)) + h(:,:,mid)*( d_dx(u(:,:,mid)) + d_dy(v(:,:,mid)) )
!------ boundary calculations ------
u(1,:,new) = 0
u(m,:,new) = 0
v(:,1,new) = 0
h(:,1,new) = 0
v(:,n,new) = 0
h(:,n,new) = 0
end do
contains
function d_dx(array)
real, intent(in) :: array(:,:)
real :: dx, c, d_dx(size(array,dim=1),size(array,dim=2))
integer :: i, j
i = size(array,dim=1)
j = size(array,dim=2)
c = 1.0d0 / dx
d_dx(2:i-1,1:j) = c*(array(3:i,1:j) - array(1:i-2,1:j))
d_dx(1,1:j) = 2*c*(array(2,1:j) - array(1,1:j))
d_dx(i,1:j) = 2*c*(array(i,1:j) - array(i-1,1:j)
end function
! ... repeat for d_dy, and d_dz that use different discretizations
I guess it’s a matter of taste if you like this type of syntax or not. Personally, I prefer to use loops and indexes for such computations as it matches the notation for stencils used on paper. In principle it’s possible to parallelize this computation using the OpenMP workshare
construct, but the results tend to be worse compared to the loop directives.
Here’s another example taken from calculating the spectral derivative of a 2-d function:
! Loop version
do j = 1, n
do i = 1, ny-1
jj = j - 1
if (j > n/2) jj = n - j + 1
y(i,j) = -((i-1)**2 + jj**2) * (2.0*pi)**2 * y(i,j)
end do
end do
! F2023 version
do concurrent ( j=1:n, i=1:ny-1)
ii = i - 1
jj = (j > n/2) ? (n - j + 1) : (j - 1)
y(i,j) = -(ii**2 + jj**2) * (2.0*pi)**2 * y(i,j)
end do
! Array version
id = spread([(i,i=1,ny)],dim=2,ncopies=n) - 1
jd = spread([(merge(n-j+2,j,j>n/2),j=1,n)],dim=1,ncopies=ny-1) - 1
y = -(id**2 + jd**2) * (2.0*pi)**2 * y ! will the temporary be elided?
The array version is less lines, and we could even inline it (although it would again destroy readability). However the performance and storage implications are very different. The array version creates much more memory traffic and performance suffers. Even if we cache the index arrays for future computations, we’d be still be accessing more memory than is really needed.
I’m not familiar enough with NumPy, to say if this would be the equivalent,
ny = n // 2 + 1
x = np.array((ny, n),order='F')
# ... initialize x
# Forward FFT
y = f_fft2d(x)
# Spectral derivative
i, j = np.ogrid[:ny-1,:n]
j = np.where(j > n/2, n - j, j)
y[:ny-1,:n] *= (i**2 + j**2) * (2*pi)**2 # will broadcasting do the magic?
# Inverse FFT
dxx = f_ifft2d(y)
The notation is nice and succinct, but it looks like it would still involved the overhead of creating and multiplying the broadcast arrays? How would I parallelize it? Replace NumPy with CuPy?
Perhaps you are trying to advocate that we need to operate at a higher-level of abstraction than just raw-loops? This would include adopting code generators for PDE solvers (e.g. PyFR, Dedalus, FireDrake, …), and in the domain of linear algebra an abstract representations of tensors (including vectors and matrices) and just-in-time compilation concepts like TACO, The Tensor Algebra Compiler, or the MLIR Sparsifer do, to name just a few examples. These types of things tend to be easier to use in dynamic languages and I can agree there is a threat that Fortran is being left behind.