Fortran: Array Language (video)

This video does a great job of explaining in <6 minutes that Modern Fortran, like APL, is an array language. It covers implied do-loops (list comprehension), shape, reshape, pack, where, pure elemental functions, whole array operations, sum, and do concurrent, and that the aliasing rules of Fortran are conducive to optimization., The author, Asher Mancinelli, has a number of videos on “array programming languages” and features Fortran among 4 languages in another video 1 Problem, 4 Languages, 2 Paradigms. Unlike some other video authors, Mancinelli comes across as calm and not histrionic. His site says he “work[s] on the NVHPC C, C++ and Fortran compilers at NVIDIA.”

16 Likes

Array Language — a very precise and good name.

I do believe that one of the core strengths of Fortran is its native support for matrices and matrix/vector operations . This strength is super important and advantageous for a computing-oriented language, because most numerical algorithms are essentially a sequence of matrix/vector operations.

However, unfortunately, the Fortran community seems to be indifferent to the performance of Fortran’s intrinsic matrix/vector operations and even their occasional failures under default compiler settings. This is bizarre. See the following for example.

It seems quite “fashionable” in Fortran to code in loops instead of matrix/vector operations, while other computing-oriented languages like MATLAB/Python are all promoting the latter. Surely, this is partially because these scripting languages are inefficient with loops, but the good news is that their matrix/vector operations are as efficient as Fortran loops (thanks to the underlying C / Fortran libraries). The question is, why Fortran couldn’t / shouldn’t do the same? It would make the code much more understandable, maintainable, and expressive, and make the language much more accessible and usable to common users.

PRIMA uses matrix/vector operations as much as possible. No loop appears unless it is the only solution. When proposing to integrate PRIMA to SciPy, a comment I received from the SciPy community about the source code of PRIMA was as follows.

The syntax is clear, it uses vectorized assignments and Python-style (“modern”) slices when passing subarrays to a subroutine and so on

I am glad that my choice and effort are appreciated.

Those languages are not inefficient because of loops per-se but because they are interpreted. An operation as simple as accessing an element a[i,j] may involve calling a (polymorphic) method (__getitem__) on an instance of a polymorphic type (which can potentially change in the midst of the loop). This introduces a lot of overhead. In contrast, an expression like a @ b will ideally be dispatched to a function from a C library almost directly.

For example if we look at the numpy.vdot(a,b) function to form the dot product of two complex vectors. Underneath the Python facade are clunky, manually “unrolled”, loops in C:

void
CDOUBLE_vdot(char *ip1, npy_intp is1, char *ip2, npy_intp is2,
             char *op, npy_intp n, void *NPY_UNUSED(ignore))
{
// ...
    {
        double sumr = (double)0.0;
        double sumi = (double)0.0;
        npy_intp i;

        for (i = 0; i < n; i++, ip1 += is1, ip2 += is2) {
            const double ip1r = ((double *)ip1)[0];
            const double ip1i = ((double *)ip1)[1];
            const double ip2r = ((double *)ip2)[0];
            const double ip2i = ((double *)ip2)[1];

            sumr += ip1r * ip2r + ip1i * ip2i;
            sumi += ip1r * ip2i - ip1i * ip2r;
        }
        ((double *)op)[0] = sumr;
        ((double *)op)[1] = sumi;
    }
}

Note the convention of prefixing input arguments with i (e.g. ip1, is1), and output arguments with o.

We can implement the same function in Fortran:

program demo_vdot

use, intrinsic :: iso_c_binding, only: c_int64_t, c_double, c_ptr
implicit none

integer, parameter :: npy_intp = c_int64_t    
integer, parameter :: dp = c_double
complex(dp), allocatable :: a(:), b(:)
complex(dp) :: zdotc, c
external :: zdotc
type(c_ptr) :: c_dum

a = [(1,2),(3,4)]
b = [(5,6),(7,8)]

! Fortran
print *, dot_product(a,b)

! "NumPy"
call cdouble_vdot(a,1_npy_intp,b,1_npy_intp,c,2_npy_intp,c_dum)
print *, c

! BLAS (MKL)
print *, zdotc(2,a,1,b,1)

contains

    subroutine cdouble_vdot(ip1,is1,ip2,is2,op,n,unused) bind(c)
       integer(npy_intp), value :: is1, is2, n
       complex(c_double), intent(in) :: ip1(is1*n), ip2(is2*n)
       complex(c_double), intent(out) :: op
       type(c_ptr), value :: unused
       op = dot_product(ip1(1::is1),ip2(1::is2))
    end subroutine

end program
$ ifort demo_vdot.f90 -qmkl
$ ./a.out
 (70.0000000000000,-8.00000000000000)
 (70.0000000000000,-8.00000000000000)
 (70.0000000000000,-8.00000000000000)

Big bang for the buck.

2 Likes

Right. This is the point. Then my question is why don’t we promote the same matrix/vector operations in Fortran and still prefer writing loops?

1 Like

Last year I updated some work and showed the comparison wrt loop.

whole array programming demonstration

2 Likes

Compiler will unexpectedly stack overflow when use matrix/vector operations . It will generate temporary arrays, so generally, matrix/vector operations will be slower than do loop in fortran.What array language is just a beautiful imagination.

2 Likes

This is exactly my concern. How beautiful it would be if it were not only an imagination! What a pity and how strange that it seems only an imagination — to some degree — for the moment, and the community seems to be fine with it (indeed, indifferent to it because everyone seems to love loops so much!)!

1 Like

Exactly that’s the point. In my codes (I use Windows, not sure if this matters) whenever I use the array syntax I run sooner or later into stackoverflow. I believe I had a post on this forum testing such issue and it turned out that to use array syntax safely you have to declare heap array 0 and/or increase the size of the stack but still…

1 Like

This comes at a cost too, compared to the do loop code. Heap allocation is much more expensive than stack allocation. The do loop code typically requires no extra memory allocation at all, neither stack nor heap.

I think when the semantics of array operations was set back in the fortran 8x period, the expectation was that the fortran compilers would optimize away all of the unnecessary memory allocations, and, when possible, produce the optimal code (i.e. equivalent to strip-mined, unrolled, do loops). I think with multilevel caches, simd operations, and later gpu offloading, it all got too complicated for compilers to achieve that.

One of the reasons that vendors like Convex opposed the fortran 8x standard was because of array notation. They felt that introducing array notation would allow their competitors to achieve maximal performance without having to invest the kind of compiler development effort in do-loop optimization that they had done. This is ironic in hindsight, since array notation itself has not fulfilled its potential even after almost four decades.

2 Likes

Thank you @RonShepard for the historical remarks. Yes, this is exactly what I used to expect and now hope Fortran could achieve with array operations.

I do not think this is trivial, but it seems that MATLAB, for example, has partially (not fully) achieved this, thanks to the C/Fortran libraries under the hood.

The achievement may not be full, but the trend is important: scripting languages like MATLAB are always encouraging array operations and discouraging loops, telling the users that “write your code in array operations, and we will do the rest”. They may not be doing it in the best way yet, but they are doing it better and better.

This is even more ironic given that scripting languages like MATLAB have partially achieved this using Fortran, while Fortran itself is still far away.

2 Likes

The scripting languages only need to have their array operations exceed the performance of their slow, interpreted, loops. The analogous loops in fortran are expected to, and usually do, achieve much better performance than an interpreted loop. In fortran, at least in the present time, I would say that array syntax is more for the benefit and convenience of the programmer than to improve performance. One would hope in the future that might not be the case, and superior performance would also be a result of the use of array syntax (as feared by those 80s-era compiler vendors).

1 Like

They may never achieve the best performance, but they are always improving, and they are encouraging array operations and discouraging loops. There is a clear trend, which I believe is important.

Totally agree. What a pity that it is the truth! Even worse, array operations seem to be discouraged in the Fortran community, where people much prefer writing loops. This trend is BAD if Fortran would ever like to become an easily usable and powerful “array language”.

I also hope so. However, according to the trend mentioned above, I do not think this will happen in the next 40 years or ever — it has been 40 year since the 80s.

To make things bold: Fortran community seems to have a strange love for loops and a strange hate for array operations, although array operations are supposed to be one of the core strengths of a modern “Formula Translator”. This is against the modern tide, and this is becoming yet another nail on the coffin of Fortran. Sorry to say this.

1 Like

I find 'pure elemental` routines disappointing. They are incredibly restrictive and I have NEVER seen them compile to anything faster than a naive loop over the same elements.

I will always welcome counter examples, I would love to be wrong about this.

Oh it’s also definitely a fault of language design that the standard permits intrinsic function implementations in compilers to segfault on valid input (stack overflow). Insane.

1 Like

Everyone is complaining about stack overflow issues with intrinsic functions. I get it, I have been in the same boat. However this generally a Windows only issue due to its 1 MB default stack size. Increase the stack size in the linker to a more reasonable size and these issues typically go away.

1 Like

Not sure it it is only 1 MB, but It is difficult to understand how 64-bit Windows can provide such a small stack. It appears Microsoft has no concept of large computing, which is what a 64-bit OS enables ?
Why is stack re-sizing so inaccessible ?

My own programming style includes, where possible using array syntax to replace the inner loops to identify AVX instructions are appropriate. It is an easy way to bias towards efficient computation, which is what array syntax does provide.

2 Likes

On modern operating systems, default stack sizes seems never bigger than a few megabytes.

1 Like

@Beliavsky ,

Thanks very much for this link.

I have watched the video and was very surprised at the DO CONCURRENT syntax (including REDUCE). This got me searching through the latest F24 standard !!

I wonder when this expanded syntax will be available in the compilers I use?
There is certainly a lot of new syntax that is only relevant when multiple threads are enabled, which implies if multiple threads are not available, then the syntax will not be adequately tested. (No thread count option?)
I do hope this can be the path to better support for OpenMP features in Fortran.

That’s already pretty good in my opinion, albeit if only for the cases where it works. It saves you the trouble of declaring the loop counters, and you can always replace a scalar with an array if need be. The part I find the most restrictive is you can’t use them as callback functions.

Does that really have to do with the language standard? In my eyes it’s just the (poor) choice of one vendor to put temporary array expressions on the stack. The language itself has no notion of stack or heap. The only place it alludes to the existence of a heap is in the section on runtime environments (J3/18-007r1, C.12.1), where it says that:

heap allocation/deallocation (e.g., (DE)ALLOCATE in a Fortran subprogram and malloc/free in a C func-6
tion) can be performed without interference

I think the stack problems are due to the lazy OS not allowing an extendible stack.
I am just as lazy, as in only a few well developed projects I defind a 500MByte stack, but in most other smaller or temporary projects, I simply put large arrays in COMMON or MODULES and avoid temporary arrays. If all else fails, I copy the script to change the stack (which is not well documented?)

In 64-bit OS large stack sizes are just a virtual memory address, although you can comprimise on near and far addressing.

One positive with stack design is in OpenMP where each thread has a seperate stack. ( again, defining large stack sizes is poorly documented and varies considerably between compilers)

As @zaikunzhang has shown before, the stack problem occurs on Linux too, even for expressions which seemingly don’t involve temporaries:

allocate (A(n, n), B(n, n))
B = matmul(A, A)                ! <-- segfault with ifort and ifx if n is big

The segfault remains even when using the option -standard-realloc-lhs, or replacing B with B(:,:). It appears the problem is a temporary created for the expression on the left-hand side.


Not all problems can be fitted nicely into terms of dense linear algebra,

y = A x, \qquad C = AB

The matrix A could be sparse, or we may even deal with more general problems of the form,

y = Op(x), \qquad C = Op(B)

where the Op(*) transformation potentially requires more elaborate (or indirect) data structures. At a high-level, we can always build abstractions.

The Fortran community also has an affinity for parallel programming. To do so they depend on OpenMP’s loop-related directives. The same goes for the OpenACC directive !$acc parallel loop.

Here’s an example taken from a PDE solver:

integer, parameter :: m = 500, n = 500

real, dimension(m,n,3) :: u,v,h
real, dimension(m,n) :: f

!---begin time marching loop
do step 1,nsteps
    old = mod(istep+2, 3) +1
    mid = mod(istep+3, 3) +1
    new = mod(istep+1, 3) + 1

! ------ interior calculations ------
    u(:,:,new) = u(:,:,old) - (2*dt) * ( u(:,:,mid)*d_dx(u(:,:,mid)) + v(:,:,mid)*d_dy(u(:,:,mid)) - f(:,:)*v(:,:,mid) +  g*d_dx(h (:,:,mid)) )
    v(:,:,new) = v(:,:,old) - (2*dt) * u(:,:,mid)*d_dx(v(:,:,mid)) + v(:,:,mid)*d_dy(v(:,:,mid)) + f(:,:)*u(:,:,mid) + g*d_dy(h(:,:,mid)))
    h(:,:,new) = h(:,:,old) - (2*dt) * u(:,:,mid)*d_dx(h(:,:,mid)) + v(:,:,mid)*d_dy(h(:,:,mid)) + h(:,:,mid)*( d_dx(u(:,:,mid)) + d_dy(v(:,:,mid)) )
!------ boundary calculations ------
    u(1,:,new) = 0
    u(m,:,new) = 0
    v(:,1,new) = 0
    h(:,1,new) = 0
    v(:,n,new) = 0
    h(:,n,new) = 0
end do

contains

function d_dx(array)
    real, intent(in) :: array(:,:)
    real :: dx, c, d_dx(size(array,dim=1),size(array,dim=2))
    integer :: i, j
    i = size(array,dim=1)
    j = size(array,dim=2)
    c = 1.0d0 / dx
    d_dx(2:i-1,1:j) = c*(array(3:i,1:j) - array(1:i-2,1:j))
    d_dx(1,1:j) = 2*c*(array(2,1:j) - array(1,1:j))
    d_dx(i,1:j) = 2*c*(array(i,1:j) - array(i-1,1:j)
end function

! ... repeat for d_dy, and d_dz that use different discretizations

I guess it’s a matter of taste if you like this type of syntax or not. Personally, I prefer to use loops and indexes for such computations as it matches the notation for stencils used on paper. In principle it’s possible to parallelize this computation using the OpenMP workshare construct, but the results tend to be worse compared to the loop directives.

Here’s another example taken from calculating the spectral derivative of a 2-d function:

  ! Loop version
  do j = 1, n
    do i = 1, ny-1
      jj = j - 1
      if (j > n/2) jj = n - j + 1
      y(i,j) = -((i-1)**2 + jj**2) * (2.0*pi)**2 * y(i,j)
    end do
  end do

! F2023 version
do concurrent ( j=1:n, i=1:ny-1)
    ii = i - 1
    jj = (j > n/2) ? (n - j + 1) : (j - 1)
    y(i,j) = -(ii**2 + jj**2) * (2.0*pi)**2 * y(i,j)
end do

! Array version
id = spread([(i,i=1,ny)],dim=2,ncopies=n) - 1
jd = spread([(merge(n-j+2,j,j>n/2),j=1,n)],dim=1,ncopies=ny-1) - 1
y = -(id**2 + jd**2) * (2.0*pi)**2 * y    ! will the temporary be elided?

The array version is less lines, and we could even inline it (although it would again destroy readability). However the performance and storage implications are very different. The array version creates much more memory traffic and performance suffers. Even if we cache the index arrays for future computations, we’d be still be accessing more memory than is really needed.

I’m not familiar enough with NumPy, to say if this would be the equivalent,

ny = n // 2 + 1
x = np.array((ny, n),order='F')
# ... initialize x

# Forward FFT
y = f_fft2d(x)

# Spectral derivative
i, j = np.ogrid[:ny-1,:n]
j = np.where(j > n/2, n - j, j)
y[:ny-1,:n] *= (i**2 + j**2) * (2*pi)**2     # will broadcasting do the magic?

# Inverse FFT
dxx = f_ifft2d(y)

The notation is nice and succinct, but it looks like it would still involved the overhead of creating and multiplying the broadcast arrays? How would I parallelize it? Replace NumPy with CuPy?

Perhaps you are trying to advocate that we need to operate at a higher-level of abstraction than just raw-loops? This would include adopting code generators for PDE solvers (e.g. PyFR, Dedalus, FireDrake, …), and in the domain of linear algebra an abstract representations of tensors (including vectors and matrices) and just-in-time compilation concepts like TACO, The Tensor Algebra Compiler, or the MLIR Sparsifer do, to name just a few examples. These types of things tend to be easier to use in dynamic languages and I can agree there is a threat that Fortran is being left behind.

2 Likes