Numerical recipes on array syntax

I had a look at the Fortran 90 version of the Numerical Recipes textbook. They write that array syntax in Fortran 90 can enable parallelism, and give some examples. I was a bit sceptical since in my experiance replacing loops with array syntax expressions is most of the time bad for performance. Therefore I came up with a test. Here is the example from the book:

We have a vector x with n elements and the result is another vector w with the same number of elements.

Here is the first implementation with loops only:

do i=1,n
    w(i) = 0.0_dp
    do j=1,n
        w(i) = w(i) + abs(x(i)+x(j))
    enddo
enddo

and here is the vectorized, Matlab-style implementation where all loops have been removed:

! a is a matrix with dimensions (n,n)
allocate(a(n,n))
a = spread(x,dim=2,ncopies=n)+spread(x,dim=1,ncopies=n)
! Sum over rows to get vector w (I could sum over columns as well)
w = sum(abs(a),dim=1)

To test performance, I wrote this simple program:

program main
   implicit none
   integer, parameter :: dp = kind(1.0d0)

   integer, allocatable :: n_vec(:)
   integer :: n, n_c, iostat
   real(dp), allocatable :: x(:), w1(:), w2(:)
   real(dp) :: t1_start, t1_end, t2_start, t2_end
   real(dp), parameter :: atol = 1.0e-12_dp
   real(dp), parameter :: rtol = 1.0e-12_dp
   logical :: ok

   !-----------------------------------------
   ! Values of n to test
   !-----------------------------------------
   n_vec = [100, 1000, 10000, 30000 ]

   !-----------------------------------------
   ! Print table header
   !-----------------------------------------
   write(*,*) "---------------------------------------------------------------------"
   write(*,*) "     n        |   sub_loops time (s)   |   sub_vec time (s)   | tolerance check"
   write(*,*) "---------------------------------------------------------------------"

   !-----------------------------------------
   ! Loop over n values
   !-----------------------------------------
   do n_c = 1, size(n_vec)

      n = n_vec(n_c)

      allocate(x(n),w1(n),w2(n),source=0.0_dp,stat=iostat)
      if (iostat /= 0) then
          write(*,*) "Error allocating arrays of size ", n
          stop  
      endif
      
      call RANDOM_NUMBER(x)

      ! Run sub_loops
      call cpu_time(t1_start)
      call sub_loops(n, x, w1)
      call cpu_time(t1_end)

      ! Run sub_vec
      call cpu_time(t2_start)
      call sub_vec(n, x, w2)
      call cpu_time(t2_end)

      ! Tolerance-based comparison
      ok = maxval(abs(w1 - w2) / (atol + rtol * abs(w1))) <= 1.0_dp

      ! Print row
      if (ok) then
          write(*,'(I10,3X,F12.6,3X,F12.6,5X,A)') n, t1_end - t1_start, t2_end - t2_start, "OK"
      else
          write(*,'(I10,3X,F12.6,3X,F12.6,5X,A)') n, t1_end - t1_start, t2_end - t2_start, "FAIL"
      endif


      deallocate(x, w1, w2)

   enddo

contains

   !--------------------------------------------------
   ! First method: Loops
   !--------------------------------------------------
   subroutine sub_loops(n, x, w)
      implicit none
      integer, intent(in)   :: n
      real(dp), intent(in)  :: x(n)
      real(dp), intent(out) :: w(n)
      integer :: i, j
  
    do i=1,n
        w(i) = 0.0_dp
        do j=1,n
            w(i) = w(i) + abs(x(i)+x(j))
        enddo
    enddo

   end subroutine sub_loops

   !--------------------------------------------------
   ! Second method: Array syntax
   !--------------------------------------------------
   subroutine sub_vec(n, x, w)
      implicit none
      integer, intent(in)   :: n
      real(dp), intent(in)  :: x(n)
      real(dp), intent(out) :: w(n)
      real(dp), allocatable :: a(:,:)
      integer :: i

    ! a is a matrix with dimensions (n,n)
    allocate(a(n,n))
    a = spread(x,dim=2,ncopies=n)+spread(x,dim=1,ncopies=n)
    ! Sum over rows to get vector w (I could sum over columns as well)
    w = sum(abs(a),dim=1)
    !do concurrent (i=1:n)
    !    w(i) = sum( abs( x(i) + x ) )
    !enddo 
      
   end subroutine sub_vec

end program main 

I compiled and ran this on Windows with ifx and got a stackoverflow problem:

So not good… Then I set heap arrays 0 and got these results:

This shows that loops are much faster than array syntax, as I expected. So I would like to check with the community if my example makes sense and my conclusion is warranted. Could it be that the massive slowdown of the array version is due to allocating all arrays on the heap? So this would not mean that array syntax is bad per se…

Thanks for any feedback on this!

You are comparing different things in your benchmark.

You should have compared the purely loop-based algorithm, with the algorithm that the Numerical Recipes authors gave “for a small-scale parallel machine” (i.e. their first alternative to coding the loops). Then you would have seen that the cost is identical.

The algorithm with the spread intrinsic is an O(n²), instead of an O(n), algorithm (memory-requirements wise), and the authors actually point this out in their text. (Hence, allocation of the temporary array a that it requires comes at a cost.)

1 Like

I think this would be correct. Try moving the allocation outside the n_c loop, and then use the
subset x(1:n), w1(1:n), and w2(1:n) within the loop. Memory allocation is an expensive operation.

This may not make the array code faster, but it should at least make it more equal to the loop code. As far as array syntax allowing parallelism and other compiler optimization magic, this was a fairly common belief in the late 1980s and 1990s, but it never really worked that way in practice. I don’t know if this was because of over optimism about compiler and hardware technology at that time, or if it was because of a basic design flaw in fortran array syntax, or what.

1 Like

Thanks @RonShepard

I rewrote the code as you suggested (also moved the two subtoutines to a module for clarity) and the timings are similar.

module m 
  implicit none
  integer, parameter :: dp = kind(1.0d0)
  
  contains 
    
  !--------------------------------------------------
   ! First method: Loops
   !--------------------------------------------------
   subroutine sub_loops(n, x, w)
      implicit none
      integer, intent(in)   :: n
      real(dp), intent(in)  :: x(n)
      real(dp), intent(out) :: w(n)
      integer :: i, j
  
    do i=1,n
        w(i) = 0.0_dp
        do j=1,n
            w(i) = w(i) + abs(x(i)+x(j))
        enddo
    enddo

   end subroutine sub_loops

   !--------------------------------------------------
   ! Second method: Array syntax
   !--------------------------------------------------
   subroutine sub_vec(n, x, w)
      implicit none
      integer, intent(in)   :: n
      real(dp), intent(in)  :: x(n)
      real(dp), intent(out) :: w(n)
      real(dp), allocatable :: a(:,:)
      integer :: i

    ! a is a matrix with dimensions (n,n)
    allocate(a(n,n))
    a = spread(x,dim=2,ncopies=n)+spread(x,dim=1,ncopies=n)
    ! Sum over rows to get vector w (I could sum over columns as well)
    w = sum(abs(a),dim=1)
    !do concurrent (i=1:n)
    !    w(i) = sum( abs( x(i) + x ) )
    !enddo 
      
   end subroutine sub_vec
  
  
end module m

    
program main
    use m, only: dp, sub_loops, sub_vec
   implicit none

   integer, allocatable :: n_vec(:)
   integer :: n, n_c, iostat
   real(dp), allocatable :: x(:), w1(:), w2(:)
   real(dp) :: t1_start, t1_end, t2_start, t2_end
   real(dp), parameter :: atol = 1.0e-12_dp
   real(dp), parameter :: rtol = 1.0e-12_dp
   logical :: ok

   !-----------------------------------------
   ! Values of n to test
   !-----------------------------------------
   n_vec = [100, 1000, 10000, 30000 ]
   
   allocate(x(n_vec(4)),w1(n_vec(4)),w2(n_vec(4)),source=0.0_dp,stat=iostat)
   if (iostat /= 0) then
       write(*,*) "Error allocating arrays of size ", n_vec(4)
       stop  
   endif
   
   call RANDOM_NUMBER(x)

   !-----------------------------------------
   ! Print table header
   !-----------------------------------------
   write(*,*) "---------------------------------------------------------------------"
   write(*,*) "     n        |   sub_loops time (s)   |   sub_vec time (s)   | tolerance check"
   write(*,*) "---------------------------------------------------------------------"

   !-----------------------------------------
   ! Loop over n values
   !-----------------------------------------
   do n_c = 1, size(n_vec)

      n = n_vec(n_c)

      ! Run sub_loops
      call cpu_time(t1_start)
      call sub_loops(n, x(1:n), w1(1:n))
      call cpu_time(t1_end)

      ! Run sub_vec
      call cpu_time(t2_start)
      call sub_vec(n, x(1:n), w2(1:n))
      call cpu_time(t2_end)

      ! Tolerance-based comparison
      ok = maxval(abs(w1 - w2) / (atol + rtol * abs(w1))) <= 1.0_dp

      ! Print row
      if (ok) then
          write(*,'(I10,3X,F12.6,3X,F12.6,5X,A)') n, t1_end - t1_start, t2_end - t2_start, "OK"
      else
          write(*,'(I10,3X,F12.6,3X,F12.6,5X,A)') n, t1_end - t1_start, t2_end - t2_start, "FAIL"
      endif


      

   enddo
   
   deallocate(x, w1, w2)

end program main 

I’m not sure I fully understand your comment. I wanted to compare the relative performance of the (1) nested loops (coded in subroutine sub_loops) vs (2) equivalent implementation without loops (coded in subroutine sub_vec). It turns out that on my Windows machine with ifx, (2) is always slower than (1). One possible reason is that (2) is more memory-intensive than the loop-based approach.

I’m not sure if my example generalizes to other compilers and/or other operating systems. However, @RonShepard pointed out that the belief that array syntax helps parallelism was widespread in the 80-90s but does not work in practice.

Fortran 8x’s (i.e. the later Fortran 90’s) array syntax originated with hardware vendors that back then produced SIMD machines, i.e. either vector computers, or massively parallel machines with very simple processors that acted in lockstep.

The latter machines have gone out of favor. But vector processing has found its way into modern commercial processor designs, as a form of (limited) SIMD processing. This is what array syntax was designed for. Not for parallelization on (today’s predominantly) MIMD computers.

2 Likes

I don’t know what was proposed re. the standard at the time but the first (and only) form of vector syntax I saw was in the CDC Cyber 203 and 205 system compilers circa mid-1980s. While the CDC compilers claimed to “auto-vectorize” loops like Cray’s did I could never get them to work as well as Cray’s compilers. In order to reach the full vectorization potential of the 203/5 you usually had to rewrite your code using their vector syntax. This led to a lot of non-transportable code and might have been one of the motivations for introducing vector syntax into the langauge. However, if I remember correctly IF you could take full advantage of the 203/205 hardware which required long vector lengths the CDC machines were faster for some applications than the competing Cray systems at the time. As far as support for parallel systems, I think that might have been the Connection Machines but I don’t remember if they supported vector syntax. I think things like FORALL might have originated with the Connection Machines.

2 Likes

For those who are interested in (Fortran) history, here’s a link to a Connection Machine’s manual of CM Fortran.

That array syntax looks rather familiar.

3 Likes

I only know a little about modern GPU architectures but aren’t they essentially a type of “Streaming Multi-Processors” ie lots of simple processors. I think the Connection Machines were also classed as SMPs. I doubt the architectures are anywhere near alike but I wonder if Nvidia “borrowed” some of their ideas for their GPUs from CM.

2 Likes

Yes, I should have added that I was referring to CPU (not GPU) architectures.

1 Like

I’ve found this interesting article on Medium on the topic, that compares the Connection Machines’ architecture to that of modern Nvidia GPUs (which intermix SIMD and MIMD processing).

What I found fascinating is that the CM architecture seems to be about to make a comeback (closer to its original form) in the Cerebras wafer scale integration “chip”, that is further discussed in this Forbes article.

According to the latter article, Cerebras provide “high-level frameworks like PyTorch” to program this chip. (The chip is apparently not only intended for AI workloads. It has been claimed to run also fluid dynamics simulations).

I am not familiar with PyTorch, but from what I’ve seen it could be described as “yet another iteration on array syntax”. So I wouldn’t be surprised if Cerebras were to rediscover Fortran compilers and their array syntax in order to market their chip also to the HPC community in the future.

Perhaps @ivanpribec can tell us more, since the Leibniz Supercomputing Center has recently ordered a Cerebras system.

1 Like