I had a look at the Fortran 90 version of the Numerical Recipes textbook. They write that array syntax in Fortran 90 can enable parallelism, and give some examples. I was a bit sceptical since in my experiance replacing loops with array syntax expressions is most of the time bad for performance. Therefore I came up with a test. Here is the example from the book:
We have a vector x with n elements and the result is another vector w with the same number of elements.
Here is the first implementation with loops only:
do i=1,n
w(i) = 0.0_dp
do j=1,n
w(i) = w(i) + abs(x(i)+x(j))
enddo
enddo
and here is the vectorized, Matlab-style implementation where all loops have been removed:
! a is a matrix with dimensions (n,n)
allocate(a(n,n))
a = spread(x,dim=2,ncopies=n)+spread(x,dim=1,ncopies=n)
! Sum over rows to get vector w (I could sum over columns as well)
w = sum(abs(a),dim=1)
To test performance, I wrote this simple program:
program main
implicit none
integer, parameter :: dp = kind(1.0d0)
integer, allocatable :: n_vec(:)
integer :: n, n_c, iostat
real(dp), allocatable :: x(:), w1(:), w2(:)
real(dp) :: t1_start, t1_end, t2_start, t2_end
real(dp), parameter :: atol = 1.0e-12_dp
real(dp), parameter :: rtol = 1.0e-12_dp
logical :: ok
!-----------------------------------------
! Values of n to test
!-----------------------------------------
n_vec = [100, 1000, 10000, 30000 ]
!-----------------------------------------
! Print table header
!-----------------------------------------
write(*,*) "---------------------------------------------------------------------"
write(*,*) " n | sub_loops time (s) | sub_vec time (s) | tolerance check"
write(*,*) "---------------------------------------------------------------------"
!-----------------------------------------
! Loop over n values
!-----------------------------------------
do n_c = 1, size(n_vec)
n = n_vec(n_c)
allocate(x(n),w1(n),w2(n),source=0.0_dp,stat=iostat)
if (iostat /= 0) then
write(*,*) "Error allocating arrays of size ", n
stop
endif
call RANDOM_NUMBER(x)
! Run sub_loops
call cpu_time(t1_start)
call sub_loops(n, x, w1)
call cpu_time(t1_end)
! Run sub_vec
call cpu_time(t2_start)
call sub_vec(n, x, w2)
call cpu_time(t2_end)
! Tolerance-based comparison
ok = maxval(abs(w1 - w2) / (atol + rtol * abs(w1))) <= 1.0_dp
! Print row
if (ok) then
write(*,'(I10,3X,F12.6,3X,F12.6,5X,A)') n, t1_end - t1_start, t2_end - t2_start, "OK"
else
write(*,'(I10,3X,F12.6,3X,F12.6,5X,A)') n, t1_end - t1_start, t2_end - t2_start, "FAIL"
endif
deallocate(x, w1, w2)
enddo
contains
!--------------------------------------------------
! First method: Loops
!--------------------------------------------------
subroutine sub_loops(n, x, w)
implicit none
integer, intent(in) :: n
real(dp), intent(in) :: x(n)
real(dp), intent(out) :: w(n)
integer :: i, j
do i=1,n
w(i) = 0.0_dp
do j=1,n
w(i) = w(i) + abs(x(i)+x(j))
enddo
enddo
end subroutine sub_loops
!--------------------------------------------------
! Second method: Array syntax
!--------------------------------------------------
subroutine sub_vec(n, x, w)
implicit none
integer, intent(in) :: n
real(dp), intent(in) :: x(n)
real(dp), intent(out) :: w(n)
real(dp), allocatable :: a(:,:)
integer :: i
! a is a matrix with dimensions (n,n)
allocate(a(n,n))
a = spread(x,dim=2,ncopies=n)+spread(x,dim=1,ncopies=n)
! Sum over rows to get vector w (I could sum over columns as well)
w = sum(abs(a),dim=1)
!do concurrent (i=1:n)
! w(i) = sum( abs( x(i) + x ) )
!enddo
end subroutine sub_vec
end program main
I compiled and ran this on Windows with ifx and got a stackoverflow problem:
So not good… Then I set heap arrays 0 and got these results:
This shows that loops are much faster than array syntax, as I expected. So I would like to check with the community if my example makes sense and my conclusion is warranted. Could it be that the massive slowdown of the array version is due to allocating all arrays on the heap? So this would not mean that array syntax is bad per se…
Thanks for any feedback on this!


