`target` attribute seemingly affecting performances

Not necessarily, and it all depends on what you are doing with that pointer afterwards. Note that the alternative is an ASSOCIATE statement:

! p does not have to be declared beforehand
associate( p => var%a%b(2)%c%d )
   ...
end associate

It looks similar but it’s definitely not, as p is not a pointer here.

Thanks :slightly_smiling_face:

It is mostly a social problem. The person first introducing a POINTER p for a scalar intrinsic type may have checked that performance did not suffer with the available compilers. But what would a subsequent maintainer do? Would they think that there is a deep reason why POINTER was used instead of a temporary? Would they be tempted to use more POINTERs? Would they check performance effects? Soon, you might end up with code riddled with POINTERs that nobody could account for.

The matmul() intrinsic is overloaded so that it works with both 2D and 1D arrays. This allows matrix-matrix, matrix-vector, vector-matrix, and vector-vector (outer products) to be done through that same generic matmul() function reference. But if one of the matrices in the matmul() call is really a (1:1,1:N) 2D array, then the specific function with the matrix argument is used instead of the specific function with the vector argument. Consequently, there could be a significant difference in timings with a pointer argument like pt2=>array(1:1,1:N) rather than pt1=>(1,1:N). Numerically the computed results will be the same regardless. Also, the matrix-version of matmul() might detect that its argument is really a vector and do a runtime branch to the appropriate vector-specific function. The same considerations would apply also to an actual argument being array(1,1:N) or array(1:1,1:N) or to the analogous situation within an ASSOCIATE block. So there are a lot of uncertainties in the timings and the performance with all of these possibilities

1 Like

I noticed this recently for the Level 2 BLAS SGER/DGER operation which computes the rank-1 update:

A = A + \alpha x y^T

The matrix A is m \times n, the vector x is m \times 1 and the vector y is n \times 1. In Fortran there are multiple ways to perform this (assuming you don’t just use BLAS):

program rank1_update
implicit none
real :: A(3,4)
real :: x(3), y(4)
real :: alpha
integer :: i

alpha = 1.0
x = [1,2,3]
y = [1,2,3,4]

A = 0
call sger_dc(alpha,A,x,y)
print '(4(F8.3))', (A(i,:), i=1,3)
print '()'

A = 0
call sger_arr(alpha,A,x,y)
print '(4(F8.3))', (A(i,:), i=1,3)
print '()'

A = 0
call sger_mm(alpha,A,x,y)
print '(4(F8.3))', (A(i,:), i=1,3)
print '()'

A = 0
call sger_reshape(alpha,A,x,y)
print '(4(F8.3))', (A(i,:), i=1,3)

contains

    ! loops
    subroutine sger_dc(alpha,a,x,y)
    real, intent(in) :: alpha, x(:), y(:)
    real, intent(inout) :: a(:,:)
    integer :: i, j
    do concurrent(i=1:size(a,1),j=1:size(a,2))
        a(i,j) = a(i,j) + alpha * x(i) * y(j)
    end do
    end subroutine

    ! array syntax
    subroutine sger_arr(alpha,a,x,y)
    real, intent(in) :: alpha, x(:), y(:)
    real, intent(inout) :: a(:,:)
    a = a + alpha * spread(x,2,size(a,2)) &
                  * spread(y,1,size(a,1))
    end subroutine

    ! matmul
    subroutine sger_mm(alpha,a,x,y)
    real, intent(in) :: alpha
    real, intent(in), target :: x(:), y(:)
    real, intent(inout) :: a(:,:)
    real, pointer :: x2(:,:), y2T(:,:)
    x2(1:size(x),1:1) => x
    y2T(1:1,1:size(y)) => y
    a = a + alpha * matmul(x2,y2T)
    end subroutine

    ! reshape
    subroutine sger_reshape(alpha,a,x,y)
    real, intent(in) :: alpha
    real, intent(in) :: x(:), y(:)
    real, intent(inout) :: a(:,:)

    associate(x2 => reshape(x,shape=[size(x),1]), &
              y2 => reshape(y,shape=[size(y),1]))
        a = a + alpha * matmul(x2,transpose(y2))
    end associate

   end subroutine
end program

Guess which one is the most efficient in practice? The loop one; for other variants the compilers create temporary arrays

Edit: I was curious how the generated procedures differ in size (or how bloated are they)

$ gfortran -Wall -O3 -march=skylake -fno-inline -c rank1_update.f90 
$ nm --print-size --size-sort --radix=d rank1_update.o | grep "sger" | awk '{print $2, $4}'
0000000000000193 sger_dc.3.constprop.0.isra.0
0000000000000197 sger_mm.1.constprop.0.isra.0
0000000000000698 sger_arr.2.constprop.0.isra.0
0000000000003307 sger_reshape.0.constprop.0.isra.0

(note that gfortran performs constant propagation)

$ ifort -warn all -O3 -xSKYLAKE -inline-level=0 -c rank1_update.f90 
$ nm --print-size --size-sort --radix=d rank1_update.o | grep "IP_sger" | awk '{print $2, $4}'
0000000000000256 rank1_update_IP_sger_dc_
0000000000000272 rank1_update_IP_sger_arr_
0000000000000368 rank1_update_IP_sger_mm_
0000000000001232 rank1_update_IP_sger_reshape_
2 Likes

I guessed the loop one, since it’s the most straightforward for the compiler to implement. However, I think with good optimizations I don’t see why the “array syntax” couldn’t be rewritten by the compiler to be equivalent. I think the matmul can also in principle be rewritten. This is something that as a user I always wanted the compiler to do. I want to attempt these kinds of optimizations with LFortran after we reach beta.

1 Like

This rank-1 operation occurs in LU factorization (trailing panel update):

  real :: A(m,n)
  integer :: i, j, k

  ! ...
  ! k = pivot

  do j = k+1,n
      do i = k+1,m
          A(i,j) = A(i,j) - A(i,k)*A(k,j)
      end do
  end do

This loop is quite tricky for a compiler and not straightforward to vectorize. But it is possible by using do concurrent or an array expression (in principle). Alternatively, the trick is to use the procedural boundary as a means of stating “no aliasing here”:

  lda = m
  call sger(m-k,n-k,-one,a(k+1,k),1,a(k,k+1),lda,a(k+1,k+1),lda)
1 Like