Implied-do array declarations in a subroutine

Hello all,

I have a subroutine that operates on a matrix, and outputs an augmented version of the given one:

subroutine foo(b, a)
 real, intent(out), allocatable :: b(:,:)
 real, intent(in) :: a(:,:)
 ! ...
end subroutine foo

Matrix a is squared, say nxn, and b is for example j*n x k*n, for some integers j, k. Some operation is performed on each new slice of b (multiplied by a scalar, added some function value, etc).
The way I am doing this is by accessing a by columns, a(:,i) repeateadly. Since this slice [1, 2, ...] is always the same, I thought of creating a parameter array:

integer, parameter :: slice(*) = [(i, i = 1, size(a,1))]

inside the subroutine foo, since a has a different shape each time it’s called. Is that parameter array not a valid code? I get an error when compiling (gfortran v10.2.0)

139 |         integer, parameter :: slice(*) = [(t, t = 1, size(a,1))]
    |                                                                1
Error: Invalid character in name at (1)

If I use a constant value (e.g. 10) instead of size(a,1) the code compiles. I could also pass this slice array as an argument, but that seems to be slower.

Thanks

1 Like

A named constant (i.e. declared with parameter) must be initialized with a constant expression. Constant expressions cannot contain variables (which includes dummy arguments). Thus no, that code is not conforming. I’d recommend doing something like the following.

associate(slice => [(i, i = 1, size(a,1))])
  ...
end associate

That effectively creates a “run time constant” (i.e. once set its value cannot be changed). That seems to be what you’re after; i.e. a value that is constant within the scope of a procedure, but depends on the procedure’s argument(s).

edit: That explanation is a little sloppy. In reality it’s a bit more nuanced than that. That code would be valid if a were explicit size. So specifically for size, it’s not a constant expression if its argument is assumed size, assumed shape, or deferred shape (i.e. allocatable with : for the dimensions).

Thank you for the quick reply. So passing an extra argument to the subroutine

subroutine foo(b, a, n)
! ...
integer, parameter :: slice(*) = [(i, i = 1, n)]

won’t work either. That was my second option. I will test with associate and with passing the slice array to see which is better.

What’s the reasoning behind not letting dummy arguments be used for initializing parameters?

Named constants (PARAMETERs) are compile-time values and must be given a value with a constant expression. A dummy argument is an execution-time thing.

1 Like

Just out of curiosity, is the slice array used for referencing into array a as in a(slice,i)?

No, it is for the augmented matrix b. Say slice = [1, 2, 3], and b has 2 times the size of a in the first dimension (column), then

b(slice,i) = a(:, i) * (some operation)  ! could also use a(slice,i)
slice = slice + 3 ! 3 = size of slice
b(slice,i) = a(:,i) * (some other operation)

In fact it is something similar to a kronecker product of matrices, where each submatrix of b would be a(i,j) * b, but instead of just a(i,j) it is a more complicated expression.

If it’s performance sensitive part of your code, you might prefer to use array references directly:

integer, parameter :: ns = 3
integer :: n

n = size(a,1)
b(1:n,i) = a(:,i) * ( ... )
b(1+ns:n+ns,i) = a(:,i) * ( ... )

assuming your operations follow a predictable pattern. It does however ruin legibility arguably.

In the version with the slice array of indexes created at runtime there could be extra referencing/de-referencing instructions involved, instead of simple pointer arithmetic, depending on how good your optimizing compiler is. This is not to say there is anything wrong with your original way of doing it.

1 Like

I see. I have tested other ideas (such as using spread for expanding a into b and then doing the required operations in b), but not that one.
The access pattern is indeed predictable, it is always by columns so (i-1)*ns+1 : i*ns i.e. 1:3, 4:6... and do that for several columns. I will try, thank you.

I also though of writing chunks (submatrix) directly into b, but that would require writing data in different columns of b; I haven’t tested it but I assume it will be slower since it needs to jump from column to column (at least for large matrices).

EDIT: A quick test using @ivanpribec suggestion does indeed seem to help. The subroutine alone is about 15% faster (with relatively small matrices ~50x50). Great.
EDIT 2: For very small matrices (less than ~20 elements) using an array slice seems faster. But for those sizes the timings are in the order of 1e-6, so testing this subroutine alone is pointless.

Interesting approaches but array syntax is not always a better solution, albeit one of my favorite Fortran features. If you are creating an array that just contains values calculated by an expression, a DO loop is often clearer in my mind, and for large problems eliminates the need for a large scratch array. So in the simplest case If I want a row of B(:,: ) to contain the values 1,2,3,4, … instead of creating a vector A(: ) that contains those values and setting a row of B(:,: ) to that array, simply doing

do i=1,size(B,dim=1)
   B(i:)=expression_using_i
enddo

I have found more than one case where that is faster as well as thriftier. It would be interesting to see more of the code to make sure that applies in this case, but at least as a footnote to the discussion, that is sometimes an appropriate approach (DO loops are not completely obsolete!) . It would be interesting to time the various approaches versus the simplest change (just removing PARAMETER from the statement so it would be dynamically allocated on each call).

My code in particular is a specific case that probably cannot be generalized. But as I mentioned in a previous post, the problem is similar to a Kronecker product of matrices.

So for that case, I have created a few different subroutines with all the ideas presented in this thread: In all cases, the matrices are a(n_a x m_a) and b(n_b x m_b). Also to refer to matrix elements the indices (i,j) are used, and the loops variables are i_a, j_a, i_b, j_b, i_c, j_c (hopefully that helps to see in which direction -row/column- of what matrix -a/b/c- the loop operates on).

How the code works: the matrix c is filled in by columns, in chucks of n_b elements.

  • First column: c(1:n_b,1) = a(1,1)*b(:,1), then c(n_b+1:2*n_b,1) = a(2,1)*b(:,1), etc, all the way to c((n_a-1)*n_b+1:n_a*n_b) =:a(n_a,1)*b(:,1).
  • Second column: same elements a(1,1), a(2,1) ... a(n_a,1) multiplying the second column of b: b(:,2).
  • This process is repeated m_b times, with the same column of a, until matrix c has been filled to column n_b.
  • Repeat this process m_a times, successively using a(:,2), a(:,3), ..., a(:,m_a).

With a slice, allocatable, implied-do loop

    subroutine kronecker_product_slice_v1(c, a, b)
        integer, intent(out), allocatable :: c(:,:)
        integer, intent(in)                :: a(:,:), b(:,:)

        integer :: tmp
        integer :: n_a, m_a, n_b, m_b
        integer :: i_a, j_a, j_b, j_c
        integer :: t

        integer(ik), allocatable :: slice(:)

        n_a = size(a, dim = 1)
        m_a = size(a, dim = 2)
        n_b = size(b, dim = 1)
        m_b = size(b, dim = 2)

        allocate(slice(n_b))
        allocate(c(n_a * n_b, m_a * m_b))

        j_c = 1
        do j_a = 1, m_a
            do j_b = 1, m_b
                slice = [(t, t = 1, n_b)]
                do i_a = 1, n_a
                    tmp = a(i_a, j_a)
                    c(slice, j_c) = tmp * b(:, j_b)
                    slice = slice + n_b
                end do
                j_c = j_c + 1
            end do
        end do
    end subroutine kronecker_product_slice_v1

With slice, passed as an argument

    subroutine kronecker_product_slice_v2(c, a, b, array)
        integer, intent(out), allocatable :: c(:,:)
        integer, intent(in) :: a(:,:), b(:,:)
        integer, intent(in) :: array(:)

        integer :: n_a, n_b, m_a, m_b
        integer :: i_a, j_a, j_b. j_c
        integer :: t

        integer, allocatable :: slice(:)

        n_a = size(a, dim = 1)
        m_a = size(a, dim = 2)
        n_b = size(b, dim = 1)
        m_b = size(b, dim = 2)

        allocate(c(n_a * n_b, m_a * m_b))

        j_c = 1

        do j_a = 1, m_a
            do j_b = 1, m_b
                slice = array
                do i_a = 1, n_a
                    c(slice, j_c) = a(i_a, j_a) * b(:, j_b)
                    slice = slice + n_b
                end do
                j_c = j_c + 1
            end do
        end do
    end subroutine kronecker_product_slice_v2

Referencing, without slice array, v1 (as suggested by @ivanpribec if I understood correctly)

 subroutine kronecker_product_ref_v1(c, a, b)
        integer, intent(out), allocatable :: c(:,:)
        integer, intent(out) :: a(:,:), b(:,:)

        integer :: tmp
        integer :: n_a, m_a, n_b, m_b  ! rows/cols of matrices a, b
        integer :: i_a, j_a, j_b, i_b, j_c


        n_a = size(a, dim = 1)
        m_a = size(a, dim = 2)
        n_b = size(b, dim = 1)
        m_b = size(b, dim = 2)

        allocate(c(n_a * n_b, m_a * m_b))

        j_c = 1

        do j_a = 1, m_a
            do j_b = 1, m_b
                do i_a = 1, n_a
                    tmp = a(i_a, j_a)
                    do i_b = 1, n_b
                        c((i_a - 1) * n_b + i_b, j_c) = tmp * b(i_b, j_b)
                    end do
                end do
                j_c = j_c + 1
            end do
        end do
    end subroutine kronecker_product_ref_v1

Referencing, without slice array, v2 (similar to v1, but using variables instead of computing slice)

    subroutine kronecker_product_ref_v2(c, a, b)
        integer, intent(out), allocatable :: c(:,:)
        integer, intent(out) :: a(:,:), b(:,:)

        integer :: tmp
        integer :: n_a, m_a, n_b, m_b
        integer :: i_a, j_a, j_b, j_c
        integer :: low, upp


        n_a = size(a, dim = 1)
        m_a = size(a, dim = 2)
        n_b = size(b, dim = 1)
        m_b = size(b, dim = 2)

        allocate(c(n_a * n_b, m_a * m_b))

        low = 1
        upp = n_b
        j_c = 1

        do j_a = 1, m_a
            do j_b = 1, m_b
                do i_a = 1, n_a
                    tmp = a(i_a, j_a)
                    c(low : upp, j_c) = tmp * b(:, j_b)
                    low = low + n_b
                    upp = upp + n_b
                end do
                j_c = j_c + 1
                low = 1
                upp = n_b
            end do
        end do
    end subroutine kronecker_product_ref_v2

An extra DO loop, v1 (as suggested by @urbanjost, computing the slice)

    subroutine kronecker_product_do_v1(c, a, b)
        integer, intent(out), allocatable :: c(:,:)
        integer, intent(out) :: a(:,:), b(:,:)

        integer    :: tmp
        integer :: n_a, m_a, n_b, m_b
        integer :: i_a, j_a, j_b, i_b, j_c

        n_a = size(a, dim = 1)
        m_a = size(a, dim = 2)
        n_b = size(b, dim = 1)
        m_b = size(b, dim = 2)

        allocate(c(n_a * n_b, m_a * m_b))

        j_c = 1

        do j_a = 1, m_a
            do j_b = 1, m_b
                do i_a = 1, n_a
                    tmp = a(i_a, j_a)
                    do i_b = 1, n_b
                        c((i_a - 1) * n_b + i_b, j_c) = tmp * b(i_b, j_b)
                    end do
                end do
                j_c = j_c + 1
            end do
        end do
    end subroutine kronecker_product_do_v1

An extra DO loop, v2 (with variables instead of computing the slice)

    subroutine kronecker_product_do_v2(c, a, b)
        integer, intent(out), allocatable :: c(:,:)
        integer, intent(out) :: a(:,:), b(:,:)

        integer :: tmp
        integer :: n_a, m_a, n_b, m_b
        integer :: i_a, j_a, j_b, i_b, j_c
        integer :: skip

        n_a = size(a, dim = 1)
        m_a = size(a, dim = 2)
        n_b = size(b, dim = 1)
        m_b = size(b, dim = 2)

        allocate(c(n_a * n_b, m_a * m_b))

        skip = 0
        j_c = 1

        do j_a = 1, m_a
            do j_b = 1, m_b
                do i_a = 1, n_a
                    tmp = a(i_a, j_a)
                    do i_b = 1, n_b
                        c(skip + i_b, j_c) = tmp * b(i_b, j_b)
                    end do
                    skip = skip + n_b
                end do
                j_c = j_c + 1
                skip = 0
            end do
        end do
    end subroutine kronecker_product_do_v2

There are more options to test (e.g. passing the sizes n_a, n_b, etc) and possibly others may be faster, but that’s all I have tested so far. In most of this subroutines the variable tmp (that stores the value a(i_a, j_a)) is not strictly needed, but it seemed to help (i.e. be faster). I could be wrong.

From my tests, the do and ref versions are faster than the ones using the slice.
Between those two, it’s hard to say (same with v1 or v2). ref seems to be faster, but I cannot confirm, and it also depends on the sizes of a and b.
Something else to test could be the use of integer, real and complex. Maybe the fastest version differs between types.

A test problem for any of these subroutines is as follows:

program test
    implicit none

    integer, parameter :: ik = selected_int_kind(r = 35)    ! for timing
    integer, parameter :: rk = selected_real_kind(r = 3000)    ! quad precision

    integer(ik) :: ts, tf, tr       ! timing
    integer :: i

    integer, parameter :: array(*) = [(i, i = 1, XX)]   ! slice, substitute XX with size(b,1)

    integer, allocatable :: a(:,:), b(:,:), c(:,:)

    a = reshape([1, 1, 1, 3, 0, 2, 2, 0, 2], [3, 3])
    b = reshape([0, 5, 1, 5, 0, 1], [3, 2])

    call system_clock(ts, tr)
    call kronecker_product_XYZ(c, a, b)   ! substitute XYZ
    call system_clock(tf)
    print *, "time: ", real(tf - ts, rk) / tr

    do i = 1, size(a, 1) * size(b, 1)
        print *, c(i,:)
    end do

contains
    ! ...
end program test

PS: There may be bugs in the subroutine codes

3 Likes

Was curious what range you were seeing. These were the numbers I got with a
quick run (in no way claiming to be rigorous):

For the given problem size (performance and resource usage may vary dramatically
for different problem sizes) x 1 000 000, executions …

   Factor of 118 from slowest debug mode to fastest optimized, but a much
   tighter grouping when common optimization flags are used.
   
   Some interesting flip-flops but in general the DO and REF_V1 methods
   appeared to produce equivalent wall-clock and be the fastest; but only
   tried the two primary compiler switch combinations provided by fpm(1)
   and default compiler switches for two compilers.
   
   method    gfortran_release  ifort_release  gfortran_debug  ifort_debug  gfortran_default  ifort_default
   do        0.02300000        0.08830000     0.72700000      1.36609995   0.46700001        0.08670000
   do_v2     0.02300000        0.08890000     0.72299999      1.34549999   0.44299999        0.08560000
   ref_v1    0.02500000        0.08960000     0.75199997      1.37660003   0.39800000        0.08670000
   ref_v2    0.02400000        0.16440000     0.56300002      1.41470003   0.32800001        0.13429999
   slice2    0.03900000        0.59189999     0.77600002      2.69339991   0.43799999        0.47510001
   slice_v1  0.32200000        0.54030001     1.11000001      2.71889997   0.78799999        0.47940001

main.f90 (9.2 KB)

So your results match mine, with do and ref being ahead.

For the given problem size

Do you mean, using the matrices given in the example? (sizes 3x3 and 3x2).
Have you tested larger matrices? For example:

   b = reshape([(i, i = 1, 1200)], [20, 60])
   a = reshape([(-i, i = 1, 3500)], [70, 50])

Although in this case, a good portion of the time will be spent moving data around.

Also, all these subroutines write data by columns. I imagined that given how Fortran stores matrices, that’s the best way. However there is the option of writing blocks (submatrices) every step, so a(1,1)*b, the whole b matrix is written into c, that is n_b rows and m_b columns.
Maybe I will test that next.

Just ran your case; wanted to get an idea of the magnitudes of the differences. I do suspect size will make a significant difference. You did a nice job; was thinking it would make a useful example with a little more variations for those concerned with performance; maybe showing how a function versus a subroutine can affect speed, and so on. Compilers have gotten so much better at optimizing it is sometimes interesting to look at the intermediate code; it shows why you want to develop with debug flags on but create optimized production versions (usually) and so on. I like what you did.

Did some tests with writing blocks (the whole b matrix) every step. Surprisingly I did not see a big slowdown compared to the do/ref versions.

By blocks, columnwise, v1

    subroutine kronecker_product_blocks_v1(c, a, b)
        integer, intent(out), allocatable :: c(:,:)
        integer, intent(in) :: a(:,:), b(:,:)

        real :: tmp
        integer :: n_a, m_a, n_b, m_b
        integer :: i_a, j_a, j_b

        n_a = size(mat_a, dim = 1)
        m_a = size(mat_a, dim = 2)
        n_b = size(mat_b, dim = 1)
        m_b = size(mat_b, dim = 2)

        allocate(mat_c(n_a * n_b, m_a * m_b))

        j_b = 1
        do j_a = 1, m_a
            do i_a = 1, n_a
                tmp = a(i_a, j_a)
                c((i_a - 1) * n_b + 1 : i_a * n_b, (j_b - 1) * m_b + 1 : j_b * m_b) = tmp * b
            end do
            j_b = j_b + 1
        end do
    end subroutine kronecker_product_blocks_v1

By blocks, columnwise, v2

    subroutine kronecker_product_blocks_v2(c, a, b)
        integer, intent(out), allocatable :: c(:,:)
        integer, intent(in) :: a(:,:), b(:,:)

        real :: tmp
        integer :: n_a, m_a, n_b, m_b
        integer :: i_a, j_a, j_b
        integer :: row, col

        n_a = size(mat_a, dim = 1)
        m_a = size(mat_a, dim = 2)
        n_b = size(mat_b, dim = 1)
        m_b = size(mat_b, dim = 2)

        allocate(mat_c(n_a * n_b, m_a * m_b))

        col = 0
        do j_a = 1, m_a
            row = 0
            do i_a = 1, n_a
                tmp = a(i_a, j_a)
                c(row + 1 : row + n_b, col + 1 : col + m_b) = tmp * b
                row = row + n_b
            end do
            col = col + m_b
        end do
    end subroutine kronecker_product_blocks_v2

I tested a rowwise version, but it was definitely slower, so it is not included (it is as simple as switching the loop variables i_a, j_a and the row/col updates).

Using reals instead of ìntegers made the code about an order of magnitude slower, but the do/ref version stay ahead.

Differences between v1 and v2 are basically the same as with the do/ref versions i.e. hardly noticeable change.

All in all, these blockwise subroutines are perhaps easier to read, since only two (instead of three/four) loops are needed. And it seems they offer the same performance.

That’s because Fortran arrays are laid out column-wise. Having the first array dimension match the inner-most loop gives you a better cache access pattern.

IMO, the last version makes the Kronecker-like block product quite clear.

Here’s another variation using associate:

do j = 1, size(a,1)
  associate(jl => (j-1)*m_b + 1, ju => j*m_b)
    do i = 1, size(a,2)
      associate(il => (i-1)*n_b + 1, iu => i*n_b)
        c(il:iu, jl:ju) = a(i,j)*b
      end associate
    end do
  end associate
end do

I’d expect the performance to be comparable with the last two versions you posted.