Matrix index-pointer confusion

Hello. I have been using BLAS/LAPACK with Fortran for some time now, and recently I have used it with C which lead me to a confusion on how it is called with Fortran.

In Fortran, if you want to call a subroutine such as dgemm for matrix-matrix multiply and you want to use a submatrix of say a matrix A in the multiplication (for example the submatrix A(2:, 2:)) as opposed to the entire matrix, the usual process I would follow would be to just specify the top-left entry A(2, 2) of the submatrix and the proper submatrix size and its leading dimension. An example is given in the Fortran code below, where the multiplication C = A(2:, 2:) * B is performed.

I just took this as a way of life in Fortran, but recently when I used BLAS/LAPACK in C it seemed much more clear why you specify the first entry of the submatrix in the subroutine call. Below the Fortran code is an equivalent C code (I am storing C matrices as 1d arrays in column-major order). When calling dgemm in C, you specify a pointer to the top-left entry of the submatrix (in the example code this is &A[1 + lda*1]) and this makes sense to me since an array in C is secretly a pointer to the first element of the array (so in calling a submatrix you specify a pointer to the first element of the array and the code knows how far in memory to jump to adjacent entries).

Thinking back to Fortan, I don’t really know why you must also specify the first entry of the matrix. Are arrays also stored basically as pointers in Fortran? I tried looking through Modern Fortran Explained, but I haven’t been able to find an answer (it could be there but I just haven’t been able to find it). How would I write my own code that looks like BLAS/LAPACK calls, since to me it looks like an argument such as A(2, 2) is just a double, so I would get compiler errors trying to index a double. If there is a reference for how arrays in Fortran are stored/used, that would be appreciated also.

Thanks for the time.

program main
    implicit none

    double precision :: A(3, 3), B(2, 2), C(2, 2)

    A = reshape([1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0], shape=[3, 3])
    B = reshape([1.5, 2.5, 3.5, 4.5], shape=[2, 2])
    ! A = 1.0   4.0   7.0   A(2:, 2:) = 5.0   8.0
    !     2.0   5.0   8.0               6.0   9.0
    !     3.0   6.0   9.0
    !
    ! B = 1.5   3.5
    !     2.5   4.5

    ! C = 1.0 * A(2:, 2:) * B
    call dgemm('n', 'n', 2, 2, 2, 1.0d0, A(2, 2), 3, B, 2, 0.0d0, C, 2)
    ! C = 27.5   53.5
    !     31.5   61.5
    
endprogram main
#include <stdlib.h>

void dgemm_(char* transa, char* transb, int* m, int* n, int* k,
           double* alpha, double* A, int* lda, double* B, int* ldb,
           double* beta, double* C, int* ldc);

int main()
{
    int m, n, k;
    int lda, ldb, ldc;
    m = 2; n = 2; k = 2;
    lda = 3; ldb = 2; ldc = 2;

    char transa, transb;
    transa = 'n'; transb = 'n';

    double alpha, beta;
    alpha = 1.0; beta = 0.0;

    double *A, *B, *C;
    A = (double*) malloc(lda * lda * sizeof(double));
    B = (double*) malloc(ldb * ldb * sizeof(double));
    C = (double*) malloc(ldc * ldc * sizeof(double));

    A[0 + lda*0] = 1.0; A[0 + lda*1] = 4.0; A[0 + lda*2] = 7.0;
    A[1 + lda*0] = 2.0; A[1 + lda*1] = 5.0; A[1 + lda*2] = 8.0;
    A[2 + lda*0] = 3.0; A[2 + lda*1] = 6.0; A[2 + lda*2] = 9.0;

    B[0 + ldb*0] = 1.5; B[0 + ldb*1] = 3.5;
    B[1 + ldb*0] = 2.5; B[1 + ldb*1] = 4.5;

    dgemm_(&transa, &transb, &m, &n, &k,
           &alpha, &A[1 + lda*1], &lda, B, &ldb,
           &beta, C, &ldc);

    free(A); free(B); free(C);

    return 0;
}

Your description of the C argument association is exactly the same as is done in fortran. Before array syntax was introduced in f90, A(2,2) was the only way to pass the address of the first element. The LDA argument of lapack is how many storage elements to skip to get to the next column, or element A(2,3) in the actual array argument. Arrays in fortran are stored in column major order, the opposite of C and C-like languages, but the address and offset ideas are the same in both languages.

1 Like

I think this is the way how an subarray can be passed to a subroutine in old Fortran.
The subroutine’s dummy variable is associated with elements [5,6,7,8,9], and interpreted as

5 8 
6 9
7 ?

The last row is incomplete, but you are not supposed to work with these elements anyway.

program matrix
    implicit none
    integer :: mat(3,3), i

    interface
        subroutine sub(a, lda, n)
            implicit none
            integer :: lda, n, a(lda,*)
        end subroutine
    end interface

    mat = reshape([(i,i=1,9)], [3,3])
    do i = 1,3
        print *, mat(i,:)
    end do
    print *
    call sub(mat(2,2),lda=3,n=2)
end program matrix

subroutine sub(a,lda,n)
    implicit none
    integer :: lda, n, a(lda,*)
    integer :: i
    do i = 1,n
        print *, a(i,1:n)
    end do
    print *, a(3,1) ! ok, but not needed
    print *, a(3,2) ! illegal, can cause segfault
end subroutine sub

If you search for “storage sequence association” you’ll find several posts about this sort of argument passing.

Here’s one recent one: Reshaping arrays via explicit length arguments - #3 by ivanpribec

Thanks for the keyword. Using the J3 interpretation document and reference BLAS I was able to make up a few test codes that worked as expected and got the hang of the notation.

Storage sequence association appears to be in the deprecated features part of Modern Fortran Explained (section A.8). Is there a reason for this? To illustrate the point, one of the test codes I made up was a subroutine to print a matrix. Its declaration looks like:

subroutine print_matrix(A, lda, m, n)
    integer, intent(in) :: lda, m, n
    double precision, intent(in) :: A(lda, *)
    ...
endsubroutine print_matrix

Calling it for say a 3 x 2 submatrix starting at position 3, 4 of the 8 x 8 matrix A would then look like call print_matrix(A(3, 4), 8, 3, 2). If I were to write a more “modern” version of this code, I would write:

subroutine print_matrix(A, m, n)
    integer, intent(in) :: m, n
    double precision, intent(in) :: A(m, n)
    ...
endsubroutine print_matrix

And call this for the same submatrix like call print_matrix(A(3:5, 4:5), 3, 2).

To me the second subroutine seems a lot more wasteful. I could be wrong on this, but since I am calling the second one with a slice, it would first create an array copy and work with that array, and then trash it at the end. The first subroutine would just work with the one array A, and not create or destroy space for it. I know the example isn’t perfect (it could work with just the array A and require 2 starting indices for the print, but I think this gets the point across).

I could be misunderstanding something here, but I feel the storage sequence association method is better than using array slices, and I don’t understand why it is deprecated.

Opinions differ on this matter. I suppose that one problem is this mode of argument passing is intrinsically unsafe and a one-off mistake will easily cause an out-of-bounds violation. It also isn’t “rank-safe”.

The “modern” approach of using assumed-shape arrays is safer in that sense, but it can introduce some performance penalties (contiguous vs non-contiguous code paths).

In performance-oriented codes, explicit-shape array arguments (which allow sequence association) are still common. Here’s a related discussion: What's the purpose of array size inside subroutine arguments? - #8 by ivanpribec

In your second example I think it makes more sense to compare versus assumed-shape arguments:

Subroutine print_matrix(A)
   Real(dp) :: A(:,:)

This allows you to pass a non-contiguous array section (a so-called slice), without making an extra copy. In practice compilers will pass a “dope array” that includes the array rank, shape, and strides.

Edit: here’s a small program to demonstrate the concept.

program demo
implicit none
integer, parameter :: dp = kind(1.0d0)
real(dp) :: A(8,8)
call print_contig(A(:,:3))
call print_contig(A(3:5, 4:5))
contains
    subroutine print_contig(A)
        real(dp), intent(in) :: A(:,:)
        print *, is_contiguous(A)
    end subroutine
end program

Edit 2: I’ve always thought Fortran is easier than C++ with its complicated argument passing semantics, with pass by value (which includes pointers) and pass by reference, not to mention the whole story with lvalues and rvalues. In Fortran it’s the array dummy argument semantics which are confusing (but powerful).

1 Like

A.8, in my copy, is titled “Non-default mapping for implicit typing”. MFE deprecates, in A.1, “Storage association”, which is a different beast to “Sequence association”. The deprecated “Storage association” is effected mainly by EQUIVALENCE and COMMON statements.

“Sequence association” is how you had to pass arrays in pre-Fortran 90 days, when there was no mechanism to create array-valued expressions. It allowed argument association between an array element and an array dummy. The array that was associated was the portion of the array that started with the specified element, and was followed by any remaining array elements, in strict “array element order”. This enabled implementations to just use a memory address for that argument, and this misconception (i.e. that Fortran mandated passing memory addresses around) took hold.

2 Likes

With explicit shape array dummy arguments it is not checked whether the shape or rank of the actual argument matches. This can lead to hard to diagnose out-of-bounds access errors, or in the case of multidimensional arrays strange storage sequence association. For assumed size (with * as dimension) arguments the problem is even worse, because you can’t even check that the procedure stays within the bounds declared for the dummy argument, since it doesn’t have any. Using assumed shape dummy arguments is considered “safer”, because the compiler and runtime can do better checks. This can come with some performance penalty, but that matters or is actually significant far less often than you might think.

Consider the following program as an example:

program array_args
    implicit none
    integer :: arr(3,4)
    call fill_it(arr, size(arr))
    call print_it(arr)
    call set_part(arr(2,3), 2, 2)
    call print_it(arr)
contains
    subroutine fill_it(arr, n)
        integer, intent(in) :: n
        integer, intent(out) :: arr(n)

        integer :: i

        do concurrent (i = 1 : n)
            arr(i) = i
        end do
    end subroutine
    subroutine print_it(arr)
        integer, intent(in) :: arr(:,:)

        integer :: i, j

        do j = 1, size(arr, 2)
            print '(*(i2,x))', (arr(i,j),i=1,size(arr,1))
        end do
    end subroutine
    subroutine set_part(arr, n, m)
        integer, intent(in) :: n, m
        integer, intent(out) :: arr(n,m)

        arr(1:n,1:m) = 42
    end subroutine
end program

Before looking at the output (shown below), can you predict which elements in the call to set_part will be overwritten?

Output
 1  2  3
 4  5  6
 7  8  9
10 11 12
 1  2  3
 4  5  6
 7 42 42
42 42 12

If the fact that it wasn’t the bottom right corner surprised you, it may be worth reconsidering any use of explicit shape or assumed size, especially for multi-dimensional arrays.

If you print it in its proper shape, there is no surprise; it starts from 2, 3 element, next element is the next in memory, but next column is forced to be 2 away (lda=2) instead of the underlying 3 so they are shifted up by one. Hence I don’t see your point about assumed shape. That’s an arithmetic error not a programmatic one. This is one of the most common errors that happen with LAPACK code when you are passing a slice regardless its C or Fortran.

           1           4           7          10
           2           5           8          11
           3           6           9          12
 ==================
           1           4           7          42
           2           5          42          42
           3           6          42          12

In fact this is a counter-argument to your point that you are not able to send a slice because set_part can only work with contiguous arrays.

EDIT: Well it is not your point, my apologies. I’ve read a bit too fast.

However with the assumed shape and you are 6 functions deep and you are trying to understand the code, you are up for a treat troubleshooting that codebase.

I cannot imagine any place where assumed array would result with a better code ergonomics than explicit size declarations unless there is some other fundamental problem.

But it can be checked, i.e., there is nothing preventing compilers to check it. Unless I am mistaken.

Note that it doesn’t compile with LFortran by default, you get a type mismatch error. To compile it, you have to provide a --legacy-array-sections flag. It seems that one should not use the legacy array sections in modern codes, so we disabled it by default.

Well… subroutines with many dummy arguments become unwieldy. You’ll know if you ever had to call the LAPACK eigenvalue routines. It’s not the right-level of abstraction for every situation, which is why products like MATLAB and NumPy/SciPy have been developed.

I think we can all agree it’s nicer to call matmul(A,B) than call dgemm(...), that is until the point you are writing your own recursive factorization algorithm and the BLAS interface begins to shine.

Ultimately we need to have both, assumed-shape when ergonomics are key and it’s useful to pass slices, and explicit-shape when you want precise control over layout. There are many shades of grey.

Sometimes even assumed-shape arrays don’t help reduce verbosity, because the problems we are trying to solve are just so complex. For instance, the ECMWF CLOUDSC mini-app contains the following driver subroutine:

  SUBROUTINE CLOUDSC_DRIVER( &
     & NUMOMP, NPROMA, NLEV, NGPTOT, NGPTOTG, &
     & KFLDX, PTSPHY, &
     & PT, PQ, TENDENCY_CML, TENDENCY_TMP, TENDENCY_LOC, &
     & PVFA, PVFL, PVFI, PDYNA, PDYNL, PDYNI, &
     & PHRSW,    PHRLW, &
     & PVERVEL,  PAP,      PAPH, &
     & PLSM,     LDCUM,    KTYPE, &
     & PLU,      PLUDE,    PSNDE,    PMFU,     PMFD, &
     & PA,       PCLV,     PSUPSAT,&
     & PLCRIT_AER,PICRIT_AER, PRE_ICE, &
     & PCCN,     PNICE,&
     & PCOVPTOT, PRAINFRAC_TOPRFZ, &
     & PFSQLF,   PFSQIF ,  PFCQNNG,  PFCQLNG, &
     & PFSQRF,   PFSQSF ,  PFCQRNG,  PFCQSNG, &
     & PFSQLTUR, PFSQITUR, &
     & PFPLSL,   PFPLSN,   PFHPSL,   PFHPSN, &
     & YDOMCST, YDOETHF, YDECLDP )
    ! Driver routine that performans the parallel NPROMA-blocking and
    ! invokes the CLOUDSC kernel

    USE YOECLDP  , ONLY : TECLDP
    USE YOMCST   , ONLY : TOMCST
    USE YOETHF   , ONLY : TOETHF

    INTEGER(KIND=JPIM), INTENT(IN)    :: NUMOMP, NPROMA, NLEV, NGPTOT, NGPTOTG
    INTEGER(KIND=JPIM), INTENT(IN)    :: KFLDX
    REAL(KIND=JPRB),    INTENT(IN)    :: PTSPHY       ! Physics timestep
    REAL(KIND=JPRB),    INTENT(IN)    :: PT(:,:,:)    ! T at start of callpar
    REAL(KIND=JPRB),    INTENT(IN)    :: PQ(:,:,:)    ! Q at start of callpar
    TYPE(STATE_TYPE),   INTENT(IN)    :: TENDENCY_CML(:) ! cumulative tendency used for final output
    TYPE(STATE_TYPE),   INTENT(IN)    :: TENDENCY_TMP(:) ! cumulative tendency used as input
    TYPE(STATE_TYPE),   INTENT(OUT)   :: TENDENCY_LOC(:) ! local tendency from cloud scheme
    REAL(KIND=JPRB),    INTENT(IN)    :: PVFA(:,:,:)  ! CC from VDF scheme
    REAL(KIND=JPRB),    INTENT(IN)    :: PVFL(:,:,:)  ! Liq from VDF scheme
    REAL(KIND=JPRB),    INTENT(IN)    :: PVFI(:,:,:)  ! Ice from VDF scheme
    REAL(KIND=JPRB),    INTENT(IN)    :: PDYNA(:,:,:) ! CC from Dynamics
    REAL(KIND=JPRB),    INTENT(IN)    :: PDYNL(:,:,:) ! Liq from Dynamics
    REAL(KIND=JPRB),    INTENT(IN)    :: PDYNI(:,:,:) ! Liq from Dynamics
    REAL(KIND=JPRB),    INTENT(IN)    :: PHRSW(:,:,:) ! Short-wave heating rate
    REAL(KIND=JPRB),    INTENT(IN)    :: PHRLW(:,:,:) ! Long-wave heating rate
    REAL(KIND=JPRB),    INTENT(IN)    :: PVERVEL(:,:,:) !Vertical velocity
    REAL(KIND=JPRB),    INTENT(IN)    :: PAP(:,:,:)   ! Pressure on full levels
    REAL(KIND=JPRB),    INTENT(IN)    :: PAPH(:,:,:)  ! Pressure on half levels
    REAL(KIND=JPRB),    INTENT(IN)    :: PLSM(:,:)    ! Land fraction (0-1)
    LOGICAL        ,    INTENT(IN)    :: LDCUM(:,:)   ! Convection active
    INTEGER(KIND=JPIM), INTENT(IN)    :: KTYPE(:,:)   ! Convection type 0,1,2
    REAL(KIND=JPRB),    INTENT(IN)    :: PLU(:,:,:)   ! Conv. condensate
    REAL(KIND=JPRB),    INTENT(INOUT) :: PLUDE(:,:,:) ! Conv. detrained water
    REAL(KIND=JPRB),    INTENT(IN)    :: PSNDE(:,:,:) ! Conv. detrained snow
    REAL(KIND=JPRB),    INTENT(IN)    :: PMFU(:,:,:)  ! Conv. mass flux up
    REAL(KIND=JPRB),    INTENT(IN)    :: PMFD(:,:,:)  ! Conv. mass flux down
    REAL(KIND=JPRB),    INTENT(IN)    :: PA(:,:,:)    ! Original Cloud fraction (t)
    REAL(KIND=JPRB),    INTENT(IN)    :: PCLV(:,:,:,:) 
    REAL(KIND=JPRB),    INTENT(IN)    :: PSUPSAT(:,:,:)
    REAL(KIND=JPRB),    INTENT(IN)    :: PLCRIT_AER(:,:,:) 
    REAL(KIND=JPRB),    INTENT(IN)    :: PICRIT_AER(:,:,:) 
    REAL(KIND=JPRB),    INTENT(IN)    :: PRE_ICE(:,:,:) 
    REAL(KIND=JPRB),    INTENT(IN)    :: PCCN(:,:,:)     ! liquid cloud condensation nuclei
    REAL(KIND=JPRB),    INTENT(IN)    :: PNICE(:,:,:)    ! ice number concentration (cf. CCN)

    REAL(KIND=JPRB),    INTENT(INOUT) :: PCOVPTOT(:,:,:) ! Precip fraction
    REAL(KIND=JPRB),    INTENT(OUT)   :: PRAINFRAC_TOPRFZ(:,:) 
    ! Flux diagnostics for DDH budget
    REAL(KIND=JPRB),    INTENT(OUT)   :: PFSQLF(:,:,:)  ! Flux of liquid
    REAL(KIND=JPRB),    INTENT(OUT)   :: PFSQIF(:,:,:)  ! Flux of ice
    REAL(KIND=JPRB),    INTENT(OUT)   :: PFCQLNG(:,:,:) ! -ve corr for liq
    REAL(KIND=JPRB),    INTENT(OUT)   :: PFCQNNG(:,:,:) ! -ve corr for ice
    REAL(KIND=JPRB),    INTENT(OUT)   :: PFSQRF(:,:,:)  ! Flux diagnostics
    REAL(KIND=JPRB),    INTENT(OUT)   :: PFSQSF(:,:,:)  !    for DDH, generic
    REAL(KIND=JPRB),    INTENT(OUT)   :: PFCQRNG(:,:,:) ! rain
    REAL(KIND=JPRB),    INTENT(OUT)   :: PFCQSNG(:,:,:) ! snow
    REAL(KIND=JPRB),    INTENT(OUT)   :: PFSQLTUR(:,:,:) ! liquid flux due to VDF
    REAL(KIND=JPRB),    INTENT(OUT)   :: PFSQITUR(:,:,:) ! ice flux due to VDF
    REAL(KIND=JPRB),    INTENT(OUT)   :: PFPLSL(:,:,:) ! liq+rain sedim flux
    REAL(KIND=JPRB),    INTENT(OUT)   :: PFPLSN(:,:,:) ! ice+snow sedim flux
    REAL(KIND=JPRB),    INTENT(OUT)   :: PFHPSL(:,:,:) ! Enthalpy flux for liq
    REAL(KIND=JPRB),    INTENT(OUT)   :: PFHPSN(:,:,:) ! Enthalp flux for ice

One-level deeper, the assumed-shape arrays get passed to a subroutine using explicit-shape arguments. Sometimes I wonder how (in)efficient this whole nested argument-passing section is… I’m guessing it is just a tiny fraction of the total time the computation takes, but I think it would be interesting to combine these arrays into a couple of derived types (structs) and see if it makes any difference.

But the standard says it’s a valid call even if the shapes/ranks don’t match, so if you check it and report an error, you’re rejecting (potentially) valid code.

This

A big thing is avoiding future bugs. A thing I sometimes see happen is something like

integer, allocatable :: arr(:)
...
allocate(arr(10))
...
call something(arr, ..., 10, ...)
...

And then in the future somebody changes the allocation without realising there are (potentially lots of) other places they need to now go find and change to maintain consistency. Those are hard to diagnose bugs in large, complex code bases and I’ve seen people lose weeks over them.

1 Like

I’m not sure how it shines but I can see sometimes it can be useful. But how many times do you need to write recursive code compared to regular usage?

However I would also add that most of these things happen because the tools are forcing us to this predicament. C or Fortran does not matter here. We just don’t have a low-level language where the function argument carries its metadata to the function call. It happens in Python that you can pass a slice or a view and the called function will accept the view as a meaningful input so stuff like this works

>>> A = np.random.rand(6, 6)
>>> np.linalg.eigvals(A[1:5:2, 1:5:2])
array([0.45919692, 1.01877888])

but obviously this convenience is only available at Python level so all disadvantages follow. You can still write recursive code with it etc. But no low-level language adopted this type of convenience yet which is sorely missed.

I would say this is happening typically when you don’t have proper data structures like dictionaries or ergonomic dataclasses to quickly bundle up things together to pass in. I’ve been translating such code in the last year. It is really astonishing how authors are doing everything they can to make the code flexible by making the signatures mind bogglingly crowded. That’s why syntax ergonomics matter a lot not just a taste issue.

1 Like

Thanks to @FedericoPerini we have this type of convenience in Fortran too:

program demo
use stdlib_linalg, only: eigvals
real, allocatable :: a(:,:)
complex, allocatable :: clambda(:)
allocate(a(6,6))
call random_number(a)
clambda = eigvals(a(1:5:2,1:5:2))
print *, clambda
end program

Link to the example: Compiler Explorer

2 Likes

In Fortran, signatures carry important optimization considerations too, most importantly the absence of aliasing which would inhibit vectorization. I agree that encapsulating data that belongs together in classes (derived types) is desirable.

I was referring to this procedure specifically: dgetrf2. It’s constructed very cleverly. I’ve only had to write my own specialized factorization routines once or twice, so it’s not a very common usage case. BLAS/LAPACK/ScaLAPACK authors did a great job with the tools at their disposal.

Right, I’ve used the fact in the past that you can reshape the dimensions inside a subroutine. However, such usage is a minority, but it is sometimes useful even in modern code. That is separate from the legacy array sections though. As a user I would like the compiler to check the code though. One approach might be to tell the compiler with pragmas or compiler options or some other syntax that a given subroutine is reshaping the dimensions, otherwise the compiler should check ranks exactly.

That’s wonderful already but I tried to make a point that you don’t pass a copy but a view. Otherwise you can always first copy and then pass a contiguous array.

Come to think of it, eigvals was a bad choice to use in the example, because that’s also copying behind the scenes to send it to fortran.

Oh, sorry if I missed that. Isn’t NumPy forced to make a copy too in case it relies on LAPACK?

In Fortran you could force a contiguous copy outside with a small change (extra parentheses):

clambda = eigvals((A(1:5:2,1:5:2))   ! (!) Array temporary

But if I haven’t misunderstood what you are trying to say is that the BLAS/LAPACK interfaces don’t accept strided arrays, meaning you are always forced to make a copy? That is absolutely true and I guess it might be a concern in some cases. Sometimes the performance improvement from contiguous access can amortize the cost of making a copy. If you think of Level 3 BLAS routines, optimized implementations copy and transpose sub-blocks of the arrays, to make better usage of the CPU caches.


In BLIS the authors made the decision to support both row and column strides, and cater to both Fortran and C array ordering. It is a bit complicated to call from Fortran, but the complexity can be hidden under a wrapper layer (they already offer a BLAS compatibility layer).

No you are quite right. My example derailed the discussion to linear algebra ops. What I meant to emphasize is that you take a slice or arbitrary part of the array with varying strides and the receiving function still sees a full blown array without the contiguousness requirements. I should not have used LAPACK stuff to make a point because it cluttered the point I was hoping to make.