Writing wrappers for LAPACK and BLAS routines

Hi there!

I am trying to write a short RK4 routine for some numerical calculation. I have written some wrapper functions to call the LAPACK and BLAS functions for some of the steps in my code.

For examples, see the code example.

module helpers
    use, intrinsic ::iso_fortran_env, only:dp=>real64
    implicit none
    private

    public zmulmv, zmulmm

    contains

    ! wrapper for ZGEMV
    subroutine zmulmv(A, X, Y)
        complex(dp), dimension(:,:), intent(in) :: A
        complex(dp), dimension(:), intent(in) :: X
        complex(dp), dimension(:), intent(out) :: Y
        integer :: m, n, incx, incy, lda
        character :: transa
        complex(dp) :: alpha, beta
        lda = size(A,1)
        incx = 1
        incy = 1
        m = size(A,dim=1)
        n = size(A,dim=2)
        alpha = (1.0d0, 0.0d0)
        beta = (1.0d0, 0.0d0)
        transa = 'N'
        call zgemv(transa, m, n, alpha, A, lda, X, incx, alpha, Y, incy)
    end subroutine  

    ! wrapper for ZGEMM
    subroutine zmulmm(A, B, C)
        complex(dp), dimension(:, :), intent(in) :: A, B
        complex(dp), dimension(:, :), intent(out) :: C
        integer :: l, m, n, lda, ldb, ldc
        character :: transa, transb, transc
        complex(dp) :: alpha, beta 
        alpha=(1.0d0, 0.0d0)
        beta=(1.0d0, 0.0d0)
        transa = 'N'
        transb = 'N'
        transc = 'N'
        lda = size(A,1)
        ldb = size(B,1)
        ldc = size(C,1)
        l = size(A,1)
        n = size(B,1)
        m = size(C,1)
        call zgemm(transa, transb, l, n, m, alpha, a, lda, b, ldb, beta, c, ldc)
    end subroutine
end module helpers

So, for some reason these functions don’t seem to work as expected are failing the matrix-matrix multiplication tests I wrote for them.

What am I missing here?

I think you should have

beta=(0.0d0, 0.0d0)

because the matrix C is output only.

1 Like

Hello,

Without the full test code it can be difficult to tell…

However, what you are doing here is C = C+A.B, not C = A.B, and since C is unitialized you probably get garbage. Your beta should be equal to zero, not to one. And you are using twice alpha in your call to zgemv instead of alpha then beta

1 Like

Thanks! I completely missed the coefficients. This fixes everything.

1 Like

Note that your wrappers are likely to degrade the performances. Your matrices are defined as assumed shape arguments, which can be discontiguous. The BLAS routines don’t have explicit interfaces, so the compiler has to assume that they expect contiguous arrays. The risk is then that the compiler generates copy-in/copy-out for all arrays, regardless they are actually contiguous or not. And in the case you are multiplying matrices made of array sections (thus discontiguous), you are not taking advantage of the ability of the BLAS routines to handle them without copy-in/copy-out

1 Like

Ok, that makes sense. Do you suggest passing the dimension of matrices as an integer input to the wrapper function?

Or otherwise is any other way using allocatable arrays?

1 Like

Maybe addition of contiguous attribute to the arrays definition will solve this potential issue with performance?

If only contiguous arrays are input, yes the contiguous attribute will do.

1 Like

Ok, if you don’t mind could you please elaborate on this?

If you declare your arrays like this, you enforce the contiguity on them:

complex(dp), dimension(:,:), contiguous, intent(in) :: A

Then the compiler knowns that it can pass them directly to the blas routines, with temporary copies.

Now the problem is possibly when you call your own routines:

complex(dp), allocatable :: A(:,:)
...
allocate( A(n,m) )
...
call zmulmv(A, X, Y)

is fine (at least for A) : the compiler knows that zmulmv is expecting a contiguous array, the standard enforces allocatable arrays to be contiguous, so the compliler knows that no copy-in in needed.

complex(dp), allocatable :: A(:,:)
...
allocate( A(n,m) )
...
call zmulmv(A(1:n/2,:), X, Y)

is less fine: A(1:n/2,:) is a non-contiguous array section. Then the compiler has to generate a copy-in to a temporary array.

1 Like

The dummy arguments in the blas routines are declared as assumed size (e.g. (*) or (lda,*)) which implies contiguous arguments. So the contiguous problem is not caused by the lack of an explicit interface, it is still there even if an explicit interface were provided. The other comments about copy-in/copy-out are correct.

The only way to eliminate this problem entirely would be to redefine the blas interface so that it uses assumed shape declarations for the array dummy arguments. This would require an explicit interface that is not backward compatible, so there is no evolutionary step from assumed size to assumed shape.

1 Like

Sure. But what the compiler really sees in the above case is the absence of interface.

Looking at the control-flow graph in Compiler Explorer, there is one path which is free of malloc’s and I speculate this is the case when the matrix is contiguous. In any case compilers are able to detect the generation of array copies and report it. In gfortran the option is -fcheck=array-temps.

It’s good to be aware of the problem, especially if the code is meant to run on a large machine and for lots of iterations. But people make dozens of copies all the time in MATLAB or SciPy, and very few seem to care about it. For example in scipy.linalg.solve (a wrapper for ?GESV) the input matrix and right-handside are not overwritten by default and users leave the module make copies all the time.

In case of gfortran, writing BLAS wrappers could be avoided in favor of the intrinsic matmul + the compiler flag -fexternal-blas. More information can be found here: Mapping matrix & vector arithmetic to BLAS calls - #6 by ivanpribec

1 Like

But, would that be preferred over calling mkl when compiling with ifort ?

I guess the issue with contiguous arrays, that @PierU mentioned, may also be a compiler specific issue.

The gfortran compiler has the option to map fortran intrinsics to BLAS calls. I’m not sure if the intel compiler also has this option (I think it doesn’t). If it did, then the intel compiler would map the intrinsics to the blas implementations within the mkl library.

There are many ways that a programmer can write the code, and there are many possibilities for which library routines eventually get called. That flexibility is good in some ways, but if there are large performance differences for these various possibilities, then it can also make the programmer’s task more difficult. Especially when different compilers (or different versions of a given compiler) perform differently, then the programmer must do a lot of tuning and benchmarking to see which combinations of code and compiler options are best. This is how conditional compilation blocks with lots of tests to select the best options end up in production codes.

One desirable feature is for the compiler to do as much of this testing and block selection as possible at compile time. The matmul() intrinsic already does a lot of this. It can compute vector-matrix products, matrix-vector products, vector-vector outer products, and matrix-matrix products, and it knows at compile time what it is supposed to do, it does not need to wait until run time to do those tests and make those decisions. In principle it also can make decisions at both compile time and run time based on contiguous arrays or other array or memory features, and if the arrays are not contiguous, then it knows if and when to make contiguous temporary copies. User-written routines do not have that full flexibility, so they naturally require more run time testing. I don’t know how it could be done, but it would be good if some of those advantages of intrinsic functions like matmul() could be transferred over and made available to user-written routines.

It depends of course to which point performances matter and if this a critical part of the computations or not.

If I declare my matrices as assumed shape arrays (even with contiguous) within the wrapper functions, for some reason I run into stackoverflow for matrices larger than 1000.

I am not sure why this is happening, why would the arguments of the wrapper functions be stored on the stack?

1 Like

To go further we need to see more code, if possible a minimal reproducible example. That said:

This is how they are in your very post, so what have you changed?

Yes, the subroutines that wrap BLAS routines in my code use arrays with assumed shape. As mentioned in original post, I use these wrapper functions in my RK4 subroutine (see the code below).

    subroutine rungekutta(func, y0, ti, tf, dt, print_nstep, outfile)
        use blas_wrappers
        use, intrinsic :: iso_fortran_env, only:dp=>real64
        implicit none
        complex(dp), intent(in) :: func(:,:)
        complex(dp), intent(in) :: y0(:)
        real(dp), intent(in) :: ti, tf, dt
        integer, intent(in) :: print_nstep
        character (len=*), intent(in) :: outfile
       
        complex(dp), allocatable :: yi(:)
        complex(dp), allocatable, dimension(:) :: k1, k2, k3, k4 
        complex(dp), allocatable, dimension(:) :: kt

        
        complex(dp) :: expt, norm, autocorr
        real(dp) :: t
        integer :: n, m
        integer :: tstep
        
        n = size(func, dim=2)
        m = size(y0)
        ! some consistency checks
        if (n .ne. m) then
            print*, "!!ERROR: Dimensions of func and y0 don't match."
            stop
        end if

        allocate(k1(n), k2(n), k3(n), k4(n))
        allocate(kt(n))
        allocate(yi(n))
    
        t = ti
        tstep = 0
        yi = y0
        open(100, file=trim(outfile))
        ! writing for t0
        call zmul_zdotc(yi, yi, norm)
        ! expectation values
        call zmul_mv(func, yi, kt)
        call zmul_zdotc(yi,kt, expt)
        ! autocorr
        call zmul_zdotc(y0, yi, autocorr)
        write(100,'(4f32.16)') t, abs(norm), abs(autocorr), abs(expt)  

        do while (t < tf)
            call zmul_mv(func, yi, k1)
            call zmul_mv(func, (yi + (dt*0.5d0)*k1), k2)
            call zmul_mv(func, (yi + (dt*0.5d0)*k2), k3)
            call zmul_mv(func, (yi + (dt*1.0d0)*k3), k4)
            yi = yi + (dt/6.0d0)*(k1 + 2.0d0*k2 + 2.0d0*k3 + k4)
            t = t + dt
            tstep = tstep + 1
            if (modulo(tstep, print_nstep) == 0) then
                ! norm
                call zmul_zdotc(yi, yi, norm)
                ! intermediate normalisation only for itp
                yi = 1/(norm)**(0.5d0) * yi
                call zmul_zdotc(yi, yi, norm)
                ! expectation values
                call zmul_mv(func, yi, kt)
                call zmul_zdotc(yi,kt, expt)
                ! autocorr
                call zmul_zdotc(y0, yi, autocorr)
                write(100,'(4f32.16)') t, abs(norm), abs(autocorr), abs(expt)  
            end if
        end do
        close(100)
    end subroutine rungekutta

Note: I just renamed helpers to blas_wrappers and the only other change is beta = (0.0d0, 0.0d0)

So, the when I try to test rungekutta with any func matrix larger than 1000, I run into stackoverflow, but I am not sure why this happens.

This is probably the same problem but higher in the call sequence. In rungekutta the complier cannot know if func is contiguous or not, so it is likely making a temporary copy when calling zmul_mv. Some compilers (e.g. ifort/ifx) allocate the temporary arrays on the stack memory, which has a limited size. If func is too large this makes a runtime error.

Which compiler are you using?