Writing wrappers for LAPACK and BLAS routines

A matrix of rank 10^6 in double precision (8 bytes), would require roughly 7450 GB of storage. Assuming that only 2 % are non-zero, storage of the matrix entries drops to 150 GB. Add another 75 GB for the non-zero pattern in CSR format. We are still within the 250 GB of memory you said a single (multi-core) CPU has available. Storage of the vectors yi, k1-k4, and kt, 10^6 elements each, is merely 46 MB, and very likely fits entirely into the L3 cache.

Regarding the question of performance advantages of sparse over dense matrix methods, I’ve found the following HPC wire article and referenced paper to be an interesting read. It echos my suspicions that the advances in hardware and the availability of large cheap pools of memory make the complexity and overhead of sparse methods a less viable option than just using dense methods etc on GPUs or multi-core CPUs for several classes of problems (such as the machine learning problem described in the article). A comparison of sparse vs dense methods for a wider range of problems on modern hardware would be helpful.

I find the article title misleading, because the hypothesis that led to the improved performance was

The researcher suspected the matrices might not be as sparse as originally assumed [emphasis mine] (hence they emphasize “assumed sparse” in their discussions), and they knew the benefits on memory savings in such cases are diminished as they can be easily overwhelmed by the additional costs of matrix operations.

So being aware of your matrix structure appears to pay off.

1 Like

Totally agree about knowing the matrix structure. That’s why I think a similar analysis comparing sparse vs dense matrix methods for say large scale static Finite Element problems across a variety of hardware architectures would be a useful addition to the knowledge base. On the other hand, just what is the figure of merit wrt number of zero vs non-zero elements that one would use to define what “sparse” means. 40% zeros, 20% zeros ? etc

It varies, depending upon structure and hardware you are trying to target. A vague definition would be “enough that it is worth exploiting”. In cuSPARSE and MKL there are block sparse formats, where some zeros may be stored, but this way there are nice “dense” regions, where you can do dense computation. When matrices are symmetric, A = A^T, or the algorithms involve calculation of matrix powers y = A^p x (matrix polynomials, s-step Krylov methods), there are more tricks which can be exploited.

I’ve seen some works from the libCEED and MFEM communities, where the matrix-free approaches beat assembled matrices by large margins. See for example Fig. 7 in this work: [2204.01722] Performance Portable Solid Mechanics via Matrix-Free $p$-Multigrid, or other works by Jed Brown. Another person working heavily in this area, specifically on matrix-free discontinous Galerkin methods, is Martin Kronbichler.

AFAIK, the advantage of these algorithms is because they have higher amounts of flops per dof, they are better suited to today’s “imbalanced” architectures, where you can do more flops than memory accesses.

(Image taken from the NHR PerfLab Seminar: A Review of Processor Advances Over the Past Thirty Years)

1 Like

In some application fields, a sparse matrix is one where the number of nonzero elements scales as O(N) rather than O(N**2). In others, it means that the O(N**2) factor is a small number, like 1% or 10%;

3 Likes

Nope. In Fortran all arrays are born free and remain free and equal in rights contiguous, this is a requirement of the standard.

As suggested above, using a parallel BLAS implementation could help. Or you can write by yourself a multithreaded matrix/matrix multiplication. With OpenMP this is quite easy.

The right way if your matrices are sparse is to use a sparse matrix library.

Or, as suggested by @ivanpribec , to not explicitly store the matrix. Imagine you want to perform a 3-point sliding sum on long vector of size n: mathematically it can be expressed by a matrix-vector multiplication y=M.x , where M is a n*n tridiagonal matrix with the 1.0/3.0 constant along the 3 diagonal. But expressing the matrix and performing a classical matrix-vector multiplication would be totally inefficient (and even not possible for large n). Instead one writes a routine that just does “as-if”, without expressing the matrix at all:

subroutine sliding_average_3(x,y)
   implicit none
   real, intent(in)  :: x(:)
   real, intent(out) :: y(:)
   integer :: i, n
   n = size(x)
   y(1) = x(1)+x(2)
   do i = 2, n-1
      y(i) = sum(x(i-1:i+1))
   end do
   y(n) = x(n-1) + x(n)
   y(:) = y(:) / 3.0
end subroutine
3 Likes

Sliding sums/averages can also be computed by keeping track of the current value, and then subtracting off the “first” element and adding the new “last” element to compute the next value. No matter the width of the window, there are always only those two floating point operations for each new element. There can be problems with roundoff errors with this algorithm, so this is a good situation for using a compensating summation or an extended precision intermediate, or you can just restart every so often to keep the errors from propagating too far.

Regarding the choice of explicit matrix computations or “operator” formulations, one aspect to consider is the number of times each matrix element will be used. If it is only used once, or just a few times, then it might be a good situation for an “operator” formulation. If it is used many times, then the explicit matrix computation and storage might be indicated.

My bad, should’ve given more details. I am trying to use RK4 for solving time-dependent Schrodinger equation for different kinds of Hamiltonians. func could be of different forms, and usually results from FEM analysis of the problem. What started out as simple task to write up an RK4 routines using BLAS functons for larger matrices has metamorphosised into all these problems.

I’ve tried implementing a matrix free approach by computing my matrix elements on the fly, and it seems to work. However I believe this should be the case, only when func is dense and not sparse.

I think what you’re referring to is similar to an n-point stencil form to approximate derivatives in Finite-element methods. That way func would be limited to a band matrix, in that case I guess using zgbmv would be more efficient.

I tried to implement this, and I’ve attached snippets from my code below.

I am not able to run it over the all the threads allocated to the job. For example, even with OMP_NUM_THREADS=32 on a machine with 256 threads, it seems to use up only 24 threads at best.

    ! parallelised dot product of two vectors
    subroutine omp_dotprod(psi_i, psi_j, result)
        implicit none
        real(dp), dimension(:), intent(in)  :: psi_i, psi_j
        real(dp), intent(out) :: result
        integer :: u
        integer :: ndim
        result = 0.0d0
        ndim = size(psi_i)
        !$omp parallel default(none) private(u) shared(ndim, psi_i, psi_j, result)
        !$omp do reduction ( + : result )
        do u = 1, ndim
            result = result + psi_i(u) * psi_j(u)
        end do
        !$omp end do
        !$omp end parallel
    end subroutine

    ! parallelised constant times a vector plus a vector ( with double precision) 
    subroutine omp_daxpy(a, x_array, y_array, z_array)
        implicit none
        real(dp), intent(in) :: a
        real(dp), dimension(:), intent(in)  :: x_array, y_array
        real(dp), dimension(:), intent(out) :: z_array
        integer :: i, ndim
        ndim = size(x_array)
        z_array = 0.0d0
        !$omp parallel default(none) private(i) shared(ndim, a, x_array, y_array, z_array)
        !$omp do
        do i=1,ndim
            z_array(i) = a*x_array(i) + y_array(i)
        end do
        !$omp end do
        !$omp end parallel
    end subroutine

    ! parallelised vector normalisation
    subroutine omp_normalize(psi)
        real(dp), intent(inout) :: psi(:)
        real(dp) :: norm, inv_norm
        integer :: u, ndim
        ndim = size(psi)
        call omp_dotprod(psi, psi, norm)
        inv_norm = 1.0d0/(norm)**(0.5)
        !$omp parallel default(none) private(u) shared(ndim, inv_norm, psi)
        !$omp do
        do u=1,ndim
            psi(u) = inv_norm * psi(u)
        end do
        !$omp end do 
        !$omp end parallel
    end subroutine omp_normalize

    ! wrapper for MKL SPARSE DSYMV
    subroutine csr_dmul_mv(A, arow, acol, X, Y)
        implicit none
        integer, allocatable, dimension(:), intent(in) :: arow, acol
        real(dp), allocatable, dimension(:), intent(in) :: A
        real(dp), allocatable, dimension(:), intent(in) :: X, Y
        integer :: m
        character (len=1) :: uplo
        uplo = 'U'
        m = size(arow) - 1
        call mkl_dcsrsymv(uplo, m, A, arow, acol, X, Y)
    end subroutine csr_dmul_mv

    ! wrapper for DDOT
    subroutine dmul_ddot(X, Y, res)
        real(dp), allocatable, dimension(:), intent(in) :: X, Y 
        real(dp), intent(out) :: res
        real(dp), external :: ddot
        integer :: incx, incy, ndim
        ndim = size(X, dim=1)
        incx = 1
        incy = 1
        res = ddot(ndim, x, incx, y, incy)
    end subroutine
    ! imaginary time propagation for a sparse hamiltonian
    subroutine RK4(h_array, h_row, h_col, psi0, dt, print_nstep, etol, Ei, tstep)
        implicit none
        real(dp), allocatable, intent(inout) :: h_array(:)
        integer, allocatable, intent(in) :: h_row(:)
        integer, allocatable, intent(in) :: h_col(:)
        real(dp), allocatable, intent(inout) :: psi0(:)
        integer,  intent(in)    :: print_nstep
        integer, intent(inout)  :: tstep
        real(dp), intent(inout) :: Ei
        real(dp), intent(in)    :: dt
        real(dp), intent(in)    :: etol
        real(dp) :: t  ! time
        real(dp), allocatable, dimension(:) :: psi_i
        real(dp), allocatable, dimension(:) :: k1, k2, k3, k4 
        real(dp), allocatable, dimension(:) :: kt
        real(dp) :: const ! constant for daxpy operations
        real(dp) :: norm, autocorr, E0, dE
        real(dp) :: a
        integer :: ndim

        ! allocating arrays
        ndim = size(psi0)
        a = -1.0d0
        allocate(k1(ndim), k2(ndim), k3(ndim), k4(ndim))
        allocate(kt(ndim))
        allocate(psi_i(ndim))
        
        ! initiating some variables
        t = 0.0d0
        tstep = 0
        const = 0.0d0
        psi_i = psi0
        E0 = 0.0d0
        ! changing H = -1.0 * H
        h_array = -1.0d0* h_array
        ! omp_minus_ham_psi gives -(\hat{H}.\psi)
        call csr_dmul_mv(h_array, h_row, h_col, psi_i, kt)
        call dmul_ddot(kt, psi_i, Ei)   ! Ei = <\psi|-\hat{H}|\psi>
        Ei = -1.0d0*Ei                    ! multiply -1.0d0 to get correct energy
        open(100, file='imag_tprop.out')
        ! OUTFILE headers
        write(100,*) ""
        write(100,*) "Imaginary Time Propagation"
        write(100,*) "--------------------------"
        write(100,*) ""
        ! writing headers
        write(100,'(4a32)') 'Imag. time (a.u.)', '< psi_i | psi_i >', '< psi_i | psi_0 >', 'Energy (hartree)'
        do 
            ! print output to itp.out
            if (modulo(tstep, print_nstep) == 0) then
                ! norm
                call dmul_ddot(psi_i, psi_i, norm)
                ! autocorr
                call dmul_ddot(psi0, psi_i, autocorr)
                write(100,'(4f32.16)') t, norm, autocorr, Ei 
            end if
            ! RK4 step
            ! computing k1
            call csr_dmul_mv(h_array, h_row, h_col, psi_i, k1)
            ! computing k2
            call omp_daxpy(0.5d0*dt, k1, psi_i, kt)
            call csr_dmul_mv(h_array, h_row, h_col, kt, k2)
            ! computing k3
            call omp_daxpy(0.5d0*dt, k2, psi_i, kt)
            call csr_dmul_mv(h_array, h_row, h_col, kt, k3)
            ! computing k4
            call omp_daxpy(1.0d0*dt, k3, psi_i, kt)
            call csr_dmul_mv(h_array, h_row, h_col, kt, k4)
            ! parallelising: psi = psi + (dt/6)*(k1 + 2*k2 + 2*k3 + k4)
            call omp_daxpy(dt/6.0d0, k1, psi_i, kt)
            call omp_daxpy(dt/3.0d0, k2, kt, psi_i)
            call omp_daxpy(dt/3.0d0, k3, psi_i, kt)
            call omp_daxpy(dt/6.0d0, k4, kt, psi_i)

            ! intermediate normalisation only for itp
            call omp_normalize(psi_i)
            ! time step increment
            t = t + dt
            tstep = tstep + 1
            
            ! calculating energy (Ei)
            call csr_dmul_mv(h_array, h_row, h_col, psi_i, kt)
            call dmul_ddot(kt, psi_i, Ei)   ! Ei = <\psi|-\hat{H}|\psi>
            Ei = -1.0d0*Ei                    ! multiply -1.0d0 to get correct energy
            dE = abs(E0 - Ei)
            E0 = Ei
            ! checking for Energy convergence
            if (dE < etol) then
                ! norm
                call dmul_ddot(psi_i, psi_i, norm)
                ! autocorr
                call dmul_ddot(psi0, psi_i, autocorr)
                write(100,'(4f32.16)') t, norm, autocorr, Ei
                psi0 = psi_i
                exit
            end if
        end do    
        write(100,*) ""
        write(100,*) "Summary of Imaginary time propagation"
        write(100,*) "-------------------------------------"
        write(100,'(a,i16)')   "n       = ", n
        write(100,'(a,f16.8)') "dt      = ", dt
        write(100,'(a,i16)')   "nstep   = ", tstep
        write(100,'(a,f16.8)') "E_conv  = ", etol
        write(100,'(a,f16.8)') "Ei      = ", Ei
        write(100,*) ""
        close(100)
    end subroutine RK4 


How do you know? Is that by observing the display of the top command? If yes, what you are seeing is an average occupation of the cores, which is affected be the serial parts of your code.

Yes, this is with both top (also htop), the average seems to be ~2400% (~24 threads) even when OMP_NUM_THREADS=32.

I’ve observed similar behavior before, where MKL wouldn’t use more than 4 cores (8 threads), despite having 8 available (16 threads). My uninformed guess is that MKL has some internal heuristic (perhaps depending on the number of nonzeros, and the sparsity pattern?) to select a suitable number of threads.

Maybe you can find some information here: Managing Performance and Memory. I think it also depends on which Sparse BLAS backend you are using, e.g. pthreads, OpenMP, or the TBB backend.

In one of the threads (admittedly, quite old) on the Intel Software Forum, an employee from Intel wrote:

but it should be noted, that behavior will depends on input size task. Sometimes, the decreasing the number of active threads will produce the better performance because threads overhead with small task

From a second thread:

since the sparsity of the second matrix is very and very low it’s impossible to provide efficient parallelization of sparse matrix-vector multiply for a matrix with roughly 70K of non-zeros simply because it’s not enough computational work for many threads. This explains why only 1 thread is faster than 2 or 4 threads in this case.

And a third one:

Now, you should use (for a lot of high performance data intensive operations like found in math libraries) as many threads as there are physical cores. So if you have hyperthreading turned on, you should probably only use half the available threads (=the number of physical cores).

Can you try running a larger problem and check if the number of threads increases?

It’s also very important which of the 24 threads on the 256 thread machine it is using and how they are bound to the cores and NUMA domains. I’m guessing you are using some type of four-to-eight socket Xeon server system?

You can launch your application with OMP_DISPLAY_ENV=VERBOSE ./executable or add export OMP_DISPLAY_ENV=VERBOSE to your batch script to get some information about the thread affinity settings.

Btw, when it comes to this linear combination, I think you could get some minor improvements from a fused kernel, similar to the SUNDIALS NVector. The CPU should be able to handle 5 independent memory streams and find more opportunities for overlapping the operations. It will also reduce the overhead from the opening the OpenMP parallel region only once instead of four times.

Speaking of OpenMP parallel regions, it would be an interesting exercise to push the parallel region out of the main iteration loop (but perhaps not worth it at this stage when the potential benefit is still unclear).

Addendum: Studying a little bit the daxpy and fused RK4 kernels in Compiler Explorer, it appears like both kernels can be scheduled in a pipeline such that only 1 cycle per vector length is needed.

Thanks for the suggestion. I’ve been trying to profile my code using Intel VTune and I found that the omp_daxpy routine has an imbalance. There is about 10-30% (depending the size of tha matrix) idle time while executing this steps.

Additionally, there’s a lot of wait/idle time during the sparse matrix-vector multiplication step that uses mkl_dcsrsymv. However, I am not sure if it is due the MKL or if this slow down is due to the fact that I’m using intel MKL on an AMD machine.