A matrix of rank 10^6 in double precision (8 bytes), would require roughly 7450 GB of storage. Assuming that only 2 % are non-zero, storage of the matrix entries drops to 150 GB. Add another 75 GB for the non-zero pattern in CSR format. We are still within the 250 GB of memory you said a single (multi-core) CPU has available. Storage of the vectors yi
, k1
-k4
, and kt
, 10^6 elements each, is merely 46 MB, and very likely fits entirely into the L3 cache.
Regarding the question of performance advantages of sparse over dense matrix methods, I’ve found the following HPC wire article and referenced paper to be an interesting read. It echos my suspicions that the advances in hardware and the availability of large cheap pools of memory make the complexity and overhead of sparse methods a less viable option than just using dense methods etc on GPUs or multi-core CPUs for several classes of problems (such as the machine learning problem described in the article). A comparison of sparse vs dense methods for a wider range of problems on modern hardware would be helpful.
I find the article title misleading, because the hypothesis that led to the improved performance was
The researcher suspected the matrices might not be as sparse as originally assumed [emphasis mine] (hence they emphasize “assumed sparse” in their discussions), and they knew the benefits on memory savings in such cases are diminished as they can be easily overwhelmed by the additional costs of matrix operations.
So being aware of your matrix structure appears to pay off.
Totally agree about knowing the matrix structure. That’s why I think a similar analysis comparing sparse vs dense matrix methods for say large scale static Finite Element problems across a variety of hardware architectures would be a useful addition to the knowledge base. On the other hand, just what is the figure of merit wrt number of zero vs non-zero elements that one would use to define what “sparse” means. 40% zeros, 20% zeros ? etc
It varies, depending upon structure and hardware you are trying to target. A vague definition would be “enough that it is worth exploiting”. In cuSPARSE and MKL there are block sparse formats, where some zeros may be stored, but this way there are nice “dense” regions, where you can do dense computation. When matrices are symmetric, A = A^T, or the algorithms involve calculation of matrix powers y = A^p x (matrix polynomials, s-step Krylov methods), there are more tricks which can be exploited.
I’ve seen some works from the libCEED and MFEM communities, where the matrix-free approaches beat assembled matrices by large margins. See for example Fig. 7 in this work: [2204.01722] Performance Portable Solid Mechanics via Matrix-Free $p$-Multigrid, or other works by Jed Brown. Another person working heavily in this area, specifically on matrix-free discontinous Galerkin methods, is Martin Kronbichler.
AFAIK, the advantage of these algorithms is because they have higher amounts of flops per dof, they are better suited to today’s “imbalanced” architectures, where you can do more flops than memory accesses.
(Image taken from the NHR PerfLab Seminar: A Review of Processor Advances Over the Past Thirty Years)
In some application fields, a sparse matrix is one where the number of nonzero elements scales as O(N)
rather than O(N**2)
. In others, it means that the O(N**2)
factor is a small number, like 1% or 10%;
Nope. In Fortran all arrays are born free and remain free and equal in rights contiguous, this is a requirement of the standard.
As suggested above, using a parallel BLAS implementation could help. Or you can write by yourself a multithreaded matrix/matrix multiplication. With OpenMP this is quite easy.
The right way if your matrices are sparse is to use a sparse matrix library.
Or, as suggested by @ivanpribec , to not explicitly store the matrix. Imagine you want to perform a 3-point sliding sum on long vector of size n: mathematically it can be expressed by a matrix-vector multiplication y=M.x , where M is a n*n tridiagonal matrix with the 1.0/3.0 constant along the 3 diagonal. But expressing the matrix and performing a classical matrix-vector multiplication would be totally inefficient (and even not possible for large n). Instead one writes a routine that just does “as-if”, without expressing the matrix at all:
subroutine sliding_average_3(x,y)
implicit none
real, intent(in) :: x(:)
real, intent(out) :: y(:)
integer :: i, n
n = size(x)
y(1) = x(1)+x(2)
do i = 2, n-1
y(i) = sum(x(i-1:i+1))
end do
y(n) = x(n-1) + x(n)
y(:) = y(:) / 3.0
end subroutine
Sliding sums/averages can also be computed by keeping track of the current value, and then subtracting off the “first” element and adding the new “last” element to compute the next value. No matter the width of the window, there are always only those two floating point operations for each new element. There can be problems with roundoff errors with this algorithm, so this is a good situation for using a compensating summation or an extended precision intermediate, or you can just restart every so often to keep the errors from propagating too far.
Regarding the choice of explicit matrix computations or “operator” formulations, one aspect to consider is the number of times each matrix element will be used. If it is only used once, or just a few times, then it might be a good situation for an “operator” formulation. If it is used many times, then the explicit matrix computation and storage might be indicated.
My bad, should’ve given more details. I am trying to use RK4 for solving time-dependent Schrodinger equation for different kinds of Hamiltonians. func
could be of different forms, and usually results from FEM analysis of the problem. What started out as simple task to write up an RK4 routines using BLAS functons for larger matrices has metamorphosised into all these problems.
I’ve tried implementing a matrix free approach by computing my matrix elements on the fly, and it seems to work. However I believe this should be the case, only when func
is dense and not sparse.
I think what you’re referring to is similar to an n-point stencil form to approximate derivatives in Finite-element methods. That way func
would be limited to a band matrix, in that case I guess using zgbmv
would be more efficient.
I tried to implement this, and I’ve attached snippets from my code below.
I am not able to run it over the all the threads allocated to the job. For example, even with OMP_NUM_THREADS=32
on a machine with 256 threads, it seems to use up only 24 threads at best.
! parallelised dot product of two vectors
subroutine omp_dotprod(psi_i, psi_j, result)
implicit none
real(dp), dimension(:), intent(in) :: psi_i, psi_j
real(dp), intent(out) :: result
integer :: u
integer :: ndim
result = 0.0d0
ndim = size(psi_i)
!$omp parallel default(none) private(u) shared(ndim, psi_i, psi_j, result)
!$omp do reduction ( + : result )
do u = 1, ndim
result = result + psi_i(u) * psi_j(u)
end do
!$omp end do
!$omp end parallel
end subroutine
! parallelised constant times a vector plus a vector ( with double precision)
subroutine omp_daxpy(a, x_array, y_array, z_array)
implicit none
real(dp), intent(in) :: a
real(dp), dimension(:), intent(in) :: x_array, y_array
real(dp), dimension(:), intent(out) :: z_array
integer :: i, ndim
ndim = size(x_array)
z_array = 0.0d0
!$omp parallel default(none) private(i) shared(ndim, a, x_array, y_array, z_array)
!$omp do
do i=1,ndim
z_array(i) = a*x_array(i) + y_array(i)
end do
!$omp end do
!$omp end parallel
end subroutine
! parallelised vector normalisation
subroutine omp_normalize(psi)
real(dp), intent(inout) :: psi(:)
real(dp) :: norm, inv_norm
integer :: u, ndim
ndim = size(psi)
call omp_dotprod(psi, psi, norm)
inv_norm = 1.0d0/(norm)**(0.5)
!$omp parallel default(none) private(u) shared(ndim, inv_norm, psi)
!$omp do
do u=1,ndim
psi(u) = inv_norm * psi(u)
end do
!$omp end do
!$omp end parallel
end subroutine omp_normalize
! wrapper for MKL SPARSE DSYMV
subroutine csr_dmul_mv(A, arow, acol, X, Y)
implicit none
integer, allocatable, dimension(:), intent(in) :: arow, acol
real(dp), allocatable, dimension(:), intent(in) :: A
real(dp), allocatable, dimension(:), intent(in) :: X, Y
integer :: m
character (len=1) :: uplo
uplo = 'U'
m = size(arow) - 1
call mkl_dcsrsymv(uplo, m, A, arow, acol, X, Y)
end subroutine csr_dmul_mv
! wrapper for DDOT
subroutine dmul_ddot(X, Y, res)
real(dp), allocatable, dimension(:), intent(in) :: X, Y
real(dp), intent(out) :: res
real(dp), external :: ddot
integer :: incx, incy, ndim
ndim = size(X, dim=1)
incx = 1
incy = 1
res = ddot(ndim, x, incx, y, incy)
end subroutine
! imaginary time propagation for a sparse hamiltonian
subroutine RK4(h_array, h_row, h_col, psi0, dt, print_nstep, etol, Ei, tstep)
implicit none
real(dp), allocatable, intent(inout) :: h_array(:)
integer, allocatable, intent(in) :: h_row(:)
integer, allocatable, intent(in) :: h_col(:)
real(dp), allocatable, intent(inout) :: psi0(:)
integer, intent(in) :: print_nstep
integer, intent(inout) :: tstep
real(dp), intent(inout) :: Ei
real(dp), intent(in) :: dt
real(dp), intent(in) :: etol
real(dp) :: t ! time
real(dp), allocatable, dimension(:) :: psi_i
real(dp), allocatable, dimension(:) :: k1, k2, k3, k4
real(dp), allocatable, dimension(:) :: kt
real(dp) :: const ! constant for daxpy operations
real(dp) :: norm, autocorr, E0, dE
real(dp) :: a
integer :: ndim
! allocating arrays
ndim = size(psi0)
a = -1.0d0
allocate(k1(ndim), k2(ndim), k3(ndim), k4(ndim))
allocate(kt(ndim))
allocate(psi_i(ndim))
! initiating some variables
t = 0.0d0
tstep = 0
const = 0.0d0
psi_i = psi0
E0 = 0.0d0
! changing H = -1.0 * H
h_array = -1.0d0* h_array
! omp_minus_ham_psi gives -(\hat{H}.\psi)
call csr_dmul_mv(h_array, h_row, h_col, psi_i, kt)
call dmul_ddot(kt, psi_i, Ei) ! Ei = <\psi|-\hat{H}|\psi>
Ei = -1.0d0*Ei ! multiply -1.0d0 to get correct energy
open(100, file='imag_tprop.out')
! OUTFILE headers
write(100,*) ""
write(100,*) "Imaginary Time Propagation"
write(100,*) "--------------------------"
write(100,*) ""
! writing headers
write(100,'(4a32)') 'Imag. time (a.u.)', '< psi_i | psi_i >', '< psi_i | psi_0 >', 'Energy (hartree)'
do
! print output to itp.out
if (modulo(tstep, print_nstep) == 0) then
! norm
call dmul_ddot(psi_i, psi_i, norm)
! autocorr
call dmul_ddot(psi0, psi_i, autocorr)
write(100,'(4f32.16)') t, norm, autocorr, Ei
end if
! RK4 step
! computing k1
call csr_dmul_mv(h_array, h_row, h_col, psi_i, k1)
! computing k2
call omp_daxpy(0.5d0*dt, k1, psi_i, kt)
call csr_dmul_mv(h_array, h_row, h_col, kt, k2)
! computing k3
call omp_daxpy(0.5d0*dt, k2, psi_i, kt)
call csr_dmul_mv(h_array, h_row, h_col, kt, k3)
! computing k4
call omp_daxpy(1.0d0*dt, k3, psi_i, kt)
call csr_dmul_mv(h_array, h_row, h_col, kt, k4)
! parallelising: psi = psi + (dt/6)*(k1 + 2*k2 + 2*k3 + k4)
call omp_daxpy(dt/6.0d0, k1, psi_i, kt)
call omp_daxpy(dt/3.0d0, k2, kt, psi_i)
call omp_daxpy(dt/3.0d0, k3, psi_i, kt)
call omp_daxpy(dt/6.0d0, k4, kt, psi_i)
! intermediate normalisation only for itp
call omp_normalize(psi_i)
! time step increment
t = t + dt
tstep = tstep + 1
! calculating energy (Ei)
call csr_dmul_mv(h_array, h_row, h_col, psi_i, kt)
call dmul_ddot(kt, psi_i, Ei) ! Ei = <\psi|-\hat{H}|\psi>
Ei = -1.0d0*Ei ! multiply -1.0d0 to get correct energy
dE = abs(E0 - Ei)
E0 = Ei
! checking for Energy convergence
if (dE < etol) then
! norm
call dmul_ddot(psi_i, psi_i, norm)
! autocorr
call dmul_ddot(psi0, psi_i, autocorr)
write(100,'(4f32.16)') t, norm, autocorr, Ei
psi0 = psi_i
exit
end if
end do
write(100,*) ""
write(100,*) "Summary of Imaginary time propagation"
write(100,*) "-------------------------------------"
write(100,'(a,i16)') "n = ", n
write(100,'(a,f16.8)') "dt = ", dt
write(100,'(a,i16)') "nstep = ", tstep
write(100,'(a,f16.8)') "E_conv = ", etol
write(100,'(a,f16.8)') "Ei = ", Ei
write(100,*) ""
close(100)
end subroutine RK4
How do you know? Is that by observing the display of the top
command? If yes, what you are seeing is an average occupation of the cores, which is affected be the serial parts of your code.
Yes, this is with both top
(also htop
), the average seems to be ~2400% (~24 threads) even when OMP_NUM_THREADS=32
.
I’ve observed similar behavior before, where MKL wouldn’t use more than 4 cores (8 threads), despite having 8 available (16 threads). My uninformed guess is that MKL has some internal heuristic (perhaps depending on the number of nonzeros, and the sparsity pattern?) to select a suitable number of threads.
Maybe you can find some information here: Managing Performance and Memory. I think it also depends on which Sparse BLAS backend you are using, e.g. pthreads, OpenMP, or the TBB backend.
In one of the threads (admittedly, quite old) on the Intel Software Forum, an employee from Intel wrote:
but it should be noted, that behavior will depends on input size task. Sometimes, the decreasing the number of active threads will produce the better performance because threads overhead with small task
From a second thread:
since the sparsity of the second matrix is very and very low it’s impossible to provide efficient parallelization of sparse matrix-vector multiply for a matrix with roughly 70K of non-zeros simply because it’s not enough computational work for many threads. This explains why only 1 thread is faster than 2 or 4 threads in this case.
And a third one:
Now, you should use (for a lot of high performance data intensive operations like found in math libraries) as many threads as there are physical cores. So if you have hyperthreading turned on, you should probably only use half the available threads (=the number of physical cores).
Can you try running a larger problem and check if the number of threads increases?
It’s also very important which of the 24 threads on the 256 thread machine it is using and how they are bound to the cores and NUMA domains. I’m guessing you are using some type of four-to-eight socket Xeon server system?
You can launch your application with OMP_DISPLAY_ENV=VERBOSE ./executable
or add export OMP_DISPLAY_ENV=VERBOSE
to your batch script to get some information about the thread affinity settings.
Btw, when it comes to this linear combination, I think you could get some minor improvements from a fused kernel, similar to the SUNDIALS NVector. The CPU should be able to handle 5 independent memory streams and find more opportunities for overlapping the operations. It will also reduce the overhead from the opening the OpenMP parallel region only once instead of four times.
Speaking of OpenMP parallel regions, it would be an interesting exercise to push the parallel region out of the main iteration loop (but perhaps not worth it at this stage when the potential benefit is still unclear).
Addendum: Studying a little bit the daxpy and fused RK4 kernels in Compiler Explorer, it appears like both kernels can be scheduled in a pipeline such that only 1 cycle per vector length is needed.
Thanks for the suggestion. I’ve been trying to profile my code using Intel VTune and I found that the omp_daxpy
routine has an imbalance. There is about 10-30% (depending the size of tha matrix) idle time while executing this steps.
Additionally, there’s a lot of wait/idle time during the sparse matrix-vector multiplication step that uses mkl_dcsrsymv
. However, I am not sure if it is due the MKL or if this slow down is due to the fact that I’m using intel MKL on an AMD machine.