Specialized matrix decomposition

Radial basis function generated finite differences are a young numerical method for solving PDEs or performing interpolation on unstructured points in multiple dimensions. An approachable description of the method for solving PDEs can be found in the book by Fornberg and Flyer, A Primer on Radial Basis Functions with Applications to the Geosciences, SIAM, 2015, or alternatively, the article by the same authors, Solving PDEs with Radial Basis Functions, Acta Numerica, 24, 2015

(If I’m not mistaken the idea of using radial basis functions for function approximation is originally from MJD Powell , who is also the original author of COBYLA and other optimization algorithms available in PRIMA.)

Modern variations of the RBF-method involve solving dozens of small linear systems for the shape factors in the irregular stencils. The linear systems have the form,

\begin{bmatrix} A & P \\ P^T & 0 \end{bmatrix} \begin{bmatrix} \alpha \\ \beta \end{bmatrix} = \begin{bmatrix} L \phi \\ L p \end{bmatrix}

where A is a matrix with entries A_{ij} = \phi(|\boldsymbol{x}_i - \boldsymbol{x}_j|), that are the radial basus functions \phi(\bullet) (for instance \phi(r) = r^3), and P is a polynomial augmenation matrix. In 2-d this will usually be a Vandermonde-type matrix,

P = \begin{bmatrix} 1 & x_1 & y_1 & x_1^2 & x_1 y_1 & y_1^2 & \dots \\ \vdots & \vdots & \vdots & \vdots & \vdots & \vdots & \vdots \\ 1 & x_n & y_n & x_n^2 & x_n y_n & y_n^2 & \dots \end{bmatrix}

where the maximum order of the polynomial basis, n_P, determines the accuracy of the approximation.

In Fortran pseudo-code, the setup of the matrix for a 2-d problem looks something like this:

integer, parameter :: npoly = 6
integer, parameter :: ix(npoly) = [0,1,0,2,1,0], &
                      iy(npoly) = [0,0,1,0,1,2]

real(wp) :: A(n,n), P(n,npoly), M(n+npoly,n+npoly)

 ! RBF part, A_ij = phi(|x_i - x_j|)
 A = phi( hypot(spread(x,1,n)-spread(x,2,n), &
                spread(y,1,n)-spread(y,2,n) )

 ! Polynomial part, P
 do k = 1, npoly
    P(1:n,k) = (x**ix(k)) * (y**iy(k))
 end do

 ! Full matrix, M = [A P; P^T 0]
 M(1:n,1:n) = A
 M(1:n,n+1:n+npoly) = P
 M(n+1:n+npoly,1:n) = transpose(P)
 M(n+1:n+npoly,n+1:n+npoly) = 0

The next step is to solve the RBF-FD system. In MATLAB this can be achieved with the backslash operator - M\rhs. In Fortran I’m using LAPACK directly:

! ... prepare M
np = n + npoly
call dgetrf(np, np, M, np, ipiv, info) ! factorize

! ... prepare RHS 
call dgetrs('N', np, 1, M, ldm, ipiv, rhs, np, info) ! solve

In practice, the dimensions of the matrices are n from 9 to 60, and npoly in the range 6-28 (sufficient for a sixth order accurate method in 2-d), so the individual matrices are not that large.

Since these operations are to be performed millions of times during the course of the computation, I’ve been wondering if it’s possible to exploit the known block structure and other matrix properties to increase the performance of the factorization and solution steps?

I posted the same question two years ago on SciComp Stack Exchange, but to no avail:


I believe E.J. Kansa was the first to try to apply RBFs to solving PDEs but I could be wrong. I believe Powell’s work just focused on interpolation. Search for Kansa’s papers on multi-quadrics. Most meshless solution methods (Moving Least Squares (MLS), Smooth Particle Hydrodynamics (SPH), Reproducing Kernel Particle Methods (RKPM) and Generalized Finite Difference Methods) share some of the problems you are facing so you might want to search the literature in those areas (to extensive to list here). If your particles are fixed in space then you might be able to use some kind of blocking technique (maybe based on quadtrees and octrees to bin points into fixed locations). If the points are moving I would think you have do a nearest neighbor seach of some kind. Also, how do you know your resulting Vandermonde matrix is not ill-conditioned (they are notorious for having condition problems)

There are some theorems which guarantee the RBF + poly system is uniquely solvable, as long as the points are unisolvent. This has to do with the (conditional) positive-definiteness of the RBFs. There are a few works explaining why the RBF + poly approach is more robust than classic WLS/MLS: Comparison of Moving Least Squares and RBF+poly for Interpolation and Derivative Approximation | Journal of Scientific Computing

That’s correct. Kansa’s work is mainly on the global RBF approach. I’m more interested in the local RBF-FD approach, where I have many smaller stencils, resulting in a sparse operator, instead of a dense global once.

I know that Sandia has a code named Compadre (GitHub - sandialabs/compadre: Compadre (Compatible Particle Discretization and Remap)). The inversion of the shape-function matrices appears to be handled by batched Kokkos and KokkosKernels, which I assume are just calling the LAPACK least-squares routines underneath the layers of C++ templates, but I admit I haven’t verified this or tested the performance of that library against my naive Fortran+LAPACK+OpenMP approach (without any kind of batching). Edit: the strategy used by Compared is documented in the Wiki:

We now use a KokkosKernels implementation of QR+Pivoting, removing any reliance on LAPACK or CUSOLVER + CUBLAS.

A couple of years ago I worked in the group that authored a meshfree PDE solver library named Medusa (also in C++) and they were using Eigen with standard LU factorization with pivoting to handle their stencil computations.

cuBLAS (and I believe a few other BLASes too) has batched kernels for doing multiple (small) factorizations and solves simultaneously, i.e. cublasgetrfBatched() and cublasgetrsBatched(), but they are still based on a general LU factorization, without any further knowledge of the matrix structure.

In the static case, the shape function calculation only has to be done once, so it’s not really worth trying to optimize things any further. But in the dynamic-case, the re-assembly of the global operators can indeed by expensive, which is why I’d like to do it faster if possible.

I suppose I’m looking for something closer to ReLAPACK that uses tailored block algorithms, or something similar to the factored approach for finite element operators used in libCEED.

For the 2-by-2 block matrix above, the block decomposition of the RBF-FD is given in a work by Victor Bayona. Multiplying the first row of the system above by P^T A^{-1} gives

\begin{bmatrix} P^T & P^T A^{-1}P \\ P^T & 0 \end{bmatrix} \begin{bmatrix} \alpha \\ \beta \end{bmatrix} = \begin{bmatrix} P^T A^{-1} L \phi \\ L p \end{bmatrix}

subtracting the rows to find \beta,

\beta = (P^T A^{-1}P)^{-1} (P^T A^{-1} L \phi - L p)

and substituting it back into the original system back to find an explicit formula for \alpha,

\begin{aligned} \alpha &= A^{-1}L\phi - A^{-1}P (P^T A^{-1}P)^{-1} (P^T A^{-1} L \phi - L p) \\ &= [I - WP^T]A^{-1}L\phi + W Lp \end{aligned}

where W = A^{-1}P (P^T A^{-1}P)^{-1}.

But I don’t see any computational benefit at the moment.

Maybe I’m being naive, but why not solving the second equation for \alpha, and then the first for \beta?

P^T \alpha = L p
P \beta = L \phi - A \alpha

This way, you wouldn’t have to store M, and only need to factorize the much smaller P.

P isn’t square, but of size n-by-n_P. The number of points in the stencil n will generally be higher than the polynomial degree.

Of course, then you are looking for a least-squares solution, for example based on the QR or SVD factorization of P.

I think it’s not the same. The full RBF+poly system is uniquely solvable so least squares are not needed. You can see from the block decomposition that \alpha is a linear combination of both the RBF and poly parts meaning it combines the properties of two different functional spaces.

But the idea of the Moving Least Squares method that @rwmsu mentioned, is essentially just to do polynomial interpolation, P^T \alpha = f, using the (weighted) least squares solution of the system.

1 Like

How about factorizing P=QR? Thus:

\begin{bmatrix} Q & 0 \\ 0 & I_{n^P} \end{bmatrix} \begin{bmatrix} Q^T A Q & R \\ R^T & 0 \end{bmatrix} \begin{bmatrix} Q^T & 0 \\ 0 & I_{n^P} \end{bmatrix} \begin{bmatrix} \alpha \\ \beta \end{bmatrix} = \begin{bmatrix} L\phi \\ L p \end{bmatrix}

which implies

\begin{bmatrix} Q^T A Q & R \\ R^T & 0 \end{bmatrix} \begin{bmatrix} Q^T\alpha \\ \beta \end{bmatrix} = \begin{bmatrix} Q^TL\phi \\ L p \end{bmatrix}

You could then follow @ricriv 's suggestion but instead solving

R^T Q\alpha=Lp

with just the upper triangular part of R (assuming that’s non-singular) and set the remaining elements to zero.

Extracted from of “Michael J. D. Powell. 29 July 1936—19 April 2015”, Biogr. Mems Fell. R. Soc.64, 341–366 (2018):

The next subject-matter are multivariate approximations using radial basis functions. Although subjected to a great deal of research by others, Mike was at that time one of the main persons in that area beginning with the review paper [41] that already contained new proofs of important results. As Yuan Xu said in an article in the Journal of Approximation Theory there was “a flow of preprints [on radial basis functions] pouring out of Powell’s group in Cambridge”.

[41] “Radial basis functions for multivariable interpolation: a review” , in Algorithms for Approximation (J.C. Mason and M.G. Cox, eds), Oxford University Press (Oxford), 143–167 (1987).

Thanks for the QR suggestion, I’ll need to study it a bit closer. From a naive first look the formula solving R^T Q \alpha = Lp, would also be the least-squares solution of just the polynomial part. I think the problem is that R is rectangular or something else about the dimensions is not necessarily consistent.

Today I read an article by Demmel, Higham, and Schreiber (1992) on Block LU Factorization (PDF, 997 KB). For a 2-by-2 block matrix, we can use the decomposition:

\begin{bmatrix} A & P \\ P^T &0 \end{bmatrix} = \begin{bmatrix} I & \\ L_{21} & I \end{bmatrix} \begin{bmatrix} U_{11} & U_{21} \\ & U_{22} \end{bmatrix}

Here the blocks U_{11}, U_{22}, and L_{21} are dense and not upper-triangular like they are in the LAPACK variation of this algorithm.

Based on the block decomposition I’ve sketched the following algorithm:

! Factorization of the symmetric positive-definite system
!   | A    P |  = | I     | | U11 U12 |
!   | P^T  0 |    | L21 I | |     U22 |
! using a block LU decomposition.
! We can compute the block factors this way:
! Step 1: U11 = A, U12 = P
! Step 2: Solve L21 U11 = P^T for L21
!         This is equivalent to solving A^T L21^T = P, which
!         can be achieved using the LU decomposition of A 
! Step 3: Calculate U22 = 0 - L21 U12 
!         * for the solution phase, we can calculate the 
!           inverse U22^(-1) and store it in-place
! Once the blocks factors are known we can solve the system M x = b
!       M       x      b
!   | A    P | |x1| = |b1| 
!   | P^T  0 | |x2|   |b2|
! using the following passes (can be done in-place)
! Step 1: Forward pass
!   f1 = b1
!   f2 = b2 - L21 f1
! Step 2: Backward pass
!   x2 = U22^(-1) f1
!   x1 = U11^(-1) (f1 - U12 x2)
!      = A^(-1) (f1 - P x2); (use existing factorization of A)

Here is a proof-of-concept code using QR factorization. I incorrectly stated above that the remaining elements should be set to zero, however they also have to be solved-for because (as you say) the solution is unique.

program test
implicit none
integer, parameter :: np=3,n=10,nnp=n+np
integer l,info,ipiv(nnp)
real(8) a(n,n),p(n,np)
real(8) x(n),y(np)
real(8) c(n,n),d(n,n)
real(8) e(n),f(n),g(n),h(n)
real(8) m(nnp,nnp),t(nnp)
real(8) tau(np),work(n)

! n × n matrix A
call random_number(a(:,:))
! n × nₚ matrix P
call random_number(p(:,:))
! vector Lϕ of length n
call random_number(x(:))
! vector Lp of length nₚ
call random_number(y(:))

!     Solution using full matrix M     !
! form the matrix M
! form the vector t = [ Lϕ, Lp ]
! solve the linear system
call dgesv(nnp,1,m,nnp,ipiv,t,nnp,info)

!     Solution using Q R factorization     !
! factorize P = Q R
call dgeqrf(n,np,p,n,tau,work,n,info)
! C = Qᵀ A
call dormqr('L','T',n,n,np,p,n,tau,c,n,work,n,info)
! D = Qᵀ A Q
call dormqr('R','N',n,n,np,p,n,tau,d,n,work,n,info)
! form Qᵀ Lϕ
call dormqr('L','T',n,1,np,p,n,tau,g,n,work,n,info)
! solve Rᵀ f = Lp for the first nₚ elements
call dtrsv('U','T','N',np,p,n,f,1)
! find the remaining n - nₚ elements of f
call dgemv('N',l,np,1.d0,d(np+1,1),n,f,1,0.d0,e,1)
call dgesv(l,1,d(np+1,np+1),n,ipiv,f(np+1),n,info)
! compute α = Q f
call dormqr('L','N',n,1,np,p,n,tau,f,n,work,n,info)
! form Qᵀ A α
call dgemv('N',n,n,1.d0,a,n,f,1,0.d0,h,1)
call dormqr('L','T',n,1,np,p,n,tau,h,n,work,n,info)
! set g = Qᵀ Lϕ - Qᵀ A α
! solve R β = g
call dtrsv('U','N','N',np,p,n,g,1)

! determine the difference between the two solutions (should be zero)

print *,'norm of difference',norm2(t(:))

end program


This particular problem looks very similar to the linear problem we have to solve when using a predictor-corrector scheme in incompressible fluid dynamics (with the difference that there L\beta = 0).
You can get away with only two LU (or Cholesky if A is sym. pos. def.), one for the n x n matrix \mathbf{A} and one for the npoly x npoly matrix \mathbf{P}^* \mathbf{A}^{-1} \mathbf{P}. You’d proceed in two steps:

  1. Get \beta by solving \mathbf{P}^* \mathbf{A}^{-1} \mathbf{P} \beta = \mathbf{P}^* \mathbf{A}^{-1} L\phi - Lp using the LU/Cholesky factorization of \mathbf{P}^* \mathbf{A}^{-1} \mathbf{P}.
  2. Get \alpha = \mathbf{A}^{-1} \left( L\phi - \mathbf{P}\beta \right) where the rhs is computed using the LU/Cholesky factorization of \mathbf{A}.

I don’t think you can make it any more efficient than that.

Thank you @jkd2022 for the neat example! That should fit nicely within my existing program. I will give it a test and report the performance/accuracy versus the whole-matrix LU approach.

@loiseaujc, thanks for the steps, I believe your approach aligns with the block decomposition approach I described above. The U_{22} block of the decomposition corresponds to P^* A^{-1} P.

In fluid mechanics there is also the similar saddle-point matrix originating from the discretization of the Stokes equations: K = [ A, B^T; B, -C], where A is typically a (vector) Laplacian, B^T and B are rectangular divergence matrices and -C is (a possibly 0) stabilization matrix. Typically the blocks will be sparse and Krylov solvers can exploit this block structure on top.

Yes, it is the same thing as the LU block formulation you’ve posted simply with the expressions of the blocks made clear. It indeed is the same as the Stokes saddle point problem. For Navier-Stokes with implicitation of the viscous terms, A no longer is the Laplacian but the Helmholtz operator (i.e. A = I - dt*L/Re).

Thank you @jkd2022 for the neat example! That should fit nicely within my existing program. I will give it a test and report the performance/accuracy versus the whole-matrix LU approach.

I’d be interested to know the results. The code is not particularly efficient; for example, only part of the matrix D is used.

Also, the Vandermonde structure of P is not exploited, and there exist fast QR factorizations of this type of matrix. See here and here.

I imagine it’s possible to exploit the structure in the 1-d case where polynomial belong to the basis p_j(x) \in \{1,x,x^2,...\} and the order is high. But for low polynomial orders (2 to 6), and in 2- or 3- dimensions , p_j(x) \in \{1,x,y, xy, x^2, xy, y^2, ...\} there is not much to gain. The savings at high orders come from the recursive evaluation,

subroutine poly2(x,y,p)
real(dp), intent(in) :: x, y
real(dp), intent(out) :: p(6)
  p(1) = 1
  p(2) = x
  p(3) = y
  p(4) = x**2
  p(5) = x*y
  p(6) = y**2
end subroutine

subroutine poly3(x,y,p)
real(dp), intent(in) :: x, y
real(dp), intent(out) :: p(10)
  p( 1) = 1
  p( 2) = x
  p( 3) = y
  p( 4) = x**2
  p( 5) = x*y
  p( 6) = y**2
  p( 7) = (x**2)*x ! reuse
  p( 8) = (x*y)*x  ! reuse
  p( 9) = (x*y)*y  ! reuse
  p(10) = (y**2)*y ! reuse
end subroutine

I don’t need to go any higher than 6. order, and the procedures are easy to generate and I expect them to be inlined where needed.