Issues with Pseudo-Inverse implementation using SVD

I’m attempting to implement the pseudo-inverse of a matrix using singular value decomposition (dgesvd). I have created a GitHub repository (GitHub - gha3mi/pinverse: This repository contains a Fortran module for calculating the pseudoinverse of a matrix using singular value decomposition (SVD).) to test this function. You can find the implementation of the pinv function here. However, I am encountering two issues:

  • In test1, I compared the results of the pinv function in MATLAB with my implementation. The results match when using the Intel compilers (ifort (IFORT) 2021.8.0 20221119 and ifx (IFORT) 2023.0.0 20221201), but there are differences when using gfortran (GNU Fortran (GCC) 13.1.0). Do you have any suggestions to address this issue?
  • In test2, I would like to test the performance of pinv for large matrices, such as A(100000,10000). Are there any suggestions you can provide to improve the performance of the pseudo-inverse for such large matrices?
1 Like

That is going to be a very expensive O(n^3) SVD operation. As with regular inverse-matrix computations, you should look for the possibility that you don’t really need the inverse, but rather you can settle for something else, such as the solution of linear equations. If you are solving a least-squares problem, for example, then there may be other more efficient ways to compute rank-deficient solutions (e.g. iterative methods that scale as O(n^2) per iteration).

[edit] I misread the matrix dimensions when I posted this originally. I thought you had a square matrix. For a rectangular matrix of dimension A(m,n), the SVD will scale as m*n^2. An iterative method would scale as m*n per iteration.


Thank you for your answer, @RonShepard.
In this case, is DGELS a good choice?

I would suggest that you start with that, and then decide how it scales and whether there are other approaches that are better. DGELS uses QR factorization rather than SVD, so it will be cheaper for that reason, but DGELS will still scale as m*n^2, which will be costly for your large matrix dimensions. If you could find an iterative method (like a preconditioned-GMRES) that converges in, say 100 iterations, then the effort would scale as 100*m*n, but finding such a method, if it exists, is problem-specific. You might also benefit from parallelization, so you should explore other linear algebra packages such as: ScaLAPACK — Scalable Linear Algebra PACKage. If your matrix is sparse, meaning the number of nonzero elements is <<m*n, then that feature might also help, especially within an iterative solution method. In any case, when you start to scale up the matrix dimensions with DGELS, you will see then what kind of problems you face.

  1. Use optimized LAPACK/BLAS routines (if you aren’t already). For example: OpenBLAS, AtlasBLAS, GotoBLAS or Intel MKL. Could give 2-10 X speedup over Fortran BLAS.
  2. Use a parallelized BLAS/LAPACK library. Speedup will depend on problem, and may be between 0.5 and NUMBER_OF_CORES at a guess
  3. Exploit any structure in the input matrix: sparsity; blocks - in particular large zero, identity, diagonal and structured blocks.

(1) and (2) are relatively easy - usually just install the appropriate libraries for your system, change the linker flags and perhaps set some environment variables. (3) requires some work.

I have also used pseudo-inverse solutions, I am not an expert in linear algebra, just a user, and out of curiosity I added my version of pseudo-inverse solutions under my fork and tested the performance difference at:

(Because I am using Windows-MSYS2 environment, M_stopwatch is not valid in that environment)

> fpm test --compiler gfortran --flag "-Wno-line-truncation -llapack -lblas" test2 --profile release
 Elapsed CPU time: 1.32812500
 Elapsed CPU time: 9.40625000
 test2 completed.
> ifort with Visual-Studio (release mode)
 Elapsed CPU time:    0.5937500
 Elapsed CPU time:    3.531250
 test2 completed.

It is faster to call LAPACK’s API directly when possible. (Using -Ofast can be unexpected, so it is not used here)

@RonShepard, thank you. The matrices are sparse.
I’m going to start with DGELS as you mentioned. Then maybe I will consider using ScaLAPACK, although I have not been successful in using it before.

@DavidB, thank you for your suggestions. I’m currently using Intel MKL, I will try the other libraries.

You may not get much improvement over Intel MKL. I’d look elsewhere.

If your matrices are sparse you can also consider ARPACK. You can try it from SciPy in python to see if it fits your needs svds(solver=’arpack’) — SciPy v1.11.4 Manual. If it does, there are some discussions and sources for ARPACK here Modernizing Arpack

This recent paper may be a start into the literature.

Jung, J., Sael, L. Fast and accurate pseudoinverse with sparse matrix reordering and incremental approach. Mach Learn 109, 2333–2347 (2020). Fast and accurate pseudoinverse with sparse matrix reordering and incremental approach | SpringerLink

1 Like

I may be way off base, but if the matrix is sparse, would adding a statement

if( U(j, irank).eq.0.) cycle

before the inner loop save n multiply/add/divides at small cost for each non-zero U(i,rank) element?

@zoziha, thank you for your comparison.
The pseudo-inverse without SVD (full rank) is obviously faster, but I think using SVD provides more accurate solutions. Here are the relative errors:

using pinv: 6.273484339772714E-015
using linv: 4.771404614080608E-014

In some cases, SVD also allows for the use of low rank, which can lead to faster computations (I guess). I will test this solution.

I have noticed that using a larger lwork parameter in the dgesvd function yields slightly better performance.

dgesvd('A', 'A', m, n, A, m, S, U, m, VT, n, work, lwork, info)

Here are the results:
With lwork = huge(lwork): cpu= 25.03 user= 24.55 sys= 0.48 wall= 2.89

And with a smaller lwork: 'cpu= 28.70 user= 28.24 sys= 0.46 wall= 3.34`

@hkvzjal, thank you for your suggestion. I will investigate it.

@feenberg You mean like this?

  do irank = 1, min(n, m)
     do j = 1, m
        if (U(j, irank) == 0.) cycle
        do i = 1, n
           Apinv(i, j) = Apinv(i, j) + VT(irank, i) * U(j, irank) / S(irank)
        end do
     end do
  end do

The SVD of a sparse matrix does not necessarily result in zero left singular vectors.

@zoziha, after making some improvements, I achieved better performance with SVD. Here are the results for A(10000,1000):

linv (without SVD): Elapsed CPU time: 11.50686
pinv (SVD)        : Elapsed CPU time: 7.912435

for A(100000,1000):

linv (without SVD):  Elapsed CPU time:    162.8467    
pinv (SVD)        :  Elapsed CPU time:    60.14035

The updated version is available on GitHub.

1 Like

I’m not sure what that exactly means. Did you allocate a work(:) array of length huge(lwork)?

The optimal value of lwork is returned by DGELS() in the first element of the work array when you input lwork=-1 . One typically does something like

real(wp) :: work1(1)
real(wp), allocatable :: work(:)
call dgels(...work1,-1....)  ! memory allocation query.
lwork = nint(work1(1))
allocate( work(lwork) )
call dgels(,lwork...)  ! actual solution.

This is kind of clunky, but it is flexible. It allows, for example, a single workspace allocation to be reused for several solutions, so it minimizes the allocation/deallocation overhead.

[edit] I just noticed that you were talking about DGESVD() rather than DGELS(), but the same memory allocation approach works in either case.

In any case, I would suggest using the LAPACK routines to get a feeling for how the problems scale and what kinds of accuracies to expect with various singular value thresholds, and then explore the various sparse matrix options with that reference. So even if you end up not using the LAPACK code, those initial experiments are not wasted effort.

1 Like

@RonShepard, I didn’t know that setting lwork=-1 calculates only the optimal size of the work array. I have corrected it.