Fem assembly using stdlib sparse coo format

Hi,

I am trying to perform an assembly of finite element stiffness matrices using stdlib’s coo matrix format, like so:

   function assembly() result(rslt)
      use stdlib_sparse_kinds, only: COO_dp_type

      bool :: rslt

      integer           :: el
      integer           :: dofs_el(3)
      real(dp)          :: K_el(3,3)
      type(COO_dp_type) :: COO

      call COO%malloc(5,5,17)

      do el = 1, 2
         call calcKel(el, dofs_el, K_el)
         call COO%add(dofs_el, dofs_el, K_el)
      end do

   contains

      subroutine calcKel(i, dofs, K)
         integer, intent(in)     :: i
         integer, intent(out)    :: dofs(3)
         real(dp), intent(out)   :: K(3,3)

         select case(i)
          case(1)
            dofs = [1,2,3]
            K(1,1) = -1._dp
            K(1,2) = 2._dp
            K(1,3) = 3._dp
            K(2,1) = 2._dp
            K(2,2) = 1._dp
            K(2,3) = 1._dp
            K(3,1) = 1._dp
            K(3,2) = 1._dp
            K(3,3) = 1._dp
          case default
            dofs = [3,4,5]
            K(1,1) = 2._dp
            K(1,2) = -1._dp
            K(1,3) = 3._dp
            K(2,1) = 1._dp
            K(2,2) = 2._dp
            K(2,3) = -1._dp
            K(3,1) = 3._dp
            K(3,2) = 2._dp
            K(3,3) = 1._dp
         end select
      end subroutine

   end function

I must be misunderstanding something, since I thought I could (1) allocate an empty global coo sparse matrix and (2) populate it by adding the element stiffness matrices in a do loop.

However, it seems that the coo matrix isn’t initialized correctly and hence cannot be used to add the element matrices.

Am I misunderstanding how the stdlib coo matrix should be used?

Also, can mpi be used with coo%add(), so that multiple processes are used in parallel for assembly?

This operation only creates raw memory data for coo%index and coo%data buffers. At this point you haven’t really created the coo matrix corresponding to the FEM mesh. The add operator only works if you already have a fully created sparse matrix as it will internally look for the proper index to add the individual elements.

An implementation to go from a homogeneous mesh (same number of nodes per cell/element) can be found here FSPARSE/src/conversion/fsparse_cells2sparse.f90 at main · jalvesz/FSPARSE · GitHub (it has to be adapted to use the sparse types from stdlib, should be quite straightforward as stdlib_sparse was adapted from fsparse)

When you use MPI to parallelize your program this means that you are using a distributed-data framework - as opposed to shared-parallelism where you parallelize the loops instead of the data… for simplicity of the explanation - meaning, you have multiple copies of your program with synchronization steps. Each program contains its own chunks of data and the indexes are local to the MPI process. So, yes you can use MPI but you shall know that each process is working on its own partition of your FEM mesh. It is up to you to create this partitions beforehand or at the beginning of the program. This also implies that when you declare a variable, for instance the matrix, you actually have multiple instances, one-per-process, every operation is local to the process including the add.

As a side note: you might prefer to assemble not on the coo matrix but on a csr matrix. So you want to do: FEM mesh > COO > CSR, then use the CSR matrix for assembly and subsequent algebraic operations.

1 Like

Thank you for making me aware that there is more to this implementation problem than I initially thought.

I will research openmp and csr, instead of openmpi and coo :slight_smile:

If stdlib gives you a path to a multi-thread solution, I would first concentrate on getting a single thread solution working correctly.

I am not aware of the type of sparsity that “coo” supports, nor of the sparsity associated with the type of FE problem you are addressing, but I have found skyline storage to be effective for structural FE problems when using a direct solver.
This must be complimented by an appropriate equation re-ordering approach.
Multi-threaded direct skyline solvers are more suited to a large shared memory OpenMP approach, rather than a distributed MPI approach.

After reading “coo” and “csr” description, a “csc” format may be more effective. During the reduction process, csc format would typically degenerate to full skyline storage so I would be interested in the effectiveness of your application of sparsity to your FE problems.
Again, the type of FE problem being solved is a significant influence. Those that are more suited to coo or csc description, rather than skyline, are not typical.

I agree here, focus on getting the sequential version working and efficient. Just after research on where are the bottle necks and see how to parallelize.

The discussions about parallelism are wide an it can extend for a while. At this stage a few things worth noting (with some oversimplifications):

  • OpenMPI is an implementation library of the MPI standard. Other implementations are also available depending on the OS.
  • With MPI you’ll be able to run the same application on large clusters where the memory is distributed. With multi-threading one is confined to a single socket. One can always try to mix both but that’s a topic on its own. (for the curious http://www.idris.fr/media/eng/formations/hybride/form_hybride_en.pdf)
  • MPI is capable of handling “shared” memory whilst the programmer sees the code as if it were distributed. In most applications I’m aware of, MPI parallelism out-performs OpenMP even within a single socket. OpenMP advantage would typically be that the total memory consumption is lower as there would be no need for halo regions to synchronize data. But this also comes at the price that each thread needs to handle larger memory regions and thus the chances of cache-misses are larger.

Just as with many other topics: it depends! Can the problem be solved with direct solvers or does it require iterative solvers? only the application can tell. From my experience, for iterative solvers the CSR matvec is more efficient than the CSC matvec. But for direct solvers, a CSC representation can indeed be useful. I think @ggerber is also using MUMPS, it does have a few nice interfaces enabling passing the matrix in COO format and then the library will handle internally the necessary transformations for the direct solvers. MUMPS also has several renumbering algorithms that can be used to accelerate performance.

The COO format is typically a good pivotal format: to convert from one representation to another or to do topology transformations, but not to do arithmetic operations.

For such operations, other formats (CSR, CSC, SELL-sigma, ELLPACK, skyline) are recommended depending on the problem and the hardware (CPU? GPU?)

1 Like