Fem assembly using stdlib sparse coo format

Hi,

I am trying to perform an assembly of finite element stiffness matrices using stdlib’s coo matrix format, like so:

   function assembly() result(rslt)
      use stdlib_sparse_kinds, only: COO_dp_type

      bool :: rslt

      integer           :: el
      integer           :: dofs_el(3)
      real(dp)          :: K_el(3,3)
      type(COO_dp_type) :: COO

      call COO%malloc(5,5,17)

      do el = 1, 2
         call calcKel(el, dofs_el, K_el)
         call COO%add(dofs_el, dofs_el, K_el)
      end do

   contains

      subroutine calcKel(i, dofs, K)
         integer, intent(in)     :: i
         integer, intent(out)    :: dofs(3)
         real(dp), intent(out)   :: K(3,3)

         select case(i)
          case(1)
            dofs = [1,2,3]
            K(1,1) = -1._dp
            K(1,2) = 2._dp
            K(1,3) = 3._dp
            K(2,1) = 2._dp
            K(2,2) = 1._dp
            K(2,3) = 1._dp
            K(3,1) = 1._dp
            K(3,2) = 1._dp
            K(3,3) = 1._dp
          case default
            dofs = [3,4,5]
            K(1,1) = 2._dp
            K(1,2) = -1._dp
            K(1,3) = 3._dp
            K(2,1) = 1._dp
            K(2,2) = 2._dp
            K(2,3) = -1._dp
            K(3,1) = 3._dp
            K(3,2) = 2._dp
            K(3,3) = 1._dp
         end select
      end subroutine

   end function

I must be misunderstanding something, since I thought I could (1) allocate an empty global coo sparse matrix and (2) populate it by adding the element stiffness matrices in a do loop.

However, it seems that the coo matrix isn’t initialized correctly and hence cannot be used to add the element matrices.

Am I misunderstanding how the stdlib coo matrix should be used?

Also, can mpi be used with coo%add(), so that multiple processes are used in parallel for assembly?

This operation only creates raw memory data for coo%index and coo%data buffers. At this point you haven’t really created the coo matrix corresponding to the FEM mesh. The add operator only works if you already have a fully created sparse matrix as it will internally look for the proper index to add the individual elements.

An implementation to go from a homogeneous mesh (same number of nodes per cell/element) can be found here FSPARSE/src/conversion/fsparse_cells2sparse.f90 at main · jalvesz/FSPARSE · GitHub (it has to be adapted to use the sparse types from stdlib, should be quite straightforward as stdlib_sparse was adapted from fsparse)

When you use MPI to parallelize your program this means that you are using a distributed-data framework - as opposed to shared-parallelism where you parallelize the loops instead of the data… for simplicity of the explanation - meaning, you have multiple copies of your program with synchronization steps. Each program contains its own chunks of data and the indexes are local to the MPI process. So, yes you can use MPI but you shall know that each process is working on its own partition of your FEM mesh. It is up to you to create this partitions beforehand or at the beginning of the program. This also implies that when you declare a variable, for instance the matrix, you actually have multiple instances, one-per-process, every operation is local to the process including the add.

As a side note: you might prefer to assemble not on the coo matrix but on a csr matrix. So you want to do: FEM mesh > COO > CSR, then use the CSR matrix for assembly and subsequent algebraic operations.

1 Like

Thank you for making me aware that there is more to this implementation problem than I initially thought.

I will research openmp and csr, instead of openmpi and coo :slight_smile:

If stdlib gives you a path to a multi-thread solution, I would first concentrate on getting a single thread solution working correctly.

I am not aware of the type of sparsity that “coo” supports, nor of the sparsity associated with the type of FE problem you are addressing, but I have found skyline storage to be effective for structural FE problems when using a direct solver.
This must be complimented by an appropriate equation re-ordering approach.
Multi-threaded direct skyline solvers are more suited to a large shared memory OpenMP approach, rather than a distributed MPI approach.

After reading “coo” and “csr” description, a “csc” format may be more effective. During the reduction process, csc format would typically degenerate to full skyline storage so I would be interested in the effectiveness of your application of sparsity to your FE problems.
Again, the type of FE problem being solved is a significant influence. Those that are more suited to coo or csc description, rather than skyline, are not typical.

I agree here, focus on getting the sequential version working and efficient. Just after research on where are the bottle necks and see how to parallelize.

The discussions about parallelism are wide an it can extend for a while. At this stage a few things worth noting (with some oversimplifications):

  • OpenMPI is an implementation library of the MPI standard. Other implementations are also available depending on the OS.
  • With MPI you’ll be able to run the same application on large clusters where the memory is distributed. With multi-threading one is confined to a single socket. One can always try to mix both but that’s a topic on its own. (for the curious http://www.idris.fr/media/eng/formations/hybride/form_hybride_en.pdf)
  • MPI is capable of handling “shared” memory whilst the programmer sees the code as if it were distributed. In most applications I’m aware of, MPI parallelism out-performs OpenMP even within a single socket. OpenMP advantage would typically be that the total memory consumption is lower as there would be no need for halo regions to synchronize data. But this also comes at the price that each thread needs to handle larger memory regions and thus the chances of cache-misses are larger.

Just as with many other topics: it depends! Can the problem be solved with direct solvers or does it require iterative solvers? only the application can tell. From my experience, for iterative solvers the CSR matvec is more efficient than the CSC matvec. But for direct solvers, a CSC representation can indeed be useful. I think @ggerber is also using MUMPS, it does have a few nice interfaces enabling passing the matrix in COO format and then the library will handle internally the necessary transformations for the direct solvers. MUMPS also has several renumbering algorithms that can be used to accelerate performance.

The COO format is typically a good pivotal format: to convert from one representation to another or to do topology transformations, but not to do arithmetic operations.

For such operations, other formats (CSR, CSC, SELL-sigma, ELLPACK, skyline) are recommended depending on the problem and the hardware (CPU? GPU?)

1 Like

With regards to “it depends”, it depends on the type of FEM solution you are attempting; direct or itterative solver.

My structural FE analysis uses a skyline storage - direct solution approach. For this analysis, the full system stiffness matrix is assembled (in a csc format) and reduced. Equation reordering is essential to minimise the storage and operations count.
I use an OpenMP multi-threaded approach, where this large matrix (of many gigabytes; much larger that the cache memory) is available to all threads.For this multi-threaded approach the bottleneck is usually the memory <> cache transfer capacity. During the reduction, improved performance is targeted by sharing cache memory between most threads. This can involve sharing equations in the active front between threads.

OpenMP is well suited to a large shared memory approach.

For itterative solvers (which I have limited experience), I expect an MPI approach would be more suitable. This use of distributed threads would address a small local subset of the mesh/equations, so not use the full stiffness matrix. The sparse storage approach would be chosen to address this local subset solution.
This is a more complex problem definition and I would expect that the efficiency of this approach would be dependent on the selection of local sub-problems. I would be expected this would be used for repeated analysis of established problems.

The choice of sparsity addressing type is not the most significant issue.

1 Like

I inherited a serial fortran pipeflow code that implemented its own frontal solver.

Mumps sounds like a nice modular building block to replace the original serial frontal solver.

At this moment it seems that openmp and mumps’s elemental-level assembly can complement each other.

I am therefore trying to write my first openmp proof of concept where openmp creates a concatenated array of element matrices with a corresponding concatenated array of element degrees of freedom in parallel using “!$omp do”.

Then I think it is simply a matter of feeding mumps those two arrays…

2 Likes

Hi,

I managed to implemented a very basic linear elastic finite element solver using openmpi, openmp and mumps.

I am new to openmpi and openmp and was wondering if someone has an opinion on my openmp implementation. Mumps requires openmpi, while my element assembly uses openmp.

main.f90



program main
   use debug_test, only: test
   use veggies, only: result_t
   use mpi_f08, only: MPI_INIT, MPI_FINALIZE, MPI_SUCCESS

   implicit none

   type(result_t)                :: rslt ! comment this line out when doing: fpm run ...
   integer                       :: ierr


   CALL MPI_INIT(ierr)
   if (ierr /= MPI_SUCCESS) error stop "MPI_INIT failed"

   ! print *, "Hello mumps"
   rslt = test() ! comment this line out when doing: fpm run ...

   CALL MPI_FINALIZE(IERR)
   if (ierr /= MPI_SUCCESS) error stop "MPI_FINALIZE failed"
   print *, "bye"
end program main

debug_test.f90

module debug_test
   use veggies, only: result_t, assert_equals, assert_equals_within_absolute
   use stdlib_linalg_constants, only: dp

   implicit none

   private
   public:: test

contains

   function test() result(rslt)
      type(result_t)                :: rslt
      rslt = check_openmp_mumps_assembly_bc()
   end function test

   subroutine getElem(el, k, topo, A, rhs)
      ! element stiffness matrix, global degrees of freedom list and rhs
      integer, intent(in):: el
      real(dp), intent(in):: k
      integer, intent(out):: topo(2)
      real(dp), intent(out):: A(2,2)
      real(dp), intent(out):: rhs(2)

      real(dp):: sum
      integer::i

      ! do some intensive computation
      sum = 0._dp
      do i = 1, 1e7
         sum = sum + i**2
      end do

      ! element stiffness matrix and rhs
      topo = [el,el+1]
      A = k*reshape([1,-1,-1,1],[2,2])
      rhs = [0._dp,0._dp]
   end subroutine

   function check_openmp_mumps_assembly_bc() result(rslt)
      ! elastic bars of equal length and stiffness k connected in series; left side fixed; right side force Fx applied
      use mumpsSolver, only: solve_elemental
      use mpi_f08, only: MPI_COMM_WORLD ! MPI_COMM_SELF when using 'fpm test'

      type(result_t) :: rslt

      integer:: n ! number of nodes
      integer:: nelt ! number of elements
      integer:: ndofs ! number of dofs per element
      integer              :: el ! element counter
      integer              :: i ! row counter
      integer              :: j ! column counter
      integer              :: itime ! time-step counter

      real(dp):: k ! element stiffness
      real(dp):: Fx3 ! axial force
      integer:: bcdof

      integer, allocatable    :: eltptr_ref(:) ! indices of each element's 1st dof in elt_var
      integer, allocatable    :: eltptr(:) ! indices of each element's 1st dof in elt_var
      integer, allocatable    :: eltvar_ref(:) ! concatenated list of each element's dofs
      integer, allocatable    :: eltvar(:) ! concatenated list of each element's dofs
      real(dp), allocatable   :: A_elt_ref(:)
      real(dp),allocatable    :: A_elt(:)
      integer, allocatable    :: A_rows(:)
      integer, allocatable    :: indices(:)
      real(dp), allocatable   :: rhs_ref(:)
      real(dp), allocatable   :: rhs(:)
      real(dp), allocatable   :: A_el(:,:) ! element stiffness matrix
      real(dp),allocatable    :: rhs_el(:) ! element rhs
      integer, allocatable    :: topo(:) ! element global dof numbers

      nelt = 100
      n = nelt + 1
      ndofs = 2

      k = 1
      Fx3 = 1
      bcdof = 1

      ! Allocate
      allocate(eltptr(nelt+1), eltptr_ref(nelt+1))
      allocate(eltvar(nelt*ndofs), eltvar_ref(nelt*ndofs))
      allocate(A_elt(nelt*ndofs*ndofs), A_elt_ref(nelt*ndofs*ndofs), A_rows(nelt*ndofs*ndofs))
      allocate(rhs(n), rhs_ref(n))

      allocate(A_el(ndofs,ndofs))
      allocate(rhs_el(ndofs))
      allocate(topo(ndofs))

      ! benchmark test values
      do el = 1, nelt
         eltptr_ref(el) = 2*el-1
         eltvar_ref(2*el-1:2*el) = [el,el+1]
         A_elt_ref(4*el-3:4*el) = k*[1,-1,-1,1]
         rhs_ref(el) = Fx3/k*(el-1)
      end do
      eltptr_ref(nelt+1) = nelt*ndofs+1
      rhs_ref(nelt+1) = nelt*Fx3/k
      A_elt_ref(3) = 0._dp

      ! Time loop iteration
      do itime = 1, 2

         ! Initialize
         rhs = 0._dp

         ! Assemble
         !$omp parallel do
         do el = 1, nelt
            call getElem(el, k, topo, A_el, rhs_el)

            eltptr(el) = (el-1)*ndofs+1
            do j = 1, ndofs
               do i = 1, ndofs
                  A_elt((el-1)*ndofs*ndofs+(j-1)*ndofs+i) = A_el(i,j)
                  A_rows((el-1)*ndofs*ndofs+(j-1)*ndofs+i) = topo(i)
               end do
               eltvar((el-1)*ndofs+j) = topo(j)
               rhs(topo(j)) = rhs(topo(j)) + rhs_el(j)
            end do
         end do
         !$omp end parallel do

         eltptr(nelt+1) = nelt*ndofs+1

         ! Apply BCs
         indices = pack([(i,i=1,size(A_rows))], A_rows==bcdof)
         rhs(bcdof) = 0._dp
         rhs(nelt+1) = Fx3
         A_elt(indices) = 0._dp
         A_elt(indices(1)) = 1._dp

         ! Solve
         call solve_elemental(MPI_COMM_WORLD%MPI_VAL, n, nelt, eltptr, eltvar, A_elt, rhs) ! MPI_COMM_SELF
      end do

      ! Assert
      rslt = assert_equals_within_absolute(A_elt_ref, A_elt, 1.e-13_dp, "A_els") .and. &
         assert_equals_within_absolute(rhs_ref, rhs, 1.e-12_dp, "rhs") .and. &
         assert_equals(eltvar_ref, eltvar, "eltvar") .and. &
         assert_equals(eltptr_ref, eltptr,"eltptr")
      print *, 'bye bye'
   end function check_openmp_mumps_assembly_bc

end module debug_test

mumpsSolver.f90

module mumpsSolver
   use stdlib_linalg_constants, only: dp
   use iso_fortran_env, only: stdout=>output_unit

   implicit none
   private
   public :: solve_elemental

contains

   subroutine solve_elemental(comm, n, nelt, eltptr, eltvar, A_elt, rhs)
      ! use mpi_f08, only: MPI_COMM_WORLD

      IMPLICIT NONE
      INCLUDE 'dmumps_struc.h'

      type(dmumps_struc)      :: mumps_par
      integer                 :: i
      integer, intent(in)     :: comm ! MPI_COMM_WORLD%MPI_VAL or MPI_COMM_SELF%MPI_VAL
      integer, intent(in)     :: n
      integer, intent(in)     :: nelt
      integer, intent(in)     :: eltptr(:) ! list of row numbers for non-zero matrix entries
      integer, intent(in)     :: eltvar(:) ! list of column numbers for non-zero matrix entries
      real(dp), intent(in)    :: a_elt(:)
      real(dp), intent(inout) :: rhs(:)

      mumps_par%COMM = comm! MPI_COMM_WORLD%MPI_VAL

      mumps_par%JOB = -1 ! initialize mumps instance
      mumps_par%SYM = 0 ! A is unsymmetric
      mumps_par%PAR = 1 ! host processor also invoved in factorization and computation phase

      CALL DMUMPS(mumps_par)

      IF (mumps_par%INFOG(1) < 0) THEN ! if init failed
         WRITE(stdout,'(A,A,I6,A,I9)') " ERROR RETURN: ",&
            "  mumps_par%INFOG(1)= ", mumps_par%INFOG(1),&
            "  mumps_par%INFOG(2)= ", mumps_par%INFOG(2)
      else ! if init succeeded
         IF (mumps_par%MYID == 0) THEN
            mumps_par%n = n
            mumps_par%nelt = nelt
            allocate(mumps_par%eltptr, source=eltptr)
            allocate(mumps_par%eltvar, source=eltvar)
            allocate(mumps_par%a_elt, source=a_elt)
            allocate(mumps_par%rhs, source=rhs)
         END IF

         mumps_par%icntl(1:4) = -1 ! suppress solver output
         mumps_par%ICNTL(5) = 1 ! Elemental format
         mumps_par%JOB = 6 ! perform job 1 (analysis), 2 (factorization) and 3 (computation)
         CALL DMUMPS(mumps_par)

         IF (mumps_par%INFOG(1) < 0) THEN ! if solve failed
            WRITE(stdout,'(A,A,I6,A,I9)') " ERROR RETURN: ",&
               "  mumps_par%INFOG(1)= ", mumps_par%INFOG(1),&
               "  mumps_par%INFOG(2)= ", mumps_par%INFOG(2)
         else ! if solve succeeded
            IF (mumps_par%MYID == 0) THEN
               WRITE(stdout,*) 'Solution is ',(mumps_par%RHS(i),i=1,mumps_par%N)
               rhs = mumps_par%rhs
            END IF

            mumps_par%JOB = -2 ! terminate mumps instance
            CALL DMUMPS(mumps_par)
            IF (mumps_par%INFOG(1) < 0) THEN ! if terminate failed
               WRITE(stdout,'(A,A,I6,A,I9)') " ERROR RETURN: ",&
                  "  mumps_par%INFOG(1)= ", mumps_par%INFOG(1),&
                  "  mumps_par%INFOG(2)= ", mumps_par%INFOG(2)
            END IF
         END IF
      END IF
   end subroutine solve_elemental

end module mumpsSolver

fpm.toml

name = "mumps2"
version = "0.1.0"
license = "license"
author = "Jane Doe"
maintainer = "jane.doe@example.com"
copyright = "Copyright 2025, Jane Doe"

[build]
auto-executables = true
auto-tests = true
auto-examples = true
module-naming = false
link = ["dmumps"]

[install]
library = false
test = false

[fortran]
implicit-typing = true
implicit-external = true
source-form = "free"

[dependencies]
mpi = "*"
stdlib = "*"
rojff = { git = "https://gitlab.com/everythingfunctional/rojff.git", tag = "v1.0.1" }
[dev-dependencies]
veggies = { git = "https://gitlab.com/everythingfunctional/veggies.git", tag = "v1.0.0" }

The mumps solver I got from ubuntu

sudo apt install libmumps-headers-dev libmumps-dev mumps-test

the command I use to run:

fpm run --flag '-I/usr/include' --runner-args=' -np 3'

topo, A_el, rhs_el should all be private in the OpenMP parallel region.

Did you test your code with and without OpenMP enabled and compare the results?

2 Likes

Thank you PierU,

I have declared those variables as private and timed the results.

with

fpm run --flag '-I/usr/include' --runner-args=' -np 1'

the cpu_time is 2,7s

with

fpm run --flag '-I/usr/include' --runner-args=' -np 3'

I get 3x cpu_times of (3,6s, 3,454s and 3,661s).

It seems my code has been improperly structured, such that it is run 3x times, because of the use of openmpi. The mumps solver, however, depends on openmpi.

I will need to research how to use openmpi and openmp in the same program, without unnecessary duplicate calls.

OpenMPI and OpenMP are best for different solution approaches.
Are you hoping to use a direct or itterative solver for your FEM application ?

If it is a direct solver where you assemble the structure stiffness matrix, reduce this matrix to upper triangular, then solve for multiple load conditions based on the initial stiffness matrix, then OpenMP is the way to go. With a direct solver, all threads need access to the structure stiffness matrix, which can be a very large array (many Gigabytes), so OpenMP shared memory is a good option.
If you are modifying the structure stiffness matrix after each load case (non-linear solution) then an itterative solver may offer some benefit, using OpenMPI. In general mpi solutions have longer run times for more complex problem definitions.

What ever you do, you should confirm that the results you are are achieving with multiple threads or cores are comparible to the single thread/core, allowing for round-off errors. Multi thread solutions will be different but should be accurate to about 12 sig figures. Confirming the accuracy of the solution is very important. This can be difficult for more complex non-linear problem definitions.

1 Like

Indeed, when looking at your code. I don’t see at any point a data splitting strategy. This means that when you run with say mpirun -n 3 ... you are not running a parallel execution of your problem. You are running 3 times the same problem.

I would strongly advise that before going any further, you get yourself acquainted with Flynn's taxonomy - Wikipedia and terms such as SPMD, SIMD, SIMT, MISD, etc.

In order to profit from a dristributed parallelism framework, you have to actually split the data that the program will ingest before calling the solver. In FEM codes, one would typically split the mesh such that each process will assemble its local matrix. You could also decide to keep the mesh together, assemble only on “process 0”, and then split arithmetically the matrix between processes. Wouldn’t recommend the latter for any serious intensive use code though.

Here a highly recommended reference : High-Performance Computing: Dos and Don’ts | IntechOpen

1 Like