Domain decomposition with the Hivemind

Hello everyone.

First of all, let me thanks everyone for this incredible community that makes me learn something new every day about my computational mother tongue.

In the last 6 months I have worked a lot with MPI, to handle data exchange in a domain decomposition environment implemented with a Z-pencil strategy. Specifically, I use the code NEMO to do some idealized ocean simulations. This code has a finite difference core and it employs a one-point halo/overlap/ghostcells to facilitate computations of the finite differences. My task in the past months was to collect extra points in this halo/overlap/ghostcells in order to compute operations with larger support (in my case, a gaussian filtering and a discrete fast wavelet transform).

To do so, I have implemented 2 kinds of routines: the first kind, in several variants, to enlarge my halo region of a given h number of points (in each direction, meaning that from an n \times m rectangle I can go to a (n+2h) \times (m+2h) rectangle). The second kind actually communicates only one side of the operation, so only does right to left, or left to right, top to bottom or bottom to top. I know that I can construct the first kind of routine with the second, but I developed the second kind because it was incorrect to apply the first kind for the wavelet transform (details upon request).

I was in a hurry, so my routines are definitely not optimal, and I have learned MPI with a sink or swim approach: if it works and it is not slower than sequential, it is a win. Now, I would like to gather here the collective intelligence of this hivemind regarding domain decomposition handling, given the following problem:

North-Sea-proxy

Consider this geometry, as a non-trivial example of a domain with non regular boundaries. Each square is a sub-domain, handled with MPI. Each of this subdomains has the same dimension, let’s say N_i \times N_j \times N_k, where i means longitude, j latitude and k is the vertical. The MPI parallelization is only done in i and j, and you can assume that these two corresponds to the first and second dimension of a fortran array. Inside each subdomain there will be room for openMP, openACC, fortran do concurrent or CUDA kernels for basic parallel operations on the array. An overlap of h points is given, where this h can be tuned by the operator followings her/his/their needs. The end goal would be solving an elliptic PDE on this domain, so there is also a linear system at some point, to solve, but how to solve it is not the main discussion.

I would like to discuss about how you would target the management of multiple sub-domains, their connections, their numbering, what would be a good data structure for them and so on. I know my base software quite well, NEMO, and I am starting looking upon CROCO, I have read Milan’s book almost from cover to cover, thus I have an idea of the general picture as he implemented a similar thing with a regular domain and Fortran coarrays. However, I would like to have a discussion with people more experienced than me regarding this task, their experience and their workaround. No lines of codes are expected, but rather a discussion of the strategies for an efficient handling of the problem. Feel free to share papers on the topic.

Ocean and Climate modeling is one of the fields where Fortran is still the main language, even when it comes to porting to GPU most of the efforts are done to adapt existing fortran codes, so I hope this topic will interest you and pose you a design challenge. On my side, I will be very happy to read every contribution, summarize them from time to time, I will as well come back with more specific topics.

In the best case scenario, I would like to summarize at one point what the Fortran community regards as the best practice for such a topic. (Needless to say that whatever comes out of this discussion is not going to be monetized)

I look forward to read your thoughts.

4 Likes

I have not read your post yet, but this article might be of some use (as a first guess :slight_smile:):

Fortran High-Level Synthesis: Reducing the barriers
to accelerating HPC codes on FPGAs
Gabriel Rodriguez-Canal, Nick Brown, Tim Dykes, Jessica R. Jones, and Utz-Uwe Haus
1EPCC, The University of Edinburgh
2HPC/AI EMEA Research Lab, Hewlett Packard Enterprise

arXiv:2308.13274v1 [cs.DC] 25 Aug 2023

As far as data structures, I would look all the CFD literature related to cut-cell cartesian and immersed boundary methods. I think some of them discuss data structures. One issue is how the underlying grid was generated. Is it structured (as in a cartesian or tensor-product grid) or is it a collection of unstructured hex or quad cells. Also, do you plan on using Adaptive Mesh Refinement (AMR). If so you want to look at Marsha Berger’s work on block structrured AMR. If the grid is structured, the biggest issue is usually defining the connectivity between blocks and/or individual cells. A set of data structures (derived types) that define the following: global domain, subblocks in domain, cells in blocks, and connectivity (both physically and in terms of parallel communciation) would be my decomposition of the problem.

Although, you stated that solvers were not your current interest, I think you should look at the following book

Note: although some people use the phrase “domain decomposition” to refer to mesh partitioning. They are NOT the same thing (but one leads to the other)

Perhaps this article will be useful to you.

If the mesh topology is going to be multi-block structured, some data structures to handle it are presented in https://www.amazon.it/Handbook-Grid-Generation-Joe-Thompson/dp/0849326877/ref=tmm_hrd_swatch_0?_encoding=UTF8&qid=&sr=
The 2nd edition linked above is a well-known reference for multi-block structured meshing.

I will revive this post as I will soon start modernizing a code with the aim of making it MPI aware. The code uses a cartesian three dimensional grid and performs stencil operations on this grid. I will follow @milancurcic worked example for the tsunami simulator to design a derived data type (DDT). So I want to collect suggestions from you on what this derived data type should be performing. To me, the main two things that are needed are

  • halo exchange in non-blocking fashion through MPI (this is the requirement for my work)
  • I/O from and to NetCDF files, with automated rebuild (i.e. from np different output files, one for each subdomain, to one big file for the global domain)

but it can be extended

  • possibly support, in the same DDT, other paradigms for the halo exchange (such as coarrays)
  • other type of files would be of interest (eg paraview, ascii or whatsoever)

The specific code will simulate a particular ocean dynamics, but I think this could be of interest in other fields as well and could provide a nice tool to have in the shed for the finite difference people.
Let me know what would be your wishlist for such a DDT.

1 Like