Hello everyone.
First of all, let me thanks everyone for this incredible community that makes me learn something new every day about my computational mother tongue.
In the last 6 months I have worked a lot with MPI, to handle data exchange in a domain decomposition environment implemented with a Z-pencil strategy. Specifically, I use the code NEMO to do some idealized ocean simulations. This code has a finite difference core and it employs a one-point halo/overlap/ghostcells to facilitate computations of the finite differences. My task in the past months was to collect extra points in this halo/overlap/ghostcells in order to compute operations with larger support (in my case, a gaussian filtering and a discrete fast wavelet transform).
To do so, I have implemented 2 kinds of routines: the first kind, in several variants, to enlarge my halo region of a given h number of points (in each direction, meaning that from an n \times m rectangle I can go to a (n+2h) \times (m+2h) rectangle). The second kind actually communicates only one side of the operation, so only does right to left, or left to right, top to bottom or bottom to top. I know that I can construct the first kind of routine with the second, but I developed the second kind because it was incorrect to apply the first kind for the wavelet transform (details upon request).
I was in a hurry, so my routines are definitely not optimal, and I have learned MPI with a sink or swim approach: if it works and it is not slower than sequential, it is a win. Now, I would like to gather here the collective intelligence of this hivemind regarding domain decomposition handling, given the following problem:
Consider this geometry, as a non-trivial example of a domain with non regular boundaries. Each square is a sub-domain, handled with MPI. Each of this subdomains has the same dimension, let’s say N_i \times N_j \times N_k, where i means longitude, j latitude and k is the vertical. The MPI parallelization is only done in i and j, and you can assume that these two corresponds to the first and second dimension of a fortran array. Inside each subdomain there will be room for openMP, openACC, fortran do concurrent or CUDA kernels for basic parallel operations on the array. An overlap of h points is given, where this h can be tuned by the operator followings her/his/their needs. The end goal would be solving an elliptic PDE on this domain, so there is also a linear system at some point, to solve, but how to solve it is not the main discussion.
I would like to discuss about how you would target the management of multiple sub-domains, their connections, their numbering, what would be a good data structure for them and so on. I know my base software quite well, NEMO, and I am starting looking upon CROCO, I have read Milan’s book almost from cover to cover, thus I have an idea of the general picture as he implemented a similar thing with a regular domain and Fortran coarrays. However, I would like to have a discussion with people more experienced than me regarding this task, their experience and their workaround. No lines of codes are expected, but rather a discussion of the strategies for an efficient handling of the problem. Feel free to share papers on the topic.
Ocean and Climate modeling is one of the fields where Fortran is still the main language, even when it comes to porting to GPU most of the efforts are done to adapt existing fortran codes, so I hope this topic will interest you and pose you a design challenge. On my side, I will be very happy to read every contribution, summarize them from time to time, I will as well come back with more specific topics.
In the best case scenario, I would like to summarize at one point what the Fortran community regards as the best practice for such a topic. (Needless to say that whatever comes out of this discussion is not going to be monetized)
I look forward to read your thoughts.