Periodic boundary conditions for stencil algorithms

ivanpribec · February 6, 2024, 6:35pm

What is your preferred approach for implementing periodic boundary conditions in stencil-based algorithms?

Halo cells / ghost nodes
cshift(u, 1, dim=1)
ip= i + 1; if (ip > N) ip = 1
ip = mod(i,N) + 1
Other (explain in your reply)

0 voters

If any of the methods is not clear, let me know and I will add an example to this post.

ivanpribec · February 6, 2024, 7:33pm

So far, my own preferred approach, was to use the mod function.

The benefits are:

it’s simple,
no extra memory is needed,
no extra step needed to fill halos
compatible with canned ODE solvers

On the downside, I’ve realized the indirect access hampers the compiler when it comes to generating vector instructions:


! Using remainder
subroutine dxx1(n,u,uxx,h)
integer, intent(in) :: n
real, intent(in) :: u(n), h
real, intent(out) :: uxx(n)

integer :: i
real :: a

a = 1.0/h**2

do i = 1, n
   ip = mod(i,n) + 1
   im = mod(n+i-2,n) + 1
   uxx(i) = a * (u(ip) - 2*u(i) + u(im))     ! <-- indirect access
end do

end subroutine

! Using halos
subroutine dxx2(n,u,uxx,h)
integer, intent(in) :: n
real, intent(in) :: u(0:n+1), h
real, intent(out) :: uxx(0:n+1)
integer :: i
real :: a

a = 1.0/h**2

do i = 1, n
   uxx(i) = a * (u(i+1) - 2*u(i) + u(i-1))    ! <-- vectorizable
end do

end subroutine

It’s also possible to update the boundary cells separately, but it’s not very practical:

! First and last element are dealt with separately,
uxx(1) = a * (u(2) - 2*u(1) + u(n))
! so the inner part can be easily vectorized
do i = 2, n-1
   uxx(i) = a * (u(i+1) - 2*u(i) + u(i-1))    ! <-- vectorizable
end do
uxx(n) = a * (u(1) - 2*u(n) + u(n-1))

stillyslalom · February 6, 2024, 8:09pm

If you’re using MPI, you’ll probably want halos anyways for halo exchange. If you’re not using MPI, your problem’s probably small enough that the choice doesn’t really matter.

gnikit · February 6, 2024, 8:22pm

Pretty similar to what @stillyslalom said. For me most of the times the halo infrastructure is already there so it’s more convenient to use it to implement periodic BCs.

kimala · February 6, 2024, 8:22pm

In my humble opinion, it strongly depends if you are going to use a domain decomposition strategy at some point down the implementation. In that case, you would handle periodic conditions with some communication strategy and the overall weight of the operation won’t be too different from what happens in interior domains.

In any case, I’m going to implement some stencil operations soon and I wanted to ask here for a bunch of suggestions, do you mind if we take this thread as a general topic for stencil operations?

(Edit: started the post before seeing that two people basically said the same thing, sry for replication)

ivanpribec · February 6, 2024, 9:40pm

I was expecting to hear that the halo approach is the one most compatible with MPI parallelization. Do you have any tips how you deal with halo’s, especially for stencils which require access beyond the nearest neighbours?

A simple 2D halo-packing routine looks something like this:

   subroutine periodic_halo(nx,ny,con)
      integer, intent(in) :: nx, ny
      real(wp), intent(inout) :: con(0:nx+1,0:ny+1)

      ! Fill corners
      con(0,0) = con(nx,ny)
      con(Nx+1,0) = con(1,ny)
      con(0,Ny+1) = con(nx,1)
      con(Nx+1,Ny+1) = con(1,1)

      ! Fill east and west sides
      con(0,1:ny) = con(nx,1:ny)
      con(Nx+1,1:ny) = con(1,1:ny)

      ! Fill north and south sides
      con(1:nx,0) = con(1:nx,Ny)
      con(1:nx,ny+1) = con(1:nx,1)

   end subroutine

This brings me to another question:

Is there any ODE library (like Sundials) which is aware of halos, or does everyone implement their own time integrator?

I have nothing against it; if we end up too far away from boundary conditions, it can be split later.

kimala · February 6, 2024, 10:57pm

The only code that I’ve worked with that uses high order stencil schemes (CROCO-ocean, which is based on ROMS if I recall correctly) adjusts the size of the halo in order to have all the data necessary for the stencil. Basically, you set the halo width as a variable and write the MPI exchanges for a possible variable halo width. I had to implement this procedure inside NEMO-ocean.
I still have to dig into the paper (see section 5.2 and 5.3) and code, but the latest implementation of oceananigans.jl uses a similar approach, it gathers a halo that is very large and integrates for several time steps, each one reducing the size of the halo, until the interior domain is reached. There, a new halo exchange is performed. To my recollection, that code is the only one adopting this strategy in the ocean modelling community, and the procedure is only adopted in the so called “barotropic sub stepping” which is basically a 2D shallow water model inside a 3D code.

Concerning the hijack of your thread:
my next project is to implement the suit of Arakawa grids, in order to run some multilayer shallow water model. Handle a domain decomposition with MPI and provide basic stencil operations to operate over such grids. These basic operations should also be extremely fast to compute, possibly multithreading or to GPU. All of this should be quite standard, so here’s where I want to challenge you all: I would like for stencil operations to be nested: take as an example advection of velocity u along x, this is usually computed as u_i ( (u_{i} - u_{i-1})/dx + (u_{i+1} - u_{i})/dx ) /2, that is a combination of average and differentiation. I would like to write a function x_avg and a function x_diff such that the previous operation could be as easy as u * x_avg(x_diff(u)). I don’t have a precise idea on how to do that or even if it is possible, but after discretizing an anisotropic diffusion operator this is my wildest dream.

Below you can see what I did in NEMO, is,js,ie,je are indexes to the first and last points in the interior of the domain. I use the suffix int,ext to address the interior and exterior fields that are exchanged (i.e. the interior field of some domain goes into the exterior part of the neighboring domain), for left, right, bottom and top. I can draw a sketch if that is not clear.

   SUBROUTINE MPI_collect_r2l( kt, field_in, ni, nj, field_out, is, js, ie, je, hold)
      !!----------------------------------------------------------------------
      !!                    ***  ROUTINE MPI_collect_r2l  ***
      !!
      !! ** Purpose :  Collect information from RIGHT to LEFT (r2l) 
      !!
      !! ** Method  :   
      !!
      !! ** Action  :   
      !!
      !!-----------------------------------------------------------------------
      INTEGER(i4),                              INTENT(in   ) :: kt
      INTEGER(i4),                              INTENT(in   ) :: ni, nj
      REAL(wp),   DIMENSION( ni, nj),           INTENT(in   ) :: field_in
      LOGICAL,                                  INTENT(in   ) :: hold
      INTEGER(i4),                              INTENT(in   ) :: is
      INTEGER(i4),                              INTENT(in   ) :: js
      INTEGER(i4),                              INTENT(in   ) :: ie
      INTEGER(i4),                              INTENT(in   ) :: je
      REAL(wp),   DIMENSION( ie-is+1, je-js+1), INTENT(  out) :: field_out
      REAL(wp),   DIMENSION( ie-is+1, je-js+1)                :: lint, rext 
      !
      ! MPI variables
      INTEGER(i4) :: ierror, status, tag, my_rank      
      !
      lint = 0._wp
      rext = 0._wp
      field_out = 0._wp
      ! 
      ! Copy array into contiguous memory space to avoid MPI problems
      !
      lint = field_in( is : ie, js : je)
      tag = 65464
      !
      ! Get my rank and do the corresponding job
      CALL MPI_Comm_rank(MPI_COMM_WORLD, my_rank, ierror)
      !
      ! East to West
      !
      IF ( MOD(my_rank, jpni) .ne.      0 ) CALL MPI_send(lint, size(lint), MPI_double_precision, my_rank-1,         tag, MPI_COMM_WORLD, ierror)
      IF ( MOD(my_rank, jpni) .ne. jpni-1 ) CALL MPI_Recv(rext, size(rext), MPI_double_precision, my_rank+1, MPI_ANY_TAG, MPI_COMM_WORLD, status, ierror)
      !
      ! Safely wait
      !
      IF ( hold ) call MPI_BARRIER(MPI_COMM_WORLD, ierror)
      field_out = rext
      !
   END SUBROUTINE MPI_collect_r2l 


   SUBROUTINE MPI_collect_l2r( kt, field_in, ni, nj, field_out, is, js, ie, je, hold)
      !!----------------------------------------------------------------------
      !!                    ***  ROUTINE MPI_collect_l2r  ***
      !!
      !! ** Purpose :  Collect information from LEFT to RIGHT (l2r) 
      !!
      !! ** Method  :   
      !!
      !! ** Action  :   
      !!
      !!-----------------------------------------------------------------------
      INTEGER(i4),                              INTENT(in   ) :: kt
      INTEGER(i4),                              INTENT(in   ) :: ni, nj
      REAL(wp),   DIMENSION( ni, nj),           INTENT(in   ) :: field_in
      LOGICAL,                                  INTENT(in   ) :: hold
      INTEGER(i4),                              INTENT(in   ) :: is
      INTEGER(i4),                              INTENT(in   ) :: js
      INTEGER(i4),                              INTENT(in   ) :: ie
      INTEGER(i4),                              INTENT(in   ) :: je
      REAL(wp),   DIMENSION( ie-is+1, je-js+1), INTENT(  out) :: field_out
      REAL(wp),   DIMENSION( ie-is+1, je-js+1)                :: rint, lext 
      !
      ! MPI variables
      INTEGER(i4) :: ierror, status, tag, my_rank      
      !
      rint = 0._wp
      lext = 0._wp
      ! 
      ! Copy array into contiguous memory space to avoid MPI problems
      !
      rint = field_in( is : ie, js : je )
      tag = 65464
      !
      ! Get my rank and do the corresponding job
      CALL MPI_Comm_rank(MPI_COMM_WORLD, my_rank, ierror)
      !
      ! West to East
      !
      IF ( MOD(my_rank, jpni) .ne. jpni-1 ) CALL MPI_send(rint, size(rint), MPI_double_precision, my_rank+1,         tag, MPI_COMM_WORLD, ierror)
      IF ( MOD(my_rank, jpni) .ne.      0 ) CALL MPI_Recv(lext, size(lext), MPI_double_precision, my_rank-1, MPI_ANY_TAG, MPI_COMM_WORLD, status, ierror)
      !
      ! Safely wait
      !
      IF ( hold ) call MPI_BARRIER(MPI_COMM_WORLD, ierror)
      field_out = lext
      !
   END SUBROUTINE MPI_collect_l2r 


   SUBROUTINE MPI_collect_t2b( kt, field_in, ni, nj, field_out, is, js, ie, je, hold)
      !!----------------------------------------------------------------------
      !!                    ***  ROUTINE MPI_collect_t2b  ***
      !!
      !! ** Purpose :  Collect information from TOP to BOTTOM (t2b) 
      !!
      !! ** Method  :   
      !!
      !! ** Action  :   
      !!
      !!-----------------------------------------------------------------------
      INTEGER(i4),                              INTENT(in   ) :: kt
      INTEGER(i4),                              INTENT(in   ) :: ni, nj
      REAL(wp),   DIMENSION( ni, nj),           INTENT(in   ) :: field_in
      LOGICAL,                                  INTENT(in   ) :: hold
      INTEGER(i4),                              INTENT(in   ) :: is
      INTEGER(i4),                              INTENT(in   ) :: js
      INTEGER(i4),                              INTENT(in   ) :: ie
      INTEGER(i4),                              INTENT(in   ) :: je
      REAL(wp),   DIMENSION( ie-is+1, je-js+1), INTENT(  out) :: field_out
      REAL(wp),   DIMENSION( ie-is+1, je-js+1)                :: bint, text 
      !
      ! MPI variables
      INTEGER :: ierror, status, tag, my_rank
      !
      bint = 0._wp
      text = 0._wp
      !
      ! Copy array into contiguous memory space to avoid MPI problems
      !
      bint = field_in( is : ie, js : je)
      tag = 65464
      !
      ! Get my rank and do the corresponding job
      CALL MPI_Comm_rank(MPI_COMM_WORLD, my_rank, ierror)
      !
      ! North to South
      !
      IF ( INT(my_rank/jpni)    .ne.    0 ) CALL MPI_send(bint, size(bint), MPI_double_precision, my_rank-jpni,         tag, MPI_COMM_WORLD, ierror)
      IF ( INT(my_rank/jpni) +1 .ne. jpnj ) CALL MPI_Recv(text, size(text), MPI_double_precision, my_rank+jpni, MPI_ANY_TAG, MPI_COMM_WORLD, status, ierror)
      !
      ! Safely wait
      !
      IF ( hold ) call MPI_BARRIER(MPI_COMM_WORLD, ierror)
      field_out = text
      !
   END SUBROUTINE MPI_collect_t2b


   SUBROUTINE MPI_collect_b2t( kt, field_in, ni, nj, field_out, is, js, ie, je, hold)
      !!----------------------------------------------------------------------
      !!                    ***  ROUTINE MPI_collect_b2t  ***
      !!
      !! ** Purpose :  Collect information from BOTTOM to TOP (b2t) 
      !!
      !! ** Method  :   
      !!
      !! ** Action  :   
      !!
      !!-----------------------------------------------------------------------
      INTEGER(i4),                              INTENT(in   ) :: kt
      INTEGER(i4),                              INTENT(in   ) :: ni, nj
      REAL(wp),   DIMENSION( ni, nj),           INTENT(in   ) :: field_in
      LOGICAL,                                  INTENT(in   ) :: hold
      INTEGER(i4),                              INTENT(in   ) :: is
      INTEGER(i4),                              INTENT(in   ) :: js
      INTEGER(i4),                              INTENT(in   ) :: ie
      INTEGER(i4),                              INTENT(in   ) :: je
      REAL(wp),   DIMENSION( ie-is+1, je-js+1), INTENT(  out) :: field_out
      REAL(wp),   DIMENSION( ie-is+1, je-js+1)                :: tint, bext 
      !
      ! MPI variables
      INTEGER :: ierror, status, tag, my_rank
      !
      tint = 0._wp
      bext = 0._wp
      !
      ! Copy array into contiguous memory space to avoid MPI problems
      !
      tint = field_in( is : ie, js : je)
      tag = 65464
      !
      ! Get my rank and do the corresponding job
      CALL MPI_Comm_rank(MPI_COMM_WORLD, my_rank, ierror)
      !
      ! South to North
      !
      IF ( INT(my_rank/jpni) +1 .ne. jpnj ) CALL MPI_send(tint, size(tint), MPI_double_precision, my_rank+jpni,         tag, MPI_COMM_WORLD, ierror);
      IF ( INT(my_rank/jpni)    .ne.    0 ) CALL MPI_Recv(bext, size(bext), MPI_double_precision, my_rank-jpni, MPI_ANY_TAG, MPI_COMM_WORLD, status, ierror)
      !
      ! Safely wait
      !
      IF ( hold ) call MPI_BARRIER(MPI_COMM_WORLD, ierror)
      field_out = bext
      !
   END SUBROUTINE MPI_collect_b2t

Shahid · February 7, 2024, 10:42am

I am familiar only with the third option:

ip = i + 1
if (ip > N) ip = 1

Halo cells are something I heard of in MPI programming.
Don’t know the other two.

VladimirF · February 8, 2024, 3:42pm

I have always just use wider halo regions. 2 or 3 cells is enough for the fourth order. For higher orders one may want to use compact differences anyway. Or completely different approaches (spectral, DG).

VladimirF · February 8, 2024, 3:54pm

I use my own integrator. Firstly, it is really a small part of the code, adjusting everything for a library may easily be more complicated, and secondly, because fractional-step method for incompressible flow is not just a system for ODEs, but there is this DAE part as well, but I do not see how that is relevant. The integrator does not care about the halos at all. In the ODE part it just requires that the halo regions be updated. It is true that for implicit methods you need to make the evaluation aware of the right boundary conditions, but that is true for all boundary conditions, not just the periodic ones.

ivanpribec · February 9, 2024, 10:10pm

I found some interesting experiments with stencil and halo syntax for Fortran in the following two works:

ForOpenCL: Transformations Exploiting Array Syntax in
Fortran for Accelerator Programming: https://arxiv.org/pdf/1107.2157.pdf
Locally-Oriented Programming: A Simple Programming Model for Stencil-Based Computations on Multi-Level Distributed Memory Architectures: https://arxiv.org/pdf/1502.03504.pdf

The first work would be relevant to the discussion on Fortran Programmers : How do you want to offload to GPU accelerators in the next 5 years?

! Envisioned stencil syntax
pure CONCURRENT subroutine Laplacian(U)}
real, HALO(:,:) :: U
     U(0,0) = U(0,+1) &
+ U(-1,0) - 4*U(0, 0) + U(+1,0) &
            + U(0,-1)
end subroutine Laplacian

! Declaration syntax including halo
real, allocatable, dimension(:,:), codimension[:,:] &
HALO(1:*:1,1:*:1) :: U

! Launching the kernel
do while (.not. converged)
   do concurrent (i=1:M, j=1:N) [[device]]
      call Laplacian( U(i,j)[device] )
   end do
   call HALO_TRANSFER(U, BC=CYCLIC)
end do

ivanpribec · February 9, 2024, 11:48pm

It’s a pity we don’t see more such experimentation in Fortran.

In Python there are a few stencil generators. For example using pystencils, one can do the following:

import pystencils as ps

# 1. field definitions
src_field, dst_field = ps.fields("src, dst:[2D]")

# 2. define update rule
update_rule = ps.Assignment(lhs=dst_field[0, 0],
                             rhs=(src_field[1, 0] + src_field[-1, 0] +
                                  src_field[0, 1] + src_field[0, -1] - 
                                  4*src_field[0,0]))

config_opts = {
    "default_number_int": "int32",
    "cpu_openmp": True, 
    "ghost_layers": 1,
    "function_name": "laplace_kernel"
}

# 3. compile update rule to function
kernel_ast = ps.create_kernel(update_rule,
    config=ps.CreateKernelConfig(**config_opts))

# 4. Show generate code
ps.show_code(kernel_ast)

When executed this emits the following kernel (I’ve folded the output to be narrower):

$ python laplace.py
FUNC_PREFIX void laplace_kernel(double * RESTRICT  _data_dst, 
                                double * RESTRICT const _data_src, 
                                int64_t const _size_dst_0, 
                                int64_t const _size_dst_1, 
                                int64_t const _stride_dst_0, 
                                int64_t const _stride_dst_1, 
                                int64_t const _stride_src_0, 
                                int64_t const _stride_src_1)
{
   #pragma omp parallel
   {
      #pragma omp for schedule(static)
      for (int64_t ctr_0 = 1; ctr_0 < _size_dst_0 - 1; ctr_0 += 1)
      {
         for (int64_t ctr_1 = 1; ctr_1 < _size_dst_1 - 1; ctr_1 += 1)
         {
            _data_dst[_stride_dst_0*ctr_0 + _stride_dst_1*ctr_1] = 
    -4.0*_data_src[_stride_src_0*ctr_0 + _stride_src_1*ctr_1] 
    + _data_src[_stride_src_0*ctr_0 + _stride_src_0 + _stride_src_1*ctr_1] 
    + _data_src[_stride_src_0*ctr_0 + _stride_src_1*ctr_1 + _stride_src_1] 
    + _data_src[_stride_src_0*ctr_0 + _stride_src_1*ctr_1 - _stride_src_1] 
    + _data_src[_stride_src_0*ctr_0 - _stride_src_0 + _stride_src_1*ctr_1];
         }
      }
   }
}

To make this callable from Fortran, all that needs to be done is:

#include <ISO_Fortran_binding.h>

#define FUNC_PREFIX __attribute__ ((noinline))
#define RESTRICT __restrict

// ... <place kernel code here> ...

#define data(a) ((double *) a->base_addr)
#define size(a,d) a->dim[d].extent
#define stride(a,d) (a->dim[d].sm / a->elem_len)

void laplace(CFI_cdesc_t *src, CFI_cdesc_t *dst) 
{
    laplace_kernel(data(dst),data(src),
            size(dst,0),size(dst,1),stride(dst,0),stride(dst,1),
            stride(src,0),stride(src,1));
}

Probably one could also adapt pystencils to generate the glue code automatically or even print Fortran directly. The downside of having the kernel in C is that options like -fcheck=bounds won’t work, although one could insert some debugging assertions.

kimala · February 10, 2024, 12:03pm

Disclaimer 1: I only took a glimpse at the first paper
Disclaimer 2: Stream of consciousness here

I truly do not understand that piece of code though. I understand that they want to have a specific syntax to generate a region around a given array, but then how is this region used?

IMHO, the two problems of stencil operations (from the efficiency point of view) are

How to organize the stencil operation in order to access the memory efficiently (example: is it possible to write something better than the classic 5 point stencil applied at each point? Can strides be used efficiently to perform the stencil “column-wise” and thus access the memory better?)
How to organize the halo (if on the same array or detached) so to optimize the transfer across nodes (possibly overlapping the communication of the halo with the interior computation, doable with halos in a separate variable)

Can this be organized correctly in a simple derived datatype?

Topic		Replies	Views
Fixing an OpenMP race condition Help	16	1239	April 8, 2022
Domain decomposition with the Hivemind Projects	5	591	July 11, 2024
OpenMP code implementation in gfortran and intel Help	35	709	September 5, 2024
Nvfortran comparison of do concurrent vs OpenMP code Help	24	822	September 9, 2024
Some coarray performance results	18	1413	January 28, 2022

Periodic boundary conditions for stencil algorithms

Related topics