# Periodic boundary conditions for stencil algorithms

What is your preferred approach for implementing periodic boundary conditions in stencil-based algorithms?

• Halo cells / ghost nodes
• cshift(u, 1, dim=1)
• ip= i + 1; if (ip > N) ip = 1
• ip = mod(i,N) + 1
0 voters

If any of the methods is not clear, let me know and I will add an example to this post.

2 Likes

So far, my own preferred approach, was to use the `mod` function.

The benefits are:

• itâ€™s simple,
• no extra memory is needed,
• no extra step needed to fill halos
• compatible with canned ODE solvers

On the downside, Iâ€™ve realized the indirect access hampers the compiler when it comes to generating vector instructions:

``````
! Using remainder
subroutine dxx1(n,u,uxx,h)
integer, intent(in) :: n
real, intent(in) :: u(n), h
real, intent(out) :: uxx(n)

integer :: i
real :: a

a = 1.0/h**2

do i = 1, n
ip = mod(i,n) + 1
im = mod(n+i-2,n) + 1
uxx(i) = a * (u(ip) - 2*u(i) + u(im))     ! <-- indirect access
end do

end subroutine

! Using halos
subroutine dxx2(n,u,uxx,h)
integer, intent(in) :: n
real, intent(in) :: u(0:n+1), h
real, intent(out) :: uxx(0:n+1)
integer :: i
real :: a

a = 1.0/h**2

do i = 1, n
uxx(i) = a * (u(i+1) - 2*u(i) + u(i-1))    ! <-- vectorizable
end do

end subroutine
``````

Itâ€™s also possible to update the boundary cells separately, but itâ€™s not very practical:

``````! First and last element are dealt with separately,
uxx(1) = a * (u(2) - 2*u(1) + u(n))
! so the inner part can be easily vectorized
do i = 2, n-1
uxx(i) = a * (u(i+1) - 2*u(i) + u(i-1))    ! <-- vectorizable
end do
uxx(n) = a * (u(1) - 2*u(n) + u(n-1))``````

If youâ€™re using MPI, youâ€™ll probably want halos anyways for halo exchange. If youâ€™re not using MPI, your problemâ€™s probably small enough that the choice doesnâ€™t really matter.

2 Likes

Pretty similar to what @stillyslalom said. For me most of the times the halo infrastructure is already there so itâ€™s more convenient to use it to implement periodic BCs.

In my humble opinion, it strongly depends if you are going to use a domain decomposition strategy at some point down the implementation. In that case, you would handle periodic conditions with some communication strategy and the overall weight of the operation wonâ€™t be too different from what happens in interior domains.

In any case, Iâ€™m going to implement some stencil operations soon and I wanted to ask here for a bunch of suggestions, do you mind if we take this thread as a general topic for stencil operations?

(Edit: started the post before seeing that two people basically said the same thing, sry for replication)

I was expecting to hear that the halo approach is the one most compatible with MPI parallelization. Do you have any tips how you deal with haloâ€™s, especially for stencils which require access beyond the nearest neighbours?

A simple 2D halo-packing routine looks something like this:

``````   subroutine periodic_halo(nx,ny,con)
integer, intent(in) :: nx, ny
real(wp), intent(inout) :: con(0:nx+1,0:ny+1)

! Fill corners
con(0,0) = con(nx,ny)
con(Nx+1,0) = con(1,ny)
con(0,Ny+1) = con(nx,1)
con(Nx+1,Ny+1) = con(1,1)

! Fill east and west sides
con(0,1:ny) = con(nx,1:ny)
con(Nx+1,1:ny) = con(1,1:ny)

! Fill north and south sides
con(1:nx,0) = con(1:nx,Ny)
con(1:nx,ny+1) = con(1:nx,1)

end subroutine
``````

This brings me to another question:

• Is there any ODE library (like Sundials) which is aware of halos, or does everyone implement their own time integrator?

I have nothing against it; if we end up too far away from boundary conditions, it can be split later.

The only code that Iâ€™ve worked with that uses high order stencil schemes (CROCO-ocean, which is based on ROMS if I recall correctly) adjusts the size of the halo in order to have all the data necessary for the stencil. Basically, you set the halo width as a variable and write the MPI exchanges for a possible variable halo width. I had to implement this procedure inside NEMO-ocean.
I still have to dig into the paper (see section 5.2 and 5.3) and code, but the latest implementation of oceananigans.jl uses a similar approach, it gathers a halo that is very large and integrates for several time steps, each one reducing the size of the halo, until the interior domain is reached. There, a new halo exchange is performed. To my recollection, that code is the only one adopting this strategy in the ocean modelling community, and the procedure is only adopted in the so called â€śbarotropic sub steppingâ€ť which is basically a 2D shallow water model inside a 3D code.

my next project is to implement the suit of Arakawa grids, in order to run some multilayer shallow water model. Handle a domain decomposition with MPI and provide basic stencil operations to operate over such grids. These basic operations should also be extremely fast to compute, possibly multithreading or to GPU. All of this should be quite standard, so hereâ€™s where I want to challenge you all: I would like for stencil operations to be nested: take as an example advection of velocity u along x, this is usually computed as u_i ( (u_{i} - u_{i-1})/dx + (u_{i+1} - u_{i})/dx ) /2, that is a combination of average and differentiation. I would like to write a function `x_avg` and a function `x_diff` such that the previous operation could be as easy as `u * x_avg(x_diff(u))`. I donâ€™t have a precise idea on how to do that or even if it is possible, but after discretizing an anisotropic diffusion operator this is my wildest dream.

Below you can see what I did in NEMO, `is,js,ie,je` are indexes to the first and last points in the interior of the domain. I use the suffix `int,ext` to address the interior and exterior fields that are exchanged (i.e. the interior field of some domain goes into the exterior part of the neighboring domain), for left, right, bottom and top. I can draw a sketch if that is not clear.

``````   SUBROUTINE MPI_collect_r2l( kt, field_in, ni, nj, field_out, is, js, ie, je, hold)
!!----------------------------------------------------------------------
!!                    ***  ROUTINE MPI_collect_r2l  ***
!!
!! ** Purpose :  Collect information from RIGHT to LEFT (r2l)
!!
!! ** Method  :
!!
!! ** Action  :
!!
!!-----------------------------------------------------------------------
INTEGER(i4),                              INTENT(in   ) :: kt
INTEGER(i4),                              INTENT(in   ) :: ni, nj
REAL(wp),   DIMENSION( ni, nj),           INTENT(in   ) :: field_in
LOGICAL,                                  INTENT(in   ) :: hold
INTEGER(i4),                              INTENT(in   ) :: is
INTEGER(i4),                              INTENT(in   ) :: js
INTEGER(i4),                              INTENT(in   ) :: ie
INTEGER(i4),                              INTENT(in   ) :: je
REAL(wp),   DIMENSION( ie-is+1, je-js+1), INTENT(  out) :: field_out
REAL(wp),   DIMENSION( ie-is+1, je-js+1)                :: lint, rext
!
! MPI variables
INTEGER(i4) :: ierror, status, tag, my_rank
!
lint = 0._wp
rext = 0._wp
field_out = 0._wp
!
! Copy array into contiguous memory space to avoid MPI problems
!
lint = field_in( is : ie, js : je)
tag = 65464
!
! Get my rank and do the corresponding job
CALL MPI_Comm_rank(MPI_COMM_WORLD, my_rank, ierror)
!
! East to West
!
IF ( MOD(my_rank, jpni) .ne.      0 ) CALL MPI_send(lint, size(lint), MPI_double_precision, my_rank-1,         tag, MPI_COMM_WORLD, ierror)
IF ( MOD(my_rank, jpni) .ne. jpni-1 ) CALL MPI_Recv(rext, size(rext), MPI_double_precision, my_rank+1, MPI_ANY_TAG, MPI_COMM_WORLD, status, ierror)
!
! Safely wait
!
IF ( hold ) call MPI_BARRIER(MPI_COMM_WORLD, ierror)
field_out = rext
!
END SUBROUTINE MPI_collect_r2l

SUBROUTINE MPI_collect_l2r( kt, field_in, ni, nj, field_out, is, js, ie, je, hold)
!!----------------------------------------------------------------------
!!                    ***  ROUTINE MPI_collect_l2r  ***
!!
!! ** Purpose :  Collect information from LEFT to RIGHT (l2r)
!!
!! ** Method  :
!!
!! ** Action  :
!!
!!-----------------------------------------------------------------------
INTEGER(i4),                              INTENT(in   ) :: kt
INTEGER(i4),                              INTENT(in   ) :: ni, nj
REAL(wp),   DIMENSION( ni, nj),           INTENT(in   ) :: field_in
LOGICAL,                                  INTENT(in   ) :: hold
INTEGER(i4),                              INTENT(in   ) :: is
INTEGER(i4),                              INTENT(in   ) :: js
INTEGER(i4),                              INTENT(in   ) :: ie
INTEGER(i4),                              INTENT(in   ) :: je
REAL(wp),   DIMENSION( ie-is+1, je-js+1), INTENT(  out) :: field_out
REAL(wp),   DIMENSION( ie-is+1, je-js+1)                :: rint, lext
!
! MPI variables
INTEGER(i4) :: ierror, status, tag, my_rank
!
rint = 0._wp
lext = 0._wp
!
! Copy array into contiguous memory space to avoid MPI problems
!
rint = field_in( is : ie, js : je )
tag = 65464
!
! Get my rank and do the corresponding job
CALL MPI_Comm_rank(MPI_COMM_WORLD, my_rank, ierror)
!
! West to East
!
IF ( MOD(my_rank, jpni) .ne. jpni-1 ) CALL MPI_send(rint, size(rint), MPI_double_precision, my_rank+1,         tag, MPI_COMM_WORLD, ierror)
IF ( MOD(my_rank, jpni) .ne.      0 ) CALL MPI_Recv(lext, size(lext), MPI_double_precision, my_rank-1, MPI_ANY_TAG, MPI_COMM_WORLD, status, ierror)
!
! Safely wait
!
IF ( hold ) call MPI_BARRIER(MPI_COMM_WORLD, ierror)
field_out = lext
!
END SUBROUTINE MPI_collect_l2r

SUBROUTINE MPI_collect_t2b( kt, field_in, ni, nj, field_out, is, js, ie, je, hold)
!!----------------------------------------------------------------------
!!                    ***  ROUTINE MPI_collect_t2b  ***
!!
!! ** Purpose :  Collect information from TOP to BOTTOM (t2b)
!!
!! ** Method  :
!!
!! ** Action  :
!!
!!-----------------------------------------------------------------------
INTEGER(i4),                              INTENT(in   ) :: kt
INTEGER(i4),                              INTENT(in   ) :: ni, nj
REAL(wp),   DIMENSION( ni, nj),           INTENT(in   ) :: field_in
LOGICAL,                                  INTENT(in   ) :: hold
INTEGER(i4),                              INTENT(in   ) :: is
INTEGER(i4),                              INTENT(in   ) :: js
INTEGER(i4),                              INTENT(in   ) :: ie
INTEGER(i4),                              INTENT(in   ) :: je
REAL(wp),   DIMENSION( ie-is+1, je-js+1), INTENT(  out) :: field_out
REAL(wp),   DIMENSION( ie-is+1, je-js+1)                :: bint, text
!
! MPI variables
INTEGER :: ierror, status, tag, my_rank
!
bint = 0._wp
text = 0._wp
!
! Copy array into contiguous memory space to avoid MPI problems
!
bint = field_in( is : ie, js : je)
tag = 65464
!
! Get my rank and do the corresponding job
CALL MPI_Comm_rank(MPI_COMM_WORLD, my_rank, ierror)
!
! North to South
!
IF ( INT(my_rank/jpni)    .ne.    0 ) CALL MPI_send(bint, size(bint), MPI_double_precision, my_rank-jpni,         tag, MPI_COMM_WORLD, ierror)
IF ( INT(my_rank/jpni) +1 .ne. jpnj ) CALL MPI_Recv(text, size(text), MPI_double_precision, my_rank+jpni, MPI_ANY_TAG, MPI_COMM_WORLD, status, ierror)
!
! Safely wait
!
IF ( hold ) call MPI_BARRIER(MPI_COMM_WORLD, ierror)
field_out = text
!
END SUBROUTINE MPI_collect_t2b

SUBROUTINE MPI_collect_b2t( kt, field_in, ni, nj, field_out, is, js, ie, je, hold)
!!----------------------------------------------------------------------
!!                    ***  ROUTINE MPI_collect_b2t  ***
!!
!! ** Purpose :  Collect information from BOTTOM to TOP (b2t)
!!
!! ** Method  :
!!
!! ** Action  :
!!
!!-----------------------------------------------------------------------
INTEGER(i4),                              INTENT(in   ) :: kt
INTEGER(i4),                              INTENT(in   ) :: ni, nj
REAL(wp),   DIMENSION( ni, nj),           INTENT(in   ) :: field_in
LOGICAL,                                  INTENT(in   ) :: hold
INTEGER(i4),                              INTENT(in   ) :: is
INTEGER(i4),                              INTENT(in   ) :: js
INTEGER(i4),                              INTENT(in   ) :: ie
INTEGER(i4),                              INTENT(in   ) :: je
REAL(wp),   DIMENSION( ie-is+1, je-js+1), INTENT(  out) :: field_out
REAL(wp),   DIMENSION( ie-is+1, je-js+1)                :: tint, bext
!
! MPI variables
INTEGER :: ierror, status, tag, my_rank
!
tint = 0._wp
bext = 0._wp
!
! Copy array into contiguous memory space to avoid MPI problems
!
tint = field_in( is : ie, js : je)
tag = 65464
!
! Get my rank and do the corresponding job
CALL MPI_Comm_rank(MPI_COMM_WORLD, my_rank, ierror)
!
! South to North
!
IF ( INT(my_rank/jpni) +1 .ne. jpnj ) CALL MPI_send(tint, size(tint), MPI_double_precision, my_rank+jpni,         tag, MPI_COMM_WORLD, ierror);
IF ( INT(my_rank/jpni)    .ne.    0 ) CALL MPI_Recv(bext, size(bext), MPI_double_precision, my_rank-jpni, MPI_ANY_TAG, MPI_COMM_WORLD, status, ierror)
!
! Safely wait
!
IF ( hold ) call MPI_BARRIER(MPI_COMM_WORLD, ierror)
field_out = bext
!
END SUBROUTINE MPI_collect_b2t

``````

I am familiar only with the third option:

ip = i + 1
if (ip > N) ip = 1

Halo cells are something I heard of in MPI programming.
Donâ€™t know the other two.

I have always just use wider halo regions. 2 or 3 cells is enough for the fourth order. For higher orders one may want to use compact differences anyway. Or completely different approaches (spectral, DG).

I use my own integrator. Firstly, it is really a small part of the code, adjusting everything for a library may easily be more complicated, and secondly, because fractional-step method for incompressible flow is not just a system for ODEs, but there is this DAE part as well, but I do not see how that is relevant. The integrator does not care about the halos at all. In the ODE part it just requires that the halo regions be updated. It is true that for implicit methods you need to make the evaluation aware of the right boundary conditions, but that is true for all boundary conditions, not just the periodic ones.

I found some interesting experiments with stencil and halo syntax for Fortran in the following two works:

The first work would be relevant to the discussion on Fortran Programmers : How do you want to offload to GPU accelerators in the next 5 years?

``````! Envisioned stencil syntax
pure CONCURRENT subroutine Laplacian(U)}
real, HALO(:,:) :: U
U(0,0) = U(0,+1) &
+ U(-1,0) - 4*U(0, 0) + U(+1,0) &
+ U(0,-1)
end subroutine Laplacian

! Declaration syntax including halo
real, allocatable, dimension(:,:), codimension[:,:] &
HALO(1:*:1,1:*:1) :: U

! Launching the kernel
do while (.not. converged)
do concurrent (i=1:M, j=1:N) [[device]]
call Laplacian( U(i,j)[device] )
end do
call HALO_TRANSFER(U, BC=CYCLIC)
end do
``````
1 Like

Itâ€™s a pity we donâ€™t see more such experimentation in Fortran.

In Python there are a few stencil generators. For example using pystencils, one can do the following:

``````import pystencils as ps

# 1. field definitions
src_field, dst_field = ps.fields("src, dst:[2D]")

# 2. define update rule
update_rule = ps.Assignment(lhs=dst_field[0, 0],
rhs=(src_field[1, 0] + src_field[-1, 0] +
src_field[0, 1] + src_field[0, -1] -
4*src_field[0,0]))

config_opts = {
"default_number_int": "int32",
"cpu_openmp": True,
"ghost_layers": 1,
"function_name": "laplace_kernel"
}

# 3. compile update rule to function
kernel_ast = ps.create_kernel(update_rule,
config=ps.CreateKernelConfig(**config_opts))

# 4. Show generate code
ps.show_code(kernel_ast)
``````

When executed this emits the following kernel (Iâ€™ve folded the output to be narrower):

``````\$ python laplace.py
FUNC_PREFIX void laplace_kernel(double * RESTRICT  _data_dst,
double * RESTRICT const _data_src,
int64_t const _size_dst_0,
int64_t const _size_dst_1,
int64_t const _stride_dst_0,
int64_t const _stride_dst_1,
int64_t const _stride_src_0,
int64_t const _stride_src_1)
{
#pragma omp parallel
{
#pragma omp for schedule(static)
for (int64_t ctr_0 = 1; ctr_0 < _size_dst_0 - 1; ctr_0 += 1)
{
for (int64_t ctr_1 = 1; ctr_1 < _size_dst_1 - 1; ctr_1 += 1)
{
_data_dst[_stride_dst_0*ctr_0 + _stride_dst_1*ctr_1] =
-4.0*_data_src[_stride_src_0*ctr_0 + _stride_src_1*ctr_1]
+ _data_src[_stride_src_0*ctr_0 + _stride_src_0 + _stride_src_1*ctr_1]
+ _data_src[_stride_src_0*ctr_0 + _stride_src_1*ctr_1 + _stride_src_1]
+ _data_src[_stride_src_0*ctr_0 + _stride_src_1*ctr_1 - _stride_src_1]
+ _data_src[_stride_src_0*ctr_0 - _stride_src_0 + _stride_src_1*ctr_1];
}
}
}
}
``````

To make this callable from Fortran, all that needs to be done is:

``````#include <ISO_Fortran_binding.h>

#define FUNC_PREFIX __attribute__ ((noinline))
#define RESTRICT __restrict

// ... <place kernel code here> ...

#define size(a,d) a->dim[d].extent
#define stride(a,d) (a->dim[d].sm / a->elem_len)

void laplace(CFI_cdesc_t *src, CFI_cdesc_t *dst)
{
laplace_kernel(data(dst),data(src),
size(dst,0),size(dst,1),stride(dst,0),stride(dst,1),
stride(src,0),stride(src,1));
}
``````

Probably one could also adapt pystencils to generate the glue code automatically or even print Fortran directly. The downside of having the kernel in C is that options like `-fcheck=bounds` wonâ€™t work, although one could insert some debugging assertions.

Disclaimer 1: I only took a glimpse at the first paper
Disclaimer 2: Stream of consciousness here

I truly do not understand that piece of code though. I understand that they want to have a specific syntax to generate a region around a given array, but then how is this region used?

IMHO, the two problems of stencil operations (from the efficiency point of view) are

• How to organize the stencil operation in order to access the memory efficiently (example: is it possible to write something better than the classic 5 point stencil applied at each point? Can strides be used efficiently to perform the stencil â€ścolumn-wiseâ€ť and thus access the memory better?)
• How to organize the halo (if on the same array or detached) so to optimize the transfer across nodes (possibly overlapping the communication of the halo with the interior computation, doable with halos in a separate variable)

Can this be organized correctly in a simple derived datatype?