Swapping arrays during time-stepping

ivanpribec · February 6, 2024, 10:29pm

For time-stepping large PDE problems, it’s common to read from one array (the previous time step), and write to a second array (the result at the new time step). Afterward the arrays are swapped. This is also known as the “flip-flop” approach, or ping-ponging. The separate arrays also make the update step easier to parallelise (no race conditions).

What is your favourite implementation approach? (Scroll down for examples)

Separate arrays with parity flag
Array with extra dimension
Pointer swap
Overwrite old array
Other (describe in your reply)

0 voters

Separate arrays with parity flag

real, allocatable :: f1(:), f2(:)
integer :: flipflop

! ... initialize f

flipflop = 1

do step = 1, nsteps

    select case(flipflop)
    case(1)
        call sweep(fin=f1,fout=f2)
    case(2)
        call sweep(fin=f2,fout=f1)
    end select
    flipflop = 3 - flipflop

end do

If zero-based indexing is used, the parity flag can be inverted using

! 0-based indexing
flipflop = 1 - flipflop

Array with extra dimension

This one relies on array slicing or dummy argument association. Integer flags are used to denote the old and new arrays.

real, allocatable :: f(:,:)
integer :: tmp, old, new

allocate(f(n,2))

! ... initialize f
old = 1
new = 2

do step = 1, nsteps

    call sweep(fin=f(:,kold),fout=f(:,knew))

    tmp = old
    old = new
    new = tmp

end do

Alternatively, the integer flags can be calculated before the update like this:

    old = mod(step+1,2) + 1
    new = mod(step+2,2) + 1

or after the update, like this:

    old = 3 - old
    new = 3 - new

Pointer swap

real, allocatable, target :: f1(:), f2(:)
real, pointer :: fold(:), fnew(:), ftmp(:)

fold => f1
fnew => f2

do step = 1, nsteps

    call sweep(fin=fold,fout=fnew)

    ftmp => fold
    fold => fnew
    fnew => ftmp

end do

Overwrite old array

real, allocatable :: fold(:), fnew(:)

do step = 1, nsteps
    call sweep(fin=fold,fout=fnew)
    fold = fnew
end do

everythingfunctional · February 6, 2024, 11:23pm

My first pass is somewhere between overwrite and swap, as I like to be able to write something like this at a high level.

state = state + d_dt(state) * dt

For very large problems maybe this isn’t efficient, but I bet the compilers these days are pretty good at playing some tricks to optimize a statement like that.

RonShepard · February 6, 2024, 11:28pm

Just for completeness, I’ve also seen this done in the following way:

do step = 1, nsteps/2
   call sweep(fin=f(:,kold),fout=f(:,knew))
   call sweep(fin=f(:,knew),fout=f(:,kold))
end do

amasaki203 · February 7, 2024, 12:02am

For most problems, I prefer the pointer swap approach rather than managing indexes or increasing if and case branches. When using this method, I always add a contiguous attribute to the declaration of the pointer variables to expect this will benefit from compiler optimizations. Thus, like that:

real, pointer, contiguous :: fold(:), fnew(:), ftmp(:)

For small-scale problems such as one-dimensional, I use the overwrite old array approach.

oscardssmith · February 7, 2024, 1:07am

The main other approach that is often better (assuming the arrays are small enough) is to use a higher order integrator that uses more than 2 state arrays. Once you are using a higher order integrator, you need to store a bunch of stages anyway, and the output is a linear combination of these stages, so it’s a somewhat moot point since you no longer have 2 arrays to swap.

IanMartinAjzenszmidt · February 7, 2024, 4:45am

The “flip-flop” or “ping-ponging” approach to time-stepping in large PDE (Partial Differential Equation) problems is indeed a common and effective strategy, especially in parallel computing environments. My preferred implementation approach would depend on a few factors, such as the specific PDE being solved, the computing environment (e.g., CPU vs. GPU), and the programming language being used. However, a general approach I favor involves a combination of clarity, efficiency, and adaptability to parallel computing. Here’s a broad outline:

Use of High-Level Abstractions (when applicable): In environments where high-level abstractions are available and performance is not severely impacted (like Python with NumPy or Julia), using these can greatly simplify code and make it more readable. This is especially true for operations that are inherently parallel, like array operations.
Explicit Buffer Swapping: Maintain two distinct buffers or arrays, one for the current time step and one for the next. After each time step, swap these buffers instead of copying data between them. This can be as simple as swapping pointers or references to the arrays, which is much more efficient than copying array contents.
Parallelism-Friendly Code: Write the update step in a way that is amenable to parallel execution. This usually means avoiding dependencies between different parts of the array during the update. For instance, if using a language like C++ with OpenMP or Python with Numba, you would ensure that the loop iterations are independent so they can be safely parallelized.
Memory Considerations: Especially in GPU computing, but also relevant for CPU, consider how memory access patterns affect performance. Strive for coalesced memory accesses in GPUs and cache-friendly patterns in CPUs.
Error Handling and Validation: Ensure that your implementation includes robust error handling and validation checks. This is crucial for debugging and for ensuring that parallelization does not introduce subtle bugs.
Profiling and Optimization: Use profiling tools to understand where bottlenecks lie and optimize those parts of the code. This might involve different strategies for different architectures (e.g., SIMD optimizations for CPUs, warp-shuffle operations for GPUs).
Scalability and Adaptability: Design the implementation to be scalable to different problem sizes and adaptable to different PDEs or boundary conditions. This often means abstracting the core computational routines from the specifics of the PDE being solved.
Documentation and Comments: Maintain clear documentation and comments, especially for parts of the code that handle the complexities of parallelization and memory management.

In summary, my favorite approach balances performance with readability and maintainability, leveraging high-level abstractions when possible but also diving into lower-level optimizations where they are most beneficial. This approach should be adaptable to various types of PDEs and computing environments.

hkvzjal · February 7, 2024, 7:26am

I use array overwriting which for time stepping in large non-linear FEM problems is just transparent compared to the time required for solving the problem.

With a two time steps approach it would globally speaking look like:

M * dX = Residual( loads_{t+dt} , X_t, X_{t-dt} )
X_{t+dt} = coef_1 * X_t + coef_2 * X_{t-dt} + coef_3 * dX
X_{t-dt} = X_t
X_t = X_{t+dt}

Where in (1) the matrix M can be linear or non-linear so Newton-like iterations might be needed within. When parallelizing in a SPMD framework, with certain d.o.f. being shared across processes, one “just” needs to check that dX is synchronized at the end of (1), (2)-to-(4) can be done without any regards to communication.

irukoa · February 7, 2024, 12:41pm

To add: in problems involving unitary dynamics it is also common to employ exponential integrators. In these cases I go with something like

dt = (tend - tstart)/real(nt - 1)
do it = 1, nt
  t = tstart + dt*(it - 1)
  tev = matmul(tev, matrix_exp(-cmplx_i*H(t)*dt))
enddo

This approach also requires no swapping.

VladimirF · February 8, 2024, 3:45pm

In my main code I swap the pointers, but with move_alloc(), I do not use pointers unless I really have to.

VladimirF · February 8, 2024, 3:50pm

There are low-storage higher order Runge-Kutta solvers available. One has one or maybe two extra array per variable, but one has an array with a result of the stage and the array with the original value. And you need to somehow swap these. It doesn’t matter if it i 1st order Euler, or 5th order RK, the problem still remains the same.

ivanpribec · February 8, 2024, 4:21pm

So like this?

real, allocatable, dimension(:) :: ftmp, fold, fnew

call move_alloc(from=fold, to=ftmp)
call move_alloc(from=fnew, to=fold)
call move_alloc(from=ftmp, to=fnew)

Seems like the language could introduce a generic swap_alloc intrinsic function.

VladimirF · February 8, 2024, 4:40pm

Like that, yes, I have a subroutine for that.

nncarlson · February 8, 2024, 10:14pm

Yes, this. Fast like swapping pointers, but can result in faster code when operating on the arrays (depending on context) because they won’t have the target attribute and the optimizer shouldn’t have to worry about possible aliasing. That’s been my experience, fwiw.

manuel.arenaz · March 13, 2024, 6:52pm

Interesting point… I have seen several of them in real codes ands others are new to me. With simplicity and maintainability in mind, assuming all of them provide a reasonable performance, I would vote for Overwriting arrays.

Nice thread @ivanpribec !

ivanpribec · March 14, 2024, 7:20am

Thanks @manuel.arenaz. I also find it useful to have these kind of application-specific patterns collected in one place.

A few more remarks on this topic:

The method suggested by @RonShepard is essentially loop unrolling (see PWR049 in the Codee Open Catalog). Personally I’d prefer to modify the loop counter like this:
```
do step = 0, nsteps, 2
    call sweep(fin=f1,fout=f2)
    call sweep(fin=f2,fout=f1)
end do
```

The various approaches can be combined. For example use of parity and pointers:

do step = 0, nsteps

   if (modulo(step,2) == 0) then
      pold => p1
      pnew => p2
   else
      pold => p2
      pnew => p1
   end if

   call sweep(pold, pnew)
end do

The approach with move_alloc works perfectly fine also with arrays mapped to OpenMP target devices and also with OpenACC and CUDA managed memory.
Swapping pointer arrays or swapping allocatable arrays using move_alloc is very similar in terms of the amount of instructions required. (Not that it really matters, because the bulk of work will be in the other subroutines within the loop body.)

In other programming languages swapping looks like this:

// C
void swap(double **a, double **b) // swaps the pointers!
{
    double *tmp = *a;
    *a = *b;
    *b = tmp;
}

// C++
#include <algorithm>
std::swap(a, b);

# Python (and also Julia)
a, b = b, a

% MATLAB 
% https://www.mathworks.com/matlabcentral/answers/362680-how-to-swap-values-of-two-variables

% Option 1
[b, a] = deal(a,b);

% Option 2
[b, a] = swap(a,b)

function [b, a] = swap(a, b)
% This function has no body!

Edit, 05/04/2024: I added the link which helped me realize that doing two time steps is equivalent to loop unrolling.

RonShepard · March 14, 2024, 3:39pm

This approach also works when you have multiple array swaps, e.g. when you are using a circular buffer workspace. It can be done with the last index of a multidimensional array, with array offsets, with pointers, or with allocatable arrays and move_alloc().

Some care must be taken when using array offsets or pointers to hide the possible memory aliases. The compiler can produce the optimal code when it knows that aliases cannot occur. One way to do this is with a subroutine call, as above. Once inside the subroutine, the compiler knows that its dummy arguments cannot be aliased. I’m curious if the same semantics occurs when the subroutine call is replaced with an ASSOCIATE block? Can the compiler know that the associate block renames are not aliased?

ivanpribec · September 29, 2024, 4:03pm

Here is a more object-oriented approach adapted from a pattern encountered with swapping kernel buffers in OpenCL (How to effectively swap OpenCL memory buffers? - Stack Overflow):

program timesweep
implicit none

interface
    subroutine sweep(n,fin,fout)
        integer, value :: n
        real, intent(in) :: fin(n)
        real, intent(out) :: fout(n)
    end subroutine
end interface

type :: kernel_t
    integer :: n
	real, pointer, contiguous :: fin(:), fout(:)
    procedure(sweep), pointer, nopass :: kernel => sweep
end type

integer :: n, nsteps, step
real, allocatable, target :: f1(:), f2(:)
type(kernel_t) :: odd, even

nsteps = 1000
n = 100

allocate(f1(n),f2(n))

! ... Initialize f1, f2 ...

even = kernel_t(n,f1,f2)
odd  = kernel_t(n,f2,f1)

do step = 0, nsteps-1
    if (modulo(step,2) == 0) then
         call exec(even)
    else
         call exec(odd)
    end if

! in F2023, also
!     call exec(mod(step,2) == 0 ? even : odd)
end do

contains

	subroutine exec(kern)
		type(kernel_t), intent(in) :: kern
        call kern%kernel(n=kern%n,fin=kern%fin,fout=kern%fout)
	end subroutine

end program

The nice thing about this approach is it is easy to change the sweep method - you just need to supply a different procedure pointer. This decision could also happen at runtime. Unless the kernel would provide some other methods/behaviors, I don’t think this pattern is really worth the extra effort.

Topic		Replies	Views
Parallel Fortran Coarrays Longer CPU Time Than Serial Fortran	18	517	November 4, 2024
Optimizing vectorized array operations Help	79	1539	May 5, 2025
Gauss Algorithm written in Fortran slower than written in Python for large matrices? Help	26	2339	April 20, 2022
Performance impact of how a large array is accessed Help	49	1921	June 4, 2023
OpenMP: efficiency in reduction in large array	16	1439	December 13, 2022

Swapping arrays during time-stepping

Separate arrays with parity flag

Array with extra dimension

Pointer swap

Overwrite old array

Related topics