Swapping arrays during time-stepping

For time-stepping large PDE problems, it’s common to read from one array (the previous time step), and write to a second array (the result at the new time step). Afterward the arrays are swapped. This is also known as the “flip-flop” approach, or ping-ponging. The separate arrays also make the update step easier to parallelise (no race conditions).

What is your favourite implementation approach? (Scroll down for examples)

  • Separate arrays with parity flag
  • Array with extra dimension
  • Pointer swap
  • Overwrite old array
  • Other (describe in your reply)
0 voters

Separate arrays with parity flag

real, allocatable :: f1(:), f2(:)
integer :: flipflop

! ... initialize f

flipflop = 1

do step = 1, nsteps

    select case(flipflop)
        call sweep(fin=f1,fout=f2)
        call sweep(fin=f2,fout=f1)
    end select
    flipflop = 3 - flipflop

end do

If zero-based indexing is used, the parity flag can be inverted using

! 0-based indexing
flipflop = 1 - flipflop

Array with extra dimension

This one relies on array slicing or dummy argument association. Integer flags are used to denote the old and new arrays.

real, allocatable :: f(:,:)
integer :: tmp, old, new


! ... initialize f
old = 1
new = 2

do step = 1, nsteps

    call sweep(fin=f(:,kold),fout=f(:,knew))

    tmp = old
    old = new
    new = tmp

end do

Alternatively, the integer flags can be calculated before the update like this:

    old = mod(step+1,2) + 1
    new = mod(step+2,2) + 1

or after the update, like this:

    old = 3 - old
    new = 3 - new

Pointer swap

real, allocatable, target :: f1(:), f2(:)
real, pointer :: fold(:), fnew(:), ftmp(:)

fold => f1
fnew => f2

do step = 1, nsteps

    call sweep(fin=fold,fout=fnew)

    ftmp => fold
    fold => fnew
    fnew => ftmp

end do

Overwrite old array

real, allocatable :: fold(:), fnew(:)

do step = 1, nsteps
    call sweep(fin=fold,fout=fnew)
    fold = fnew
end do

My first pass is somewhere between overwrite and swap, as I like to be able to write something like this at a high level.

state = state + d_dt(state) * dt

For very large problems maybe this isn’t efficient, but I bet the compilers these days are pretty good at playing some tricks to optimize a statement like that.

Just for completeness, I’ve also seen this done in the following way:

do step = 1, nsteps/2
   call sweep(fin=f(:,kold),fout=f(:,knew))
   call sweep(fin=f(:,knew),fout=f(:,kold))
end do



For most problems, I prefer the pointer swap approach rather than managing indexes or increasing if and case branches. When using this method, I always add a contiguous attribute to the declaration of the pointer variables to expect this will benefit from compiler optimizations. Thus, like that:

real, pointer, contiguous :: fold(:), fnew(:), ftmp(:)

For small-scale problems such as one-dimensional, I use the overwrite old array approach.

The main other approach that is often better (assuming the arrays are small enough) is to use a higher order integrator that uses more than 2 state arrays. Once you are using a higher order integrator, you need to store a bunch of stages anyway, and the output is a linear combination of these stages, so it’s a somewhat moot point since you no longer have 2 arrays to swap.

The “flip-flop” or “ping-ponging” approach to time-stepping in large PDE (Partial Differential Equation) problems is indeed a common and effective strategy, especially in parallel computing environments. My preferred implementation approach would depend on a few factors, such as the specific PDE being solved, the computing environment (e.g., CPU vs. GPU), and the programming language being used. However, a general approach I favor involves a combination of clarity, efficiency, and adaptability to parallel computing. Here’s a broad outline:

  1. Use of High-Level Abstractions (when applicable): In environments where high-level abstractions are available and performance is not severely impacted (like Python with NumPy or Julia), using these can greatly simplify code and make it more readable. This is especially true for operations that are inherently parallel, like array operations.
  2. Explicit Buffer Swapping: Maintain two distinct buffers or arrays, one for the current time step and one for the next. After each time step, swap these buffers instead of copying data between them. This can be as simple as swapping pointers or references to the arrays, which is much more efficient than copying array contents.
  3. Parallelism-Friendly Code: Write the update step in a way that is amenable to parallel execution. This usually means avoiding dependencies between different parts of the array during the update. For instance, if using a language like C++ with OpenMP or Python with Numba, you would ensure that the loop iterations are independent so they can be safely parallelized.
  4. Memory Considerations: Especially in GPU computing, but also relevant for CPU, consider how memory access patterns affect performance. Strive for coalesced memory accesses in GPUs and cache-friendly patterns in CPUs.
  5. Error Handling and Validation: Ensure that your implementation includes robust error handling and validation checks. This is crucial for debugging and for ensuring that parallelization does not introduce subtle bugs.
  6. Profiling and Optimization: Use profiling tools to understand where bottlenecks lie and optimize those parts of the code. This might involve different strategies for different architectures (e.g., SIMD optimizations for CPUs, warp-shuffle operations for GPUs).
  7. Scalability and Adaptability: Design the implementation to be scalable to different problem sizes and adaptable to different PDEs or boundary conditions. This often means abstracting the core computational routines from the specifics of the PDE being solved.
  8. Documentation and Comments: Maintain clear documentation and comments, especially for parts of the code that handle the complexities of parallelization and memory management.

In summary, my favorite approach balances performance with readability and maintainability, leveraging high-level abstractions when possible but also diving into lower-level optimizations where they are most beneficial. This approach should be adaptable to various types of PDEs and computing environments.


I use array overwriting which for time stepping in large non-linear FEM problems is just transparent compared to the time required for solving the problem.

With a two time steps approach it would globally speaking look like:

  1. M * dX = Residual( loads_{t+dt} , X_t, X_{t-dt} )
  2. X_{t+dt} = coef_1 * X_t + coef_2 * X_{t-dt} + coef_3 * dX
  3. X_{t-dt} = X_t
  4. X_t = X_{t+dt}

Where in (1) the matrix M can be linear or non-linear so Newton-like iterations might be needed within. When parallelizing in a SPMD framework, with certain d.o.f. being shared across processes, one “just” needs to check that dX is synchronized at the end of (1), (2)-to-(4) can be done without any regards to communication.

To add: in problems involving unitary dynamics it is also common to employ exponential integrators. In these cases I go with something like

dt = (tend - tstart)/real(nt - 1)
do it = 1, nt
  t = tstart + dt*(it - 1)
  tev = matmul(tev, matrix_exp(-cmplx_i*H(t)*dt))

This approach also requires no swapping.

In my main code I swap the pointers, but with move_alloc(), I do not use pointers unless I really have to.


There are low-storage higher order Runge-Kutta solvers available. One has one or maybe two extra array per variable, but one has an array with a result of the stage and the array with the original value. And you need to somehow swap these. It doesn’t matter if it i 1st order Euler, or 5th order RK, the problem still remains the same.

So like this?

real, allocatable, dimension(:) :: ftmp, fold, fnew

call move_alloc(from=fold, to=ftmp)
call move_alloc(from=fnew, to=fold)
call move_alloc(from=ftmp, to=fnew)

Seems like the language could introduce a generic swap_alloc intrinsic function.

1 Like

Like that, yes, I have a subroutine for that.

Yes, this. Fast like swapping pointers, but can result in faster code when operating on the arrays (depending on context) because they won’t have the target attribute and the optimizer shouldn’t have to worry about possible aliasing. That’s been my experience, fwiw.