Indeed, the classic streaming version of LBM is implemented as you’ve shown. Here’s an excerpt from a separate subroutine of mine which does the same operation as yours:
!$omp workshare
fdst(:,:,0) = fsrc(:,:,0)
!$omp end workshare
!$omp do schedule(static)
do x = 1, nx
xp1 = mod(x, nx) + 1
xm1 = mod(nx + x - 2, nx) + 1
do y = 1, ny
fdst(y,x,1) = fsrc(y,xm1,1)
end do
end do
!$omp end do
The only difference is I’m using a non-local read, while yours has a non-local write. I wonder if this makes any speed difference. I have also unrolled the lattice directions, so I’m working on one “spatial” slice of PDF’s at a time. After streaming I swap the arrays using integer pointers.
The subroutine in the original post is for a variation of LBM called DUGKS (Discrete Unified Gas Kinetic Scheme). The main idea is that since the collision in LBM is discretized by a Crank-Nicholson scheme, the advective part should be evaluated at the temporal midpoint t + \Delta t/2 too. This results however in a method with quite different numerical properties. The biggest change is that the Courant number can be varied, which gives a bit more freedom in the choice of discretization parameters. However the method also requires more storage & more FLOPS, hence has lower bandwidth (more work per time step). I’m interested to evaluate whether it pays off somehow.