Fixing an OpenMP race condition

Indeed, the classic streaming version of LBM is implemented as you’ve shown. Here’s an excerpt from a separate subroutine of mine which does the same operation as yours:

      !$omp workshare
      fdst(:,:,0) = fsrc(:,:,0)
      !$omp end workshare

      !$omp do schedule(static)
      do x = 1, nx
         xp1 = mod(x, nx) + 1
         xm1 = mod(nx + x - 2, nx) + 1
         do y = 1, ny
            fdst(y,x,1) = fsrc(y,xm1,1)
         end do
      end do
      !$omp end do

The only difference is I’m using a non-local read, while yours has a non-local write. I wonder if this makes any speed difference. I have also unrolled the lattice directions, so I’m working on one “spatial” slice of PDF’s at a time. After streaming I swap the arrays using integer pointers.

The subroutine in the original post is for a variation of LBM called DUGKS (Discrete Unified Gas Kinetic Scheme). The main idea is that since the collision in LBM is discretized by a Crank-Nicholson scheme, the advective part should be evaluated at the temporal midpoint t + \Delta t/2 too. This results however in a method with quite different numerical properties. The biggest change is that the Courant number can be varied, which gives a bit more freedom in the choice of discretization parameters. However the method also requires more storage & more FLOPS, hence has lower bandwidth (more work per time step). I’m interested to evaluate whether it pays off somehow.

2 Likes