I thought it would be instructive to consider a simple, but not entirely trivial, computational example and compare parallel implementations using coarrays and openmp (and MPI too).

The example repeatedly applies a Laplace smoothing operator (3-point, nearest-neighbor stencil) to 1D data. Here is the serial implementation:

```
program main
integer :: j, n, num_itr
character(16) :: arg
real, allocatable :: x(:), y(:)
call get_command_argument(1, arg)
read(arg,*) n
call get_command_argument(2, arg)
read(arg,*) num_itr
allocate(x(0:n+1), y(n))
call random_number(x(1:n)) ! initial state
x(0) = 0.0; x(n+1) = 0.0 ! fixed boundary values
do j = 1, num_itr
call apply_laplacian_smoothing(x, y)
x(1:n) = y
end do
write(*,'((i0,1x,g0))') (j, x(j), j=1,n)
contains
subroutine apply_laplacian_smoothing(x, y)
real, intent(in) :: x(0:)
real, intent(out) :: y(:)
integer :: j
real, parameter :: theta = 0.25
do j = 1, size(y)
y(j) = (1-2*theta)*x(j) + theta*(x(j-1) + x(j+1))
end do
end subroutine
end program
```

And here is my coarray implementation with interspersed comments to help explain what is going on:

```
program main
integer :: j, num_pts, num_itr
integer :: np, me, n, r, offset
character(16) :: arg
real, allocatable :: x(:), y(:)
real :: xleft[*], xright[*]
call get_command_argument(1, arg)
read(arg,*) num_pts
call get_command_argument(2, arg)
read(arg,*) num_itr
np = num_images()
me = this_image()
! Partition the points as equally as possible amongst the images.
n = num_pts / np
r = modulo(num_pts, np) ! extras
if (me <= r) n = n + 1 ! first r images get an extra point
! This image works on points offset+1 to offset+n (computed next),
! but the global index doesn't really matter; for the purposes of
! this image it is more convenient to use local indices 1 to n.
offset = n*(me-1)
if (me > r) offset = offset + r
! To compute the smoothed value at index j we need data at j-1 and j+1
! thus our data array needs to extend to local indices 0 and n+1. Those
! are our boundary points: either a partition boundary or global boundary.
allocate(x(0:n+1), y(n))
call random_number(x(1:n)) ! initial state
! Fixed boundary values
if (me == 1) x(0) = 0.0 ! first image has the left boundary point
if (me == np) x(n+1) = 0.0 ! last image has the right boundary point
do j = 1, num_itr
! Copy the value at my first and last point to coarrays
! for my neighbor images to retrieve.
if (me > 1) xleft = x(1)
if (me < np) xright = x(n)
! Set my partition boundary values using data from my neighbor images
sync all
if (me > 1) x(0) = xright[me-1]
if (me < np) x(n+1) = xleft[me+1]
call apply_laplacian_smoothing(x, y)
x(1:n) = y
end do
! Each image writes its values one at a time and in image order
if (me > 1) sync images (me-1)
write(*,'((i0,1x,g0))') (offset+j, x(j), j=1,n)
if (me < np) sync images (me+1)
contains
! This is exactly the same subroutine as in the serial version.
subroutine apply_laplacian_smoothing(x, y)
real, intent(in) :: x(0:)
real, intent(out) :: y(:)
integer :: j
real, parameter :: theta = 0.25
do j = 1, size(y)
y(j) = (1-2*theta)*x(j) + theta*(x(j-1) + x(j+1))
end do
end subroutine
end program
```

Note that the two programs donât give the same results because they start with different state; easily fixed by just assigning a constant value.

Can someone provide an openmp implementation (@adenchfi maybe?); Iâve only ever used openmp vicariously through students. I expect it to be far simpler than my coarray version in this simple case. I havenât given any thought yet to what an MPI version would be, but I expect it to be *much* more complex (syntactically at least) than the coarray version.