Parallel Fortran Coarrays Longer CPU Time Than Serial Fortran

@gareth strangely, when I use buffers, the time per iteration increased :joy: The additional time is probably due to the fact that I need to assign the values to the buffer first. You can track the changes here: Comparing main...buffers 路 obdwinston/Parallel-Fortran 路 GitHub

Before using buffer (recap):

is = tiled_indices_start
ie = tiled_indices_end
nt = time_step_iterations
allocate(U(n_cells)) ! U field (unstructured mesh)
...
do n = 1, nt
...
    do i = is, ie
        ... ! operations with U
        U(i)[1] = U(i) ! gathering back to image 1
        sync all
    end do
...
end do
 Elapsed time per iteration (cpu_time):    1.9869999999999610E-003
 Elapsed time per iteration (system_clock):    2.0000000000000000E-003
 Estimated time remaining (h:m:s):            0           0          47

After using buffer:

is = tiled_indices_start
ie = tiled_indices_end
nt = time_step_iterations
allocate(U(n_cells)) ! U field (unstructured mesh)
allocate(B(is:ie)) ! U field buffer
...
do n = 1, nt
...
    do i = is, ie
        ... ! operations with U buffer
        B(i) = ... ! assign result to U buffer
    end do
    U(is:ie)[1] = B(is:ie) ! gathering back to image 1
    sync all
...
end do
Elapsed time per iteration (cpu_time):    8.0100000000129512E-003
Elapsed time per iteration (system_clock):    8.0000000000000002E-003
Estimated time remaining (h:m:s):            0           0           0