Parallel Fortran Coarrays Longer CPU Time Than Serial Fortran

Hello everyone, I am new to parallel computing with Fortran and I just finished reading Modern Fortran by Milan Curcic. I managed to incorporate what I have learned from the book into a personal CFD flow solver for the 2D Euler equations, and everything works out fine. The only problem is that when I run on anything more than 1 image, the clocked CPU time for each time step is much higher. It depends on the mesh size, but roughly speaking its about at least twice as long for every additional image.

As I am completely new to this, I am probably not doing it right. You can view the full solver here, but to not waste your time, the general algorithm structure is given in main.f90, while mod_solve.f90 contains all the heavy-lifting procedures. Running the shell script (./run.sh) from project root runs the entire program. Of note, I am working with an unstructured grid, and because of that I allocate the full-sized coarrays for the necessary flow variables on each image, but each image only works on the tiled indices of the full array. Once done, all the updated values are gathered on the full array of image 1, and the cycle continues. I am not sure whether the allocation of the full-sized array in every image is slowing down the process, but I do not think it should significantly affect the compute time, since each image only works on a small subset of the full array anyways, and the allocation is only done once at the start of the program (before the time loop).

Any advice to improve the parallelisation will be greatly appreciated! Cheers!

1 Like

Try the clocked elapsed time (SYSTEM_CLOCK) for each time step.

For more than 1 image, you need to understand what CPU_TIME and SYSTEM_CLOCK are reporting.

2 Likes

Intel Document:

CPU_TIME

NOTE:

If you want to estimate performance or scaling of multithreaded applications, you should use intrinsic subroutine SYSTEM_CLOCK

SYSTEM_CLOCK

To get the elapsed time, you must call SYSTEM_CLOCK twice, and subtract the starting time value from the ending time value.

Thanks everyone for the suggestion. I have calculated using system_clock, and the elapsed time is still higher. Which makes sense because the iterations are taking longer in real world with the parallel coarrays than without.

Serial

Elapsed time per iteration (cpu_time):   0.16091200000000150     
Elapsed time per iteration (system_clock):   0.16100000000000000     
Estimated time remaining (h:m:s):           11          10          19

Parallel (3 images)

 Elapsed time per iteration (cpu_time):   0.50230600000000036     
 Elapsed time per iteration (system_clock):   0.50200000000000000     
 Estimated time remaining (h:m:s):           34          52          53

That said, there may be an issue with the way I am parallelising the operations. I just want to verify, is it normal for the coarray to be allocated fully-sized on each image, and each image perform on the tiled indices before gathering on image 1 (see example below)? Seems like there is a lot of overhead, but the example given in the book (Modern Fortran by Milan Curcic) works on a structured mesh which makes it a lot easier to use a smaller-sized U (e.g. size of tiled_indices_start - tiled_indices_end + halo points). I am not sure if this is the cause of the longer parallel compute times.

For example,

allocate(U(n_cells)) ! U field (unstructured mesh)
...
do n = 1, n_iterations
...
    do i = tiled_indices_start, tiled_indices_end
        ... ! operations with U
        U(i)[1] = U(i) ! gathering back to image 1
        sync all
    end do
...
end do

Welcome @obdwinston,

I haven’t tried to run your code, but it looks like there is lots of “fine-grained” communication going on in mod_solve.f90 – with puts and gets of small amounts of data inside loops, a bit like in your example above.

In general, I’d expect this to lead to poor performance. My experience with other codes is that it’s faster to minimise the number of communications (e.g. once per timestep, with a larger block of data in the communication). Compilers aren’t generally smart enough to do this for you.

I would expect this to be a bigger issue than the replication of data over all images (although one would avoid that too if trying to get great performance).

While coarrays provide very nice syntax, it’s probably best to view communications as expensive unless proven otherwise, and avoid the fine-grained approach.

These are just my Bayesian priors, and it would be better to profile your code to check.

Yeah, this is not an efficient way to handle the parallelism. You’re doing domain decomposition for the computations, but doing so on global arrays. The arrays themselves need to be decomposed and distributed across the images with each image only holding a piece of the global problem. This amounts to each image holding the mesh/data for a subdomain, plus a layer of ghost cells around the periphery to hold data the image needs for its calculations but are produced by other images working on neighboring subdomains. Images then communicate with neighboring images instead of all communication passing through image 1 as you’ve done.

All this is fairly complicated to implement, especially for an unstructured mesh. The virtue of what you’ve done is to avoid that complexity, but I’m not at all surprised that it performs poorly. And as someone else mentioned, you appear to be doing lots of fine-grained communication (i.e., single values) which I think probably is much less efficient than communicating blocks of data.

I’ll also add that in my experience coarrays perform very poorly (Intel and gfortran) for the type of communication needed in domain decomposition of PDEs. (NAG being an exception, where it is comparable to MPI.)

1 Like

Thanks @gareth and @nncarlson for your comments. I was indeed trying to avoid unnecessary complexity with the coarrays, but I guess there is no free lunch in the world. With regard to the fine-grained communication, I understand that it refers to the frequent, small data transfers between images. But even with halo points, such fine-grained communication will still be required no? Albeit with a smaller-sized coarray. That said, do you also have any Fortran parallel computing resources or books that you can recommend? Perhaps how to better design parallel computing programs?

You might consider taking a look at https://github.com/nncarlson/index-map and the finite volume example, which is closest to what you’re wanting to do. If you have questions feel free to message me directly.

1 Like

You can also check out Parallel Programming with Co-arrays

With regard to the fine-grained communication, I understand that it refers to the frequent, small data transfers between images. But even with halo points, such fine-grained communication will still be required no? Albeit with a smaller-sized coarray.

The fine-grained communication can be avoided by packing data to be communicated into contiguous communication buffers. Then you can do a few calls like

do i = 1, number_of_images_to_send_to
! Send all data that receive_image(i) requires 
! from this_image, in a contiguous chunk
receive_buffer(receive_start(i):receive_end(i))[receive_image(i)] =
     send_buffer(sent_start(i):send_end(i))
end do

Ideally you’d do this once per time-step (or at least, as infrequently as you can).

Notice this is not the same as looping over the indices to be sent. Although logically equivalent, with current compiler technology it seems better to do few, large communications.

There’s a bit of book-keeping work involved to pack and unpack the buffers to the right place.

I also concur with @nncarlson that the performance of coarray implementations can vary quite a bit. In our nested grid shallow water solver, I originally used coarrays to get it working, but later introduced options to use MPI instead, which is currently better on the intel clusters we use.

I’d suggest to you write parallel code so that all the parallel communication steps are hidden from the logic of the main program (behind subroutines). Then you have all the coarray communication in one place, and it’s not so difficult to change it to MPI or something else later on if needed.

Here’s an example of the main parallel communication module in that shallow water solver I mentioned. You can see there’s preprocessing to allow coarrays or various flavours of MPI.

@nncarlson @hkvzjal Thanks for the suggestions, I will give these a look!

Ah okay I get it, thanks for the clear concise explanation. I could definitely cut down on the fine-grained communication with a buffer. But some are unavoidable because of the dependence on the updated values in the next subroutine.

Will do! The modularisation of the parallel logic definitely helps like what you mentioned, to switch between say CAF or MPI if needed. Thanks for the SWE code suggestion, will take a look!

@gareth strangely, when I use buffers, the time per iteration increased :joy: The additional time is probably due to the fact that I need to assign the values to the buffer first. You can track the changes here: Comparing main...buffers · obdwinston/Parallel-Fortran · GitHub

Before using buffer (recap):

is = tiled_indices_start
ie = tiled_indices_end
nt = time_step_iterations
allocate(U(n_cells)) ! U field (unstructured mesh)
...
do n = 1, nt
...
    do i = is, ie
        ... ! operations with U
        U(i)[1] = U(i) ! gathering back to image 1
        sync all
    end do
...
end do
 Elapsed time per iteration (cpu_time):    1.9869999999999610E-003
 Elapsed time per iteration (system_clock):    2.0000000000000000E-003
 Estimated time remaining (h:m:s):            0           0          47

After using buffer:

is = tiled_indices_start
ie = tiled_indices_end
nt = time_step_iterations
allocate(U(n_cells)) ! U field (unstructured mesh)
allocate(B(is:ie)) ! U field buffer
...
do n = 1, nt
...
    do i = is, ie
        ... ! operations with U buffer
        B(i) = ... ! assign result to U buffer
    end do
    U(is:ie)[1] = B(is:ie) ! gathering back to image 1
    sync all
...
end do
Elapsed time per iteration (cpu_time):    8.0100000000129512E-003
Elapsed time per iteration (system_clock):    8.0000000000000002E-003
Estimated time remaining (h:m:s):            0           0           0

Interesting that it’s even slower! Possible issues are

  • While you’ve introduced buffers, in some places you haven’t removed the previous communications (such as line 64 or 93 of mod_solve.f90). Try to remove such things (e.g. some cases could be eliminated by doing the addition on the receiving process, while others could be replaced with contiguous communication outside of a loop).
  • The buffers aren’t leading to contiguous communication. For example line 96 of mod_solve.f90 has a non-contiguous block in the communication (remember Fortran’s array memory order convention, x(i:j,:) is non-contiguous unless i:j is the entire first dimension). Depending on how smart your compiler is, it might split these up into lots of smaller communications.

Maybe these things will improve the speed (or not, just a guess).

But irrespective, you’re unlikely to get good performance while all the communication is passing through image 1 alone.

I do experience slow execution of coarray programs on CPUs myself since a few years for now, using gfortran/OpenCoarrays as well as ifort (and now ifx). Earlier versions of these compilers did show a much higher runtime performance for coarray programs on a CPU.

My assumption is that the low coarray performance is somehow related to MPICH.

Earlier versions of gfortran did not have the -fallow-argument-mismatch flag: https://groups.google.com/g/comp.lang.fortran/c/vITMz5e1nHQ .

When I do run MPICH ./configure on Linux to build mpich for an already installed gfortran compiler, I see two options regarding this flag:

  1. If I have installed gfortran 9 version, MPICH ./configure runs without any problems.

  2. If I have installed gfortran 11 or later versions, MPICH ./configure aborts with the following error message:

mpich 4.0.1 ./configure checking whether gfortran allows mismatched arguments... yes, with -fallow-argument-mismatch configure: error: The Fortran compiler gfortran does not accept programs that call the same routine with arguments of different types without the option -fallow-argument-mismatch. Rerun configure with FFLAGS=-fallow-argument-mismatch and FCFLAGS=-fallow-argument-mismatch

But now, whenever I do run MPICH configure with this -fallow-argument-mismatch argument, the end result is an installation that yields a very low performance for coarray programs on a CPU.

The one setup that I still do have on another computer is gfortran 9.2.1, MPICH 3.2.1 (later versions of MPICH did also work), and OpenCoarrays 2.8.0, just as a prove that coarray programs can be executed with very high performance on a CPU. Older versions of ifort had the same high coarray performance on my computer.

Only the allocation of coarrays takes quiet a long time (several seconds) with gfortran/OpenCoarrays, but once allocated runtime execution itself is super fast with the above setup.

regards

It’s been a few years since I tried using co-arrays on your typical commodity AMD/Intel 8 or 16 core processor. I gave up because of the dismal performance (but I will admit to being somewhat jaded by using co-arrays on Cray HPC systems which have the hardware necessary to support PGAS type programs with Cray’s compilers). I never investigated this but I think both MPICH and openMPI can be built to default to using shared memory instead of the TCP/IP stack on multi-core shared memory nodes. The MPI impiementations on most large HPC systems I’ve used appeared to do shared memory communications on a node and only used the switch/interconnect to go to off node processes. It’s been a very long time since I built either MPICH or openMPI from scratch and I wonder if the current default is to build for shared memory on multi-core processors. I think ifort at one time allowed you to specify if you wanted to use shared memory for co-arrays but don’t quote me on that. Again, most of my co-array experience is on Cray’s which have the hardware to make co-arrays competitive with pure MPI. I just don’t know if thats the case on your average workstation PC.

Edit

Also I would think you would only need to use -fallow-argument-mismatch if you are using the MPI include file (mpi.h) instead of the Fortran 08 module (use mpi08). Unfortunately, because the standards committee appears to have no interest in defining a transportable module format, the mpi08 module is compiler specific, meaning you will have to build a separate version of MPI for each compiler if you want ot use the mpi08 module

Thanks for giving coarrays a try and for reading my book. I’d love to be able to say that I’ve used coarrays more than the academic exercise in my book, but alas I have not. As a sanity check to ensure your software stack is set up correctly: are you able to run the tsunami example and see the speedup as you increase the number of images? For a sufficiently large domain (I think 1000x1000 should be big enough), you should be able to see close to a 4-times speedup when going from 1 to 4 images, for example. I haven’t done this since 2019, but back then it worked.

1 Like

Hello Prof. Curcic,

I had a blast reading your book! It really made me appreciate Fortran more and helped me to refactor/modularise my flow solver :grin: Unfortunately, I did not have much luck implementing Fortran coarrays due to its poor performance (I have since removed the repository for the parallel version because it was running longer than serial).

I tried running the final tsunami program from GitHub and had slower run times with increasing images. I’m not sure if it’s because I’m running on an Apple silicon chip (M3 chip), which may not be optimised for CAF. The results are as follow:

Running on 1 image (1001 x 1001 grid):

./src/final/tsunami  17.65s user 5.06s system 96% cpu 23.601 total

Running on 2 images (1001 x 1001 grid):

cafrun -n 2 src/final/tsunami  39.83s user 7.12s system 184% cpu 25.426 total

Running on 4 images (1001 x 1001 grid):

cafrun -n 4 src/final/tsunami  269.98s user 8.16s system 390% cpu 1:11.26 total

Hi @gareth, thanks for looking through my code!

Ah I never knew this! However, as per my reply to Prof. Curcic’s comment, I think there is perhaps an optimisation issue with OpenMPI and OpenCoarrays running on Apple silicon. For the final program in his book, I am getting much higher runtimes with more images.