Coarray version of MPI_AlltoAll

Happy Tuesday, Fortraners!
I am refactoring a legacy and spaghetti code written in Fortran 77/90 into modern Fortran. The original code is parallelized by both MPI and OpenMP. I am trying to use as many “new features” as I can when modernizing the code. My wish is to replace MPI with coarray Fortran.

Now, the difficulty I am facing is that there are many distributed matrix transposition involved in the original code and thus MPI_AlltoAll is used extensively. I did some research and found that coarray may have the potential to out-perform a regular MPI_AlltoAll by using overlapping communications (Pekurovsky D., 2012 and Robert Fiedler, et al., 2013). I am not a computer science major but from reading these papers my impression is a co_alltoall is totally possible but might require CS knowledge that I do not possess.

So here comes the question (or questions): Does a coarray version of MPI_AlltoAll (for example co_alltoall) already exist (I tried to look for an open-source solution but had no luck.)? If not, is it worth it/how hard is it to implement a co_alltoall, and where should I start?

2 Likes

I don’t know if there’s a library, but Robert Numrich tackles this problem in section 5.7 of his “Parallel Programming with Co-Arrays” book.

1 Like

I happen to have a copy of this book but only read the first four chapters of it! Guess it’s time to continue :slight_smile:

1 Like

@han190 , thanks for mentioning those references. I’m not familiar with them and should take a look.

Before you plunge into replacing MPI with coarrays I would strongly advise that you create a representative test bed for your particular usage of MPI – focusing on MPI_AlltoAll to start – and confirm that coarrays are competitive with MPI, and portable across the compilers/platforms you use. My own experience is that coarrays are not even close to being competitive, for the most part. But your experience may be quite different.

My own test bed focused on MPI_Alltoallv used for a halo exchange. You might find the coarray replacements there instructive.

1 Like

@milancurcic Thanks again for directing me to the book! I will have to read chapter 9 and 10 thoroughly before I am able to make any serious conclusion. BTW, is that possible for me to accept more than one replies as solutions? Currently I am only allowed to select one.

@nncarlson Yes, I read your post about coarray and definitely understand your concern, but coarray is a very attractive feature for me (and probably many other domain scientists) because it is standard conforming, and the syntax of it is just elegant. Personally, I am inclined to use it if the performance penalty isn’t unacceptable. I will come back for the testing part once I have my co_alltoall implemented. Thank you for your suggestions!

alltoall is an expensive operation (it seems like more or less a \mathcal{O}(n^2) operation where n is the number of cpu cores involved), especially if it has to occur frequently.

I mean, usually I would have a rank 0 core which I call it a boss core. It will try its best to distribute equal amount of jobs to each cores, then each cores do their jobs independently, finally send their results back to the rank 0 core. In such case, you only need things like scatter, scatterv, gather, gatherv, or broadcast, which are all \mathcal{O}(n) operations, no need to use alltoall, which is expensive.

So, my personal naïve opinion is, perhaps you may check if alltoall is absolutely necessary. If you may be able to use things like scatter or scatterv or broadcast instead of alltoall, that may gave you more speedup than implementing a coarray version of alltoall.

Coarray seems more or less using MPI, so coarray version of alltoall seems will perform similar with MPI’s intrinsic alltoall if not better.