I’m investigating use of coarrays for performing a halo exchange operation. I’ve only just started and have plans for several approaches as well as some reference implementations using MPI, but my very first performance results are so surprising that I wanted to go ahead and share what I’ve got now to get some feedback on what might be going on.
Here is a sample data point. The halo exchange for a vector with 1.6M elements distributed across 12 images (one per core) takes 0.29 sec using gfortran with OpenCoarrays. (What is being communicated between images is just the overlap or halo, not all vector elements.)
The same test using the NAG compiler takes only 0.000086 sec, which I can scarcely believe, but if it is even remotely correct, NAG is wicked fast. It is perhaps plausible, because unlike Intel and OpenCoarrays which are based on MPI, NAG uses its own proprietary “Co-SMP” shared memory software that is directed specifically at coarrays (but is limited to single-node, shared memory use cases).
But at the other end of the spectrum is the Intel compiler where the same test takes a horrendous 8.9 sec!
I’ve created a repo at github where you can find the code and lots of additional details.
These timing are from a Linux desktop running Fedora 33 with a 12-core Threadripper 2920X CPU and 32 GB of memory.
Running 1 image would be an interesting case I hadn’t thought of. In this case there is no halo to exchange and the “gather” operation being timed is effectively a no-op (though it goes through the motions). Nevertheless I did try it. Here are the compile lines used:
ifort -coarray=single …
gfortran -fcoarray=single …
nagfor -coarray=single …
And as expected, all returned miniscule times.
Here’s another data point with just 2 images: gfortran 0.11 sec, nag 4.5e-5 sec, intel 12 sec (even longer that with 12 images!) Something wacky is going on with Intel that I don’t understand.
Edit: I went back to the original “-coarray=shared” builds, but just set the number of images to 1, and still miniscule times for all. That was probably the better test than recompiling with “-coarray=single”
I don’t think that can be the case. The “halo” for an image isn’t looking at the same memory as the neighboring images, but is its own separate copy of the data. So there has to be an actual copying of data from one memory location to another happening in the exchange.
Okay, I’m pretty sure I’ve identified the cause behind the terrible performance of the Intel executable. I brought up a network monitor, and every time I run the executable it hammers the network.
What I don’t know is why. This is the first time I’ve used Intel’s MPI; I’ve only ever used OpenMPI and MPICH before. If someone knows how to tell Intel’s MPI that I’m just running on the local node please let me know. (I am using the “-coarray=shared” flag and not “-coarray=distributed”)
Setting the environment variables I_MPI_FABRIC and I_MPI_DEVICE to shm eliminated the network traffic for me and substantially improved the timings for Intel, though they are still significantly worse than gfortran/OpenCoarrays.
Understanding the benefits of Coarrays vs OpenMP is an interest I share.
I would suggest a test that has a more substantial/meaningful calculation phase. OpenMP on gFortran will take 10 to 20 micro-seconds to initiate a parallel region, so I would expect multiple coarray processes could take longer. Coarrays would need a substantial calculation to justify their use.
I am not familiar with NAG’s single-node, shared memory coarray equivalent. Is it more OpenMP like? Do they also support OpenMP?
I am more familiar with OpenMP on Ryzen, but am wondering if OpenCoarrays might provide an alternative Fortran conforming approach. Utilising GPUs is another possible alternative.
OpenMP on Threadripper with more memory channels would be interesting.
I would expect any testing of Coarrays must be dependent on the hardware selected; lots of hardware complexity, which is why I started with OpenMP.
A coarray program is very different than a program that uses OpenMP. When a coarray program is launched, multiple images of the entire program are launched and running at the same time, each in their own address space, just as when using MPI. The images communicate with each other through special coarray variables which allow one image to read/write from the corresponding variable in another image. The MPI equivalent would be sending/receiving messages. What I’m interested in here is whether coarrays can be a good alternative to MPI. For that (insofar as performance goes), including a substantial calculation phase in the test is entirely irrelevant; the whole question is how costly the communication is. (As far as usage goes – but I’m just starting – coarrays are remarkably simple to use compared with MPI.)
I am not familiar with NAG’s single-node, shared memory coarray equivalent. Is it more OpenMP like?
Perhaps a misunderstanding here. Coarrays are part of the standard and NAG supports it just like Intel and gfortran – same code for all compilers. The issue is how a compiler implements the underlying communication between images. Intel and gfortran (with OpenCoarrays) use MPI, and NAG its own proprietary implementation. This is mostly transparent to the user, but can easily impact performance. (Actually OpenCoarrays supports other backends besides MPI, but I think those are more experimental.) And NAG does support OpenMP, though I’ve never used it.
Understanding the benefits and trade-offs of MPI, coarrays, OpenMP, etc. is an important, but much bigger, issue than my immediate concern. The advantage of OpenMP as you’ve observed is the ability to take an otherwise serial code and have the compiler parallelize sections of it for you (or off-load to coprocessors/GPUs). Whereas with MPI and coarrays the entire program is parallel from the outset.
The Nag run times are remarkable. I found the same network overhead as you with the Intel compiler. I will look into this in a bit more depth when I have time.
Thanks @cmappic for taking the time to run the tests! I was happy to see that your results were consistent with mine, especially in regards to Intel. As I noted on the webpage, Intel misidentifies the layout of my processor, seeing it as having 6 cores with 4 threads each instead of 12 cores with 2 threads each. I believe I have gotten the placement of images appropriately using environment variables, but wasn’t sure if my Intel timings were a result of a messed up configuration.
Btw, late last night I committed an MPI reference version of the benchmark. Not surprisingly it is the fastest (in my tests), but the NAG compiler is within a factor of 2. Given that those times are so small, if there was any modest amount of real computation going on between halo exchanges that difference would be completely insignificant.
Well yes, but not significantly so in my opinion. Remember that in a parallel algorithm the communication costs are hopefully a small fraction of the overall cost, and what is being measured here is just the communication cost. (But gfortran and intel being 10^4 - 10^5 slower is an entirely different matter.) Also keep in mind that this is my initial simple-minded coarray implementation – and the code is remarkably simple. I’m working on a new version that replaces the communication of scattered values with communication of contiguous blocks of data. We’ll see if that improves things.
Here are my current results for the 12-core test. You can find these on the repo webpage with more explanation that I won’t repeat here.
Very nice. I don’t know, you start from a small problem (70K), but go to 4372K, which seems quite large, isn’t it? And NAG is still 50% slower than your MPI version. However, is there a way to run NAG+MPI? It could be that gfortran that you used is just faster.
I actually did that too, but didn’t bother to record the timings since they were essentially the same as gfortran+MPI. I think the reason for that is that there is very little Fortran involved – nearly all the work in the gather operation happens in MPI and in both cases MPI was built using gcc.
And yes, the small problem is rather too small to be decomposed across 12 cores – there is a definite limit to strong scaling here. Likewise the largest problem is bigger than ideal for only 12 cores. The sweet spot probably lies somewhere between B2 and B3 for 12 cores.
@certik you’ll be interested to learn that NAG+coarrays can be made faster than NAG+MPI. The gather procedure being timed uses a local coarray. The key was to use a persistent coarray, thus taking the coarray allocation/deallocation out of the procedure. Here are the timings (time in µs):
test-#image
B0-12
B1-12
B2-12
B3-12
B4-12
NAG + coarray
6.8
8.8
18
39
120
NAG + MPI
9.2
13
24
48
94
The change made no difference whatsoever for gfortran/OpenCoarrays or Intel; their coarray allocation/deallocation time is completely dwarfed by whatever else they are doing.
I saw two ways to get a persistent coarray (no difference in timings). Unfortunately neither of them are acceptable in practice, IMO.
Make the coarray a module variable
Make the coarray an allocatable component of the DT that holds all the data describing the communication pattern.
I think it’s clear why 1 is bad. 2 would be fine except for the fact that the DT is then not allowed to be associated with a dummy argument with intent(out). One aspect of the coarray philosophy (which I’m still trying to wrap my head around) is that allocations/deallocations of coarrays must be explicit and not hidden, and allocatable components of a derived type passed to an intent(out) dummy argument are deallocated on entry. Hence the restriction. My practice with DT is to have a type bound init procedure that instantiates an instance (as best we can in Fortran) where the passed object is intent(out). That’s not something I’d easily give up.