Some coarray performance results

nncarlson · January 21, 2022, 6:10pm

I’m investigating use of coarrays for performing a halo exchange operation. I’ve only just started and have plans for several approaches as well as some reference implementations using MPI, but my very first performance results are so surprising that I wanted to go ahead and share what I’ve got now to get some feedback on what might be going on.

Here is a sample data point. The halo exchange for a vector with 1.6M elements distributed across 12 images (one per core) takes 0.29 sec using gfortran with OpenCoarrays. (What is being communicated between images is just the overlap or halo, not all vector elements.)

The same test using the NAG compiler takes only 0.000086 sec, which I can scarcely believe, but if it is even remotely correct, NAG is wicked fast. It is perhaps plausible, because unlike Intel and OpenCoarrays which are based on MPI, NAG uses its own proprietary “Co-SMP” shared memory software that is directed specifically at coarrays (but is limited to single-node, shared memory use cases).

But at the other end of the spectrum is the Intel compiler where the same test takes a horrendous 8.9 sec!

I’ve created a repo at github where you can find the code and lots of additional details.

certik · January 21, 2022, 7:08pm

How long does it run with 1 image using gfortran, nag and intel? Are you running it on a linux Intel based laptop?

ptbrady · January 21, 2022, 7:22pm

I suspect a shared-memory halo exchange doesn’t need to exchange any data but just runs through some memory coherence protocols. Just a guess.

nncarlson · January 21, 2022, 7:55pm

These timing are from a Linux desktop running Fedora 33 with a 12-core Threadripper 2920X CPU and 32 GB of memory.

Running 1 image would be an interesting case I hadn’t thought of. In this case there is no halo to exchange and the “gather” operation being timed is effectively a no-op (though it goes through the motions). Nevertheless I did try it. Here are the compile lines used:

ifort -coarray=single …
gfortran -fcoarray=single …
nagfor -coarray=single …

And as expected, all returned miniscule times.

Here’s another data point with just 2 images: gfortran 0.11 sec, nag 4.5e-5 sec, intel 12 sec (even longer that with 12 images!) Something wacky is going on with Intel that I don’t understand.

Edit: I went back to the original “-coarray=shared” builds, but just set the number of images to 1, and still miniscule times for all. That was probably the better test than recompiling with “-coarray=single”

nncarlson · January 21, 2022, 8:02pm

Welcome Peter!

I don’t think that can be the case. The “halo” for an image isn’t looking at the same memory as the neighboring images, but is its own separate copy of the data. So there has to be an actual copying of data from one memory location to another happening in the exchange.

nncarlson · January 21, 2022, 8:30pm

Okay, I’m pretty sure I’ve identified the cause behind the terrible performance of the Intel executable. I brought up a network monitor, and every time I run the executable it hammers the network.

What I don’t know is why. This is the first time I’ve used Intel’s MPI; I’ve only ever used OpenMPI and MPICH before. If someone knows how to tell Intel’s MPI that I’m just running on the local node please let me know. (I am using the “-coarray=shared” flag and not “-coarray=distributed”)

nncarlson · January 21, 2022, 9:36pm

Setting the environment variables I_MPI_FABRIC and I_MPI_DEVICE to shm eliminated the network traffic for me and substantially improved the timings for Intel, though they are still significantly worse than gfortran/OpenCoarrays.

certik · January 22, 2022, 3:31am

Even for 2 images, gfortran is over 2400x slower than nag, that is just way too slow. Something is up with it also.

@ptbrady welcome to the Forum!

JohnCampbell · January 22, 2022, 7:26am

Understanding the benefits of Coarrays vs OpenMP is an interest I share.

I would suggest a test that has a more substantial/meaningful calculation phase. OpenMP on gFortran will take 10 to 20 micro-seconds to initiate a parallel region, so I would expect multiple coarray processes could take longer. Coarrays would need a substantial calculation to justify their use.

I am not familiar with NAG’s single-node, shared memory coarray equivalent. Is it more OpenMP like? Do they also support OpenMP?

I am more familiar with OpenMP on Ryzen, but am wondering if OpenCoarrays might provide an alternative Fortran conforming approach. Utilising GPUs is another possible alternative.
OpenMP on Threadripper with more memory channels would be interesting.

I would expect any testing of Coarrays must be dependent on the hardware selected; lots of hardware complexity, which is why I started with OpenMP.

nncarlson · January 22, 2022, 4:17pm

A coarray program is very different than a program that uses OpenMP. When a coarray program is launched, multiple images of the entire program are launched and running at the same time, each in their own address space, just as when using MPI. The images communicate with each other through special coarray variables which allow one image to read/write from the corresponding variable in another image. The MPI equivalent would be sending/receiving messages. What I’m interested in here is whether coarrays can be a good alternative to MPI. For that (insofar as performance goes), including a substantial calculation phase in the test is entirely irrelevant; the whole question is how costly the communication is. (As far as usage goes – but I’m just starting – coarrays are remarkably simple to use compared with MPI.)

I am not familiar with NAG’s single-node, shared memory coarray equivalent. Is it more OpenMP like?

Perhaps a misunderstanding here. Coarrays are part of the standard and NAG supports it just like Intel and gfortran – same code for all compilers. The issue is how a compiler implements the underlying communication between images. Intel and gfortran (with OpenCoarrays) use MPI, and NAG its own proprietary implementation. This is mostly transparent to the user, but can easily impact performance. (Actually OpenCoarrays supports other backends besides MPI, but I think those are more experimental.) And NAG does support OpenMP, though I’ve never used it.

Understanding the benefits and trade-offs of MPI, coarrays, OpenMP, etc. is an important, but much bigger, issue than my immediate concern. The advantage of OpenMP as you’ve observed is the ability to take an otherwise serial code and have the compiler parallelize sections of it for you (or off-load to coprocessors/GPUs). Whereas with MPI and coarrays the entire program is parallel from the outset.

cmaapic · January 25, 2022, 2:36pm

Hi Neil, I have ran your halo code on 3 systems and found similar run times. Here are the details

			Intel I7 920	AMD Phenom II	AMD Ryzen 7
				X6 1055T	5700U

gfortran	opencalc-B0-4	70,302	0.010	0.019	0.0060
	opencalc-B1-4	206,368	0.022	0.048	0.0081
	opencalc-B2-4	562,019	0.052	0.070	0.0196
	opencalc-B3-4	1,648,288	0.126	0.145	0.0390
	opencalc-B4-4	4,372,406	0.235	0.344	0.0929

intel	opencalc-B0-4	70,302	0.440	0.416	0.891
	opencalc-B1-4	206,368	1.248	1.610	2.297
	opencalc-B2-4	562,019	6.427	9.119	9.187
	opencalc-B3-4	1,648,288	67.156	68.567	62.198
	opencalc-B4-4	4,372,406	287.064	348.230	333.342

nag	opencalc-B0-4	70,302	0.000021	0.000115	0.000025
	opencalc-B1-4	206,368	0.000037	0.000274	0.000014
	opencalc-B2-4	562,019	0.000068	0.000475	0.000056
	opencalc-B3-4	1,648,288	0.000158	0.000533	0.000187
	opencalc-B4-4	4,372,406	0.000883	0.001341	0.000676
![image	664x500](upload://3vgkGJePKN1L9FN0ustVrgQWTHv.png)

The Nag run times are remarkable. I found the same network overhead as you with the Intel compiler. I will look into this in a bit more depth when I have time.

certik · January 25, 2022, 3:09pm

Thanks @nncarlson for this benchmark and thanks @cmaapic for running it! Indeed it’s quite remarkable for NAG.

nncarlson · January 25, 2022, 3:23pm

Thanks @cmappic for taking the time to run the tests! I was happy to see that your results were consistent with mine, especially in regards to Intel. As I noted on the webpage, Intel misidentifies the layout of my processor, seeing it as having 6 cores with 4 threads each instead of 12 cores with 2 threads each. I believe I have gotten the placement of images appropriately using environment variables, but wasn’t sure if my Intel timings were a result of a messed up configuration.

Btw, late last night I committed an MPI reference version of the benchmark. Not surprisingly it is the fastest (in my tests), but the NAG compiler is within a factor of 2. Given that those times are so small, if there was any modest amount of real computation going on between halo exchanges that difference would be completely insignificant.

certik · January 25, 2022, 4:31pm

@nncarlson can you post your MPI timings? Is NAG slower than an MPI version?

So that means that for this particular example, coarrays are slower than MPI for all compilers?

nncarlson · January 25, 2022, 5:07pm

Is NAG slower than an MPI version?

Well yes, but not significantly so in my opinion. Remember that in a parallel algorithm the communication costs are hopefully a small fraction of the overall cost, and what is being measured here is just the communication cost. (But gfortran and intel being 10^4 - 10^5 slower is an entirely different matter.) Also keep in mind that this is my initial simple-minded coarray implementation – and the code is remarkably simple. I’m working on a new version that replaces the communication of scattered values with communication of contiguous blocks of data. We’ll see if that improves things.

Here are my current results for the 12-core test. You can find these on the repo webpage with more explanation that I won’t repeat here.

test-#image	array size	halo size	ifort	gfortran	nagfor	MPI
B0-12	70K	20K	0.023 s	0.033 s	23e-6 s	12e-6 s
B1-12	206K	38K	0.076 s	0.061 s	28e-6 s	17e-6 s
B2-12	562K	76K	0.38 s	0.12 s	37e-6 s	27e-6 s
B3-12	1648K	160K	2.8 s	0.29 s	56e-6 s	42e-6 s
B4-12	4372K	302K	13 s	0.45 s	140e-6 s	99e-6 s

certik · January 25, 2022, 8:10pm

Very nice. I don’t know, you start from a small problem (70K), but go to 4372K, which seems quite large, isn’t it? And NAG is still 50% slower than your MPI version. However, is there a way to run NAG+MPI? It could be that gfortran that you used is just faster.

Update: I see that you are just benchmarking the “gather” operation: coarray-halo-exchange/index_map_type.f90 at 2a0dd4974ad0b2338db4ebcc924647c3f1b92dab · nncarlson/coarray-halo-exchange · GitHub. Still, NAG+MPI vs NAG+coarrays results would be interesting.

nncarlson · January 25, 2022, 9:39pm

However, is there a way to run NAG+MPI?

I actually did that too, but didn’t bother to record the timings since they were essentially the same as gfortran+MPI. I think the reason for that is that there is very little Fortran involved – nearly all the work in the gather operation happens in MPI and in both cases MPI was built using gcc.

And yes, the small problem is rather too small to be decomposed across 12 cores – there is a definite limit to strong scaling here. Likewise the largest problem is bigger than ideal for only 12 cores. The sweet spot probably lies somewhere between B2 and B3 for 12 cores.

nncarlson · January 28, 2022, 6:56pm

@certik you’ll be interested to learn that NAG+coarrays can be made faster than NAG+MPI. The gather procedure being timed uses a local coarray. The key was to use a persistent coarray, thus taking the coarray allocation/deallocation out of the procedure. Here are the timings (time in µs):

test-#image	B0-12	B1-12	B2-12	B3-12	B4-12
NAG + coarray	6.8	8.8	18	39	120
NAG + MPI	9.2	13	24	48	94

The change made no difference whatsoever for gfortran/OpenCoarrays or Intel; their coarray allocation/deallocation time is completely dwarfed by whatever else they are doing.

I saw two ways to get a persistent coarray (no difference in timings). Unfortunately neither of them are acceptable in practice, IMO.

Make the coarray a module variable
Make the coarray an allocatable component of the DT that holds all the data describing the communication pattern.

I think it’s clear why 1 is bad. 2 would be fine except for the fact that the DT is then not allowed to be associated with a dummy argument with intent(out). One aspect of the coarray philosophy (which I’m still trying to wrap my head around) is that allocations/deallocations of coarrays must be explicit and not hidden, and allocatable components of a derived type passed to an intent(out) dummy argument are deallocated on entry. Hence the restriction. My practice with DT is to have a type bound init procedure that instantiates an instance (as best we can in Fortran) where the passed object is intent(out). That’s not something I’d easily give up.

certik · January 28, 2022, 9:13pm

Why is NAG+coarray slower for B4-12? It seems faster for all other examples.

Topic		Replies	Views
Coarrays: Not ready for prime time	64	6238	April 18, 2022
Learning coarrays, collective subroutines and other parallel features of Modern Fortran Help	48	2393	May 11, 2021
Multiple parallelization layers vs the agnosticism of coarrays: suggestions for improvement Language enhancement	0	411	February 19, 2023
Coarray Fortran 2.0	2	658	February 12, 2022
Coarray version of MPI_AlltoAll Help	5	477	July 20, 2022

Some coarray performance results

Related topics