Coarrays: Not ready for prime time

nncarlson · February 2, 2022, 5:14am

A provocative title I’ll admit, and I’ll readily admit from the start that other’s experiences may be quite different. But from my perspective coarrays are not ready for serious use. I work mostly with domain decomposition methods for PDEs and the question for me is whether Fortran coarrays can be a competitive parallel programming alternative to MPI. Computational performance is the key concern, but also performance portability – I expect things to work reasonably uniformly across multiple compilers, and am not comfortable with solutions that limit me practically to a single compiler/platform. To answer that question I’ve been considering several coarray implementations, and a reference MPI implementation, of a core halo exchange operation. So far I’ve been able to test using the NAG, Intel, and GNU Fortran (with OpenCoarrays) compilers. The results have been quite surprising and discouraging. Here are some sample results for a typical test case:

The ratio of the time of a compiler’s best coarray implementation to the best reference MPI time: gfortran: 1690; Intel: 5.42; NAG: 1.15.

The Intel result doesn’t seem half bad until you consider this result for the same test case:

The ratio of the time of a compiler’s worst coarray implementation to the time of it’s best coarray implementation: gfortran: 3.7; NAG: 9; Intel 12700!

Of the 4 coarray implementations, Intel performed roughly similar to gfortran on 3 of them, namely 3 to 4 orders of magnitude worse than MPI, and only on 1 did it approach the performance of MPI. Such wild variability isn’t acceptable to me.

So of the 3 compilers only NAG coarrays was competitive with MPI; in fact with some small tweaking its coarray implementation could be made faster than MPI.

I’ve created a repository for the code and tests (previously mentioned in an earlier post). You can find detailed results and much explanation there. I’d truly welcome any feedback you might have – create an issue or start a discussion there.

There’s one final issue regarding the usability of coarrays the greatly concerns me, and that is whether it can co-operate with MPI and work in mixed-language contexts. There’s an existing topic about using a coarray library in a non-Fortran program. Top of my list of needs is being able to use an MPI-parallel library (Hypre) from a Fortran coarray program. Since the Intel and gfortran coarray implementations are built using MPI I think this is most likely doable. However the NAG coarray implementation – the only one I find is truly usable – does something different and I have serious doubts whether it is doable with NAG.

shahmoradi · February 2, 2022, 6:25am

The current implementations of Coarray Fortran suffer from circular logic. People hesitate to use it because of the performance issues mentioned, and compiler developers do not seem to prioritize Coarray performance enhancements because people hesitate to use it.

I have implemented the same algorithms using Coarrays and MPI, and the Coarray Fortran looks concise and beautiful, nearly perfect. But exemplary performant implementations, comparable to MPI, also matter. Otherwise, the usage remains limited to educational parallel computing activities.

It is easy to criticize something to which I have contributed zero. I appreciate the efforts of Damian Rouson and Sourcery Institute (for their impressive OpenCoarrays library which had the best Coarray performance in my tests) and the Intel compiler team for their full implementation of Coarrays 2018. It’s a feat. But some (performance and auxiliary) improvements appear essential to see Coarrays more in production code.

p.s. I mentioned OpenCoarrays and Intel ifort because these are the two Coarray implementations that I have frequently used and tested until now. The NAG compiler’s implementation looks quite promising, especially if it offers the interoperation flexibility of MPI/OpenMP.
I look forward to testing it soon.

rwmsu · February 2, 2022, 2:47pm

nncarlson’s results appear to reflect the findings of Shterenlikht and Cebamanos

but contradict the findings of Garain, Balsara, and Reid

Did the OP look at possible hardware related issues such as amount of memory per core, is hyper-threading being used etc. I’ve found that for MPI you need a minimum of 2Gbytes per core on most Linux systems. Also on every large HPC system I’ve run on (thousands of cores) hyperthreading is turned off. Remember CAF was originally developed and targeted at systems with hundreds to thousands of processors/cores and for systems that had the hardware to support PGAS type operations. I’ve never expected CAF to outperform MPI-3 one-sided communications and only be competitive with MPI-2 puts and gets. The strength of CAF was never its performance but its promise of a friendly syntax for developing parallel applications

nncarlson · February 2, 2022, 3:02pm

These tests were all done on a single node (single socket, multi-core CPU).

MPICH 3.3.2. My impression from the OpenCoarrays website was that this was their preferred MPI. I’m assuming it is smart enough in my situation to be using shared memory transport. But that’s something I should investigate. Your question prompted me to build a version of OpenCoarrays using OpenMPI (which is the MPI I normally use). It built okay, but I get MPI errors when running the tests, even with a single image. I need to dig into that.

I’d like to do some much larger multi-node tests on my institutional HPC clusters, but I’ve got to figure out how to launch the tests correctly via slurm.

rwmsu · February 2, 2022, 3:14pm

@nncarlson, you might also consider running the TAU profiler. It supposedly supports CAF but that might just be for Cray systems so it might not work on a single CPU and/or with Intel and gfortran/OpenCoarrays. Link to TAU is:

nncarlson · February 2, 2022, 3:33pm

Thanks @rwmsu for those references!

I have plenty of memory, but hyperthreading is turned on. I have been concerned about the placement of the images and have monitored it while the tests are running. With Intel I was able to pin images to specific cores, and I expect with gfortran I should be able to do the same, but there the OS seems to distribute the images appropriately (though it may be migrating them). However with NAG’s implementation it looks like one has no control and must rely on the OS. But you make an excellent point – I need to disable hyperthreading in the BIOS and rerun the tests.

Now that’s an interesting statement! I do really like the coarray syntax, and having a complete parallel programming capability built into the language is quite attractive to me. But a lot of work goes into designing/developing a SPMD program, beyond the difficulties dealing with syntax. I’d happily give up some performance (factor of 2?) to gain a friendlier, more maintainable code, but there’s a limit. If it’s not reasonably close I find it very hard to argue for using coarrays over MPI.

rwmsu · February 2, 2022, 3:57pm

@nncarlson, I completely agree with you about the work required to design and develop SPMD programs. My experience with distributed memory parallel codes goes back to using PVM (Parallel Virtural Machine) to run a CFD code in parallel on multiple DEC/ALPHA and SGI workstations spread around the Georgia Tech campus. I then moved on to MPI. I guess my point is re. CAF its great for people who think MPI is to hard to master but don’t expect miracles. I’ve taught MPI programming in the past so I know that you really only have to learn about 15 functions to do 99 per cent of most parallel programming tasks and most people will build their own wrappers around those calls to provide a simpler interface. Some work yes but you only have to do that once.

Plus all the papers I’ve seen showing CAF competing with MPI was done on Cray systems that had the hardware support to do CAF justice. I don’t know of any that did a comparison on a typical modern multi-core desktop so you might be a “pioneer”

nncarlson · February 2, 2022, 4:14pm

Exactly. Developing and using an application-specific layer over MPI – and not MPI directly – is highly recommended. In fact I’ve approached CAF in the same way. And in some sense the coarray collectives introduced in F2018 are the same – there are no coarrays involved in their interfaces at all.

certik · February 2, 2022, 9:07pm

First of all, thank you @nncarlson for these tests. For those who don’t know Neil, he is an expert in developing and maintaining large multi-physics Fortran codes, recently retired from LANL. And he is a big fan of Fortran. If even he can’t get coarrays to perform well, then they are not ready for prime time.

I was surprised by this statement too. What is stopping CAF to be as performing as MPI?

I have not seriously played with coarrays myself yet, but I was lead to believe that CAF can do anything that MPI can do (in practice) and it is as performing or better. Your statement is contrary to what I was lead to believe. You might very well be right.

rwmsu · February 2, 2022, 9:22pm

@certik I probably should have said its biggest strength for ME is its promise of a friendly syntax. However, as I said previously I think most (if not all) of the work that shows coarrays perform as well as MPI was done on Cray systems that have the hardware to support PGAS puts and gets. I don’t know of a paper that shows that for a 8 core desktop. CAF started life on distributed memory machines and I suspect (particularly the implementations built on top of MPI) that there are some issues related to shared memory that aren’t being addressed. Probably like the memory placement/processor affinity problems in the early releases of OpenMP. Also, the only way I can see any implementation built on top of MPI as being faster than MPI is if the CAF implemenation limits the amount of data buffering and synchronization by default (something you would have to be proactive at doing yourself in a standard MPI implemenation). Plus many of the CAF vs MPI papers were based on MPI-2 which didn’t have the one-sided communication procedures available in MPI-3. Don’t get me wrong, I’m a big fan of CAF for the kinds of problems it was originally designed for ie distributed memory codes on large multi-processor (now multi-node/mulit-core) systems. I just think the jury is still out on if its a viable competitor on much smaller systems (in terms of performance) - If I was developing a new code from scratch I would first implement it in CAF and only move back to MPI if the performance on the target hardware was poor

fortran4r · February 2, 2022, 10:18pm

I still have high hopes for coarray.
found this article about mpi:

rouson · February 3, 2022, 3:52am

@nncarlson I’m far from being a performance expert and agree with you that the quality of implementations has been uneven. My experiences have been better, however, than is reported here and there are several other reports of encouraging performance at scale in the literature. I cite a few below, but I think all of these studied distributed-memory performance. It’s difficult to draw broad conclusions without studying a range of applications, platforms, problem sizes, problem configurations, and even runtime settings (Alessandro Fanfarillo did an interesting study of using AI to discover the best MPI installation configuration settings), in addition to varying the compilers. I don’t have the expertise to diagnose what might explain the issues you’re experiencing, but if you haven’t tried performance analysis tools like TAU, for example, you might get some useful insights.

Although the standard doesn’t guarantee that mixing Coarray Fortran with other parallel programming models will work, I believe that was the aim of the committee and the aim of most compiler vendors. Mixing MPI with CAF should work in most cases. Regarding comparisons, my hope has always been that even when CAF uses MPI under the hood, the compiler and runtime would generate higher-performing MPI than most application developers would write themselves. We at least found this to be true in the last of the papers cited below.

Speedup of 33% relative to MPI-2 on 80K cores for European weather model: Mozdzynski, G., Hamrud, M., & Wedi, N. (2015). A Partitioned Global Address Space implementation of the European Centre for Medium Range Weather Forecasts Integrated Forecasting System. International Journal of High Performance Computing Applications, 1094342015576773.

Performance competitive with MPI-3 for several applications: Garain, S., Balsara, D. S., & Reid, J. (2015). Comparing Coarray Fortran (CAF) with MPI for several structured mesh PDE applications. Journal of Computational Physics.

Speedup of 50% relative to MPI-2 for plasma fusion code on 130,000 cores: Preissl, R., Wichmann, N., Long, B., Shalf, J., Ethier, S., & Koniges, A. (2011, November). Multithreaded global address space communication techniques for gyrokinetic fusion applications on ultra-scale platforms. In Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (p. 78). ACM.

50% parallel efficiency in strong scaling for an atmospheric model on 100,00 cores: Rouson, D., Gutmann, E. D., Fanfarillo, A., & Friesen, B. (2017, November). Performance portability of an intermediate-complexity atmospheric research model in coarray Fortran. In Proceedings of the Second Annual PGAS Applications Workshop (pp. 1-4).

CAF supported by either 1-sided MPI 3 or OpenSHMEM outperforming 2-sided MPI
Rasmussen, S., Gutmann, E. D., Friesen, B., Rouson, D., Filippone, S., & Moulitsas, I. (2018). Development and performance comparison of MPI and Fortran Coarrays within an atmospheric research model. Proceedings of PAW-ATM , 18 .

rwmsu · February 3, 2022, 2:39pm

@rouson, the issue is not if CAF can compete with or outperform MPI on large distributed memory machines. The issue is performance on single CPU shared memory multicore desktop or workstation systems. I and I think others here would appreciate a formal study along the lines of what Neil is attempting. I’m particularly interested in the following issues:

How do the default MPI configurations used by Intel and gfortran/OpenCoarrays affect performance on desktop systems
What changes in MPI environment variables are needed to improve performance
What underlying or unresolved issues related to shared memory access etc. do the MPI implementations have running on a single shared memory node
What hardware requirements (memory, turning off hyperthreading etc) give the best performance
Is the strong and weak scaling shown on 100K core systems approached on much smaller systems

I also have a general question about how the current and next generation of hybrid (big core-little core) chips such as Alder Lake and AMD’s rumored hybrid architecture will effect both MPI and CAF.

Edit

I would also like to see a comparison with the OpenSHMEM implemetation. If I remember correctly Crays original MPI implementation on the T3E sat on top of SHMEM

rouson · February 4, 2022, 6:07am

@rwmsu these are great questions and an excellent way to frame the discussion. Answering such questions in a thoughtful and reproducible way could take a considerable amount of time and could be the basis of a great funding proposal or maybe at least a Google Summer of Code proposal. @everythingfunctional and I have one project in which shared-memory parallelism will play a central role so possibly we’ll be able to perform such studies and answer your questions at some point.

FWIW, one of the goals of OpenCoarrays was to present an interface that is agnostic about the underlying parallel programming model. The penultimate paper cited in my previous post benefited greatly from the ability to swap OpenSHMEM for MPI at link-time, i.e., without rewriting or even recompiling the Fortran source code. I always imagined that flexibility as the strongest argument for Coarray Fortran.

More recently, I’m focused on what I hope will be the successor to OpenCoarrays: Caffeine. The first back-end for Caffeine is yet another parallel programming model: GASNet-EX, which may outperform MPI on some combinations of application and platform.

FortranFan · February 4, 2022, 3:20pm

@nncarlson, @rouson or anyone interested in such comparisons, will it be possible to add one or more simple comparisons with a diverse set of readers with varying backgrounds in mind, many of whom may not have explored parallel programming options as much?

It may help readers understand what coarrays as part of standard Fortran brings to the table vis-a-vis other approaches outside of the language standard such as MPI.

An immediate example that comes to mind is the “canonical” one used to illustrate MPI: the calculation of PI using the Monte Carlo method e.g., see here. It should be straightforward to write the same using coarrays (and standard intrinsics only) toward the comparisons of interest.

certik · February 4, 2022, 3:44pm

Yes, a simple but “real” numerical parallel solver would be a great benchmark. The current benchmark that Neil did is mostly communication. All that is needed I think is to add some actual computation for each image, and compute something physical.

urbanjost · February 4, 2022, 4:29pm

In reference to hyperthreading, it should be mentioned for those not aware of it that hyperthreading on Intel HPC systems is often disabled because of some security issues and not for performance issues (as was the case in the past), and that at least on two common commodity platforms you can enable it in the bios and then get results indistinguishable from having it off by simply disabling the higher cores by offlining the additional cores, which is much easier than reconfiguring and rebooting if you are testing on a single Linux for example.

On linux, you can often enable/disable cores by echoing values into /sys/devices/system/cpu/cpu*/online as root.

Depending on your platform configuration your power modes can impact all parallel methods; especially if your codes alternate between parallel and scalar regions extensively as well – so if timing on an out of the box Linux installation at home note they are rarely tuned for parallel HPC-like applications and you might want to tune them up before doing any timing. HPC machines usually have context-switching modes, system process limits, powermode, swappiness, etc. set very differently (or at least should have) than most default Linux configurations that are typically set up for quick interactive response and not heavy computation.

rouson · February 4, 2022, 4:30pm

@FortranFan are you asking more about comparing the source code for instructional purposes or comparing the performance? If you’re mainly interested in how the source code compares in terms complexity, clarity, etc., I would be glad to pair program the translation of that program with you. It looks like a quick task and shouldn’t take more than 30 minutes.

A performance comparison is more difficult than meets the eye for lots of reasons. So much depends on problem choice, problem parameters, system characteristics, and familiarity with the best practices for a particular approach. Any MPI I write would be novice-level so it might not be a fair comparison.

FortranFan · February 4, 2022, 4:39pm

Thanks very much Damian for your reply: I was thinking both:

Many readers will be interested, I think, in instructions on how to write good SPMD programs using coarrays in standard Fortran,
But then Fortranners being Fortranners will then immediately want to know about performance too!!

As is generally the case, a collaborative effort might serve well. Your expertise and experience with coarrays can be an excellent introduction and instruction to many in how to author a SPMD program with that PI calculation using the Monte Carlo as a possible example. Perhaps there are other better example options?

@nncarlson et al. with MPI expertise may be able to able to provide suitable MPI example(s) toward the same and help with performance benchmarks as well, as shown in the original post.

My request is mainly to consider several examples that first and foremost bring in simplicity whilst paying attention to more than one aspect of SPMD parallel programming which is what coarrays are meant for.

Thanks,

rwmsu · February 4, 2022, 4:52pm

I seem to remember that the Rice University folks did a CAF version of the NAS parallel benchmarks. I don’t know if NASA ever did one. Maybe Cray did their own implementation. Bill Long will probably know. Those might be a useful first step towards a test suite. Another good source of potential tests problems is the old IBM MPI Redbook RS/6000 SP : Practical MPI programming. I don’t think its on the IBM Redbook site anymore but a copy can be downloaded from here:

It has several 1D, 2D, and 3d Finite difference (halo-exchange) examples (basically solving Laplace equation) along with a FEM example that should be straightforward to translate to CAF

Getting back to Neil’s original premise. I think demonstrating and documenting usable parallel performance on desktops and workstations is critical for both further development of CAF but also getting people who are reluctant to embrace modern Fortran features to give the language a chance

Topic		Replies	Views
Some coarray performance results	18	1401	January 28, 2022
Parallel Fortran Coarrays Longer CPU Time Than Serial Fortran	18	511	November 4, 2024
Fortran and MPI Advocacy	32	1491	February 5, 2024
Learning coarrays, collective subroutines and other parallel features of Modern Fortran Help	48	2403	May 11, 2021
A simple example to compare coarrays and openmp	10	2570	February 20, 2022

Coarrays: Not ready for prime time

Related topics