@rouson Hi, Just a thought came, as gcc is also used to compile Opencoarrays. Can we call opencoarray from C. If yes, are there any simple example to understand the transfer of command line inputs.
In theory the answer is yes, but if you’re programming in C, I don’t see how it would be any more ergonomic (if it was even close) than just using MPI directly.
The closest you are going to get to Co-arrays on C is Unified Parallel C. Many vendors (Cray, IBM, etc) at one time had implementations of UPC on their hardware. I think there is a CLANG implementation but don’t quote me. However, as @everythingfunctional points out you are probably better off just using MPI particularly the one-sided communication functions in MPI-3.
UPC is an extension to C99 that doesn’t confirm to C and isn’t semantically equivalent to coarrays. For example, UPC’s one-sided remote memory deallocation is just awful. UPC is also not really active anymore, so your support opportunity is quite limited unless you buy a Cray machine.
OpenSHMEM is a better match because it’s a library not a language extension, so you don’t need a special compiler. There are a bunch of OpenSHMEM implementations out there, including one based on MPI (OSHMPI), which means it works everywhere.
MPI RMA one-sided works but is not simple to use. You have to know the subset of RMA to use. Looking at examples may help:
- ARMCI-MPI uses MPI RMA properly, although the GMR abstraction layer obfuscates things a bit.
- The original version of OSHMPI is pretty simple and uses MPI RMA properly. Look at the “sheap” path because the stuff for globals is only there because of how people wrote SHMEM in the 20th century.
To summarize, only use MPI_Win_allocate
, then call MPI_Win_lock_all
. Do all your RMA stuff with flush
or wait
operations, the call MPI_Win_unlock_all
and MPI_Win_free
. Do not use MPI_Win_fence
or PSCW. Do not use exclusive locks. They aren’t locks. If you need a lock, you build one with MPI_Compare_and_swap
, for example.
I think you can say the same thing about MPI vs coarrays. I have been a fan of coarrays since I was first introduced to them on Cray systems 20+ years ago now. However, other than a somewhat easier to learn syntax (and MPI is not as hard as a lot of folks make it out to be. You only need to learn 15 or so function calls to do the majority of the interprocessor control and communication needed for a distributed memory parallel code). Coarrays have no other advantage over MPI (and probably openSHMEM). As I’ve stated before, coarrays can be a viable option to MPI on large HPC systems (Crays in particular) that have the hardware to support GAS/PGAS. My attempts to run them on small 8-16 core workstations have been an major disappoitment. Based on threads here and what I’ve read elsewhere only NAGs shared memory implementation yields anywhere near the expected performance. I think you are always better off using MPI (or openMP on small shared memory systems) in the long run.
As a side note, I started out doing parallel programming almost 30 years ago now using Parallel Virtual Machine (PVM). There was a lot (and still is) to like about the simplicity of PVM. Its focus was on heterogeneous systems and was designed from the start to allow Fortran and C codes to exchange data. This was great for Beowulf or NOW (networks of workstations) systems but was at a performance dissadvantage on Cray, IBM etc big iron that used large numbers of the same processors. PVM is still available from netlib but hasn’t been updated in more than a decade. Given the rise of GPU computing I’ve wondered recently would something like PVM be a better match than MPI or openMP for running code kernels on GPUs.
Link to last version of PVM (around 2011) is at.
This paper from 1996 compares PVM and MPI
Yeah, I think you are alluding to the potential upside of coarrays. If implemented effectively for shared-memory specifically, which I can only assume is what NAG does, then coarray references will map precisely to load-store, without any function call overhead. The reason you don’t see this with Intel or GCC is that both call MPI unconditionally, because it’s completely portable and avoids some very difficult consistency issues with mixing MPI RMA and load-store (which NAG avoids by not supporting distributed memory at all).
Because MPI is explicit, it strongly encourages programmers to pack data into large transfers. In contrast, it is possible, and perhaps even likely, that some coarray usage patterns lead to doing one or more MPI operations for every element of an array, and the “or more” here would be synchronization. One of the things I helped the Intel Fortran team with 8 or 9 years ago was to reduce MPI synchronization in their coarray implementation, which improved the performance 10-100x in some workloads.
I don’t think there are many easy answers here, but one that users can do is to write coarray code like MPI and try to ensure that every communication operation moves as much data as possible. This means, e.g. using array notation rather than a loop over element references. Of course, there is no guarantee the compiler does the former well or the latter poorly, but it can’t hurt.

I don’t think there are many easy answers here, but one that users can do is to write coarray code like MPI and try to ensure that every communication operation moves as much data as possible.
Couldn’t agree more. Back when I would teach MPI classes, one of the things I tried to emphasize was the importance of sending as much data as possible in a single message to offset the startup and synchronization costs which on slow network/interconnect hardware can be substantial compared to how fast a processor can crunch numbers.
From what we can already see, PGAS (SPMD) support will come practically everywhere: CPU(multicore), GPU, FPGA, Ultra Ethernet (https://ultraethernet.org/), etc.
Coarrays are different from OpenSHMEM or MPI in that they do provide a simpler interface for the application developer at an higher level and that they do support unified acceleration (UXL) programming (https://uxlfoundation.org/) through a generic interface on top of OpenSHMEM, MPI-RMA, or something else in the future. For example, Fortran’s SYNC MEMORY statememnt could be mapped to shmem_quiet or rma flush or something else in the future. Thus, even todays coarray codes should be very much future proof, also due to the ISO standardization. Application developers won’t afford to develop different codes for each single device/vendor in the next few years. The main advantages of coarrays for application developers are simplicity, flexibility, and the ISO standardization.
I am using coarrays to develop new ways for synchronization of fine-grained data transfers. After dissecting a traditional synchronization process into it’s diverse parts, I did a different placement of some of these parts at the application (!) level: A blocking spin-wait loop could be replaced by a non-blocking single parallel-loop that does embrace not only the synchronization points but also the complete kernel execution. This already enables single-image asynchronous code execution, i.e. multiple simultaneously executing “threads” on each coarray image, to massively and dynamically increase the workload on each coarray image. This leads to fine-grained data transfer, through several non-atomic coarrays for each “thread”, with a large number of synchronization points (i.e. successful atomic coarray data transfer). That large number of synchronization points itself is not a problem because these are non-blocking (without a spin-wait loop). One problem to solve now was the extreme redundant execution of network synchronization statements (i.e. SYNC MEMORY) required with each single synchronization point. To solve this issue I do no longer execute SYNC MEMORY regularly with each synchronization point, but instead only if the (non-atomic) data transfer did not complete successfully (which I can test easily). (Another assumption is currently that this does lead back into ordered execution segments as well). A first (prototype) solution in Fortran was quite simple yet.
The essence of this story could be that optimizations at the application (developer) level could become very crucial to build the future of programming yet. And as complicated as this may appear and sound, I am also preparing a simple but full functional coarray-based channel implementation that could work as a project for kids: Only few lines of code and a simple syntax using coarrays.