Coarrays: Not ready for prime time

nncarlson · February 8, 2022, 9:06pm

Several ppl have asked about simple coarray examples. The “pi” being discussed uses a single scalar coarray variable to sum the partial results from the images. Another simple example is computing a prefix sum [wikipedia]. Here each image holds a scalar value and the problem is for each image to compute the sum of the values of the preceding images including itself. This is a standard parallel collective operation; MPI has MPI_Scan, for example. I’ve implemented a “fast” O(log2 N) algorithm here. It also uses just a single scalar coarray, but is a very interesting exercise in using sync images.

Federchen · February 13, 2022, 10:26pm

I will certainly not agree with the title of this thread as we already can use CAF to get prepared for the approaching exascale era, to already try out new types of parallel programming models and to already learn about strategies to massively improve performance (OpenCoarrays/gfortran yet). I am using both ifort and OpenCoarrays/gfortran successfully to develop an advanced channel/coroutine implementation (based on CAF) that is already adapted to a specific parallel programming model and that may be adapted to other models as well. OpenCoarrays/gfortran have great functionality and show high performance (with MPICH), even on my laptop computer with oversubscribing cores. Ifort has low performance (apparently with executing the purely local codes) and a number of bugs (ifort 2021.4.0), but I was able to adapt my codes so that they work the same with both compilers. In case it helps here are some of the ifort bugs that I did encounter:

My codes follow an important recommendation for CAF programming, see Modern Fortran explained, chapter 17.1, p.326: “Since it is desirable for most references to data objects in a parallel program to be local, coarray syntax should appear only in isolated parts of the source code.” To me, this appears as the best way to make CAF programming feasible yet already and for the future, to make the codes maintainable and to avoid hard to resolve coding errors.
The following code snippet is from my current prototyping and already shows how CAF codes may look like with extended functionality and the coarray syntax isolated away (I am currently preparing for a new GitHub repository to explain these codes in more detail):

module procedure frob03_01_FragmentedMethod3_CE_SM2

!----------------------------------------------------------------------------------------
! (1) subtask1 block:
subtask1_if: if (i_Channel2Status == fo % enum_Channel2Status % subtask1) then
subtask1_block: block
  integer(glob_kint) :: i_TestValue
  real(glob_krea) :: r_TestValue
  integer(glob_kint), dimension(1:3) :: ia1_TestArray

  subtask1_select: select case (i_ImageType)
  !========================================================
  ! control coroutine:
  case (enum_ImageType % ControlImage) ! on the control image

    control_coroutine_subtask1: block
      i_TestValue = 22
      r_TestValue = 2.222
      ia1_TestArray = (/22,222,22222/)
      call chnl_Channel2 % fill (i_val = i_TestValue, r_val = r_TestValue, &
                                 ia1_val = ia1_TestArray)
      call chnl_Channel2 % send (i_chstat = i_Channel2Status)
      ! this image is ready for the next task:
      i_Channel2Status = fo % enum_Channel2Status % subtask2

      !*** sending to frob03_01_FragmentedMethod3_CE_SM3 using Channel4:
      ! (we can use multiple channels in the same block for sending)
      if (i_Channel4Status == fo % enum_Channel4Status % subtask1) then
        r_TestValue = 4.444
        call chnl_Channel4 % fill (r_val = r_TestValue)
        call chnl_Channel4 % send (i_chstat = i_Channel4Status)
        ! this image is ready for the next task for Channel4:
        i_Channel4Status = fo % enum_Channel4Status % e_xit
      end if

    end block control_coroutine_subtask1
  !========================================================
  ! execute coroutine:
  case (enum_ImageType % ExecuteImage) ! on the execute images

    execute_coroutine_subtask1: block
      integer(glob_kint), dimension (1:1) :: ia1_ScalarInteger
      real(glob_krea), dimension (1:1) :: ra1_ScalarReal
      integer(glob_kint), dimension(1:3, 1:1) :: ia2_IntegerArray1D
      ! always use only a single channel within a block with IsReceive !
      ! (otherwise the data transfer through a channel won't synchronize successfully) !
      if (chnl_Channel2 % IsReceive (i_chstat = i_Channel2Status)) then
        call chnl_Channel2 % get (ia1_ScalarInteger = ia1_ScalarInteger, &
                                  ra1_ScalarReal = ra1_ScalarReal, &
                                  ia2_Integer1D = ia2_IntegerArray1D)
        i_TestValue = ia1_ScalarInteger (1)
        r_TestValue = ra1_ScalarReal (1)
        ia1_TestArray = ia2_IntegerArray1D (:,1)
        write(*,*) 'from channel2:', i_TestValue
        write(*,*) 'from channel2: ', r_TestValue
        write(*,*) 'from channel2: ', ia1_TestArray
        ! IsReceive was successful, this image is ready for the next task:
        i_Channel2Status = fo % enum_Channel2Status % subtask2
        call system_clock(count = i_Time1) ! reset the timer
      end if
    end block execute_coroutine_subtask1
  !========================================================
  ! error: unclassified image
  case default
    return
  end select subtask1_select

end block subtask1_block
end if subtask1_if
! (1) end subtask1 block
!----------------------------------------------------------------------------------------
!

We can use CAF to implement the missing low level features we need and we can use it to implement new types of sophisticated distributed object models: CAF is a fully featured parallel programming language with a dedicated parallel runtime.

Recent research and developments with MPI do still play a central role for approaching the exascale era and CAF may hopefully profit from that as well. I can only tell from my own CAF coding that the authors of the below paper may ask the most relevant questions with regard to data transfer with new types of parallel programming techniques and models, see section II.Background B.Messaging in Exascale Applications:

nncarlson · February 14, 2022, 12:54am

Thanks @Federchen for taking the time to respond and give your perspective as one who has been doing a lot of very interesting work with coarrays. I really mostly agree with what you had to say. While I think there is more room for improvement in the functionality of CAF (e.g., I think it needs more collective procedures, or perhaps a “standard library” that supplies them) it is fully functional and “ready for primetime” from that perspective. The issue for me is primarily whether it is 1) performance competitive with MPI, and 2) whether that performance is portable across compilers/platforms. I gather from @rouson and @rwmsu that it is performance competitive for very large problems on large distributed memory machines. But my interest is mainly in how it performs on the other end of the HPC spectrum, a single shared memory many-core node to 10’s of nodes of a distributed memory machine. Using a halo exchange operation as a test case, my very preliminary results show that coarrays are performance competitive if you use the NAG compiler, but not even close using opencoarrays/gfortran. The results with Intel ranged wildly depending on the variation of coarray implementation.

So my performance experience with opencoarrays/gfortran seems to be very different than yours. In my halo exchange test case it is >5000 times slower than MPI (measuring just the communication). I would love someone from the opencoarrays camp to take a look at what I’ve done, and if I’ve done something badly/wrong to let me know. The link to my repo is in the OP. I do very much like CAF and am continuing to experiment and explore, and I believe it has potential.

Btw, I am in complete agreement with the advice from MFE that you highlighted (it applies equally well to MPI) and follow it in my codes as well.

rwmsu · February 14, 2022, 2:59pm

My take from this thread is that current CAF implementations that use MPI as the transport layer are at the mercy of the quality of the MPI implementation they are built on. That’s why I hope @rouson and company continue to look at alternatives like openSHMEM and GASNet-EX. The ability of CAF to be competitive for large problems on Cray systems with hardware support for PGAS/GASnet puts and gets has been known for about 20 years so that was no surprise to me. I am somewhat surprised by its erratic performance on smaller systems. However, my opinion is that’s a function of the underlying MPI implementation, how shared memory is managed on commodity processors, and probably (on Linux based systems) how the underlying kernel is configured.

adenchfi · February 14, 2022, 7:19pm

I don’t think so. MPICH is a perfectly great MPI implementation even for smaller systems, but coarrays on gfortran still have some issues as discussed above.

rwmsu · February 14, 2022, 8:10pm

To each his own. I’ve always considered MPICH to be inferior to openMPI and thats coming from someone who has been using MPI since about 1995

rouson · February 17, 2022, 6:59am

@rwmsu Thanks for your interest in MPI alternatives. The main branch of Caffeine now has provisional support for the following features using GASNet-EX:

this_image(),
num_images(),
error stop,
sync all, and
co_sum, co_min, co_max, co_broadcast, and co_reduce.
Next up will be supporting teams, coarrays, and event_type, not necessarily in that order.

Using Caffeine currently requires explicitly invoking Caffeine procedures. Hopefully we’ll convince some compiler development teams, starting with flang, to adopt Caffeine so that users can just write standard Fortran without explicitly invoking Caffeine procedures.

CRquantum · March 22, 2022, 12:19am

I am not familiar with coarray, but it looks like coarray actually uses MPI library, right?
So it seems Coarray and MPI are the same thing, and coarray perhaps is a wrapper of MPI to make the parallel grammar more easy for everyone to use.
In terms of performance, at most coarray could perform the same as MPI if both are well written. The bottleneck should be the speed and latency of the cables which carry the communications between each cores.

However since it is a wrapper of MPI, I guess in some cases it just cannot perform as good as raw MPI.

In fact, I do hope Fortran can make MPI intrinsic. So in the code, we only need things like,

use iso_MPI

to use MPI.
This can really help. Because then we do not need to configure MPI on our computer anymore.
Currently besides installing Intel OneAPI, I found the most easy way to have MPI and Fortran installed without any further configuration is in Ubuntu or Linux, just type things like

sudo apt install gfortran mpich

Then we automatically have mpif90. No other configuration is needed anymore.

But I still wish Fortran can make MPI intrinsic.

milancurcic · March 22, 2022, 12:46am

No, coarrays are a feature of the language. MPI is a standard with many implementations. Coarrays may be implemented using MPI, but they’re at a higher abstraction level than MPI.

themos · March 22, 2022, 10:07am

If it were necessary or desirable, I would agree. It is neither.

The MPI specification takes ~1000 pages, that is more than the entire Fortran standard. The cost of incorporating it represents many years of work that could be spent on the language proper. The cost of implementing it in compilers would be non-trivial (think about semantic checks, optimizations, test cases, documentation, managing alternatives). All user code (from last 20+ years) would also need to be revised.

There is a great advantage in having a feature in the language, as opposed to having it in an external library. Just imagine how difficult it would be to do optimizations if instead of

x=y+1

you had to specify

Call add_XYZ(y,1,x)

The compiler would have to be aware of the value semantics of every subroutine and function. That is why people sometimes say that MPI is at the level of assembly language; it is best generated by an optimizing compiler, not a human.

A similar situation occurs with I/O in Fortran and C. In Fortran, I/O is part of the language and you need a particular kind of statement (READ/WRITE) to do I/O. In C, it is just a call to the standard library. Communication between images is very much like I/O. Values that do not affect the output or are not communicated, do not have to be computed. Fortran compilers apply that optimization. C compilers do not know what call is I/O and what isn’t, and in any case C is a systems programming language where all of memory is essentially considered visible by other parts of the system and consequently such optimizations are not allowed.

Coarrays may not be perfect but they are a great leap forward from MPI, comparable to the introduction of FORTRAN in 1957.
.

CRquantum · March 22, 2022, 10:19pm

Thanks @themos .
My point is simple.
I try to look at it just from a user’s perspective (not a very advanced user, just a regular user),

First, I like the idea that coarray is to make it easy for people to do parallelization in Fortran. This is a good direction, no doubt.

Second, (if I understand it correctly) since coarray is Fortran intrinsic as @milancurcic pointed out, how about let us just make it really intrinsic?
I mean, after I installed Fortran, no need to install anything else and I can just use coarray. At least for me that will be fantistic. Just like I use intrinsic function sin() in Fortran.

I do not care if coarray is based on MPI or whatever. However if using coarray requires me to install MPI myself, and performance is slower than MPI, then I do not see much point to use coarray in performance critical code.

In conclusion. I mean, like, openMP is already included in the compiler right?
Then, I know it can be a lot work, but at least I personally feel including MPI in the compiler can be useful. Especially considering that parallelization is a great feature of Fortran, there is no reason not to make this part even better, not only from performance aspect, but also from user experience aspect.

In short, as a user, I just want an installation button. After clicking this button and installed Fortran, I want MPI and coarray already there, no need to configure them anymore. That is all.

nncarlson · April 14, 2022, 7:00pm

This is a follow-up to my OP. There I linked to a test repo that compared MPI and coarray implementations of a parallel halo exchange operation. The tests were pure communication, and several posters, very reasonably, were interested in seeing tests that incorporated some numerical computation.

I’ve finally gotten back to doing just that with a couple example programs from my index-map repository. The examples solve the heat equation on the unit disk using a finite volume or finite element discretization using explicit time stepping (to avoid involving a linear solver). They exhibit the indirect data access patterns typical of unstructured mesh methods coupled with parallel halo exchange operations. Here are sample results for the finite volume example (207K cells) comparing MPI and CAF versions when compiled with the NAG and GNU Fortran compilers (the latter using OpenCoarrays). Times are µsec per time step averaged over 10K steps.

processes	1	2	4	6	8
NAG-MPI	547	267	138	95	76
NAG-CAF	362	184	100	75	66
GNU-MPI	325	155	92	70	56
GNU-CAF	452	2060	4100	4500	4487

See the link above for more details. For those interested, this is the time step loop being timed, and here and here (here for GNU) is the specific source for the gather_offp halo exchange call.

FortranFan · April 14, 2022, 7:14pm

@rouson , will it be possible for you to a take a quick look at this and provide any comments regarding the results with OpenCoarrays and gfortran in the above table? Something looks amiss here as to how the times are so poor in this situation. Given your vast experience with OpenCoarrays and gfortran, your guidance will be very useful.

rwmsu · April 14, 2022, 8:46pm

Just a wild (and probably wrong) guess, but I wonder if GNU-CAF is doing a lot of synchronization (ie MPI_WAIT etc) that it doesn’t need to do. Don’t know if you can display them or not, but it would be interesting to see the underlying MPI calls used by openCoarrays and compare that with the native MPI implementation. Also, @nncarlson, I saw your post on the Intel Fortran Forum. Did you ever try using a different MPICH implementation with ifort as suggested there.

nncarlson · April 14, 2022, 8:53pm

Yes I did (and I need to reply there), but no luck. The program segfaults at what seems to be a different place than where it does when using Intel’s MPI, which I think is due to the hydra_pmi_proxy process just running out of memory.

Federchen · April 15, 2022, 9:53pm

Ifort and gfortran/OpenCoarrays do both show the same low coarray performance (pattern) with (Intel) MPICH. To my current understanding, this is related to MPICH’s way to configure for use with mismatching arguments with the Fortran compilers (an MPICH requirement for it’s functions with void?).

Starting with gfortran release 10.0.0 argument mismatches are detected differently with the compiler and MPICH must be configured accordingly. To my current understanding, this new way to configure MPICH with the Fortran compilers leads to a very low coarray performance pattern with gfortran. (I do observe the exactly same low coarray performance pattern with ifort and Intel MPICH). I did try out recent gfortran (with OpenCoarrays) with different MPICH versions, resulting into the same poor coarray performance.

A simple trick to achieve high coarray performance with MPICH is to use older Fortran compiler releases: Gfortran releases prior to 10.0.0; prior ifort releases did also work for much higher coarray performance (I just don’t recall before which ifort version it was).

rwmsu · April 16, 2022, 1:43am

@Federchen, wouldn’t the same mismatched argument problem show up in an MPI implementation. That doesn’t appear to be the case. Also can you give an example of what you mean by “mismatched arguments”. MPI has always had a problem with Fortran assumed shape arrays if thats what you mean. Fixing this was the prime motivation for the introduction of the ISO_Fortran_binding.h macros etc. in the recent extension of the C-interoperability facility in Fortran that provided a means for passing assumed shape arrays to C without the risk of creating a temporary array and doing a copyin. For those who are unfamiliar with the assumed shape issues with MPI, the problem occured with non-blocking sends and/or blocking sends that have message sizes smaller than the eager limit which controls if a message is buffered or just sent straight from memory. In the latter case, if a temporary array is created via a copyin and for some reason the MPI function does not block until the message transfer is completed, you can exit the calling routine which could lead to the temporary array created by the copyin being deleted before the message is completed. However, I thought this problem had been fixed in recent MPI implementations. Also, the assumed shape array issue usually resulted in a deadlock or other MPI error and not degraded performance.

nncarlson · April 16, 2022, 2:16pm

I’m not sure how changes affecting the MPICH Fortran interface would have any impact on gfortran CAF since the opencoarray lib is C code. But I rebuilt MPICH 3.3.2 and opencoarrays with GCC 9.3.0, and recompiled and ran my test with gfortran 9.3.0, and I get the same results. 9.3 is about as old as I can go though due to gfortran bugs that I trip in earlier versions.

rwmsu · April 16, 2022, 2:29pm

@nncarlson, one thing you might try is building MPICH to use just the gforker process manager instead of hydra. I was going to try that but I’m having issues building MPICH with Intel oneAPI icc. Keeps dying trying to link in stdc++ even though I’ve told configure -no-cxxlib etc. gforker is specific to single node shared memory applications. hydra is really for multi-node HPC like applications. On MPICH configure add --with-pm=gforker.

Federchen · April 16, 2022, 4:58pm

When compiling MPICH 4.0.1 with GCC 11.2.0 (on Ubuntu) the configuration will abort with the following error message from MPICH’s configure:

checking whether gfortran allows mismatched arguments… yes, with -fallow-argument-mismatch
configure: error: The Fortran compiler gfortran does not accept programs that call the same routine with arguments of different types without the option -fallow-argument-mismatch. Rerun configure with FFLAGS=-fallow-argument-mismatch and FCFLAGS=-fallow-argument-mismatch

This was already discussed among gfortran/OpenCoarray team members some time ago:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91731

Whatever MPICH version I was trying out with gfortran/OpenCoarrays, the use of FFLAGS=-fallow-argument-mismatch with MPICH configure did always result into an extremely poor coarray performance pattern: The resulting coarray programs are not only slowly running, they also show a certain pattern where the execution does appear to hold at several times (with my test cases). The exactly same execution pattern is with ifort.

With gfortran/OpenCoarrays, compiling MPICH without using FFLAGS=-fallow-argument-mismatch with ./configure did always deliver an incredible high coarray performance with my test cases (with an highly optimized PGAS cost function and performance measuring only for the code execution within coarray teams and without counting times for allocating coarrays or any blocking synchronization process).

So far, avoiding

./configure FFLAGS=-fallow-argument-mismatch FCFLAGS=-fallow-argument-mismatch

with MPICH was all I had to do, to achieve high coarray performance with gfortran/Opencoarrays.

Topic		Replies	Views
Some coarray performance results	18	1397	January 28, 2022
Multiple parallelization layers vs the agnosticism of coarrays: suggestions for improvement Language enhancement	0	411	February 19, 2023
Coarray usage resources	10	946	January 19, 2022
Coarray version of MPI_AlltoAll Help	5	478	July 20, 2022
Fortran applications using Fortran 2008+ features	29	2421	June 21, 2022

Coarrays: Not ready for prime time

Related topics