Coarrays: Not ready for prime time

@kargl

On a 8 core Ryzen 7 5800X system (Linux Mint 19.3 OS), I get the following results using Intel oneAPI 2021.5. I ran with 1, 5, and 7 cores and they all gave the same results.

Pi hex digit computation
position = 2 + 1
1.246371928724869
3F123B108698

I wrote the coarray version for exposition purposes, so I skipped some extra iterations in the original. This was done more than 2 years ago, and I forget all the details. All that matters is the leading digit anyway. It is probably the case that my version starts going wrong before the original but it should be accurate enough for 10 million positions. Alternatively, you could start with the original and split the iterations among images in a similar way to what I have done.

Thanks for the quick reply. I’ve used MPI for a few applications, and have decided to take a deeper dive into coarray Fortran. I am hoping your example will simplify the initial jump. Need to look under hood to see what Bailey’s code is doing compared to yours.

This coarray version adds the extra iterations from the original code and should produce digit strings closer to the original. Of the 12 digits displayed, the first one is the one with the best accuracy bound, the rest get progressively more error-prone. Can be checked by invocations with N and N+/-1.

Original, reporting “position = 1000000”:
6C65E52CB4BD
Coarray version, reporting “position = 1000000 + 1”:
6C65E52CB442

Wow. Thanks. The results that you show are what I was originally expecting (based on comments in Bailey’s code). I’ll study your new code to learn more about coarrays.

@themos, depending on the Fortran processor, this may be a rather minor “editorial” comment or a significant issue with a data race condition that adversely impacts your coarray version: that is to avoid the use of SAVE statements and also DATA with its “implied save” behavior in Fortran parallel programs.

At first glance, it appears you can employ named constant arrays with both your tp table and hx character list. This comment may be of interest with this.

It is pretty much minimalist (“embarrassingly parallel”), as there is only one coarray variable (a double precision scalar, “s”). Each image (or MPI rank, or OpenMP thread, if you go down that route) adds their contributions to these partial sums. I wrote the final reduction without collectives because the NAG compiler did not have them until the latest release. The 4 calls to series can also be split up and that was an easy win for the OpenMP SECTION directive.

Several ppl have asked about simple coarray examples. The “pi” being discussed uses a single scalar coarray variable to sum the partial results from the images. Another simple example is computing a prefix sum [wikipedia]. Here each image holds a scalar value and the problem is for each image to compute the sum of the values of the preceding images including itself. This is a standard parallel collective operation; MPI has MPI_Scan, for example. I’ve implemented a “fast” O(log2 N) algorithm here. It also uses just a single scalar coarray, but is a very interesting exercise in using sync images.

2 Likes

I will certainly not agree with the title of this thread as we already can use CAF to get prepared for the approaching exascale era, to already try out new types of parallel programming models and to already learn about strategies to massively improve performance (OpenCoarrays/gfortran yet). I am using both ifort and OpenCoarrays/gfortran successfully to develop an advanced channel/coroutine implementation (based on CAF) that is already adapted to a specific parallel programming model and that may be adapted to other models as well. OpenCoarrays/gfortran have great functionality and show high performance (with MPICH), even on my laptop computer with oversubscribing cores. Ifort has low performance (apparently with executing the purely local codes) and a number of bugs (ifort 2021.4.0), but I was able to adapt my codes so that they work the same with both compilers. In case it helps here are some of the ifort bugs that I did encounter:

My codes follow an important recommendation for CAF programming, see Modern Fortran explained, chapter 17.1, p.326: “Since it is desirable for most references to data objects in a parallel program to be local, coarray syntax should appear only in isolated parts of the source code.” To me, this appears as the best way to make CAF programming feasible yet already and for the future, to make the codes maintainable and to avoid hard to resolve coding errors.
The following code snippet is from my current prototyping and already shows how CAF codes may look like with extended functionality and the coarray syntax isolated away (I am currently preparing for a new GitHub repository to explain these codes in more detail):

module procedure frob03_01_FragmentedMethod3_CE_SM2

!----------------------------------------------------------------------------------------
! (1) subtask1 block:
subtask1_if: if (i_Channel2Status == fo % enum_Channel2Status % subtask1) then
subtask1_block: block
  integer(glob_kint) :: i_TestValue
  real(glob_krea) :: r_TestValue
  integer(glob_kint), dimension(1:3) :: ia1_TestArray

  subtask1_select: select case (i_ImageType)
  !========================================================
  ! control coroutine:
  case (enum_ImageType % ControlImage) ! on the control image

    control_coroutine_subtask1: block
      i_TestValue = 22
      r_TestValue = 2.222
      ia1_TestArray = (/22,222,22222/)
      call chnl_Channel2 % fill (i_val = i_TestValue, r_val = r_TestValue, &
                                 ia1_val = ia1_TestArray)
      call chnl_Channel2 % send (i_chstat = i_Channel2Status)
      ! this image is ready for the next task:
      i_Channel2Status = fo % enum_Channel2Status % subtask2

      !*** sending to frob03_01_FragmentedMethod3_CE_SM3 using Channel4:
      ! (we can use multiple channels in the same block for sending)
      if (i_Channel4Status == fo % enum_Channel4Status % subtask1) then
        r_TestValue = 4.444
        call chnl_Channel4 % fill (r_val = r_TestValue)
        call chnl_Channel4 % send (i_chstat = i_Channel4Status)
        ! this image is ready for the next task for Channel4:
        i_Channel4Status = fo % enum_Channel4Status % e_xit
      end if

    end block control_coroutine_subtask1
  !========================================================
  ! execute coroutine:
  case (enum_ImageType % ExecuteImage) ! on the execute images

    execute_coroutine_subtask1: block
      integer(glob_kint), dimension (1:1) :: ia1_ScalarInteger
      real(glob_krea), dimension (1:1) :: ra1_ScalarReal
      integer(glob_kint), dimension(1:3, 1:1) :: ia2_IntegerArray1D
      ! always use only a single channel within a block with IsReceive !
      ! (otherwise the data transfer through a channel won't synchronize successfully) !
      if (chnl_Channel2 % IsReceive (i_chstat = i_Channel2Status)) then
        call chnl_Channel2 % get (ia1_ScalarInteger = ia1_ScalarInteger, &
                                  ra1_ScalarReal = ra1_ScalarReal, &
                                  ia2_Integer1D = ia2_IntegerArray1D)
        i_TestValue = ia1_ScalarInteger (1)
        r_TestValue = ra1_ScalarReal (1)
        ia1_TestArray = ia2_IntegerArray1D (:,1)
        write(*,*) 'from channel2:', i_TestValue
        write(*,*) 'from channel2: ', r_TestValue
        write(*,*) 'from channel2: ', ia1_TestArray
        ! IsReceive was successful, this image is ready for the next task:
        i_Channel2Status = fo % enum_Channel2Status % subtask2
        call system_clock(count = i_Time1) ! reset the timer
      end if
    end block execute_coroutine_subtask1
  !========================================================
  ! error: unclassified image
  case default
    return
  end select subtask1_select

end block subtask1_block
end if subtask1_if
! (1) end subtask1 block
!----------------------------------------------------------------------------------------
!

We can use CAF to implement the missing low level features we need and we can use it to implement new types of sophisticated distributed object models: CAF is a fully featured parallel programming language with a dedicated parallel runtime.

Recent research and developments with MPI do still play a central role for approaching the exascale era and CAF may hopefully profit from that as well. I can only tell from my own CAF coding that the authors of the below paper may ask the most relevant questions with regard to data transfer with new types of parallel programming techniques and models, see section II.Background B.Messaging in Exascale Applications:

4 Likes

Thanks @Federchen for taking the time to respond and give your perspective as one who has been doing a lot of very interesting work with coarrays. I really mostly agree with what you had to say. While I think there is more room for improvement in the functionality of CAF (e.g., I think it needs more collective procedures, or perhaps a “standard library” that supplies them) it is fully functional and “ready for primetime” from that perspective. The issue for me is primarily whether it is 1) performance competitive with MPI, and 2) whether that performance is portable across compilers/platforms. I gather from @rouson and @rwmsu that it is performance competitive for very large problems on large distributed memory machines. But my interest is mainly in how it performs on the other end of the HPC spectrum, a single shared memory many-core node to 10’s of nodes of a distributed memory machine. Using a halo exchange operation as a test case, my very preliminary results show that coarrays are performance competitive if you use the NAG compiler, but not even close using opencoarrays/gfortran. The results with Intel ranged wildly depending on the variation of coarray implementation.

So my performance experience with opencoarrays/gfortran seems to be very different than yours. In my halo exchange test case it is >5000 times slower than MPI (measuring just the communication). I would love someone from the opencoarrays camp to take a look at what I’ve done, and if I’ve done something badly/wrong to let me know. The link to my repo is in the OP. I do very much like CAF and am continuing to experiment and explore, and I believe it has potential.

Btw, I am in complete agreement with the advice from MFE that you highlighted (it applies equally well to MPI) and follow it in my codes as well.

5 Likes

My take from this thread is that current CAF implementations that use MPI as the transport layer are at the mercy of the quality of the MPI implementation they are built on. That’s why I hope @rouson and company continue to look at alternatives like openSHMEM and GASNet-EX. The ability of CAF to be competitive for large problems on Cray systems with hardware support for PGAS/GASnet puts and gets has been known for about 20 years so that was no surprise to me. I am somewhat surprised by its erratic performance on smaller systems. However, my opinion is that’s a function of the underlying MPI implementation, how shared memory is managed on commodity processors, and probably (on Linux based systems) how the underlying kernel is configured.

I don’t think so. MPICH is a perfectly great MPI implementation even for smaller systems, but coarrays on gfortran still have some issues as discussed above.

To each his own. I’ve always considered MPICH to be inferior to openMPI and thats coming from someone who has been using MPI since about 1995

1 Like

@rwmsu Thanks for your interest in MPI alternatives. The main branch of Caffeine now has provisional support for the following features using GASNet-EX:

  • this_image(),
  • num_images(),
  • error stop,
  • sync all, and
  • co_sum, co_min, co_max, co_broadcast, and co_reduce.
    Next up will be supporting teams, coarrays, and event_type, not necessarily in that order.

Using Caffeine currently requires explicitly invoking Caffeine procedures. Hopefully we’ll convince some compiler development teams, starting with flang, to adopt Caffeine so that users can just write standard Fortran without explicitly invoking Caffeine procedures.

2 Likes

I am not familiar with coarray, but it looks like coarray actually uses MPI library, right?
So it seems Coarray and MPI are the same thing, and coarray perhaps is a wrapper of MPI to make the parallel grammar more easy for everyone to use.
In terms of performance, at most coarray could perform the same as MPI if both are well written. The bottleneck should be the speed and latency of the cables which carry the communications between each cores.

However since it is a wrapper of MPI, I guess in some cases it just cannot perform as good as raw MPI.

In fact, I do hope Fortran can make MPI intrinsic. So in the code, we only need things like,

use iso_MPI

to use MPI.
This can really help. Because then we do not need to configure MPI on our computer anymore.
Currently besides installing Intel OneAPI, I found the most easy way to have MPI and Fortran installed without any further configuration is in Ubuntu or Linux, just type things like

sudo apt install gfortran mpich

Then we automatically have mpif90. No other configuration is needed anymore.

But I still wish Fortran can make MPI intrinsic.

No, coarrays are a feature of the language. MPI is a standard with many implementations. Coarrays may be implemented using MPI, but they’re at a higher abstraction level than MPI.

4 Likes

If it were necessary or desirable, I would agree. It is neither.

The MPI specification takes ~1000 pages, that is more than the entire Fortran standard. The cost of incorporating it represents many years of work that could be spent on the language proper. The cost of implementing it in compilers would be non-trivial (think about semantic checks, optimizations, test cases, documentation, managing alternatives). All user code (from last 20+ years) would also need to be revised.

There is a great advantage in having a feature in the language, as opposed to having it in an external library. Just imagine how difficult it would be to do optimizations if instead of

x=y+1

you had to specify

Call add_XYZ(y,1,x)

The compiler would have to be aware of the value semantics of every subroutine and function. That is why people sometimes say that MPI is at the level of assembly language; it is best generated by an optimizing compiler, not a human.

A similar situation occurs with I/O in Fortran and C. In Fortran, I/O is part of the language and you need a particular kind of statement (READ/WRITE) to do I/O. In C, it is just a call to the standard library. Communication between images is very much like I/O. Values that do not affect the output or are not communicated, do not have to be computed. Fortran compilers apply that optimization. C compilers do not know what call is I/O and what isn’t, and in any case C is a systems programming language where all of memory is essentially considered visible by other parts of the system and consequently such optimizations are not allowed.

Coarrays may not be perfect but they are a great leap forward from MPI, comparable to the introduction of FORTRAN in 1957.
.

7 Likes

Thanks @themos .
My point is simple.
I try to look at it just from a user’s perspective (not a very advanced user, just a regular user),

First, I like the idea that coarray is to make it easy for people to do parallelization in Fortran. This is a good direction, no doubt.

Second, (if I understand it correctly) since coarray is Fortran intrinsic as @milancurcic pointed out, how about let us just make it really intrinsic?
I mean, after I installed Fortran, no need to install anything else and I can just use coarray. At least for me that will be fantistic. Just like I use intrinsic function sin() in Fortran.

I do not care if coarray is based on MPI or whatever. However if using coarray requires me to install MPI myself, and performance is slower than MPI, then I do not see much point to use coarray in performance critical code.

In conclusion. I mean, like, openMP is already included in the compiler right?
Then, I know it can be a lot work, but at least I personally feel including MPI in the compiler can be useful. Especially considering that parallelization is a great feature of Fortran, there is no reason not to make this part even better, not only from performance aspect, but also from user experience aspect.

In short, as a user, I just want an installation button. After clicking this button and installed Fortran, I want MPI and coarray already there, no need to configure them anymore. That is all.

This is a follow-up to my OP. There I linked to a test repo that compared MPI and coarray implementations of a parallel halo exchange operation. The tests were pure communication, and several posters, very reasonably, were interested in seeing tests that incorporated some numerical computation.

I’ve finally gotten back to doing just that with a couple example programs from my index-map repository. The examples solve the heat equation on the unit disk using a finite volume or finite element discretization using explicit time stepping (to avoid involving a linear solver). They exhibit the indirect data access patterns typical of unstructured mesh methods coupled with parallel halo exchange operations. Here are sample results for the finite volume example (207K cells) comparing MPI and CAF versions when compiled with the NAG and GNU Fortran compilers (the latter using OpenCoarrays). Times are µsec per time step averaged over 10K steps.

processes 1 2 4 6 8
NAG-MPI 547 267 138 95 76
NAG-CAF 362 184 100 75 66
GNU-MPI 325 155 92 70 56
GNU-CAF 452 2060 4100 4500 4487

See the link above for more details. For those interested, this is the time step loop being timed, and here and here (here for GNU) is the specific source for the gather_offp halo exchange call.

1 Like

@rouson , will it be possible for you to a take a quick look at this and provide any comments regarding the results with OpenCoarrays and gfortran in the above table? Something looks amiss here as to how the times are so poor in this situation. Given your vast experience with OpenCoarrays and gfortran, your guidance will be very useful.