Coarrays: Not ready for prime time

@certik if you want something simple and numerical, please check out the Burgers PDE solver in Chapter 12 of my book. It’s a 1D, unsteady finite-difference solver that’s object-oriented and parallel using coarrays in just a few hundred lines of code. My collaborators and I have also published a couple of papers about that solver, including one showing roughly 87% parallel efficiency in weak scaling out to 16,384 cores. A link to an open-access article is below.

1 Like

Also, Bob Numrich, the inventor of Coarray Fortran, published a book on the Coarray Fortran a few years ago. This dialogue is a good reminder that I want to order an electronic copy. I’m doing that now.

Numrich, R. W. (2018). Parallel Programming with Co-arrays . Chapman and Hall/CRC.

@FortranFan The pi calculation code might make a nice example in Caffeine. I’ll consider adding a coarray translation to the repository.

The OpenCoarrays test suite has lots of examples that might be useful to people too, especially the integration tests:

1 Like

I almost forgot. The Cloverleaf min-app from UK-MAC has a CAF version (along with MPI, GPU etc). I basically solves the 2D compressible Euler equations on a Cartesian grid. See:

https://uk-mac.github.io/CloverLeaf

and

Thanks @rouson. If we have time, we should add such parallel benchmarks into https://github.com/fortran-lang/benchmarks. Maintain CAF, MPI, as well as other languages versions. If it reveals that CAF must be improved, then we can work on that. I have high hopes for Caffeine.

@certik it’s probably not suitable for the benchmarks repository. The fact that it’s solving a 1D PDE makes it very different from most PDE solvers. In particular, the halo exchanges involve moving scalars on either end of a 1D interval so the conclusions that one can draw from this code are limited.

I know there is this repository that could be used as a starting point to create some benchmarks: https://github.com/ParRes/Kernels.

@rouson if you have some good benchmark idea that would be a better fit, definitely let me know.

A very interesting repo. But it looks like these are all small examples that poke at various parallel processes, judging from the random few I looked at. So I don’t think these fall into the category of benchmarks. Indeed the repo’s own readme says “These programs should not be used as benchmarks.”

The mini-app suggested by @rwmsu looks like exactly the thing one should be looking for in a benchmark example. But we need a suite of such things that cover the spectrum of parallel usage patterns. Ideal things would be mini-apps derived from real application codes that preserve their characteristic computational complexity, memory access and communication patterns, etc.

1 Like

I have posted a paper (and a presentation) that may be interesting for that discussion in that post:

5 Likes

@vmagnin from the graphs in the presentation CAF is slower than MPI. Is that correct? I am not sure if I understand it correctly.

Yes for small problems, but for big problems the Coarrays seems equivalent to MPI. This is the abstract of the paper:

The MPI is the most widespread data exchange interface standard used in parallel programming for clusters and supercomputers with many computer platforms. The primary means of the MPI communication between processes is passing messages based on basic point-to-point blocking and non-blocking routines. The choice of the optimal implementation of exchanges is essential to minimize the idle and transmission times to achieve parallel algorithm efficiency. We used three realizations of data exchange processes based on blocking, non-blocking point-to-point MPI routines and new features of the Coarray Fortran technique to determine the most efficient parallelization strategy. For the study, the two-dimensional wave equation was used as a test problem. During the experiments, the problem size and the approaches to the data exchange for transferring data between processes were changed. For each version, we measured the computation time and the acceleration factor. The research carried out shows that the larger the problem size, the greater the benefits of delayed non-blocking routines and Coarray Fortran. The efficiency of delayed non-blocking operations is due to overlapping the data transfer in the computations background. The Coarray Fortran acceleration is achieved by using Coarray variables with shared memory. The Coarray approach starts to win with the growth of problem size.

And I don’t know which compiler is used. I have no access to the Springer paper.

1 Like

Intel 19.0

I don’t doubt the author’s results at all, but the problem with papers like this is that the results are hard to reproduce by anyone else. It looks like there may be enough info and references in the paper for someone else to recreate with significant effort something close to their code, but what is really needed is access to their code.

They have used the Intel compiler and I would assume also Intel MPI.

The codes were assembled by the Intel (R) Fortran Intel (R) 64 Compiler to run applications on Intel (R) 64 (Version 19.0.3.199 Build 20190206) from the Intel Parallel Studio XE Cluster Edition for Linux (all tools) 2019 package.

1 Like

May also be worth comparing to DVM system and to XcalableMP both of which support Fortran. It is unclear if many Fortran users will take advantage of advanced MPI features. Perhaps initial emphasis should be on ease of writing and maintaining the resulting parallel code and then on obstacles to performance optimization.

Tuning of parallel codes is quite architecture specific. If GasNet or MPI libraries are used then some comparison with the performance of these libraries on the system being used, for example using OSU benchmarks should also be done.

1 Like

There is a more impressive pi-digits calculation program by David Bailey, here. It directly calculates the N-th fractional hexadecimal digit of pi (=3.243F6A888…).

I have a coarray modification to it. The loop iterations are distributed to the images and the partial sums are accumulated. Just one scalar coarray is needed (named “s”). Giving the program the argument 1 will produce the 2nd fractional digit, “4”, as the first digit of the last printed line. The ten-millionth fractional hex digit of pi is “1”, after a few seconds of calculation.

Code
    Module prec
      Use iso_fortran_env
      Implicit None
      Integer wp
      Parameter (wp=real64)
    End Module

    Program piqp_coarray

!   ORIGINAL COMMENT BLOCK
!   This program implements the BBP algorithm to generate a few hexadecimal
!   digits beginning immediately after a given position id, or in other words
!   beginning at position id + 1.  On most systems using IEEE 64-bit floating-
!   point arithmetic, this code works correctly so long as d is less than
!   approximately 1.18 x 10^7.  If 80-bit arithmetic can be employed, this limit
!   is significantly higher.  Whatever arithmetic is used, results for a given
!   position id can be checked by repeating with id-1 or id+1, and verifying
!   that the hex digits perfectly overlap with an offset of one, except possibly
!   for a few trailing digits.  The resulting fractions are typically accurate
!   to at least 11 decimal digits, and to at least 9 hex digits.

!   The function subprogram "second" below is a stub that may be replaced with
!   a CPU time function.

!   David H. Bailey     2006-09-08
!   ORIGINAL COMMENT BLOCK END


!   modified for coarrays by Themos Tsikas, 2022

      Use prec
      Implicit None
      Real (Kind=wp) :: pid, s1, s2, s3, s4, s[ * ] 
      Integer :: i, length, status, ic, nhx
      Parameter (nhx=12)
      Character chx(nhx)
      Character (Len=256) arg
      ! ic is the hex digit position -- output begins at position ic + 1.

      If (command_argument_count()/=1) Then
        If (this_image()==1) Then
          Write (error_unit, '(A)') '<prog> hex-digit-position'
          Error Stop 1
        End If
      End If
      Call get_command_argument(1, arg, length, status)
      Read (arg, '(I256)') ic

      s1 = series(1, ic)
      s2 = series(4, ic)
      s3 = series(5, ic)
      s4 = series(6, ic)
      s = 4.0_wp*s1 - 2.0_wp*s2 - s3 - s4

      Sync All
      If (this_image()==1) Then
        Do i = 2, num_images()
          s = s + s[ i ]
        End Do
        pid = mod(s, 1.0_wp) + 1.0_wp
        Call aihex(pid, nhx, chx)
        Write (output_unit, 100) ic, pid, chx
      End If
      Stop
100   Format ('Pi hex digit computation', /, 'position =', I12, ' + 1', /, &
          F20.15, /, 12A1)

    Contains
      Subroutine aihex(x, nhx, chx)
        ! This returns, in chx, the first nhx hex digits of the fraction of x.
        Real (Kind=wp) :: x, y
        Integer nhx, i
        Character (Len=1) chx(nhx), hx(0:15)
        Data hx/'0', '1', '2', '3', '4', '5', '6', '7', '8', '9', 'A', 'B', &
          'C', 'D', 'E', 'F'/

        y = abs(x)
        Do i = 1, nhx
          y = 16.0_wp*(y-aint(y))
          chx(i) = hx(int(y))
        End Do
        Return
      End Subroutine

      Function series(m, ic) Result(s)

        ! This routine evaluates the series  sum_k 16^(ic-k)/(8*k+m)
        ! using the modular exponentiation technique.
        Integer m, ic, k
        Real (Kind=wp) :: ak, p, s, t, zero, one
        Parameter(zero=0._wp, one=1._wp)

        s = zero

        ! Sum the series up to ic.
        Do k = this_image() - 1, ic, num_images()
          ak = 8 * k + m
          p = ic - k
          if (p == zero) then
            t = one
          else if (ak == one) then
            t = zero
          else
            t = expm1(p, ak)
          end if
          s = mod(s + t / ak, one)
        End Do
        Return
      End Function

      Function expm1(p, ak)
        ! expm1 = 16^p mod ak.  This routine uses the left-to-right binary
        ! exponentiation scheme.  It is valid for  ak <= 2^24.
        Integer :: ntp, i
        Parameter (ntp=30)
        Real (Kind=wp) :: ak, expm1, p, p1, pt, r, tp(ntp), zero, one, two, &
          half, sixteen
        Parameter (zero=0.0_wp, one=1.0_wp, two=2.0_wp, half=0.5_wp, &
          sixteen=16.0_wp)
        Save tp
        Data tp/ntp*0.0_wp/

        If (ak==one) Then
          expm1 = zero
          Return
        End If
        ! If this is the first call to expm1, fill the power of two table tp.
        If (tp(1)==zero) Then
          tp(1) = one
          Do i = 2, ntp
            tp(i) = two*tp(i-1)
          End Do
        End If
        ! Find the greatest power of two less than or equal to p.
        Do i = 1, ntp
          If (tp(i)>p) Exit
        End Do
        pt = tp(i-1)
        p1 = p
        r = one
        ! Perform binary exponentiation algorithm modulo ak.
        Do
          If (p1>=pt) Then
            r = mod(sixteen*r, ak)
            p1 = p1 - pt
          End If
          pt = half*pt
          If (pt>=one) Then
            r = mod(r*r, ak)
            Cycle
          Else
            Exit
          End If
        End Do
        expm1 = mod(r, ak)
        Return
      End Function
    End Program
2 Likes

@kargl

On a 8 core Ryzen 7 5800X system (Linux Mint 19.3 OS), I get the following results using Intel oneAPI 2021.5. I ran with 1, 5, and 7 cores and they all gave the same results.

Pi hex digit computation
position = 2 + 1
1.246371928724869
3F123B108698

I wrote the coarray version for exposition purposes, so I skipped some extra iterations in the original. This was done more than 2 years ago, and I forget all the details. All that matters is the leading digit anyway. It is probably the case that my version starts going wrong before the original but it should be accurate enough for 10 million positions. Alternatively, you could start with the original and split the iterations among images in a similar way to what I have done.

This coarray version adds the extra iterations from the original code and should produce digit strings closer to the original. Of the 12 digits displayed, the first one is the one with the best accuracy bound, the rest get progressively more error-prone. Can be checked by invocations with N and N+/-1.

Original, reporting “position = 1000000”:
6C65E52CB4BD
Coarray version, reporting “position = 1000000 + 1”:
6C65E52CB442

@themos, depending on the Fortran processor, this may be a rather minor “editorial” comment or a significant issue with a data race condition that adversely impacts your coarray version: that is to avoid the use of SAVE statements and also DATA with its “implied save” behavior in Fortran parallel programs.

At first glance, it appears you can employ named constant arrays with both your tp table and hx character list. This comment may be of interest with this.

It is pretty much minimalist (“embarrassingly parallel”), as there is only one coarray variable (a double precision scalar, “s”). Each image (or MPI rank, or OpenMP thread, if you go down that route) adds their contributions to these partial sums. I wrote the final reduction without collectives because the NAG compiler did not have them until the latest release. The 4 calls to series can also be split up and that was an easy win for the OpenMP SECTION directive.