Learning coarrays, collective subroutines and other parallel features of Modern Fortran

So I may have misunderstood your first message, as you said it could “in threaded applications, reduce performance”?

No, you didn’t misunderstand. If your goal is to compare performance, then avoiding ifort’s RANDOM_NUMBER could make sense (or do something like get a lot of values in a batch). If your goal is to learn about coarrays, then performance is less of a concern.

1 Like

Thanks. Coarrays is the primary goal, comparing performances is a secondary goal… We can learn a lot both on our own code and on the way compilers are working.

That should not matter in case of Coarray parallelism, or does it? Is there difference between compiling for shared-memory Coarray vs. distributed with ifort?
My naive guess is that random_number() should not affect Coarray performance even in shared-memory Coarray applications since each core has a distinct image of the program.

No - I was remarking on the earlier experiments with OpenMP.

1 Like

Thanks! I don’t have yet encountered collective subroutines as I am only in the chapter 8 of Milan’s book…
I have merged your PR, but I have renamed your coarrays versions with the “co_sum” suffix, to keep my coarrays version: although the results are less good, there is yet some things to try to understand…

This is my updated benchmark including your co_sum versions:

Version gfortran ifort
Serial 19.9 s 34.8 s
OpenMP 9.9 s 93.0 s
Coarrays 16.2 s 14.4 s
Coarrays steady 33.2 s 35.9 s
Co_sum 13.0 s 14.1 s
Co_sum steady 17.1 s 17.1 s

co_sum improves the results compared to coarrays, especially in the version printing steadily intermediate results: I don’t understand why the difference is so big, as in both cases we are (just 20 times) accessing and summing the values of the k variables in each image…

But the co_sum versions bring another thing: in fact in the “Coarrays steady” version, the intermediate results are false (we should since the first print have something around 3.14, not 5!), but in the “co_sum steady” version it’s OK. I don’t know what error I made.

It’s also surprising that the same compiler (gfortran) yields a 31% faster result with OpenMP, although the OpenMP directives and the co_sum method are finally doing quite the same thing (from the user’s perspective).

1 Like

The synchronization and serial summation loop seem to make a significant difference (even with so few images). The co_sum intrinsic can be a fair bit smarter than that (especially if you only need the answer on a single image). Synchronization and communication are serious bottlenecks for any parallel algorithm.

1 Like

Isn’t it because of shared memory access in OpenMP? I doubt if OpenCoarrays uses any shared memory for data access. @rouson could likely add more insights into this.

1 Like

If a collective subroutine solves your problem, always choose it over writing your own algorithm using coarrays. For me, this is part of a general preference for intrinsic procedures over custom algorithms both for clarity and potentially for performance reasons – although the clarity is so important to me that I would even accept a small performance penalty if necessary. As my book co-author Jim Xia says, “Let the compiler do its job.” In the specific case of collective subroutines, Jim’s advice is especially important. Doctoral dissertation chapters and possibly even whole dissertations have been written on optimizing the sorts of parallel algorithms that collective subroutines embody. I have considerable experience in writing my own versions of collective subroutines because I first dove into parallel Fortran in 2012, at which point the Intel and Cray compilers supported coarrays and the WG5 standards committee was working on a draft of “TS 18508 Additional Parallel Features in Fortran,” which defined the collective subroutines so I started writing subroutines that emulated the collective subroutines to make it easier to migrate to Fortran 2018 when the compilers started supporting it. I can tell you from that experience that the collective subroutines provided by the language generally outperform even reasonably sophisticated user-defined coarray algorithms that accomplish the same things. To write efficient collective communication requires accounting for a range of factors go well beyond what I would have the stomach to attempt to get right myself: network topology, bandwidth, latency, message size, etc. Moreover, the standard does not require the synchronizations that most naive developers would employ to get the communication right. This is really important because any form of synchronization implies waiting and waiting hurts performance.

I have a deep backlog of publications that I need to get out the door soon and plan to submit over the next several months. Hopefully one such publication will put some data behind the above statements. I have some of the data in slides that I’ll post if I can find a moment to dig them up.

8 Likes

Performance analysis is a tricky art and that goes tenfold for parallel performance analysis. There are too many subtleties to make generalizations without studying the code in detail. For example, there are shared-memory coarray implementations. The NAG compiler is one example. A student has been working on shared-memory coarray support for gfortran. He’s far enough along that I’d hoped it would appear this year in gfortran 11, but I don’t think his work has been accepted into the 11 branch yet so I suspect it will appear next year gfortran 12.

Regarding OpenCoarrays, the main goal is to define an application binary interface that makes no reference to the underlying parallel programming model, which can be MPI, OpenSHMEM, or GASNet. There is nothing precluding the exploitation of shared-memory features by any of these and there are MPI implementations that map each MPI rank to a thread. For example, this is the approach of MPC and I assume that means it can exploit shared-memory hardware while also handling distributed-memory communication.

4 Likes

As additional input (run on a system with Intel Broadwell processors) for the serial reference and preferred co_sum coarray version (and the Cray / HPE compiler):

ftn pi_monte_carlo_serial.f90
srun -n1 -CBW28 time ./a.out
srun: job 2786931 queued and waiting for resources
srun: job 2786931 has been allocated resources

4 * 785408025 / 1000000000
Pi ~ 3.141632100000000
10.34user 0.00system 0:10.36elapsed 99%CPU (0avgtext+0avgdata 3928maxresident)k
0inputs+0outputs (1major+612minor)pagefaults 0swaps

ftn pi_monte_carlo_co_sum.f90
srun -n4 -CBW28 time ./a.out
srun: job 2786933 queued and waiting for resources
srun: job 2786933 has been allocated resources
2/ 4 images
I will compute 250000000 points
3/ 4 images
I will compute 250000000 points
4/ 4 images
I will compute 250000000 points
1/ 4 images
I will compute 250000000 points

4 * 785401357 / 1000000000
Pi ~ 3.141605428000000
3.88user 0.01system 0:03.95elapsed 98%CPU (0avgtext+0avgdata 6212maxresident)k
224inputs+0outputs (2major+952minor)pagefaults 0swaps
3.88user 0.02system 0:03.95elapsed 98%CPU (0avgtext+0avgdata 5928maxresident)k
94inputs+0outputs (3major+955minor)pagefaults 0swaps
3.88user 0.01system 0:03.95elapsed 98%CPU (0avgtext+0avgdata 6164maxresident)k
192inputs+0outputs (3major+935minor)pagefaults 0swaps
3.88user 0.01system 0:03.95elapsed 98%CPU (0avgtext+0avgdata 6024maxresident)k
122inputs+0outputs (4major+958minor)pagefaults 0swaps

I would have included the timing in the body of the test using system_clock rather than relying on the time command.

2 Likes

Interesting discussion. I’ve only played a little with co-arrays so I’m still way behind on things like the collective communications features. A few questions though.

  1. I’ve always assumed most recent MPI implementations would always use shared memory mem copies etc for all cores/processors on a shared memory node and would only (in distributed memory programs) use the interconnect/sockets etc for remote processors. Is that really true. Do the MPI implementations underneath the co-array API for openCoarrays and Intel take advantage of this.

  2. Intel compilers appear to support (according to the man pages) both shared (–coarray=shared) and distributed(–coarray=distributed) modes. I’m not sure what if any difference you would see using the distributed mode on a multi-core processor. Anyone have any info on difference in performance for both modes.

  3. The primary applications that I would use co-arrays for would be Finite Volume solvers where you need to exchange a halo of “ghost cells” for internal mesh partitions (aka communication surfaces). I’m not sure how you would do that with just collective communications but my only experience with collective communications is MPI’s reduce and allReduce functions. I think this would be one case where you would have to use co-arrays. True?

2 Likes

I wonder what are ‘minimum’ requirements for a Fortran program to create multiple images on start, apart from using appropriate compiler options and libraries? Defining a coarray surely is. Would using collective subroutines (w/o coarrays) be enough? Or maybe even just compiler options/libraries?

Two more technical questions:

  1. Are there any firm plans for gfortran to support coarrays by itself, w/o help of OpenCoarrays?

  2. Has anybody seen OpenCoarrays precompiled packages for RHEL (and/or its free clones) 8.x? Fedoras used to have it but I guess it abandoned it at no. 31.

1 Like

It is a bit surprising that the random number generator in ifort seems much slower than that in gfortran… I wonder what algorithms are used in those two cases (IIRC some of the following ones are used in gfortran?)

1 Like

Thanks for that link. I have found a Fortran version (public domain):

I could try to use it instead of call random_number(x).

The same author has a also a RNG Fortran repository (GPL v3) using the same algorithm:

1 Like

Stdlib already has a xoshiro256 generator, but (currently) it can’t be used in a multithread setting: stdlib/stdlib_stats_distribution_PRNG.fypp at master · fortran-lang/stdlib · GitHub

You can follow the blog post of Jason Blevins (Parallel Computing in Fortran with OpenMP) to create a parallel version.

3 Likes

I have updated the programs with system_clock() and the results (means on five launches):

Version gfortran ifort ifx
Serial 20.1 s 35.9 s 34.9 s
OpenMP 9.9 s 92.0 s 97.1 s
Coarrays 14.4 s 13.9 s
Coarrays steady 31.5 s 35.1 s
Co_sum 11.0 s 13.8 s
Co_sum steady 15.4 s 16.5 s

With coarrays versions, the system_clock() can yield a little better results, especially for gfortran which is quite long to launch the images (sometimes nearly 2 seconds before the prints at the beginning of the program). But with longer computations, it would not matter.

I have added some results with ifx, but note that ifx does not yet support the -coarray option.

Is there an advantage to using system_clock() over cpu_time()?

@pcosta
I tried cpu_time() but, if I remember, with OpenMP it returned the total CPU time (sum of the CPU time of each thread), like the second line returned by the time command. I don’t remember if I tried with coarrays.

Noted! I see the issue here:

There is also omp_get_wtime(), but I guess best using system_clock() which is part of the Fortran Standard.

1 Like