Learning coarrays, collective subroutines and other parallel features of Modern Fortran

Performance analysis is a tricky art and that goes tenfold for parallel performance analysis. There are too many subtleties to make generalizations without studying the code in detail. For example, there are shared-memory coarray implementations. The NAG compiler is one example. A student has been working on shared-memory coarray support for gfortran. He’s far enough along that I’d hoped it would appear this year in gfortran 11, but I don’t think his work has been accepted into the 11 branch yet so I suspect it will appear next year gfortran 12.

Regarding OpenCoarrays, the main goal is to define an application binary interface that makes no reference to the underlying parallel programming model, which can be MPI, OpenSHMEM, or GASNet. There is nothing precluding the exploitation of shared-memory features by any of these and there are MPI implementations that map each MPI rank to a thread. For example, this is the approach of MPC and I assume that means it can exploit shared-memory hardware while also handling distributed-memory communication.

4 Likes

As additional input (run on a system with Intel Broadwell processors) for the serial reference and preferred co_sum coarray version (and the Cray / HPE compiler):

ftn pi_monte_carlo_serial.f90
srun -n1 -CBW28 time ./a.out
srun: job 2786931 queued and waiting for resources
srun: job 2786931 has been allocated resources

4 * 785408025 / 1000000000
Pi ~ 3.141632100000000
10.34user 0.00system 0:10.36elapsed 99%CPU (0avgtext+0avgdata 3928maxresident)k
0inputs+0outputs (1major+612minor)pagefaults 0swaps

ftn pi_monte_carlo_co_sum.f90
srun -n4 -CBW28 time ./a.out
srun: job 2786933 queued and waiting for resources
srun: job 2786933 has been allocated resources
2/ 4 images
I will compute 250000000 points
3/ 4 images
I will compute 250000000 points
4/ 4 images
I will compute 250000000 points
1/ 4 images
I will compute 250000000 points

4 * 785401357 / 1000000000
Pi ~ 3.141605428000000
3.88user 0.01system 0:03.95elapsed 98%CPU (0avgtext+0avgdata 6212maxresident)k
224inputs+0outputs (2major+952minor)pagefaults 0swaps
3.88user 0.02system 0:03.95elapsed 98%CPU (0avgtext+0avgdata 5928maxresident)k
94inputs+0outputs (3major+955minor)pagefaults 0swaps
3.88user 0.01system 0:03.95elapsed 98%CPU (0avgtext+0avgdata 6164maxresident)k
192inputs+0outputs (3major+935minor)pagefaults 0swaps
3.88user 0.01system 0:03.95elapsed 98%CPU (0avgtext+0avgdata 6024maxresident)k
122inputs+0outputs (4major+958minor)pagefaults 0swaps

I would have included the timing in the body of the test using system_clock rather than relying on the time command.

2 Likes

Interesting discussion. I’ve only played a little with co-arrays so I’m still way behind on things like the collective communications features. A few questions though.

  1. I’ve always assumed most recent MPI implementations would always use shared memory mem copies etc for all cores/processors on a shared memory node and would only (in distributed memory programs) use the interconnect/sockets etc for remote processors. Is that really true. Do the MPI implementations underneath the co-array API for openCoarrays and Intel take advantage of this.

  2. Intel compilers appear to support (according to the man pages) both shared (–coarray=shared) and distributed(–coarray=distributed) modes. I’m not sure what if any difference you would see using the distributed mode on a multi-core processor. Anyone have any info on difference in performance for both modes.

  3. The primary applications that I would use co-arrays for would be Finite Volume solvers where you need to exchange a halo of “ghost cells” for internal mesh partitions (aka communication surfaces). I’m not sure how you would do that with just collective communications but my only experience with collective communications is MPI’s reduce and allReduce functions. I think this would be one case where you would have to use co-arrays. True?

2 Likes

I wonder what are ‘minimum’ requirements for a Fortran program to create multiple images on start, apart from using appropriate compiler options and libraries? Defining a coarray surely is. Would using collective subroutines (w/o coarrays) be enough? Or maybe even just compiler options/libraries?

Two more technical questions:

  1. Are there any firm plans for gfortran to support coarrays by itself, w/o help of OpenCoarrays?

  2. Has anybody seen OpenCoarrays precompiled packages for RHEL (and/or its free clones) 8.x? Fedoras used to have it but I guess it abandoned it at no. 31.

1 Like

It is a bit surprising that the random number generator in ifort seems much slower than that in gfortran… I wonder what algorithms are used in those two cases (IIRC some of the following ones are used in gfortran?)

1 Like

Thanks for that link. I have found a Fortran version (public domain):

I could try to use it instead of call random_number(x).

The same author has a also a RNG Fortran repository (GPL v3) using the same algorithm:

1 Like

Stdlib already has a xoshiro256 generator, but (currently) it can’t be used in a multithread setting: stdlib/stdlib_stats_distribution_PRNG.fypp at master · fortran-lang/stdlib · GitHub

You can follow the blog post of Jason Blevins (Parallel Computing in Fortran with OpenMP) to create a parallel version.

3 Likes

I have updated the programs with system_clock() and the results (means on five launches):

Version gfortran ifort ifx
Serial 20.1 s 35.9 s 34.9 s
OpenMP 9.9 s 92.0 s 97.1 s
Coarrays 14.4 s 13.9 s
Coarrays steady 31.5 s 35.1 s
Co_sum 11.0 s 13.8 s
Co_sum steady 15.4 s 16.5 s

With coarrays versions, the system_clock() can yield a little better results, especially for gfortran which is quite long to launch the images (sometimes nearly 2 seconds before the prints at the beginning of the program). But with longer computations, it would not matter.

I have added some results with ifx, but note that ifx does not yet support the -coarray option.

Is there an advantage to using system_clock() over cpu_time()?

@pcosta
I tried cpu_time() but, if I remember, with OpenMP it returned the total CPU time (sum of the CPU time of each thread), like the second line returned by the time command. I don’t remember if I tried with coarrays.

Noted! I see the issue here:

There is also omp_get_wtime(), but I guess best using system_clock() which is part of the Fortran Standard.

1 Like

Note that @jerryd suggested the -static-libgfortran option: contrarily to -static, you can use it even with OpenMP and Coarrays.

I have not included it in my test, to keep it simple, but with that specific Pi problem, you can gain 0.3 s (hence 3%) on the openmp and co_sum versions.

I tried with a true physical model but gained nothing. So it’s to try and see the effect on your programs…

I don’t know if there is an equivalent with ifort.

ifort uses L’Ecuyer 1991. I doubt the algorithm itself is the problem.

2 Likes

Thanks very much for your info :slight_smile: I’ve just searched for the “L’Ecuyer” and I guess the algorithm may be related to this article (according to the authors’ name):

Efficient and portable combined Tausworthe random number generators
https://dl.acm.org/doi/10.1145/116890.116892

My another question about ifort is that the performance of the threaded version (by OpenMP) seems not very good because of the “exclusive lock” (according to the following comment, which is something like an “atomic”-like thing?)

The ifort RANDOM_NUMBER is known to use an exclusive lock in threaded applications, reducing performance. It might not be the best choice if you’re comparing performance.

So I feel that ifort is not trying to maximize the performance of builtin random_number(). Is this possibly because MKL provides a set of highly optimized random number generators instead (and so the users are advised to use them for more “heavy” or computationally intensive calculations)?

1 Like

Alan Miller has three RNGs of L’Ecuyer.

1 Like

No, though “serious” statistical applications will likely want to use an RNG whose distribution and quality is specified. A compiler’s implementation of RANDOM_NUMBER is “good enough” for a lot of purposes, but if you’re serious about RNGs you would probably want to look elsewhere. That MKL includes RNGs was not a consideration - we simply wanted a known-good algorithm, and this was one of the best at the time, with good uniformity and a long period, without being overly complex to implement.

I know gfortran has used at least two different algorithms, I think Mersenne Twister is current.

1 Like

My first tests with Xoroshiro128+ instead of random_number() show that with OpenMP the duration is now ~4.4 s with gfortran and ~8.4 s with ifort (instead of 9.9 s and 92.0 s). But still ~43 s with ifx.

I will report full results in a few days.

2 Likes

@vmagnin and anyone wanting to see simple examples with coarrays, here’s one that can be used to alert some of the techie news writers such as Lee Phillips and Liam Tung, “Simple summation 6x faster with a 70+ year-old language than MIT’s recent effort in Julia”!!:

2 Likes

@sblionel Thanks again for the additional explanation, I think I’ve now understood some more about random_number() in ifort. I really agree that external libraries are often better to use for “serious” calculations, but my concern is that many “benchmarks” on the net (by various people) use builtin random_number() (and regard it as a kind of measure of the “language’s” performance…This is of course wrong because of the different algorithms used internally, but on the other hand, I guess it has some meaning (for users) because it is a “default” thing available out of the box.)

@vmagnin Thanks much for testing Xoroshiro. Because I would like to update my (local) library for random numbers, I will also try some codes later. By the way, I guess it might be better to use only 2 threads (for OpenMP) or 2 processes (MPI or coarrays) for comparison purpose, if the CPU has 2 physical cores. (Then I guess the “ideal” result would be ~ x2 speed up, so making it easier for comparison, than using 4 threads or processes (whose result may be complicated for different reasons ← empirically the speed gain is little on my machine + code once N(threads) > N(physical cores)…).

1 Like

I have updated https://github.com/vmagnin/exploring_coarrays with the xoroshiro128+ RNG (the previous version of the project is now in the “random_number” branch).

Globally, it has greatly improved the performances and has fixed the ifort problem with the OpenMP version (but not so much concerning ifx).

I have added a benchmark.sh script to launch automatically 10x all the versions, and compute the mean times values.

Results

Intel(R) Core™ i7-5500U CPU @ 2.40GHz, under Ubuntu 20.10
Optimization flag: -O3

CPU time in seconds with 2 images/threads (except of course Serial):

Version gfortran ifort ifx
Serial 10.77 18.77 14.66
OpenMP 5.75 9.32 60.30
Coarrays 13.21 9.79
Coarrays steady 21.80 27.83
Co_sum 5.58 9.98
Co_sum steady 9.18 12.71

With 4 images/threads (except of course Serial):

Version gfortran ifort ifx
Serial 10.77 18.77 14.66
OpenMP 4.36 8.42 43.21
Coarrays 9.47 9.12
Coarrays steady 19.41 24.78
Co_sum 4.16 9.29
Co_sum steady 8.18 10.94

Further optimization

With gfortran, the -flto (standard link-time optimizer) compilation option has a strong effect on this algorithm: for example, with the co_sum version the CPU time with 4 images falls from 4.16 s to 2.38 s!

5 Likes