Learning coarrays, collective subroutines and other parallel features of Modern Fortran

vmagnin · May 6, 2021, 6:37pm

Note that @jerryd suggested the -static-libgfortran option: contrarily to -static, you can use it even with OpenMP and Coarrays.

I have not included it in my test, to keep it simple, but with that specific Pi problem, you can gain 0.3 s (hence 3%) on the openmp and co_sum versions.

I tried with a true physical model but gained nothing. So it’s to try and see the effect on your programs…

I don’t know if there is an equivalent with ifort.

sblionel · May 6, 2021, 6:58pm

ifort uses L’Ecuyer 1991. I doubt the algorithm itself is the problem.

septc · May 7, 2021, 7:07pm

Thanks very much for your info I’ve just searched for the “L’Ecuyer” and I guess the algorithm may be related to this article (according to the authors’ name):

Efficient and portable combined Tausworthe random number generators

My another question about ifort is that the performance of the threaded version (by OpenMP) seems not very good because of the “exclusive lock” (according to the following comment, which is something like an “atomic”-like thing?)

The ifort RANDOM_NUMBER is known to use an exclusive lock in threaded applications, reducing performance. It might not be the best choice if you’re comparing performance.

So I feel that ifort is not trying to maximize the performance of builtin random_number(). Is this possibly because MKL provides a set of highly optimized random number generators instead (and so the users are advised to use them for more “heavy” or computationally intensive calculations)?

https://software.intel.com/content/www/us/en/develop/documentation/onemkl-developer-reference-c/top/statistical-functions/random-number-generators.html#random-number-generators

Beliavsky · May 7, 2021, 7:33pm

Alan Miller has three RNGs of L’Ecuyer.

sblionel · May 7, 2021, 9:58pm

No, though “serious” statistical applications will likely want to use an RNG whose distribution and quality is specified. A compiler’s implementation of RANDOM_NUMBER is “good enough” for a lot of purposes, but if you’re serious about RNGs you would probably want to look elsewhere. That MKL includes RNGs was not a consideration - we simply wanted a known-good algorithm, and this was one of the best at the time, with good uniformity and a long period, without being overly complex to implement.

I know gfortran has used at least two different algorithms, I think Mersenne Twister is current.

vmagnin · May 8, 2021, 7:55pm

My first tests with Xoroshiro128+ instead of random_number() show that with OpenMP the duration is now ~4.4 s with gfortran and ~8.4 s with ifort (instead of 9.9 s and 92.0 s). But still ~43 s with ifx.

I will report full results in a few days.

FortranFan · May 8, 2021, 8:07pm

@vmagnin and anyone wanting to see simple examples with coarrays, here’s one that can be used to alert some of the techie news writers such as Lee Phillips and Liam Tung, “Simple summation 6x faster with a 70+ year-old language than MIT’s recent effort in Julia”!!:

septc · May 9, 2021, 3:42am

@sblionel Thanks again for the additional explanation, I think I’ve now understood some more about random_number() in ifort. I really agree that external libraries are often better to use for “serious” calculations, but my concern is that many “benchmarks” on the net (by various people) use builtin random_number() (and regard it as a kind of measure of the “language’s” performance…This is of course wrong because of the different algorithms used internally, but on the other hand, I guess it has some meaning (for users) because it is a “default” thing available out of the box.)

@vmagnin Thanks much for testing Xoroshiro. Because I would like to update my (local) library for random numbers, I will also try some codes later. By the way, I guess it might be better to use only 2 threads (for OpenMP) or 2 processes (MPI or coarrays) for comparison purpose, if the CPU has 2 physical cores. (Then I guess the “ideal” result would be ~ x2 speed up, so making it easier for comparison, than using 4 threads or processes (whose result may be complicated for different reasons ← empirically the speed gain is little on my machine + code once N(threads) > N(physical cores)…).

vmagnin · May 11, 2021, 8:10am

I have updated https://github.com/vmagnin/exploring_coarrays with the xoroshiro128+ RNG (the previous version of the project is now in the “random_number” branch).

Globally, it has greatly improved the performances and has fixed the ifort problem with the OpenMP version (but not so much concerning ifx).

I have added a benchmark.sh script to launch automatically 10x all the versions, and compute the mean times values.

Results

Intel(R) Core™ i7-5500U CPU @ 2.40GHz, under Ubuntu 20.10
Optimization flag: -O3

CPU time in seconds with 2 images/threads (except of course Serial):

Version	gfortran	ifort	ifx
Serial	10.77	18.77	14.66
OpenMP	5.75	9.32	60.30
Coarrays	13.21	9.79
Coarrays steady	21.80	27.83
Co_sum	5.58	9.98
Co_sum steady	9.18	12.71

With 4 images/threads (except of course Serial):

Version	gfortran	ifort	ifx
Serial	10.77	18.77	14.66
OpenMP	4.36	8.42	43.21
Coarrays	9.47	9.12
Coarrays steady	19.41	24.78
Co_sum	4.16	9.29
Co_sum steady	8.18	10.94

Further optimization

With gfortran, the -flto (standard link-time optimizer) compilation option has a strong effect on this algorithm: for example, with the co_sum version the CPU time with 4 images falls from 4.16 s to 2.38 s!

Topic		Replies	Views
Using coarrays with two different compilers: Speed measured by code and by operative system	2	440	February 22, 2021
Parallel Programming with Coarrays in Fortran (blog post)	5	756	April 1, 2024
A simple example to compare coarrays and openmp	10	2542	February 20, 2022
Coarray tutorial	2	1271	June 16, 2022
Parallel programming resources	2	599	October 14, 2021

Learning coarrays, collective subroutines and other parallel features of Modern Fortran

Results

Further optimization

Related topics