Note that @jerryd suggested the -static-libgfortran option: contrarily to -static, you can use it even with OpenMP and Coarrays.
I have not included it in my test, to keep it simple, but with that specific Pi problem, you can gain 0.3 s (hence 3%) on the openmp and co_sum versions.
I tried with a true physical model but gained nothing. So it’s to try and see the effect on your programs…
I don’t know if there is an equivalent with ifort.
Thanks very much for your info I’ve just searched for the “L’Ecuyer” and I guess the algorithm may be related to this article (according to the authors’ name):
Efficient and portable combined Tausworthe random number generators
My another question about ifort is that the performance of the threaded version (by OpenMP) seems not very good because of the “exclusive lock” (according to the following comment, which is something like an “atomic”-like thing?)
The ifort RANDOM_NUMBER is known to use an exclusive lock in threaded applications, reducing performance. It might not be the best choice if you’re comparing performance.
So I feel that ifort is not trying to maximize the performance of builtin random_number(). Is this possibly because MKL provides a set of highly optimized random number generators instead (and so the users are advised to use them for more “heavy” or computationally intensive calculations)?
No, though “serious” statistical applications will likely want to use an RNG whose distribution and quality is specified. A compiler’s implementation of RANDOM_NUMBER is “good enough” for a lot of purposes, but if you’re serious about RNGs you would probably want to look elsewhere. That MKL includes RNGs was not a consideration - we simply wanted a known-good algorithm, and this was one of the best at the time, with good uniformity and a long period, without being overly complex to implement.
I know gfortran has used at least two different algorithms, I think Mersenne Twister is current.
My first tests with Xoroshiro128+ instead of random_number() show that with OpenMP the duration is now ~4.4 s with gfortran and ~8.4 s with ifort (instead of 9.9 s and 92.0 s). But still ~43 s with ifx.
@vmagnin and anyone wanting to see simple examples with coarrays, here’s one that can be used to alert some of the techie news writers such as Lee Phillips and Liam Tung, “Simple summation 6x faster with a 70+ year-old language than MIT’s recent effort in Julia”!!:
@sblionel Thanks again for the additional explanation, I think I’ve now understood some more about random_number() in ifort. I really agree that external libraries are often better to use for “serious” calculations, but my concern is that many “benchmarks” on the net (by various people) use builtin random_number() (and regard it as a kind of measure of the “language’s” performance…This is of course wrong because of the different algorithms used internally, but on the other hand, I guess it has some meaning (for users) because it is a “default” thing available out of the box.)
@vmagnin Thanks much for testing Xoroshiro. Because I would like to update my (local) library for random numbers, I will also try some codes later. By the way, I guess it might be better to use only 2 threads (for OpenMP) or 2 processes (MPI or coarrays) for comparison purpose, if the CPU has 2 physical cores. (Then I guess the “ideal” result would be ~ x2 speed up, so making it easier for comparison, than using 4 threads or processes (whose result may be complicated for different reasons ← empirically the speed gain is little on my machine + code once N(threads) > N(physical cores)…).
Globally, it has greatly improved the performances and has fixed the ifort problem with the OpenMP version (but not so much concerning ifx).
I have added a benchmark.sh script to launch automatically 10x all the versions, and compute the mean times values.
Results
Intel(R) Core™ i7-5500U CPU @ 2.40GHz, under Ubuntu 20.10
Optimization flag: -O3
CPU time in seconds with 2 images/threads (except of course Serial):
Version
gfortran
ifort
ifx
Serial
10.77
18.77
14.66
OpenMP
5.75
9.32
60.30
Coarrays
13.21
9.79
Coarrays steady
21.80
27.83
Co_sum
5.58
9.98
Co_sum steady
9.18
12.71
With 4 images/threads (except of course Serial):
Version
gfortran
ifort
ifx
Serial
10.77
18.77
14.66
OpenMP
4.36
8.42
43.21
Coarrays
9.47
9.12
Coarrays steady
19.41
24.78
Co_sum
4.16
9.29
Co_sum steady
8.18
10.94
Further optimization
With gfortran, the -flto(standard link-time optimizer) compilation option has a strong effect on this algorithm: for example, with the co_sum version the CPU time with 4 images falls from 4.16 s to 2.38 s!