I am now learning coarrays and I share with you my first experiments:
In the repository, you will find a very simple algorithm computing an approximation of \pi using a Monte Carlo method (very inefficient method to compute \pi, but very efficient to burn the CPU!), with different versions:
a serial version of the algorithm.
A parallel version using OpenMP.
A parallel version using coarrays.
Another coarrays version printing steadily intermediate results.
My first benchmark on a 2 cores / 4 threads CPU yields:
Version
gfortran
ifort
Serial
19.9 s
34.8 s
OpenMP
9.9 s
93.0 s
Coarrays
16.2 s
14.4 s
Coarrays steady
33.2 s
35.9 s
First, concerning the gfortran results:
I am surprised by the difference between OpenMP and coarrays (in both cases there was 4 a.out executables running).
And by the effect of printing steadily intermediate results (just 20 times).
Concerning ifort, which I am not familiar with:
I don’t understand why the results are so bad with the serial version while they are a little better than gfortran with coarrays.
And when I use ifort with -qopenmp, I see 4 a.out executables but using only 45% of the CPU. And the results are catastrophic.
Any help and comments welcome!
And I hope this post and that repository will help other people interested by learning coarrays.
With OpenMp, there should be only one process, and 4 threads.
The default options for top only show the processes. However, it is possible to display the threads with the options -H:
top -H
-H :Threads-mode operation
Instructs top to display individual threads. Without this
command-line option a summation of all threads in each
process is shown. Later this can be changed with the `H'
interactive command.
I observe something similar, with the 4 threads (thanks @jeremie.vandenplas for the -H tip!) only at 45%. It seems related to using the RANDOM_NUMBER intrinsic inside a DO loop:
Which OS platform do you use, I could not find this info. I guess reporting threads (in ps, top etc.)was (and still may be) OS-dependent and (in the past) the threads could even get their own PIDs.
Also, I understand you are using HyperThreading - this may affect results, as a HyperThread is not an fully effective core.
I was able to get an almost 4x speedup with coarrays using Intel Fortran on Windows, using the example from their tutorial that calculates pi using Monte Carlo. Link
Those guys compute pi by integrating (1-x^2)^(-1/2) on -1:1 interval and report 5x advantage of ifort 18.1 over gfortran 7.2 (yes, quite an old version)
program pi_bbp
! Bailey-Borwein-Plouffe formula for pi, from "The BBP Algorithm for Pi", by David Bailey https://www.davidhbailey.com/dhbpapers/bbp-alg.pdf
implicit none
integer , parameter :: dp=kind(1.0d0)
integer :: k
real(kind=dp) :: xk,pi
real(kind=dp), parameter :: x16 = 1/16.0_dp
pi = 0.0d0
do k=0,10
xk = real(k,kind=dp)
pi = pi + x16**k * (4/(8*xk+1) - 2/(8*xk+4) - 1/(8*xk+5) - 1/(8*xk+6))
write (*,*) k,pi
end do
write (*,*) -1,4*atan(1.0d0),"true"
end program pi_bbp
As the author of the Intel tutorial, I want to point out that this was only to introduce coarrays in an accessible manner, not as a recommendation for how to compute pi! The method shown is actually a horrible way to do it and is highly dependent on how good the random number generator is.
Note that you don’t actually need a coarray for this exercise. In fact, (in my experience) you don’t need coarrays very often, as the collective subroutines can handle most of the communication you’ll need to do. I’ve submitted a PR to your repo @vmagnin to demonstrate. Of course, then you don’t get to actually play with a coarray .
With my changes on my Intel i5 machine with 8 threads I get the following results.
Version
gfortran
ifort
Serial
15.57s user 0.00s system 99% cpu 15.571 total
25.63s user 0.00s system 99% cpu 25.639 total
OpenMP
44.55s user 0.00s system 797% cpu 5.590 total
51.65s user 2.10s system 161% cpu 33.380 total
Coarrays
50.29s user 0.28s system 751% cpu 6.731 total
53.01s user 0.32s system 752% cpu 7.084 total
Coarrays steady
cafrun -n 8 ./a.out 60.71s user 0.25s system 754% cpu 8.076 total
64.57s user 0.44s system 760% cpu 8.553 total
I too don’t understand what I’m doing wrong with ifort and openmp.
The ifort RANDOM_NUMBER is known to use an exclusive lock in threaded applications, reducing performance. It might not be the best choice if you’re comparing performance.
Thanks for that information. So I think I will try to use a classical linear congruential generator like: X_{n+1}=(a \cdot X_n + c)\mod m, \text{ with } n \in \mathbb{N} a = 16807, c = 0, m = 2^{31} - 1 and X_0 = 123.
I don’t know the quality of that pseudo-random generator, but I don’t care since my objective is not to compute \pi but to burn my CPU and play with coarrays and other parallel features.
Anyway, it is of course a very bad method to compute \pi since the precision is like \frac{1}{\sqrt{N}} (one more digit costs 100 times more points!).
I will put some warnings in the README of my repository.
When selecting an RNG for this algorithm, you want one that will have a period (distance between repeat values) as long as possible. Single precision gets you only 2^23 values at most, and there may be holes in the set.
If your goal is to learn coarrays and burn CPU, then by all means use RANDOM_NUMBER.