Learning coarrays, collective subroutines and other parallel features of Modern Fortran

I am now learning coarrays and I share with you my first experiments:

In the repository, you will find a very simple algorithm computing an approximation of \pi using a Monte Carlo method (very inefficient method to compute \pi, but very efficient to burn the CPU!), with different versions:

  • a serial version of the algorithm.
  • A parallel version using OpenMP.
  • A parallel version using coarrays.
  • Another coarrays version printing steadily intermediate results.

My first benchmark on a 2 cores / 4 threads CPU yields:

Version gfortran ifort
Serial 19.9 s 34.8 s
OpenMP 9.9 s 93.0 s
Coarrays 16.2 s 14.4 s
Coarrays steady 33.2 s 35.9 s

First, concerning the gfortran results:

  • I am surprised by the difference between OpenMP and coarrays (in both cases there was 4 a.out executables running).
  • And by the effect of printing steadily intermediate results (just 20 times).

Concerning ifort, which I am not familiar with:

  • I don’t understand why the results are so bad with the serial version while they are a little better than gfortran with coarrays.
  • And when I use ifort with -qopenmp, I see 4 a.out executables but using only 45% of the CPU. And the results are catastrophic.

Any help and comments welcome!
And I hope this post and that repository will help other people interested by learning coarrays.

5 Likes

I would expect 1 task with 400% cpu usage. How are you running the OpenMP examples?

I would run it as follows:

$ export OMP_NUM_THREADS=4 && ./a.out

where that environment variable sets the number of OpenMP threads.

1 Like

I used:

$ ifort -O3 -qopenmp pi_monte_carlo_openmp.f90
$ time ./a.out

I can see with the htop command the four threads running, but each runs at ~45% contrarily to gfortran where they are at 100%. Strange…

With the following command, it’s the same:

$ export OMP_NUM_THREADS=4 && ./a.out
1 Like

With OpenMp, there should be only one process, and 4 threads.
The default options for top only show the processes. However, it is possible to display the threads with the options -H:

top -H
       -H  :Threads-mode operation
            Instructs top to display individual threads.  Without this
            command-line option a summation of all threads in each
            process is shown.  Later this can be changed with the `H'
            interactive command.

Maybe is this option the default for htop?

1 Like

I observe something similar, with the 4 threads (thanks @jeremie.vandenplas for the -H tip!) only at 45%. It seems related to using the RANDOM_NUMBER intrinsic inside a DO loop:

1 Like

Which OS platform do you use, I could not find this info. I guess reporting threads (in ps, top etc.)was (and still may be) OS-dependent and (in the past) the threads could even get their own PIDs.

Also, I understand you are using HyperThreading - this may affect results, as a HyperThread is not an fully effective core.

BTW, is RANDOM_NUMBER thread-safe?

1 Like

I was able to get an almost 4x speedup with coarrays using Intel Fortran on Windows, using the example from their tutorial that calculates pi using Monte Carlo. Link

1 Like

Nic catch! I missed it. It might be indeed part of the explanation. Is it also the configuration of @pcosta?

Is it also the configuration of @pcosta?

I actually have 4 physical cores.

1 Like

Ubuntu 20.10 on a laptop with an Intel(R) Core™ i7-5500U CPU @ 2.40GHz with 2 cores / 4 threads.

I will try tomorrow on another machine.

Probably, with top I saw one process a.out.
And with htop I saw four a.out with different PIDs.

Those guys compute pi by integrating (1-x^2)^(-1/2) on -1:1 interval and report 5x advantage of ifort 18.1 over gfortran 7.2 (yes, quite an old version)

1 Like

I personally prefer the BBP formula for digits of pi. I have OpenMP and coarray code based on the Fortran code in the reference.

2 Likes

Cool:

program pi_bbp
! Bailey-Borwein-Plouffe formula for pi, from "The BBP Algorithm for Pi", by David Bailey https://www.davidhbailey.com/dhbpapers/bbp-alg.pdf
implicit none
integer      , parameter :: dp=kind(1.0d0)
integer                  :: k
real(kind=dp)            :: xk,pi
real(kind=dp), parameter :: x16 = 1/16.0_dp
pi = 0.0d0
do k=0,10
   xk = real(k,kind=dp)
   pi = pi + x16**k * (4/(8*xk+1) - 2/(8*xk+4) - 1/(8*xk+5) - 1/(8*xk+6))
   write (*,*) k,pi
end do
write (*,*) -1,4*atan(1.0d0),"true"
end program pi_bbp

result:

   0   3.1333333333333333     
   1   3.1414224664224664     
   2   3.1415873903465816     
   3   3.1415924575674357     
   4   3.1415926454603365     
   5   3.1415926532280878     
   6   3.1415926535728809     
   7   3.1415926535889729     
   8   3.1415926535897523     
   9   3.1415926535897913     
  10   3.1415926535897931     
  -1   3.1415926535897931      true

As the author of the Intel tutorial, I want to point out that this was only to introduce coarrays in an accessible manner, not as a recommendation for how to compute pi! The method shown is actually a horrible way to do it and is highly dependent on how good the random number generator is.

3 Likes

Note that you don’t actually need a coarray for this exercise. In fact, (in my experience) you don’t need coarrays very often, as the collective subroutines can handle most of the communication you’ll need to do. I’ve submitted a PR to your repo @vmagnin to demonstrate. Of course, then you don’t get to actually play with a coarray :stuck_out_tongue_winking_eye: .

With my changes on my Intel i5 machine with 8 threads I get the following results.

Version gfortran ifort
Serial 15.57s user 0.00s system 99% cpu 15.571 total 25.63s user 0.00s system 99% cpu 25.639 total
OpenMP 44.55s user 0.00s system 797% cpu 5.590 total 51.65s user 2.10s system 161% cpu 33.380 total
Coarrays 50.29s user 0.28s system 751% cpu 6.731 total 53.01s user 0.32s system 752% cpu 7.084 total
Coarrays steady cafrun -n 8 ./a.out 60.71s user 0.25s system 754% cpu 8.076 total 64.57s user 0.44s system 760% cpu 8.553 total

I too don’t understand what I’m doing wrong with ifort and openmp.

2 Likes

My BBP coarray code: Compiler Explorer

1 Like

The ifort RANDOM_NUMBER is known to use an exclusive lock in threaded applications, reducing performance. It might not be the best choice if you’re comparing performance.

2 Likes

Thanks for that information. So I think I will try to use a classical linear congruential generator like:
X_{n+1}=(a \cdot X_n + c)\mod m, \text{ with } n \in \mathbb{N}
a = 16807, c = 0, m = 2^{31} - 1 and X_0 = 123.
I don’t know the quality of that pseudo-random generator, but I don’t care since my objective is not to compute \pi but to burn my CPU and play with coarrays and other parallel features.

Anyway, it is of course a very bad method to compute \pi since the precision is like \frac{1}{\sqrt{N}} (one more digit costs 100 times more points!).

I will put some warnings in the README of my repository.

1 Like

When selecting an RNG for this algorithm, you want one that will have a period (distance between repeat values) as long as possible. Single precision gets you only 2^23 values at most, and there may be holes in the set.

If your goal is to learn coarrays and burn CPU, then by all means use RANDOM_NUMBER.