Learning coarrays, collective subroutines and other parallel features of Modern Fortran

vmagnin · May 3, 2021, 7:24pm

I am now learning coarrays and I share with you my first experiments:

In the repository, you will find a very simple algorithm computing an approximation of \pi using a Monte Carlo method (very inefficient method to compute \pi, but very efficient to burn the CPU!), with different versions:

a serial version of the algorithm.
A parallel version using OpenMP.
A parallel version using coarrays.
Another coarrays version printing steadily intermediate results.

My first benchmark on a 2 cores / 4 threads CPU yields:

Version	gfortran	ifort
Serial	19.9 s	34.8 s
OpenMP	9.9 s	93.0 s
Coarrays	16.2 s	14.4 s
Coarrays steady	33.2 s	35.9 s

First, concerning the gfortran results:

I am surprised by the difference between OpenMP and coarrays (in both cases there was 4 a.out executables running).
And by the effect of printing steadily intermediate results (just 20 times).

Concerning ifort, which I am not familiar with:

I don’t understand why the results are so bad with the serial version while they are a little better than gfortran with coarrays.
And when I use ifort with -qopenmp, I see 4 a.out executables but using only 45% of the CPU. And the results are catastrophic.

Any help and comments welcome!
And I hope this post and that repository will help other people interested by learning coarrays.

pcosta · May 3, 2021, 7:56pm

I would expect 1 task with 400% cpu usage. How are you running the OpenMP examples?

I would run it as follows:

$ export OMP_NUM_THREADS=4 && ./a.out

where that environment variable sets the number of OpenMP threads.

vmagnin · May 3, 2021, 8:09pm

I used:

$ ifort -O3 -qopenmp pi_monte_carlo_openmp.f90
$ time ./a.out

I can see with the htop command the four threads running, but each runs at ~45% contrarily to gfortran where they are at 100%. Strange…

With the following command, it’s the same:

$ export OMP_NUM_THREADS=4 && ./a.out

jeremie.vandenplas · May 3, 2021, 8:18pm

With OpenMp, there should be only one process, and 4 threads.
The default options for top only show the processes. However, it is possible to display the threads with the options -H:

top -H

       -H  :Threads-mode operation
            Instructs top to display individual threads.  Without this
            command-line option a summation of all threads in each
            process is shown.  Later this can be changed with the `H'
            interactive command.

Maybe is this option the default for htop?

pcosta · May 3, 2021, 8:27pm

I observe something similar, with the 4 threads (thanks @jeremie.vandenplas for the -H tip!) only at 45%. It seems related to using the RANDOM_NUMBER intrinsic inside a DO loop:

msz59 · May 3, 2021, 8:33pm

Which OS platform do you use, I could not find this info. I guess reporting threads (in ps, top etc.)was (and still may be) OS-dependent and (in the past) the threads could even get their own PIDs.

Also, I understand you are using HyperThreading - this may affect results, as a HyperThread is not an fully effective core.

BTW, is RANDOM_NUMBER thread-safe?

Beliavsky · May 3, 2021, 8:41pm

I was able to get an almost 4x speedup with coarrays using Intel Fortran on Windows, using the example from their tutorial that calculates pi using Monte Carlo. Link

jeremie.vandenplas · May 3, 2021, 8:42pm

Nic catch! I missed it. It might be indeed part of the explanation. Is it also the configuration of @pcosta?

pcosta · May 3, 2021, 8:51pm

Is it also the configuration of @pcosta?

I actually have 4 physical cores.

vmagnin · May 3, 2021, 9:03pm

Ubuntu 20.10 on a laptop with an Intel(R) Core™ i7-5500U CPU @ 2.40GHz with 2 cores / 4 threads.

I will try tomorrow on another machine.

vmagnin · May 3, 2021, 9:05pm

Probably, with top I saw one process a.out.
And with htop I saw four a.out with different PIDs.

msz59 · May 3, 2021, 9:28pm

Those guys compute pi by integrating (1-x^2)^(-1/2) on -1:1 interval and report 5x advantage of ifort 18.1 over gfortran 7.2 (yes, quite an old version)

themos · May 3, 2021, 9:55pm

I personally prefer the BBP formula for digits of pi. I have OpenMP and coarray code based on the Fortran code in the reference.

Beliavsky · May 3, 2021, 10:45pm

Cool:

program pi_bbp
! Bailey-Borwein-Plouffe formula for pi, from "The BBP Algorithm for Pi", by David Bailey https://www.davidhbailey.com/dhbpapers/bbp-alg.pdf
implicit none
integer      , parameter :: dp=kind(1.0d0)
integer                  :: k
real(kind=dp)            :: xk,pi
real(kind=dp), parameter :: x16 = 1/16.0_dp
pi = 0.0d0
do k=0,10
   xk = real(k,kind=dp)
   pi = pi + x16**k * (4/(8*xk+1) - 2/(8*xk+4) - 1/(8*xk+5) - 1/(8*xk+6))
   write (*,*) k,pi
end do
write (*,*) -1,4*atan(1.0d0),"true"
end program pi_bbp

result:

   0   3.1333333333333333     
   1   3.1414224664224664     
   2   3.1415873903465816     
   3   3.1415924575674357     
   4   3.1415926454603365     
   5   3.1415926532280878     
   6   3.1415926535728809     
   7   3.1415926535889729     
   8   3.1415926535897523     
   9   3.1415926535897913     
  10   3.1415926535897931     
  -1   3.1415926535897931      true

sblionel · May 4, 2021, 12:25am

As the author of the Intel tutorial, I want to point out that this was only to introduce coarrays in an accessible manner, not as a recommendation for how to compute pi! The method shown is actually a horrible way to do it and is highly dependent on how good the random number generator is.

everythingfunctional · May 4, 2021, 1:29am

Note that you don’t actually need a coarray for this exercise. In fact, (in my experience) you don’t need coarrays very often, as the collective subroutines can handle most of the communication you’ll need to do. I’ve submitted a PR to your repo @vmagnin to demonstrate. Of course, then you don’t get to actually play with a coarray .

With my changes on my Intel i5 machine with 8 threads I get the following results.

Version	gfortran	ifort
Serial	15.57s user 0.00s system 99% cpu 15.571 total	25.63s user 0.00s system 99% cpu 25.639 total
OpenMP	44.55s user 0.00s system 797% cpu 5.590 total	51.65s user 2.10s system 161% cpu 33.380 total
Coarrays	50.29s user 0.28s system 751% cpu 6.731 total	53.01s user 0.32s system 752% cpu 7.084 total
Coarrays steady	cafrun -n 8 ./a.out 60.71s user 0.25s system 754% cpu 8.076 total	64.57s user 0.44s system 760% cpu 8.553 total

I too don’t understand what I’m doing wrong with ifort and openmp.

themos · May 4, 2021, 8:26am

My BBP coarray code: Compiler Explorer

sblionel · May 4, 2021, 2:16pm

The ifort RANDOM_NUMBER is known to use an exclusive lock in threaded applications, reducing performance. It might not be the best choice if you’re comparing performance.

vmagnin · May 4, 2021, 2:30pm

Thanks for that information. So I think I will try to use a classical linear congruential generator like:
X_{n+1}=(a \cdot X_n + c)\mod m, \text{ with } n \in \mathbb{N}
a = 16807, c = 0, m = 2^{31} - 1 and X_0 = 123.
I don’t know the quality of that pseudo-random generator, but I don’t care since my objective is not to compute \pi but to burn my CPU and play with coarrays and other parallel features.

Anyway, it is of course a very bad method to compute \pi since the precision is like \frac{1}{\sqrt{N}} (one more digit costs 100 times more points!).

I will put some warnings in the README of my repository.

sblionel · May 4, 2021, 2:53pm

When selecting an RNG for this algorithm, you want one that will have a period (distance between repeat values) as long as possible. Single precision gets you only 2^23 values at most, and there may be holes in the set.

If your goal is to learn coarrays and burn CPU, then by all means use RANDOM_NUMBER.

Topic		Replies	Views
Coarrays: Not ready for prime time	64	6264	April 18, 2022
Parallel Fortran Coarrays Longer CPU Time Than Serial Fortran	18	513	November 4, 2024
A simple example to compare coarrays and openmp	10	2575	February 20, 2022
Some coarray performance results	18	1403	January 28, 2022
Fortran applications using Fortran 2008+ features	29	2424	June 21, 2022

Learning coarrays, collective subroutines and other parallel features of Modern Fortran

Related topics