Fortran benchmark

In this benchmark, Fortran is slower than C in every program. Is it possible to improve the Fortran code? Since this website is quite popular, if the Fortran code can be improved, that would be a great marketing point.

https://benchmarksgame-team.pages.debian.net/benchmarksgame/fastest/fortran.html

2 Likes

We can add it here: GitHub - fortran-lang/benchmarks: Fortran benchmarks and work on it.

Last time I looked at the nbody benchmark, the bottleneck was 1/sqrt(x). We would have to look at the assembly that compilers generate and see.

Down the road I would like LFortran to generate optimal assembly for things like 1/sqrt(x), those are basic tricks and it’s actually used quite often in any kind of a particle simulation (MD, n-body, etc.).

1 Like

Whoever wants to improve that benchmark should take a look at this then. Fast inverse square root - Wikipedia

That version is actually slow on modern hardware. I assume it was fast on the hardware back then.

Rather, the solution is to emit the same assembly as the C code, which uses optimal assembly. It’s the line z = _mm256_rsqrt_pd(z);.

And this is why Knuth’s adage “premature optimization is root of all evil” always seems to hold. Optimizations like this are always premature, because they’ll be out of date when the newer hardware comes out :stuck_out_tongue_winking_eye: .

1 Like

Optimal is getting the same answer as they do to the same number of significant digits, in this case it seems 9 significant digits, as fast as you can:

PROGRAM OUTPUT:
-0.169075164
-0.169059907

So replacing 1/sqrt(x) by 1 will change the answer, thus one cannot do that.

The program must solve the same equations, i.e. you can change the number of steps and you must get the correct answer out.

We want to keep the Fortran program, it is actually quite readable: n-body Classic Fortran #6 program (Benchmarks Game), but we want the compiler to generate code that runs as fast as the non-portable C code.

And never uses the COMPLEX data type???

The Fortran mandelbrot program does not use COMPLEX, which I found odd.

1 Like

Here is the relevant part from https://benchmarksgame-team.pages.debian.net/benchmarksgame/program/nbody-ifc-6.html:

       d2 = sum(r(:,m)**2)
       distance = 1/sqrt(real(d2))
       distance = distance * (1.5d0 - 0.5d0 * d2 * distance * distance)

and it’s the energy printout at the end:

  e = energy(x, v, mass)
  write (*,'(f12.9)') e

That must be correct to 9 decimal places. So what we can do is keep everything else the same and play with the accuracy of sqrt to see what is the minimal accuracy that still gives the correct final answer.

Excellent point, I haven’t noticed it. It’s either a mistake, or it is on purpose. Given that it (I assume) returns the correct answer, it means the sqrt can be in single precision (or less) and still recover total energy to 9 digits.

1 Like

I do not really understand that part of the algorithm, but if real(d2) and d2 had the same type, the second line would seem useless:

       distance = 1/sqrt(real(d2))
       distance = distance * (1.5d0 - 0.5d0 * d2 * distance * distance)

because it could be analytically simplified and we would obtain (at least mathematically):

       distance = 1/sqrt(real(d2))
       distance = distance

or am I missing something?

It’s a Newton iteration to improve accuracy.

3 Likes

It looks strange, because kind of d2 is dp(kind=8), real(d2) will change it to kind=4 ,so the precision is decrease, and then the iteration is useful maybe. So the Newton iteration looks like for fix the type casting loss of kind=8 to kind=4( :thinking:Bug maybe)

Actually,if I change it like this

       distance = 1/sqrt(d2)
       !distance = 1/sqrt(real(d2))
       !distance = distance * (1.5d0 - 0.5d0 * d2 * distance * distance)

The result is same


Based on @mecej4 answer , I see, the optimize principle is the single precision sqrt , type casting and 4 multiplication may faster than double precision sqrt. :open_mouth:

2 Likes

It looks like a trick to fool a specific compiler into emitting faster code for a specific FPU. With other compilers/machines, this code may “shoot itself in the foot”.

Here is an interpretation, expanding on the short answer by Themos:

The first of the two lines

       distance = 1/sqrt(real(d2))
       distance = distance * (1.5d0 - 0.5d0 * d2 * distance * distance)

calculates a single-precision approximation, presumably using a machine instruction that is faster than the corresponding double precision instruction. The next line uses Newton iteration, supposedly doubling the number of significant digits that are correct.

If we are using IEEE floats with 23+1 and 52+1 bits for the mantissa, the Newton step takes us from 24 to 48 bits, so the final result may be less precise than the result from a simple use of the double precision square root instruction.

3 Likes

Indeed, I think that is exactly right. The single precision 1/sqrt(x) will (presumably) generate the faster single precision instruction, and then the Newton iteration increases the precision, not quite to a full double, but enough to be accurate to 9 digits in the final energy. You can also notice the actual code has a second Newton iteration that is commented out:

       d2 = sum(r(:,m)**2)
       distance = 1/sqrt(real(d2))
       distance = distance * (1.5d0 - 0.5d0 * d2 * distance * distance)
       !distance = distance * (1.5d0 - 0.5d0 * d2 * distance * distance)
       mag(m) = tstep * distance**3

And the reason it is commented out is that we don’t need the extra accuracy. That is a common practice with these things.

I will note that this is such a commonly used technique that compilers like GFortran even have code to do this automatically (31723 – Use reciprocal and reciprocal square root with -ffast-math) — but the issue is how many iterations the compiler should do (to be safe it really should do enough iterations to recover full double precision accuracy), and since this problem requires lower accuracy, then presumably one has to do it by hand to control the accuracy vs speed better.

Note that the C version is simply using this routine to do the same:

// utilize vrsqrtps to compute an approximation of 1/sqrt(x) with float,
// cast back to double and refine using a variation of
// Goldschmidt’s algorithm to get < 1e-9 error
static inline __m256d _mm256_rsqrt_pd(__m256d s) {
    __m128 q = _mm256_cvtpd_ps(s);
    q = _mm_rsqrt_ps(q);
    __m256d x = _mm256_cvtps_pd(q);
    __m256d y = s * x * x;
    __m256d a = _mm256_mul_pd(y, _mm256_set1_pd(0.375));
    a = _mm256_mul_pd(a, y);
    __m256d b = _mm256_mul_pd(y, _mm256_set1_pd(1.25));
    b = _mm256_sub_pd(b, _mm256_set1_pd(1.875));
    y = _mm256_sub_pd(a, b);
    x = _mm256_mul_pd(x, y);
    return x;
}

It’s very hard to read, but if you translate the assembly into Fortran line by line, here is what it is doing (except that the assembly is operating on 4 doubles at once, but the Fortran compiler can vectorize in a similar manner):

! Computes x = 1/sqrt(s)
real(dp) function rsqrt(s) result(x)
real(dp), intent(in) :: s
real(dp) :: y, a, b
real(sp) :: q
q = real(s, sp)
q = 1/sqrt(q)
x = real(q, dp)
y = s * x * x
a = y * 0.375_dp
a = a * y
b = y * 1.25_dp
b = b - 1.875_dp
y = a - b
x = x * y
end function

I tested the last function and it works. It is actually using 2 Newton iterations, an equivalent of:

r = 1/sqrt(real(x,sp))
r = r * (1.5_dp - 0.5_dp * x * r**2)
r = r * (1.5_dp - 0.5_dp * x * r**2)

Plugging the second equation into the third:

r = 1/sqrt(real(x,sp))
r = r * (1.5_dp - 0.5_dp * x * r**2) * (1.5_dp - 0.5_dp * x * r**2 * (1.5_dp - 0.5_dp * x * r**2)**2)

and expanding we get:

r = 1/sqrt(real(x,sp))
r = r * (0.0625*x**4*r**8 - 0.5625*x**3*r**6 + 1.6875*x**2*r**4 - 2.4375*x*r**2 + 2.25)

And rearranging the rsqrt function it is doing:

r = 1/sqrt(real(x,sp))
r = r * (0.375*x**2*r**4 - 1.25*x*r**2 + 1.875)

Numerically it seems both are getting the same answer, so I am guessing they are somehow related using the r = 1/sqrt(x) relation. Here is the full code that I used to test it.

4 Likes

Perhaps making the procedure elemental and adding

!$omp declare simd(rsqrt) [simdlen(n)]

would aid in achieving vectorization for the Fortran version. I wonder if setting the optional clause simdlen(4) would help the compiler emit a version equivalent to the C version. A few other options are discussed in the article Explicit Vector Programming in Fortran from Intel.

In my experience GFortran correctly vectorizes such functions out of the box most of the time (with -O3 -march=native -ffast-math -funroll-loops), sometimes one needs to help it a bit such as by splitting the loop or rewriting the code in various ways. Here is an example where I apply the elemental rsqrt function on an array and you can see that GFortran correctly vectorized it: Compiler Explorer, from first look the assembly looks very good. I think for simple things like rsqrt, it should be possible to make GFortran generate the same assembly as the “C” code (really an assembly code).

1 Like

Compilers have been converting 1/sqrt(x) to rsqrt(x) for some time.

2 Likes

Odd indeed. Any Mandelbrot code I ever wrote always used COMPLEX variables.

2 Likes