Indeed, I think that is exactly right. The single precision 1/sqrt(x)
will (presumably) generate the faster single precision instruction, and then the Newton iteration increases the precision, not quite to a full double, but enough to be accurate to 9 digits in the final energy. You can also notice the actual code has a second Newton iteration that is commented out:
d2 = sum(r(:,m)**2)
distance = 1/sqrt(real(d2))
distance = distance * (1.5d0 - 0.5d0 * d2 * distance * distance)
!distance = distance * (1.5d0 - 0.5d0 * d2 * distance * distance)
mag(m) = tstep * distance**3
And the reason it is commented out is that we don’t need the extra accuracy. That is a common practice with these things.
I will note that this is such a commonly used technique that compilers like GFortran even have code to do this automatically (31723 – Use reciprocal and reciprocal square root with -ffast-math) — but the issue is how many iterations the compiler should do (to be safe it really should do enough iterations to recover full double precision accuracy), and since this problem requires lower accuracy, then presumably one has to do it by hand to control the accuracy vs speed better.
Note that the C version is simply using this routine to do the same:
// utilize vrsqrtps to compute an approximation of 1/sqrt(x) with float,
// cast back to double and refine using a variation of
// Goldschmidt’s algorithm to get < 1e-9 error
static inline __m256d _mm256_rsqrt_pd(__m256d s) {
__m128 q = _mm256_cvtpd_ps(s);
q = _mm_rsqrt_ps(q);
__m256d x = _mm256_cvtps_pd(q);
__m256d y = s * x * x;
__m256d a = _mm256_mul_pd(y, _mm256_set1_pd(0.375));
a = _mm256_mul_pd(a, y);
__m256d b = _mm256_mul_pd(y, _mm256_set1_pd(1.25));
b = _mm256_sub_pd(b, _mm256_set1_pd(1.875));
y = _mm256_sub_pd(a, b);
x = _mm256_mul_pd(x, y);
return x;
}
It’s very hard to read, but if you translate the assembly into Fortran line by line, here is what it is doing (except that the assembly is operating on 4 doubles at once, but the Fortran compiler can vectorize in a similar manner):
! Computes x = 1/sqrt(s)
real(dp) function rsqrt(s) result(x)
real(dp), intent(in) :: s
real(dp) :: y, a, b
real(sp) :: q
q = real(s, sp)
q = 1/sqrt(q)
x = real(q, dp)
y = s * x * x
a = y * 0.375_dp
a = a * y
b = y * 1.25_dp
b = b - 1.875_dp
y = a - b
x = x * y
end function
I tested the last function and it works. It is actually using 2 Newton iterations, an equivalent of:
r = 1/sqrt(real(x,sp))
r = r * (1.5_dp - 0.5_dp * x * r**2)
r = r * (1.5_dp - 0.5_dp * x * r**2)
Plugging the second equation into the third:
r = 1/sqrt(real(x,sp))
r = r * (1.5_dp - 0.5_dp * x * r**2) * (1.5_dp - 0.5_dp * x * r**2 * (1.5_dp - 0.5_dp * x * r**2)**2)
and expanding we get:
r = 1/sqrt(real(x,sp))
r = r * (0.0625*x**4*r**8 - 0.5625*x**3*r**6 + 1.6875*x**2*r**4 - 2.4375*x*r**2 + 2.25)
And rearranging the rsqrt
function it is doing:
r = 1/sqrt(real(x,sp))
r = r * (0.375*x**2*r**4 - 1.25*x*r**2 + 1.875)
Numerically it seems both are getting the same answer, so I am guessing they are somehow related using the r = 1/sqrt(x) relation. Here is the full code that I used to test it.