 I recently watched a nice video explaining the Quake III fast inverse square root algorithm. Since I don’t do much bit fiddlery in my real work, I had some fun spinning up a Fortran translation:

``````module quick_recip_sqrt
use, intrinsic :: iso_fortran_env, only: real32, int32
implicit none
contains
pure elemental function qrsqrt(x) result(y)
real(real32), intent(in) :: x
real(real32) :: y

integer(int32) :: i
real(real32) :: x2
real(real32), parameter :: threehalfs = 1.5_real32

x2 = x*0.5_real32
y = x
i = transfer(y, i)
i = int(z'5f3759df', int32) - ishft(i, -1)
y = transfer(i, y)
y = y * (threehalfs - x2*y*y)
end function
end module
``````

But is it actually fast?

``````use, intrinsic :: iso_fortran_env, only: real32
use quick_recip_sqrt, only: qrsqrt
implicit none

integer, parameter :: n = 100000, k = 100
integer :: i
real(real32), allocatable :: x(:), y(:)
real(real32) :: tick, tock, t

allocate (x(n))
allocate (y, mold=x)

print "(a,i0,a,i0,a)", "Benchmarking elapsed time in seconds for ", n, " evaluations, averaged over ", k, " repetitions"

t = 0.
do i = 1, k
call random_number(x)
y = 0.
call cpu_time(tick)
y = 1/sqrt(x)
call cpu_time(tock)
t = t + tock-tick
end do
t = t/k
print *, "1/sqrt: ", t

t = 0.
do i = 1, k
call random_number(x)
y = 0.
call cpu_time(tick)
y = qrsqrt(x)
call cpu_time(tock)
t = t + tock-tick
end do
t = t/k
print *, "qrsqrt: ", t

end
``````

Here’s what I find:

``````(base) [nshaffer:~/tmp/sqrt]\$ gfortran --version
GNU Fortran (Homebrew GCC 10.2.0) 10.2.0
Copyright (C) 2020 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

(base) [nshaffer:~/tmp/sqrt]\$ for i in {0..3}
> do
> gfortran -std=f2018 -O\$i quick_recip_sqrt.f90 test.f90 && ./a.out
> done
Benchmarking elapsed time in seconds for 100000 evaluations, averaged over 100 repetitions
1/sqrt:    1.48980020E-04
qrsqrt:    2.23302911E-03
Benchmarking elapsed time in seconds for 100000 evaluations, averaged over 100 repetitions
1/sqrt:    1.45219863E-04
qrsqrt:    1.42900055E-04
Benchmarking elapsed time in seconds for 100000 evaluations, averaged over 100 repetitions
1/sqrt:    4.25589446E-04
qrsqrt:    7.65660079E-04
Benchmarking elapsed time in seconds for 100000 evaluations, averaged over 100 repetitions
1/sqrt:    1.09429995E-04
qrsqrt:    7.45430880E-04
``````

Nope! If you dig into the disassembly, it turns out that gfortran on my machine makes use of the `sqrtss` SSE instruction for square-root. Evidently the additional division instruction to evaluate `1/sqrt(x)` still blows the hand-spun (and less accurate!) version out of the water at most optimization levels.

Remember, kids: just because it was fast in the 90’s doesn’t mean it still is! That said, it’s still a fun bit of code to pick apart and understand how it works.

4 Likes

You can use the `-mfpmath=387` option to force gfortran to use the old circuit, for fun. By default, gfortran uses `-mfpmath=sse`

The problem with benchmarks is that it is often times wrong to draw any general conclusion based on limited results. Your `qrsqrt` is twice as fast as `1/sqrt(x)` on my system at all optimizations -O1 through -O3. Looking at the generated assembly shows that `1/sqrt(x)` is converted to `fsqrt` followed by `fdivrs`.

A more important question to look at may be accuracy of the computation. A simple modification to your code shows that `qrsqrt` has a maximum relative error of 14697.6 in units of `epsilon(x)` for the random number samples I grabbed; while `1/sqrt` has a relative error of 0.49544 for the same set of random numbers.

2 Likes

Some interesting things here:

@vmagnin, your assertion is only partly correct. On at least i?-86--freebsd, gfortran will use the i387 by default over the sse instruction set. On x86_64--freebsd, gfortran selects sse instruction set in preference to i387.

1 Like

Reciprocal square roots are often used as an intermediate step in the calculation of regular square roots, so benchmarking `sqrt()` vs. `x*rsqrt(x)` would also be informative.

(Optional old guy anecdote: the Cray-1 didn’t have a floating-point division instruction; instead, it had a “reciprocal approximation” ROM table look-up instruction and a variant of floating-point multiplication that performed a Newton-Raphson refinement iteration. Both of these could be implemented in a pipelined fashion, so there were vector versions of these instructions, and a vectorized FP division was a three-instruction two-chime sequence. On the Cray-2/-3, these were augmented with variants that performed the same technique to compute a reciprocal square root via a table look-up for approximation and yet another variant of floating-point multiplication for N-R refinement, so SQRT was also three instructions. The numerical accuracy wasn’t perfect, though. The reciprocal look-up tables were loaded by the Cray-2 boot sequence, and you could actually change numeric results at your site by tweaking them.)