A (mis)adventure in optimization

I recently watched a nice video explaining the Quake III fast inverse square root algorithm. Since I don’t do much bit fiddlery in my real work, I had some fun spinning up a Fortran translation:

module quick_recip_sqrt
  use, intrinsic :: iso_fortran_env, only: real32, int32
  implicit none
contains
  pure elemental function qrsqrt(x) result(y)
    real(real32), intent(in) :: x
    real(real32) :: y

    integer(int32) :: i
    real(real32) :: x2
    real(real32), parameter :: threehalfs = 1.5_real32

    x2 = x*0.5_real32
    y = x
    i = transfer(y, i)
    i = int(z'5f3759df', int32) - ishft(i, -1)
    y = transfer(i, y)
    y = y * (threehalfs - x2*y*y)
  end function
end module

But is it actually fast?

use, intrinsic :: iso_fortran_env, only: real32
use quick_recip_sqrt, only: qrsqrt
implicit none

integer, parameter :: n = 100000, k = 100
integer :: i
real(real32), allocatable :: x(:), y(:)
real(real32) :: tick, tock, t

allocate (x(n))
allocate (y, mold=x)

print "(a,i0,a,i0,a)", "Benchmarking elapsed time in seconds for ", n, " evaluations, averaged over ", k, " repetitions"

t = 0.
do i = 1, k
    call random_number(x)
    y = 0.
    call cpu_time(tick)
    y = 1/sqrt(x)
    call cpu_time(tock)
    t = t + tock-tick
end do
t = t/k
print *, "1/sqrt: ", t

t = 0.
do i = 1, k
    call random_number(x)
    y = 0.
    call cpu_time(tick)
    y = qrsqrt(x)
    call cpu_time(tock)
    t = t + tock-tick
end do
t = t/k
print *, "qrsqrt: ", t

end

Here’s what I find:

(base) [nshaffer:~/tmp/sqrt]$ gfortran --version
GNU Fortran (Homebrew GCC 10.2.0) 10.2.0
Copyright (C) 2020 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

(base) [nshaffer:~/tmp/sqrt]$ for i in {0..3}
> do
> gfortran -std=f2018 -O$i quick_recip_sqrt.f90 test.f90 && ./a.out
> done
Benchmarking elapsed time in seconds for 100000 evaluations, averaged over 100 repetitions
 1/sqrt:    1.48980020E-04
 qrsqrt:    2.23302911E-03
Benchmarking elapsed time in seconds for 100000 evaluations, averaged over 100 repetitions
 1/sqrt:    1.45219863E-04
 qrsqrt:    1.42900055E-04
Benchmarking elapsed time in seconds for 100000 evaluations, averaged over 100 repetitions
 1/sqrt:    4.25589446E-04
 qrsqrt:    7.65660079E-04
Benchmarking elapsed time in seconds for 100000 evaluations, averaged over 100 repetitions
 1/sqrt:    1.09429995E-04
 qrsqrt:    7.45430880E-04

Nope! If you dig into the disassembly, it turns out that gfortran on my machine makes use of the sqrtss SSE instruction for square-root. Evidently the additional division instruction to evaluate 1/sqrt(x) still blows the hand-spun (and less accurate!) version out of the water at most optimization levels.

Remember, kids: just because it was fast in the 90’s doesn’t mean it still is! That said, it’s still a fun bit of code to pick apart and understand how it works.

4 Likes

You can use the -mfpmath=387 option to force gfortran to use the old circuit, for fun. By default, gfortran uses -mfpmath=sse

The problem with benchmarks is that it is often times wrong to draw any general conclusion based on limited results. Your qrsqrt is twice as fast as 1/sqrt(x) on my system at all optimizations -O1 through -O3. Looking at the generated assembly shows that 1/sqrt(x) is converted to fsqrt followed by fdivrs.

A more important question to look at may be accuracy of the computation. A simple modification to your code shows that qrsqrt has a maximum relative error of 14697.6 in units of epsilon(x) for the random number samples I grabbed; while 1/sqrt has a relative error of 0.49544 for the same set of random numbers.

2 Likes

Some interesting things here:

@vmagnin, your assertion is only partly correct. On at least i?-86--freebsd, gfortran will use the i387 by default over the sse instruction set. On x86_64--freebsd, gfortran selects sse instruction set in preference to i387.

1 Like

Reciprocal square roots are often used as an intermediate step in the calculation of regular square roots, so benchmarking sqrt() vs. x*rsqrt(x) would also be informative.

(Optional old guy anecdote: the Cray-1 didn’t have a floating-point division instruction; instead, it had a “reciprocal approximation” ROM table look-up instruction and a variant of floating-point multiplication that performed a Newton-Raphson refinement iteration. Both of these could be implemented in a pipelined fashion, so there were vector versions of these instructions, and a vectorized FP division was a three-instruction two-chime sequence. On the Cray-2/-3, these were augmented with variants that performed the same technique to compute a reciprocal square root via a table look-up for approximation and yet another variant of floating-point multiplication for N-R refinement, so SQRT was also three instructions. The numerical accuracy wasn’t perfect, though. The reciprocal look-up tables were loaded by the Cray-2 boot sequence, and you could actually change numeric results at your site by tweaking them.)