A (mis)adventure in optimization

I recently watched a nice video explaining the Quake III fast inverse square root algorithm. Since I don’t do much bit fiddlery in my real work, I had some fun spinning up a Fortran translation:

module quick_recip_sqrt
  use, intrinsic :: iso_fortran_env, only: real32, int32
  implicit none
contains
  pure elemental function qrsqrt(x) result(y)
    real(real32), intent(in) :: x
    real(real32) :: y

    integer(int32) :: i
    real(real32) :: x2
    real(real32), parameter :: threehalfs = 1.5_real32

    x2 = x*0.5_real32
    y = x
    i = transfer(y, i)
    i = int(z'5f3759df', int32) - ishft(i, -1)
    y = transfer(i, y)
    y = y * (threehalfs - x2*y*y)
  end function
end module

But is it actually fast?

use, intrinsic :: iso_fortran_env, only: real32
use quick_recip_sqrt, only: qrsqrt
implicit none

integer, parameter :: n = 100000, k = 100
integer :: i
real(real32), allocatable :: x(:), y(:)
real(real32) :: tick, tock, t

allocate (x(n))
allocate (y, mold=x)

print "(a,i0,a,i0,a)", "Benchmarking elapsed time in seconds for ", n, " evaluations, averaged over ", k, " repetitions"

t = 0.
do i = 1, k
    call random_number(x)
    y = 0.
    call cpu_time(tick)
    y = 1/sqrt(x)
    call cpu_time(tock)
    t = t + tock-tick
end do
t = t/k
print *, "1/sqrt: ", t

t = 0.
do i = 1, k
    call random_number(x)
    y = 0.
    call cpu_time(tick)
    y = qrsqrt(x)
    call cpu_time(tock)
    t = t + tock-tick
end do
t = t/k
print *, "qrsqrt: ", t

end

Here’s what I find:

(base) [nshaffer:~/tmp/sqrt]$ gfortran --version
GNU Fortran (Homebrew GCC 10.2.0) 10.2.0
Copyright (C) 2020 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

(base) [nshaffer:~/tmp/sqrt]$ for i in {0..3}
> do
> gfortran -std=f2018 -O$i quick_recip_sqrt.f90 test.f90 && ./a.out
> done
Benchmarking elapsed time in seconds for 100000 evaluations, averaged over 100 repetitions
 1/sqrt:    1.48980020E-04
 qrsqrt:    2.23302911E-03
Benchmarking elapsed time in seconds for 100000 evaluations, averaged over 100 repetitions
 1/sqrt:    1.45219863E-04
 qrsqrt:    1.42900055E-04
Benchmarking elapsed time in seconds for 100000 evaluations, averaged over 100 repetitions
 1/sqrt:    4.25589446E-04
 qrsqrt:    7.65660079E-04
Benchmarking elapsed time in seconds for 100000 evaluations, averaged over 100 repetitions
 1/sqrt:    1.09429995E-04
 qrsqrt:    7.45430880E-04

Nope! If you dig into the disassembly, it turns out that gfortran on my machine makes use of the sqrtss SSE instruction for square-root. Evidently the additional division instruction to evaluate 1/sqrt(x) still blows the hand-spun (and less accurate!) version out of the water at most optimization levels.

Remember, kids: just because it was fast in the 90’s doesn’t mean it still is! That said, it’s still a fun bit of code to pick apart and understand how it works.

6 Likes

You can use the -mfpmath=387 option to force gfortran to use the old circuit, for fun. By default, gfortran uses -mfpmath=sse

Some interesting things here: