Fast_math: A collection of functions for fast number crunching using Fortran

Dear all,

Following the steps of fastmath by @FedericoPerini, I would like to introduce

This gallery contains several functions, some are own work, others from discussions here in the forum and other forums in the net.

Parts of the original work from @FedericoPerini have been merged here, you will also find implementations for fast sum and fast dot_product which are not only faster but also more accurate than the intrinsic (at least with gfortran and ifort under my available hardware :slight_smile: ).

You will find a response file for you to test easily with

>fpm @testgfortran
>fpm @testifort
>fpm @testifx

A teaser:

================================================================
  SUM on single precision values with a mask
  sum_quad: <time> =    3.9091 ns/eval, speed-up= 0.99X, rel. error=      0.0000E+00
 intrinsic: <time> =    3.8577 ns/eval, speed-up= 1.00X, rel. error=      6.8401E-05
 fsum_pair: <time> =    4.1589 ns/eval, speed-up= 0.93X, rel. error=      1.3716E-05
      fsum: <time> =    0.4318 ns/eval, speed-up= 8.93X, rel. error=      2.1368E-05

================================================================
  dot product on single precision values
  dot_quad: <time> =    1.2195 ns/eval, speed-up= 0.94X, rel. error=      0.0000E+00
 intrinsic: <time> =    1.1446 ns/eval, speed-up= 1.00X, rel. error=      3.6655E-04
     fprod: <time> =    0.1533 ns/eval, speed-up= 7.47X, rel. error=      7.5047E-07

================================================================
  Fast hyperbolic
 ftanh r64: <time> =    2.7130 ns/eval, speed-up= 6.95X, rel. error=      2.5361E-07
 ftanh r64: <time> =    3.7820 ns/eval, speed-up= 5.26X, rel. error=      2.4636E-07
  ferf r32: <time> =    1.8050 ns/eval, speed-up= 8.74X, rel. error=      3.7337E-04
  ferf r64: <time> =    4.2750 ns/eval, speed-up= 3.77X, rel. error=      3.7337E-04

Looking forward to your comments and contributions!

13 Likes

I guess the question is why are the compiler vendors still using slower and less accurate methods for their intrinsic functions when its been demonstrated that faster and more accurate methods exist. I can only guess that the old “backward compatability” and “would break user codes (ie results)” mantra is the reason.

I guess so! for the trigonometric and other functions I’m ok with the compiler giving by default the most accurate version, and giving the user the responsibility of using a “fast and dirty” version if they want to… for the sum and dot_product, I’m indeed impressed that it was possible to be both faster and more accurate at the same time compared to the compilers, would have expected that only one was possible.

2 Likes

Not an expert, but I would bet that ensuring IEEE arithmetics compliance for all edge cases is also part of the answer.

2 Likes

For the intrinsic sum at least I don’t believe there are many guarantees. There’s been a lengthy discussion before here: Some Intrinsic SUMS

The standard appears to wash it’s hands by stating the result is processor-dependent:

The result of SUM (ARRAY) has a value equal to a processor-dependent approximation to the sum of all the elements of ARRAY or has the value zero if ARRAY has size zero

(Processor being jargon for the Fortran compiler in typical scenarios.)

Especially for global sums across multiple ranks (MPI) or images (coarrays), enhanced accurary is needed to get consistent results. The following paper discusses the issue: Redirecting

The floating point addition being not associative contrarily to the mathematical addition, who knows what is the best order to compute a given sum? It depends on the terms in the sum.

We may first sort the numbers then sum them starting from the smallest ones, but it would strongly impede speed.

Yes, I participated in that discussion and it motivated me to seek an approach that would give some kind of compromise between the accuracy and speed issues.

This, or having a fully random walk before adding up could minimize the error, yet indeed the price in computational time would be too high.

That’s why I came up with the idea of “chunking” the sum/dot_product within a small vector that would separate sequential values, yet keep crunching incoming batches of memory as they come and avoid needing to jump in memory or sorting. (The idea came in part from one of the papers cited in that thread )

This is extremely important! A key use-case is the dot product required in iterative solvers that has to be synchronized at each iteration. I’ll take a read to the paper!! Thanks :slight_smile: … I could imagine that a vectorized approach could also be extended to the reduction step for distributed sum/dot product to be as consistent as possible.

3 Likes

I personally don’t mind with some of the math operations being little bit slow, as long as they produce acceptable results.

Most, if not all, numerical codes are heavily memory bound, and the future trend in improving performance of numerical codes, would be heavily biased towards increasing throughput of the codes with parallelization, rather than reducing latency of codes running on each core.

Here a challenge if anyone has an idea:

The frsqrt function in the library based on the Quake III arena algorithm is actually not that fast compared to what the compiler can do (for single precision, I see performances between 0.9 and 1.2X :man_shrugging:) . Looking a bit through the net found this repo: GitHub - Xerbo/simd_fastinvsqrt: SIMD (SSE) implementation of the infamous Fast Inverse Square Root algorithm from Quake III Arena. that compares some versions and shows that there is an intrinsic SSE implementation

_mm_rsqrt_ps(x)

which is the fastest option.

Reading around I do not find ways to call such intrinsic directly in Fortran. Are you aware of any way to achieve that efficiently? Or by any chance are the compilers already detecting a call to 1/sqrt() and replacing for that?

You may be interested in my own little study of the inverse square root. The thread is A (mis)adventure in optimization. In that case the Quake III rsqrt was both far less accurate and usually slower than the SSE instruction.

3 Likes

Nice work @nshaffer - yes the advantage for some of those function is not so clear anymore on modern CPUs. fast 1/sqrt does compare pretty favorably(~2x speed-up) if AVX is turned on though. I would bet it is due to using fma in the last operation:

vfnmadd213ss xmm0,xmm1,DWORD PTR [rip+0x555]        # 40219c <options.19.2+0x6c>
1 Like

Nice work @nshaffer !! So you motivated me to push a bit more and reworked the test and found interesting results:

code + testing
module fast_rsqrt
    use, intrinsic :: iso_fortran_env, only: sp=>real32, dp=>real64
    implicit none
    private
    
    public :: frsqrt

    interface frsqrt
        !! Retranscript of the original Quake III Arena, see https://en.wikipedia.org/wiki/Fast_inverse_square_root
        !! for pure reference
        module procedure frsqrt_dp
        module procedure frsqrt_sp
    end interface
    
contains

    elemental function frsqrt_sp(x) result(y)
        integer, parameter :: wp = sp
        real(wp), intent(in) :: x
        real(wp) :: y
        !-- Internal Variables
        real(wp)    :: x2,y2
        integer(wp) :: i
        integer(wp), parameter :: magic = int(Z'5f3759df',kind=wp)
        !-------------------------------------------------
        x2 = 0.5_wp*x
        i  = transfer(x,i)
        i  = magic - shiftr(i,1)
        y2 = transfer(i,y)

        ! Perform one Newton-Raphson step
        y  = y2*(1.5_wp-x2*y2*y2)

    end function frsqrt_sp

    elemental function frsqrt_dp(x) result(y)
        integer, parameter :: wp = dp
        real(wp), intent(in) :: x
        real(wp) :: y
        !-- Internal Variables
        real(wp)    :: x2,y2
        integer(wp) :: i
        integer(wp), parameter :: magic = 6910469410427058089_wp
        !-------------------------------------------------
        x2 = 0.5_wp*x
        i  = transfer(x,i)
        i  = magic - shiftr(i,1)
        y2 = transfer(i,y)

        ! Perform one Newton-Raphson step
        y  = y2*(1.5_wp-x2*y2*y2)

    end function frsqrt_dp
    
end module fast_rsqrt

program main
    use, intrinsic :: iso_fortran_env, only: sp=>real32, dp=>real64
    use fast_rsqrt
    !> Internal parameters and variables
    integer, parameter :: n = 1e5, ncalc = 2, niter = 10000
    integer :: iter, i
    real(dp) :: time(0:ncalc), times_tot(ncalc), err
    1 format(a10,': <time> = ',f9.4,' ns/eval, speed-up=',f5.2,'X, rel. error=',es16.4)
    !====================================================================================
    print *,""
    print *,"================================================================"
    print *," Fast rsqrt = 1/sqrt(x)"
    block
        integer, parameter :: wp=sp
        real(wp), allocatable :: x(:) , y(:), yref(:)
        real(kind=wp) :: tolerance = 1e-2_wp
        !> define a log space
        allocate( x(n) , y(n), yref(n) )
        call random_number(x)
        x = 10._wp**(-10*(1._wp-x) + 10*x)
        times_tot(:) = 0
        do iter=1,niter
            call cpu_time(time(0))
            yref = 1._wp/sqrt(x); call cpu_time(time(1))
            y = frsqrt(x)  ; call cpu_time(time(2))
            times_tot(:) = times_tot(:) + time(1:ncalc) - time(0:ncalc-1)
        end do
        err = sqrt( sum(( y - yref )**2) ) / sqrt( sum(( yref )**2) )
        print *, "time default: ",times_tot(1),"time frsqrt: ",times_tot(2)
        print 1, "frsqrt r32" , 1e9*times_tot(2)/(niter*n), times_tot(1)/times_tot(2), err
    end block
    block
        integer, parameter :: wp=dp
        real(wp), allocatable :: x(:) , y(:), yref(:)
        real(kind=wp) :: tolerance = 1e-2_wp
        !> define a log space
        allocate( x(n) , y(n), yref(n) )
        call random_number(x)
        x = 10._wp**(-200*(1._wp-x) + 200*x)
        times_tot(:) = 0
        do iter=1,niter
            call cpu_time(time(0))
            yref = 1._wp/sqrt(x); call cpu_time(time(1))
            y = frsqrt(x)  ; call cpu_time(time(2))
            times_tot(:) = times_tot(:) + time(1:ncalc) - time(0:ncalc-1)
        end do
        err = sqrt( sum(( y - yref )**2) ) / sqrt( sum(( yref )**2) )
        print *, "time default: ",times_tot(1),"time frsqrt: ",times_tot(2)
        print 1, "frsqrt r64" , 1e9*times_tot(2)/(niter*n), times_tot(1)/times_tot(2), err
    end block

end program

Results from compiler explorer

ifort 19 -O3 -xHost

 time default:   0.112649999999986      time frsqrt:   0.111316999999994     
frsqrt r32: <time> =    0.1113 ns/eval, speed-up= 1.01X, rel. error=      1.1156E-03
 time default:    1.49099300000007      time frsqrt:   0.470805999999973     
frsqrt r64: <time> =    0.4708 ns/eval, speed-up= 3.17X, rel. error=      1.0753E-03

ifx2023 -O3 -xHost

 time default:   0.322242999999995      time frsqrt:   0.317305999999985     
frsqrt r32: <time> =    0.3173 ns/eval, speed-up= 1.02X, rel. error=      1.1156E-03
 time default:    2.53335500000001      time frsqrt:   0.726347999999955     
frsqrt r64: <time> =    0.7263 ns/eval, speed-up= 3.49X, rel. error=      1.0753E-03

gfortran 13.2 -Ofast -march=native

 time default:   0.26977699999999727      time frsqrt:   0.28963799999999806     
frsqrt r32: <time> =    0.2896 ns/eval, speed-up= 0.93X, rel. error=      1.1054E-03
 time default:    1.8799389999999838      time frsqrt:   0.69013799999999925     
frsqrt r64: <time> =    0.6901 ns/eval, speed-up= 2.72X, rel. error=      1.0654E-03

gfortran 13.2 -O3 -march=native

time default:   0.30656400000004769      time frsqrt:   0.13649199999999095     
frsqrt r32: <time> =    0.1365 ns/eval, speed-up= 2.25X, rel. error=      1.1029E-03
 time default:   0.94306099999994464      time frsqrt:   0.27456500000000195     
frsqrt r64: <time> =    0.2746 ns/eval, speed-up= 3.43X, rel. error=      1.0285E-03

same tendency with my local machine

So basically:

The double precision frsqrt version is consistently >2X faster

For the single precision the story is different, with ifort or ifx, they already provide the SSE instructions with -O3

ifort

        vrsqrtps  ymm5, ymm2 

ifx:

        vrsqrt14ps      ymm1, ymm0

for gfortran one has to compile with -Ofast to also get the fastest intrinsic, otherwise with -O3, frsqrt is still faster

gfortran with -Ofast

        vrsqrtps        ymm1, YMMWORD PTR [rdi+rax]

I guess it is instructive to have it there if only just as fun experiment and for learning :slight_smile:

Given the speed-up, I think it would make sense to run at least 2 Newton steps in double precision, there’s minimal penalty but a significant accuracy improvement

1 Like