Fastmath library

I was trying out the fastmath library (here). Thanks to @FedericoPerini for providing such a great public good! I ran the program test_fastmath and I have a question regarding the interpretation of the results. After compiling with ifx -O3 and running the executable, I got the following (I report only the test for flog):

*** fast log(x) test ***
   spline(5): average time =    1.5625 ns/eval, speed-up= 2.50X, relative error=   2.57799E-006
   spline(3): average time =    0.7812 ns/eval, speed-up= 5.00X, relative error=   3.76138E-005
   intrinsic: average time =    4.6875 ns/eval, speed-up= 0.83X, relative error=   0.00000E+000

I think speed-up is defined as eval time intrinsic / eval time for package, hence a value greater than one means that flog is faster than log?
I don’t understand why the speed-up of the intrinsic is 0.83 though? Shouldn’t it be always one by construction?

Thanks!

2 Likes

Interesting result @aledinola, thank you for posting it!

The only difference between the initial intrinsics loop and the evaluation loop is the presence of a select case construct. I would guess that ifx has a slight overhead in this case, or perhaps, that it is not vectorizing that loop when the select case is present, or that the CPU is experiencing a different thermal load (are you on a laptop?). May be worth investigating with more verbose compiler flags.

Also, please consider that the cpu_time intrinsic may not exactly represent what you think. For example on Apple hardware, I found that cpu_time does not advance when your program is just waiting for GPU instructions to complete. So in that case, if you want to measure the actual wall-time improvement, it may be worth switching to system_clock that will give you the exact time snapshot.

1 Like

Wow. Here are lots of things to comment on.

First, by testing it, by running the bundled tests, all but one of the “fast” algorithms proved to be slower than their intrinsic counterparts on my system. Only the 1/sqrt(x) Quake3 was faster than the intrinsic variant:

$ gfortran-12 -O3 -march=native src/fast_exp.f90 src/fast_log.f90 src/fast_rsqrt.f90 test/test.f90  -o fastmath_test
$ ./fastmath_test 
 *** fast exp(x) test ***
 quintic spline accuracy: average time =    5.1672 ns/eval, speed-up= 0.14X, relative error=   1.11590E-006
   cubic spline accuracy: average time =    5.1677 ns/eval, speed-up= 0.14X, relative error=   2.41297E-004
               intrinsic: average time =    0.5399 ns/eval, speed-up= 1.38X, relative error=   0.00000E+000
       linear   accuracy: average time =    5.1171 ns/eval, speed-up= 0.15X, relative error=   1.88894E-002
       degree 2 accuracy: average time =    5.3753 ns/eval, speed-up= 0.14X, relative error=   1.76672E-003
       degree 3 accuracy: average time =    5.3683 ns/eval, speed-up= 0.14X, relative error=   1.22627E-004
       degree 4 accuracy: average time =    5.4069 ns/eval, speed-up= 0.14X, relative error=   7.11472E-006
       degree 5 accuracy: average time =    5.6207 ns/eval, speed-up= 0.13X, relative error=   3.52911E-007
 *** fast log(x) test ***
               spline(5): average time =    0.9103 ns/eval, speed-up= 0.78X, relative error=   2.51242E-006
               spline(3): average time =    0.7982 ns/eval, speed-up= 0.89X, relative error=   3.68552E-005
               intrinsic: average time =    0.7085 ns/eval, speed-up= 1.00X, relative error=   0.00000E+000
 *** fast 1/sqrt(x) test ***
               intrinsic: average time =    0.9051 ns/eval, speed-up= 0.98X, relative error=   0.00000E+000
                  Quake3: average time =    0.5433 ns/eval, speed-up= 1.64X, relative error=   9.35789E-004
[fastmath] 3 test completed: 3 passed, 0 failed.
STOP 0

That was also the case with ifx. Why? I will make some guesses:

Many of these operations exist in optimized, vectorized libraries (Intel SVML - Short Vector Math Library, Glibc libmvec or ARM math libraries) that will be used by the compiler if you give the right compiler flags (read your manual). By calling a specific, optimized method explicitly you prevent these libraries from being used.

Also, for some operations there exist optimized hardware instructions. On x86 for instance there is the instructions rsqrtss and rsqrtps (plus many more) for the reciprocal of a square root - and these will usually always be faster than something you wrote yourself. Using the intrinsic SQRT in the code allows the compiler to choose these operations when possible (and allowed by the compiler flags - again: read the manual).

I know for instance that gfortran is quite conservative in the choice of optimizations by default. It will not use rsqrtps in place of 1.0/SQRT(x) unless you give the flag -ffast-math, since rsqrtps is an approximation with less precision than a square root and a divide. Intel, on the other hand, will happily use that instruction with just -O3, unless you also set -fp-model=precise. This also needs to be taken into account when doing benchmarks.

1 Like

Looking more into the algorithms and the results, I also observe that most of these “fast algorithms” give less accuracy than single precision, yet is benchmarked against double precision counterparts. This is highly misleadnig.

Take again the reciprocal of the square root as an example. The relative error of the Quake3 “fast” algorithm is nearly 0.001 - this is a huge number and far less accuracy than a single precision evaluation.

Adding a last test to the suite - just using single precision gives me the following results:

$ gfortran-12 -O3 -march=native src/fast_exp.f90 src/fast_log.f90 src/fast_rsqrt.f90 test_rsqrt.f90  -o fastmath_test
$ ./fastmath_test 
 *** fast 1/sqrt(x) test ***
               intrinsic: average time =    0.8935 ns/eval, speed-up= 1.02X, relative error=   0.00000E+000
                  Quake3: average time =    0.6110 ns/eval, speed-up= 1.49X, relative error=   9.34360E-004
           intrinsic, SP: average time =    0.2482 ns/eval, speed-up= 3.68X, relative error=   3.11356E-008
[fastmath] 1 test completed: 1 passed, 0 failed.
STOP 0

$ gfortran-12 -O3 -march=native -ffast-math src/fast_exp.f90 src/fast_log.f90 src/fast_rsqrt.f90 test_rsqrt.f90  -o fastmath_test
$ ./fastmath_test 
 *** fast 1/sqrt(x) test ***
               intrinsic: average time =    0.9172 ns/eval, speed-up= 0.97X, relative error=   0.00000E+000
                  Quake3: average time =    0.7124 ns/eval, speed-up= 1.25X, relative error=   9.32439E-004
           intrinsic, SP: average time =    0.1234 ns/eval, speed-up= 7.21X, relative error=   4.06439E-008
[fastmath] 1 test completed: 1 passed, 0 failed.
STOP 0

Wow!!! Without -ffast-math, just using single precision 1.0/SQRT(x) is 3.6 times faster than the double precision alternative, twice as fast as the Quake3 algo, and gives three to four orders of magnitude lower error! With -ffast-math enabled, the speedup is a whopping 7.2 times and the error still orders of magnitude lower.

I do not see any value here.

1 Like

The value is to have convenient benchmarks so you can make informed decisions. Whether any of these “fast” variants are worthwhile is a complicated function of your problem, compiler, and hardware.

2 Likes

Yes I am on a laptop. I will try again when I have access to a desktop PC

The reason why I am testing the fast-math library is because I would like to speed up a routine for computing the logsum in one of my projects (see here for some background info).

My original subroutine using the intrinsic exp and log

subroutine sub_logsum(LogSum,Prob, V,sigma) 
	! DESCRIPTION
	! Calculates the log-sum and choice-probabilities. The computation avoids
	! overflows.
    !See for more info:
	! https://gregorygundersen.com/blog/2020/02/09/log-sum-exp/
	! https://nhigham.com/2021/01/05/what-is-the-log-sum-exp-function/
	! INPUTS
	!   V:     Vector with values of the different choices
	!   sigma: Standard deviation of the taste shock
	! OUTPUTS
	!   LogSum: Log sum of exponentials
	!   Prob:   Choice probabilities

	real(8), intent(in)  :: V(:)
	real(8), intent(in)  :: sigma
	real(8), intent(out) :: LogSum
	real(8), intent(out) :: Prob(:)
	! Local variables:
	integer :: n, id
	real(8) :: mxm, sum_prob
	
	! Body of sub_logsum
	
	n = size(V)
	
	! Check inputs
	if (n==1) then
		call myerror("sub_logsum: Need at least two elements")
	endif
	if (n/=size(Prob)) then
		call myerror("sub_logsum: V and Prob are not compatible")
	endif
	if (sigma<=0d0) then
		call myerror("sub_logsum: sigma must be zero or positive")
	endif
	
	! Maximum over the discrete choice
	id  = maxloc(V,dim=1)
	mxm = V(id)
	
	! Logsum and probabilities
	! a. numerically robust log-sum
	LogSum = mxm + sigma*log(sum(exp((V-mxm)/sigma)))
		
	! b. numerically robust probability
	Prob = exp((V-LogSum)/sigma)
	
	!-----------------------------------------------------------------!
	! Check if outputs are correct. Comment out this section for speed
	if (LogSum/=LogSum) then
		call myerror("sub_logsum: LogSum is either NaN or Inf")
	endif
	if (any(Prob<0d0)) then
		call myerror("sub_logsum: Some elements of Prob are negative")
	endif
	sum_prob = sum(Prob)
	if (abs(sum_prob-1d0)>1d-7) then
        write(*,*) "sum_prob = ", sum_prob
		call myerror("sub_logsum: Prob does not sum to one")
	endif
	Prob = Prob/sum_prob
	!-----------------------------------------------------------------!
end subroutine sub_logsum

And here is the new routine which uses @FedericoPerini’s library:

subroutine sub_logsum_f(LogSum,Prob, V,sigma) 
	! NOTE: sub_logsum_f is copied from sub_logsum and the only difference 
    ! is that it uses the fast_exp and fast_log functions.
    ! DESCRIPTION: see sub_logsum for more information.
	real(8), intent(in)  :: V(:)
	real(8), intent(in)  :: sigma
	real(8), intent(out) :: LogSum
	real(8), intent(out) :: Prob(:)
	! Local variables:
	integer :: n, id
	real(8) :: mxm, sum_prob
	
	! Body of sub_logsum
	
	n = size(V)
	
	! Check inputs
	if (n==1) then
		call myerror("sub_logsum: Need at least two elements")
	endif
	if (n/=size(Prob)) then
		call myerror("sub_logsum: V and Prob are not compatible")
	endif
	if (sigma<0d0) then
		call myerror("sub_logsum: sigma must be zero or positive")
	endif
	
	! Maximum over the discrete choice
	id  = maxloc(V,dim=1)
	mxm = V(id)
	
	! Logsum and probabilities
	! a. numerically robust log-sum
	LogSum = mxm + sigma*flog(sum(fexp((V-mxm)/sigma)))
		
	! b. numerically robust probability
	Prob = fexp((V-LogSum)/sigma)
	
	!-----------------------------------------------------------------!
	! Check if outputs are correct. Comment out this section for speed
	if (LogSum/=LogSum) then
		call myerror("sub_logsum_f: LogSum is either NaN or Inf")
	endif
	if (any(Prob<0d0)) then
		call myerror("sub_logsum_f: Some elements of Prob are negative")
	endif
	sum_prob = sum(Prob)
	if (abs(sum_prob-1d0)>1d-5) then
        write(*,*) "sum_prob = ", sum_prob
		call myerror("sub_logsum_f: Prob does not sum to one")
	endif
	Prob = Prob/sum_prob
	!-----------------------------------------------------------------!
end subroutine sub_logsum_f

This routine (sub_logsum or sub_logsum_f) is the real bottleneck of my code. Unfortunately, it seems that both on ifort and ifx using fast-math actually slows down a bit the code with respect to the version that relies on the intrinsic log and exp.

I wrote a test program to compare the performance of sub_logsum and sub_logsum_f

module mymod
    use fast_exp
    use fast_log
    implicit none 
    
    contains 
    
    subroutine myerror(string)

    implicit none
    character(len=*), intent(in) :: string
    write(*,*) "ERROR: ", string
    write(*,*) "Program will terminate.."
    pause
    stop

    end subroutine myerror
    
    subroutine sub_logsum(LogSum,Prob, V,sigma) 
		! DESCRIPTION
		! Calculates the log-sum and choice-probabilities. The computation avoids
		! overflows.
        !See for more info:
		! https://gregorygundersen.com/blog/2020/02/09/log-sum-exp/
		! https://nhigham.com/2021/01/05/what-is-the-log-sum-exp-function/
		! INPUTS
		!   V:     Vector with values of the different choices
		!   sigma: Standard deviation of the taste shock
		! OUTPUTS
		!   LogSum: Log sum of exponentials
		!   Prob:   Choice probabilities

		real(8), intent(in)  :: V(:)
		real(8), intent(in)  :: sigma
		real(8), intent(out) :: LogSum
		real(8), intent(out) :: Prob(:)
		! Local variables:
		integer :: n, id
		real(8) :: mxm, sum_prob
		
		! Body of sub_logsum
		
		n = size(V)
		
		! Check inputs
		if (n==1) then
			call myerror("sub_logsum: Need at least two elements")
		endif
		if (n/=size(Prob)) then
			call myerror("sub_logsum: V and Prob are not compatible")
		endif
		if (sigma<=0d0) then
			call myerror("sub_logsum: sigma must be positive")
		endif
		
		! Maximum over the discrete choice
		id  = maxloc(V,dim=1)
		mxm = V(id)
		
		! Logsum and probabilities
		! a. numerically robust log-sum
		LogSum = mxm + sigma*log(sum(exp((V-mxm)/sigma)))
			
		! b. numerically robust probability
		Prob = exp((V-LogSum)/sigma)
		
		!-----------------------------------------------------------------!
		! Check if outputs are correct. Comment out this section for speed
		if (LogSum/=LogSum) then
			call myerror("sub_logsum: LogSum is either NaN or Inf")
		endif
		if (any(Prob<0d0)) then
			call myerror("sub_logsum: Some elements of Prob are negative")
		endif
		sum_prob = sum(Prob)
		if (abs(sum_prob-1d0)>1d-7) then
            write(*,*) "sum_prob = ", sum_prob
			call myerror("sub_logsum: Prob does not sum to one")
		endif
		Prob = Prob/sum_prob
		!-----------------------------------------------------------------!
    end subroutine sub_logsum
    
    subroutine sub_logsum_f(LogSum,Prob, V,sigma) 
		! NOTE: sub_logsum_f is copied from sub_logsum and the only difference 
        ! is that it uses the fast_exp and fast_log functions.
        ! DESCRIPTION: see sub_logsum for more information.
		real(8), intent(in)  :: V(:)
		real(8), intent(in)  :: sigma
		real(8), intent(out) :: LogSum
		real(8), intent(out) :: Prob(:)
		! Local variables:
		integer :: n, id
		real(8) :: mxm, sum_prob
		
		! Body of sub_logsum
		
		n = size(V)
		
		! Check inputs
		if (n==1) then
			call myerror("sub_logsum: Need at least two elements")
		endif
		if (n/=size(Prob)) then
			call myerror("sub_logsum: V and Prob are not compatible")
		endif
		if (sigma<=0d0) then
			call myerror("sub_logsum: sigma must be positive")
		endif
		
		! Maximum over the discrete choice
		id  = maxloc(V,dim=1)
		mxm = V(id)
		
		! Logsum and probabilities
		! a. numerically robust log-sum
		LogSum = mxm + sigma*flog(sum(fexp((V-mxm)/sigma)))
			
		! b. numerically robust probability
		Prob = fexp((V-LogSum)/sigma)
		
		!-----------------------------------------------------------------!
		! Check if outputs are correct. Comment out this section for speed
		if (LogSum/=LogSum) then
			call myerror("sub_logsum_f: LogSum is either NaN or Inf")
		endif
		if (any(Prob<0d0)) then
			call myerror("sub_logsum_f: Some elements of Prob are negative")
		endif
		sum_prob = sum(Prob)
		if (abs(sum_prob-1d0)>1d-5) then
            write(*,*) "sum_prob = ", sum_prob
			call myerror("sub_logsum_f: Prob does not sum to one")
		endif
		Prob = Prob/sum_prob
		!-----------------------------------------------------------------!
    end subroutine sub_logsum_f

end module mymod
!===============================================================================!
    
program test_logsum_f
    use mymod, only: sub_logsum, sub_logsum_f
    implicit none
	real(8), allocatable :: V(:), Prob(:)
    real(8) :: LogSum, sigma, time, time_f, c_start, c_end
    integer, parameter :: ntest = 20000000
    integer :: i,ii
    
    ! Assign values
    sigma = 0.1d0
    V = [1d0, 2d0, 3d0, 4d0]
    
    ! Call sub_logsum
    allocate(Prob(size(V)))
    
    write(*,*) "Call sub_logsum"
    
    call cpu_time(c_start)
	do i=1,ntest 
        call sub_logsum(LogSum,Prob, V,sigma)
    enddo
    call cpu_time(c_end)
    time = c_end-c_start
    
    write(*,*) "time = ", time
    
    write(*,'(X,A,F13.10)') "LogSum = ", LogSum
    write(*,*) "Prob = "
    do ii = 1,size(Prob)
        write(*,'(F13.10)') Prob(ii)
    enddo
    
    deallocate(Prob)
    
    write(*,*) "------------------------------------------"
    write(*,*) "Call sub_logsum_f"
    
    allocate(Prob(size(V)))
    
    call cpu_time(c_start)
    do i=1,ntest 
        call sub_logsum_f(LogSum,Prob, V,sigma)
    enddo
    call cpu_time(c_end)
    time_f = c_end-c_start
    
    write(*,*) "time_f = ", time_f
    
    write(*,'(X,A,F13.10)') "LogSum = ", LogSum
    write(*,*) "Prob = "
    do ii = 1,size(Prob)
        write(*,'(F13.10)') Prob(ii)
    enddo
    
    write(*,*) "------------------------------------------"
    write(*,'(X,A)') "logsum with fast math vs logsum:"
    write(*,'(X,A,F13.10)') "time_f/time = ", time_f/time
    
    pause 
    
end program test_logsum_f

After compiling with ifort and /O3 the results are

Call sub_logsum
 time =   0.718750000000000
 LogSum =  4.0000045401
 Prob =
 0.0000000000
 0.0000000021
 0.0000453979
 0.9999546001
 ------------------------------------------
 Call sub_logsum_f
 time_f =   0.953125000000000
 LogSum =  4.0000043780
 Prob =
 0.0000000000
 0.0000000021
 0.0000453979
 0.9999546000
 ------------------------------------------
 logsum with fast math vs logsum:
 time_f/time =  1.3260869565

The results with ifx and /O3 are:

Call sub_logsum
 time =   0.968750000000000
 LogSum =  4.0000045401
 Prob =
 0.0000000000
 0.0000000021
 0.0000453979
 0.9999546001
 ------------------------------------------
 Call sub_logsum_f
 time_f =    1.04687500000000
 LogSum =  4.0000043780
 Prob =
 0.0000000000
 0.0000000021
 0.0000453979
 0.9999546000
 ------------------------------------------
 logsum with fast math vs logsum:
 time_f/time =  1.0806451613

NOTE: To compile my test file, you have to include the fast-math modules fast_exp.f90 and fast_log.f90 available here.

If you instrument the code

you’ll see that the 4 important operations in your routine take approximately the same % of time:

intrinsic: times=   2.88051367E-02   2.99785733E-02   2.95358747E-02   2.89158672E-02

Even if you deleted the exponential sum altogether, you’d only obtain a ~25% speedup w.r.t. this.

When the pie is evenly sliced, it means there isn’t much more you can do to speed it up further, unless you change paradigm. Options to consider may be to use 32-bit arithmetic (you should check what’s the max sigma that does not overflow the exponential), OpenMP, MPI, or a GPU.

2 Likes

Thanks for profiling the code! I have several “checks” to make sure that the code is ok, but these checks slow down the code considerably (as your examples show)

Probably the best solution is to comment out the checks, once I am confident about the code