Compiling only part of code with -ffast-math

nicholaswogan · April 9, 2022, 4:02pm

I’ve noticed that big speedups are possible with --ffast-math, in some circumstances. For example doing

x = 10.0_dp**x

can be ~4 times faster with --ffast-math. This is unique to 10.0**x. Speedups are not significant for something like 7.6**x.

But I don’t want to compile my whole code with --ffast-math, because I worry about messing some parts up. I just want to make a “fast” special 10**x subroutine, compiled with --fast-math. I then want to call this special subroutine from another piece of code, compiled without --fast-math.

However, this does not work. Everything links and compiles just fine, but i do not get the --fast-math speedup.

I’m using M1 Macbook air. My compiler info:

GNU Fortran (Homebrew GCC 11.2.0_3) 11.2.0

Why isn’t it possible to compile only part of code with -ffast-math? Or if it is possible, how do I do it?

nicholaswogan · April 9, 2022, 4:27pm

This is possibly just an issue with gcc not being optimized on M1 macs yet. I’m finding that clang can compile an identical C code which is as fast as the --fast-math version

rwmsu · April 9, 2022, 5:23pm

Not sure how you would do this with Xcode etc. but in old fashion make files you
can always create a special rule for just the routines you want to compile with
-ffast-math. Here is an example of a rules file I include in my base Makefiles. Note
the special rule for fast_routine.f90


SHELL = /bin/sh

.SUFFIXES:
.SUFFIXES: .o .F90 .f90 .f .F .c

.F90.o:
        $(FC) -c $(DEFINES) $(INCLUDES) $(FCFLAGS) $< -o $@

.f90.o:
        $(FC) -c $(INCLUDES) $(FCFLAGS) $< -o $@

.F.o:
        $(FC) -c $(DEFINES) $(INCLUDES) $(FCFLAGS) $< -o $@

.f.o:
        $(FC) -c $(INCLUDES) $(FCFLAGS) $< -o $@

.c.o:
        $(CC) -c $(INCLUDES) $(CCFLAGS) $< -o $@


fast_routine.o: fast_routine.f90
        $(FC) -c $(INCLUDES) --ffast-math fast_routine.f90


clean:
        rm -rf ./*.o ./*.mod

This will overide the default

,f90.o:

rule

nicholaswogan · April 10, 2022, 3:40am

Ok I use CMake. Something like this works well

add_library(fastmath fastmath.f90)
target_compile_options(fastmath PRIVATE -ffast-math)

add_library(normalmath normalmath.f90)

add_executable(main main.f90)
target_link_libraries(main fastmath normalmath)

msz59 · April 10, 2022, 4:02pm

Very strange as it is straightforward to check that r=7.6d0**x is translated identically as r=10.d0**x, i.e. by calling exp with the argument x multiplied by compile-time calculated log(base):

LFB0:
	movsd	lC0(%rip), %xmm0
	mulsd	(%rdi), %xmm0
	jmp	_exp
...
lC0:
	.long	-909346034
	.long	1073756581

Actually I am surprised it requires -ffast-math to make this optimization. I guess that pow(y,x) does just the same internally, i.e. exp(x*log(y)), so with constant base the optimizer could do it anyway with -O3 option

nicholaswogan · April 10, 2022, 4:41pm

It is strange. Overall though, I have found that you can just optimize a**x, when a is a single real, and x is a array of reals, by doing

  subroutine pow_fast(a, x)
    real(dp), intent(in) :: a
    real(dp), intent(inout) :: x(:)
    real(dp) :: c
    c = log(a)
    x = exp(x*c)
  end subroutine

The simpler, slower version (see below), must compute the log(a) many times, for no reason

  subroutine pow_slow(a, x)
    real(dp), intent(in) :: a
    real(dp), intent(inout) :: x(:)
    x = a**x
  end subroutine

The speed difference is like 3x on my computer with gfortran 11.

Beliavsky · April 10, 2022, 8:20pm

I find a smaller but substantial differential when creating a .pow. operator that computes only one log. For the code

module pow_mod
implicit none
integer, parameter :: dp = kind(1.0d0)
interface operator(.pow.)
   module procedure pow_fast
end interface
contains
pure function pow_fast(x,y) result(x_to_y)
real(kind=dp), intent(in) :: x, y(:)
real(kind=dp)             :: x_to_y(size(y))
real(kind=dp)             :: c
c = log(x)
x_to_y = exp(y*log(x))
end function pow_fast
end module pow_mod
!
program main
use pow_mod, only: dp, operator(.pow.)
use iso_fortran_env, only: compiler_version, compiler_options
implicit none
integer      , parameter   :: n = 3*10**8, ntimes = 3
real(kind=dp)              :: x,sums(2),t(3)
real(kind=dp), allocatable :: y(:)
logical      , parameter   :: compute_sums = .true.
allocate (y(n))
call random_number(x)
call random_number(y)
call cpu_time(t(1))
if (compute_sums) sums(1) = sum(x**y)
call cpu_time(t(2))
if (compute_sums) sums(2) = sum(x .pow. y)
call cpu_time(t(3))
print "(2a)","compiler version: ",trim(compiler_version())
print "(2a)","compiler options: ",trim(compiler_options())
print "(*(a12))","method","intrinsic",".pow."
print "(a12,*(f12.4))","time",t(2:ntimes) - t(:ntimes-1)
if (compute_sums) print "(/,a,*(1x,f0.8))","sums, ratio =",sums,sums(2)/sums(1)
end program main

I get output on WSL2

compiler version: GCC version 9.3.0
compiler options: -mtune=generic -march=x86-64 -O3 -fpre-include=/usr/include/finclude/math-vector-fortran.h
      method   intrinsic       .pow.
        time      3.9544      2.3701
sums, ratio = 163748866.22637784 163748866.22637784 1.00000000

compiler version: Intel(R) Fortran Intel(R) 64 Compiler Classic for applications running on Intel(R) 64, Version 2021.2.0 Build 20210228_000000
compiler options: -O3 -heap-arrays
      method   intrinsic       .pow.
        time      1.7751      1.0257

sums, ratio = 20339963.10582615 20339963.10582284 1.00000000

msz59 · April 10, 2022, 8:59pm

should probably be

c = log(x)
x_to_y = exp(y*c)

Beliavsky · April 10, 2022, 9:19pm

Beliavsky:

pure function pow_fast(x,y) result(x_to_y)
real(kind=dp), intent(in) :: x, y(:)
real(kind=dp)             :: x_to_y(size(y))
real(kind=dp)             :: c
c = log(x)
x_to_y = exp(y*log(x))
end function pow_fast

Yes, as @kargl also noted. The function should be

pure function pow_fast(x,y) result(x_to_y)
real(kind=dp), intent(in) :: x, y(:)
real(kind=dp)             :: x_to_y(size(y))
x_to_y = exp(y*log(x))
end function pow_fast

In an earlier version c was used. Running the modified code, the speeds are about the same – the compilers were optimizing away the unused c variable.

nicholaswogan · April 11, 2022, 4:07am

@Beliavsky I get a similar relative speedup for the program you posted.

themos · April 11, 2022, 7:50am

There is a sequence of two assignments to F in ULP, typo?

JohnCampbell · April 12, 2022, 2:34am

I use -ffast-math for solving large sets of linear equations ( f = K.x ) with an openmp skyline solver (no intrinsics) and get about 2x speed improvement for what is 99% of the run time.
There are many .f90 files and I mix a variety of compile options including:
-ffast-math
-O2 vs -O3,
-fopenmp vs -funroll-loops --param max-unroll-times=2 and
-mno-fma.

Basically there are only a few routines that dictate performance, so most other routines can be compiled with less agressive options (eg reading and managing data).

I have error checking on the equation solver solution, recalculating a reconstructed “f - sum(Ke.x)”, where K = sum(Ke) for all equations and do not find significant errors using -ffast-math.

JohnCampbell · April 16, 2022, 6:37am

I found an interesting calculation for pi, by using Archimedes method of dividing the sectors of a circle, initially into 6 sectors with equilateral triangles, then repeatidly halving the sectors to calculate the circumference as an estimate of pi. (using radius = 0.5)
I tried 4 kinds with precision = 6, 15, 18 and 30, to estimate pi as the sum of the sector lengths and estimate the error based on the inside and outside inscribed polygon.
pi_sector.f90 (1.8 KB)

I thought this would be a good example to test -ffast-math, using gfortran-64 Ver 11.1 on Windows 7.

The compile options I used are in the following .bat file:

now > %1.log

set options=-g -fimplicit-none -O1 -march=native
gfortran %1.f90 %options% -o %1.exe -v >> %1.log  2>&1

set options >> %1.log
%1.exe >> %1.log

set options=-g -fimplicit-none -O2 -march=native
gfortran %1.f90 %options% -o %1.exe -v >> %1.log  2>&1

set options >> %1.log
%1.exe >> %1.log

set options=-g -fimplicit-none -O2 -march=native -ffast-math
gfortran %1.f90 %options% -o %1.exe -v >> %1.log  2>&1

set options >> %1.log
%1.exe >> %1.log

set options=-g -fimplicit-none -O3 -march=native
gfortran %1.f90 %options% -o %1.exe -v >> %1.log  2>&1

set options >> %1.log
%1.exe >> %1.log

set options=-g -fimplicit-none -O3 -march=native -ffast-math
gfortran %1.f90 %options% -o %1.exe -v >> %1.log  2>&1

set options >> %1.log
%1.exe >> %1.log

notepad %1.log

The approach uses sqrt and calculates an error that approaches epsilon, which I thought might demonstrate -ffast-math “short-cuts”.
After running the test for all 4 kinds, I could not identify any change to the calculation result for any of the compile “options” alternatives.

I continue to look for failings of -ffast-math in all the wrong places !
Any suggestions where I should look ?

JohnCampbell · April 19, 2022, 2:56am

I was looking to see if there was problems with rounding for computation with significant variation in exponents, as “a” approaches “r”. I could not identify a problem.

Could you please attach a copy of h.f90 so I can see where the errors occur.

I did try a computation for x ** y, which did show a change for -ffast-math, but the difference appears to be less than epsilon() and only for some cases.
The following attached files show the x ** y test options I tried for different gfortran compile options.

      real x,y,z
      integer i

      x = 0.1
      do i = 75,135,5
        y = real(i) / 100.
         if ( i==130) y = 0.7515
         if ( i==135) y = 1.2057
        z = x**y
        write (*,*) i,x,y,z
      end do
      end

ffast_math-log.f90 (5.3 KB)
ffast_math.f90 (241 Bytes)
tst_all-bat.f90 (1.0 KB)
tst_gf-bat.f90 (112 Bytes)

JohnCampbell · April 19, 2022, 8:45am

In function ulp, could you be checking (f+a) against nearest (a,1.0). ie, how many times is f > nearest(a,1.0)-a.
Is your test cycling until this condition fails ? Is 1 in 6 million cycles an acceptable performance ?

In my calculations, I can tolerate if “a is near to b”, where the difference is basically only different in the last bit or two. For real*8, 52 or 53 bits of accuracy is pretty good.
I would be interested to know what proportion of calculations using -ffast-math provide 51 bits or less.

If I am using “wp=selected_real_kind (p=15)”, am I saying I want 50 bits of accuracy so -ffast-math is totally acceptable ?

JohnCampbell · April 20, 2022, 3:44am

@kargl,

Thanks very much for providing this information.

Most of my calculations involve ddotp (dot_product) and daxpy. Both use AVX instructions. They do not involve x ** y.

I am trying to understand if the error profile you have found : “2.9% with ulp > 3 and only 50.3% with ulp < 1” is specific to exp/log function use or would also apply to AVX instructions ?

I have calculations that can take days to complete, so any saving that -ffast-math provides is beneficial. These calculations have “redundancy” so round-off is not cumulative.

Both ddotp and daxpy involve combined multiply and addition, which is very prone to round-off, so it is difficult to identify if increased round-off associated with -ffast-math in only some cases would be significant.

For example, adding a much smaller number with increased round-off errors to a larger value accumulator will not be an issue, so I am wanting to understand what conditions lead to increased round-off with -ffast-math.

themos · April 20, 2022, 11:32am

They are also characterised by low arithmetic intensity, which means that you might be better off coding compensated summation versions of them that would be both faster and more accurate.

JohnCampbell · April 20, 2022, 12:17pm

@themos
I am not familiar with what you are referring to with this comment ? Could you please elaborate as I would like to better understand what you are suggesting.

I do find that both ddotp and daxpy can achieve low AVX efficiency when using large vectors, which I attribute to memory bandwidth, especially with multi-threading. I would not understand this to be “low arithmetic intensity”, but if there are “better off coding compensated summation versions” I would certainly like to better understand this approach.

I am always looking for ways to improve efficiency, which I would equate to achieving higher arithmetic intensity, so your comment is intriguing.

As I indicated previously, I do pay particular attention to the significance of round-off errors in my structural analysis modelling. I do find when using -ffastmath with ddotp and daxpy for structural FE models ( solving f = K.x ), there is no appreciable increase in round-off error, but perhaps a different test/measure of error may help.

The basic equation for structural FE is f = K . x ( force = sum of stiffness . displacement ), which is solving a large set of “sparse/banded” linear equations, where x/deflection is the unknown.
When x is found, the error estimate can be readily found for each equation, by recalculating error = f - K.x, or error = f - sum(Ke.x) where K = sum(Ke) and Ke are many very small matrices. I do this error calculation and report error statistics for all problems, to be aware of the significance of round-off errors in the analysis. The main cause of round-off error is significant variation in the magnitude of Ke values, which can often be managed in the meshing/modelling approach.

themos · April 20, 2022, 1:50pm

Low arithmetic intensity means that there just aren’t enough arithmetic operations requested for each data item (maybe one addition and one multiplication), consequently performance is memory bound. That also means that you can be doing additional arithmetic for free. Useful arithmetic includes things such as
Kahan compensated summation or blocked summation which can radically reduce round-off error of large summations. A recent paper showing the improvements possible is http://eprints.maths.manchester.ac.uk/2704/1/paper.pdf

JohnCampbell · April 21, 2022, 3:23am

@themos,

Most of my -ffast-math performance testing was with single thread computation.
Low arithmetic intensity also means I should be rechecking the use of -ffast-math, especially for multi-thread performance.
Would you include AVX instructions as low arithmetic intensity, which is similar to low AVX efficiency that can be experienced with large vectors (larger relative to cache size)

An alternative to Kahan compensated summation is to simply use a higher precision accumulator, which was the 80-bit register, before SSE/AVX became so dominant.
I don’t know enough about AVX instructions to know if higher precision accumulators can be combined with AVX multiply, but I suspect this is not the case.

Topic		Replies	Views
Compiling options for speed optimization Help	4	641	December 15, 2020
Is -Ofast in gfortran or -fast flag in Intel Fortran safe to use?	9	2547	March 14, 2022
Can one design coding rules to follow so that `-ffast-math` is safe?	24	2153	February 24, 2023
Optimization in gfortran vs pgf95 Help	1	421	October 24, 2023
What Ofast means for Fortran developers	9	1012	November 16, 2022

Compiling only part of code with -ffast-math

Related topics