I’ve noticed that big speedups are possible with --ffast-math, in some circumstances. For example doing
x = 10.0_dp**x
can be ~4 times faster with --ffast-math. This is unique to 10.0**x. Speedups are not significant for something like 7.6**x.
But I don’t want to compile my whole code with --ffast-math, because I worry about messing some parts up. I just want to make a “fast” special 10**x subroutine, compiled with --fast-math. I then want to call this special subroutine from another piece of code, compiled without --fast-math.
However, this does not work. Everything links and compiles just fine, but i do not get the --fast-math speedup.
I’m using M1 Macbook air. My compiler info:
GNU Fortran (Homebrew GCC 11.2.0_3) 11.2.0
Why isn’t it possible to compile only part of code with -ffast-math? Or if it is possible, how do I do it?
This is possibly just an issue with gcc not being optimized on M1 macs yet. I’m finding that clang can compile an identical C code which is as fast as the --fast-math version
Not sure how you would do this with Xcode etc. but in old fashion make files you
can always create a special rule for just the routines you want to compile with
-ffast-math. Here is an example of a rules file I include in my base Makefiles. Note
the special rule for fast_routine.f90
Very strange as it is straightforward to check that r=7.6d0**x is translated identically as r=10.d0**x, i.e. by calling exp with the argument x multiplied by compile-time calculated log(base):
Actually I am surprised it requires -ffast-math to make this optimization. I guess that pow(y,x) does just the same internally, i.e. exp(x*log(y)), so with constant base the optimizer could do it anyway with -O3 option
pure function pow_fast(x,y) result(x_to_y)
real(kind=dp), intent(in) :: x, y(:)
real(kind=dp) :: x_to_y(size(y))
x_to_y = exp(y*log(x))
end function pow_fast
In an earlier version c was used. Running the modified code, the speeds are about the same – the compilers were optimizing away the unused c variable.
I use -ffast-math for solving large sets of linear equations ( f = K.x ) with an openmp skyline solver (no intrinsics) and get about 2x speed improvement for what is 99% of the run time.
There are many .f90 files and I mix a variety of compile options including:
-ffast-math
-O2 vs -O3,
-fopenmp vs -funroll-loops --param max-unroll-times=2 and
-mno-fma.
Basically there are only a few routines that dictate performance, so most other routines can be compiled with less agressive options (eg reading and managing data).
I have error checking on the equation solver solution, recalculating a reconstructed “f - sum(Ke.x)”, where K = sum(Ke) for all equations and do not find significant errors using -ffast-math.
I found an interesting calculation for pi, by using Archimedes method of dividing the sectors of a circle, initially into 6 sectors with equilateral triangles, then repeatidly halving the sectors to calculate the circumference as an estimate of pi. (using radius = 0.5)
I tried 4 kinds with precision = 6, 15, 18 and 30, to estimate pi as the sum of the sector lengths and estimate the error based on the inside and outside inscribed polygon. pi_sector.f90 (1.8 KB)
I thought this would be a good example to test -ffast-math, using gfortran-64 Ver 11.1 on Windows 7.
The compile options I used are in the following .bat file:
The approach uses sqrt and calculates an error that approaches epsilon, which I thought might demonstrate -ffast-math “short-cuts”.
After running the test for all 4 kinds, I could not identify any change to the calculation result for any of the compile “options” alternatives.
I continue to look for failings of -ffast-math in all the wrong places !
Any suggestions where I should look ?
I was looking to see if there was problems with rounding for computation with significant variation in exponents, as “a” approaches “r”. I could not identify a problem.
Could you please attach a copy of h.f90 so I can see where the errors occur.
I did try a computation for x ** y, which did show a change for -ffast-math, but the difference appears to be less than epsilon() and only for some cases.
The following attached files show the x ** y test options I tried for different gfortran compile options.
real x,y,z
integer i
x = 0.1
do i = 75,135,5
y = real(i) / 100.
if ( i==130) y = 0.7515
if ( i==135) y = 1.2057
z = x**y
write (*,*) i,x,y,z
end do
end
In function ulp, could you be checking (f+a) against nearest (a,1.0). ie, how many times is f > nearest(a,1.0)-a.
Is your test cycling until this condition fails ? Is 1 in 6 million cycles an acceptable performance ?
In my calculations, I can tolerate if “a is near to b”, where the difference is basically only different in the last bit or two. For real*8, 52 or 53 bits of accuracy is pretty good.
I would be interested to know what proportion of calculations using -ffast-math provide 51 bits or less.
If I am using “wp=selected_real_kind (p=15)”, am I saying I want 50 bits of accuracy so -ffast-math is totally acceptable ?
Most of my calculations involve ddotp (dot_product) and daxpy. Both use AVX instructions. They do not involve x ** y.
I am trying to understand if the error profile you have found : “2.9% with ulp > 3 and only 50.3% with ulp < 1” is specific to exp/log function use or would also apply to AVX instructions ?
I have calculations that can take days to complete, so any saving that -ffast-math provides is beneficial. These calculations have “redundancy” so round-off is not cumulative.
Both ddotp and daxpy involve combined multiply and addition, which is very prone to round-off, so it is difficult to identify if increased round-off associated with -ffast-math in only some cases would be significant.
For example, adding a much smaller number with increased round-off errors to a larger value accumulator will not be an issue, so I am wanting to understand what conditions lead to increased round-off with -ffast-math.
They are also characterised by low arithmetic intensity, which means that you might be better off coding compensated summation versions of them that would be both faster and more accurate.
@themos
I am not familiar with what you are referring to with this comment ? Could you please elaborate as I would like to better understand what you are suggesting.
I do find that both ddotp and daxpy can achieve low AVX efficiency when using large vectors, which I attribute to memory bandwidth, especially with multi-threading. I would not understand this to be “low arithmetic intensity”, but if there are “better off coding compensated summation versions” I would certainly like to better understand this approach.
I am always looking for ways to improve efficiency, which I would equate to achieving higher arithmetic intensity, so your comment is intriguing.
As I indicated previously, I do pay particular attention to the significance of round-off errors in my structural analysis modelling. I do find when using -ffastmath with ddotp and daxpy for structural FE models ( solving f = K.x ), there is no appreciable increase in round-off error, but perhaps a different test/measure of error may help.
The basic equation for structural FE is f = K . x ( force = sum of stiffness . displacement ), which is solving a large set of “sparse/banded” linear equations, where x/deflection is the unknown.
When x is found, the error estimate can be readily found for each equation, by recalculating error = f - K.x, or error = f - sum(Ke.x) where K = sum(Ke) and Ke are many very small matrices. I do this error calculation and report error statistics for all problems, to be aware of the significance of round-off errors in the analysis. The main cause of round-off error is significant variation in the magnitude of Ke values, which can often be managed in the meshing/modelling approach.
Low arithmetic intensity means that there just aren’t enough arithmetic operations requested for each data item (maybe one addition and one multiplication), consequently performance is memory bound. That also means that you can be doing additional arithmetic for free. Useful arithmetic includes things such as
Kahan compensated summation or blocked summation which can radically reduce round-off error of large summations. A recent paper showing the improvements possible is http://eprints.maths.manchester.ac.uk/2704/1/paper.pdf
Most of my -ffast-math performance testing was with single thread computation.
Low arithmetic intensity also means I should be rechecking the use of -ffast-math, especially for multi-thread performance.
Would you include AVX instructions as low arithmetic intensity, which is similar to low AVX efficiency that can be experienced with large vectors (larger relative to cache size)
An alternative to Kahan compensated summation is to simply use a higher precision accumulator, which was the 80-bit register, before SSE/AVX became so dominant.
I don’t know enough about AVX instructions to know if higher precision accumulators can be combined with AVX multiply, but I suspect this is not the case.