IEEE_FMA (https://www.intel.com/content/www/us/en/docs/fortran-compiler/developer-guide-reference/2023-1/ieee-fma.html) is now available in gfortran. The fused multiply-add can perform operations of the type a = x + y \cdot z in a single instruction, i.e. faster and more accurate than a naive combination of add and multiply (Multiply–accumulate operation - Wikipedia, FMA instruction set - Wikipedia)
Is it good practice to use it, or should one rather rely on the compiler figuring out where to use FMA?
I always see huge performance gains on gfortran when using
-march flags that turn on avx, avx2 instructions. This to me means fma is routinely used wherever possible.
Forcing its usage may be beneficial in edge cases where the compiler is not capable of figuring it out by itself, but of course it’s a less “fortrannic” approach, assuming it means “you do the science; the compiler will do the rest”.
See this example for
axpy = a*x+y:
-O3 -march=core-avx2 we get
vfmadd132sd xmm0,xmm1,QWORD PTR [rsi]
-O3 -mtune=generic we get:
movsd xmm0,QWORD PTR [rdi]
mulsd xmm0,QWORD PTR [rsi]
addsd xmm0,QWORD PTR [rdx]
Note that while automatic FMA generation usually is beneficial for accuracy, it can also introduce bugs. The classic example of this is that it can make
a*a-b*b not equal 0 when
I agree with the faster part of that statement, but I think historically the accuracy part of that statement is incorrect. The FMA multiply might have different rounding conventions than the normal multiply, or the two instructions might treat denormals differently. This results in expressions like
a*a-b*b to evaluate to nonzero even when
a==b. The compiler uses the normal multiply operation (with whatever is its rounding conventions, which might be set by the programmer at run time) for one of the multiplies, and then uses the FMA instruction for the second multiply and the addition. If the rounding conventions, or the treatment of denormal numbers, are different for those two multiplications, it can have that surprising, but understandable, result. Another historical oddity is when the intermediates are evaluated in extended (e.g. 80-bit) precision with the normal multiply, but FMA uses a 64-bit multiply convention.