@themos
I am not familiar with what you are referring to with this comment ? Could you please elaborate as I would like to better understand what you are suggesting.
I do find that both ddotp and daxpy can achieve low AVX efficiency when using large vectors, which I attribute to memory bandwidth, especially with multi-threading. I would not understand this to be “low arithmetic intensity”, but if there are “better off coding compensated summation versions” I would certainly like to better understand this approach.
I am always looking for ways to improve efficiency, which I would equate to achieving higher arithmetic intensity, so your comment is intriguing.
As I indicated previously, I do pay particular attention to the significance of round-off errors in my structural analysis modelling. I do find when using -ffastmath with ddotp and daxpy for structural FE models ( solving f = K.x ), there is no appreciable increase in round-off error, but perhaps a different test/measure of error may help.
The basic equation for structural FE is f = K . x ( force = sum of stiffness . displacement ), which is solving a large set of “sparse/banded” linear equations, where x/deflection is the unknown.
When x is found, the error estimate can be readily found for each equation, by recalculating error = f - K.x, or error = f - sum(Ke.x) where K = sum(Ke) and Ke are many very small matrices. I do this error calculation and report error statistics for all problems, to be aware of the significance of round-off errors in the analysis. The main cause of round-off error is significant variation in the magnitude of Ke values, which can often be managed in the meshing/modelling approach.