Floating point errors in a function

dwwork · September 4, 2025, 2:49am

I’m hoping some of the smart people here can point me in the right direction - I have a function that evaluates an algorithm in single-precision floating point. Interestingly, I know that the first 4 decimal digits of the function result are reasonable, and the rest are random noise. I’d like to take that knowledge and work backwards to the function inputs to estimate the error of the inputs to the function.

I have a variation of this function that now does some mixed precision calculations in double precision, but the output is still single-precision. I’d like to understand the error behavior of that version as well.

Does anyone know how I can answer these questions? Some resources? I’ve been through wikipedia about floating-point, read “What Every Programmer Should Know About Floating-Point”, and some various blogs, but I’m having a hard time putting that information together to figure out how to answer this question.

DavidB · September 4, 2025, 3:30am

This blog post from Nick Higham - and the accompanying references - may be helpful. Chapters 1 and 2 of his book (see references) provide a lot of background.

themos · September 4, 2025, 9:43am

They shouldn’t be random, unless you explicitly ask for RANDOM_NUMBER. I will assume that the outputs are repeatable (not necessarilly portable between compilers) but inaccurate.

You could give Herbie a chance.

A canonical reference would be Accuracy and Reliability in Scientific Computing. Anything written by the late, great, Nick Higham is worth reading.

dwwork · September 4, 2025, 3:46pm

Thank you! This definitely gives me a good starting point.

dwwork · September 4, 2025, 3:48pm

Yes, you are correct - when I say ‘random’ I really meant ‘inaccurate’. I want to trace those inaccuracies to the source and see how inaccurate the inputs are to begin with.

themos · September 4, 2025, 6:03pm

The inputs cannot really be said to be “inaccurate”, they are floating point data (rational numbers with a denominator that is a power of 2) specified at the precision that has been selected by you. Numerical analysis is the technology that lets you calculate the closest representable floating point number to the true result of a mathematical function. The IEEE-compliant FPU will do this automatically for the single operations (+,-,*,/,sqrt) but not for more complicated expressions. Getting the last bit of the result correct can be enormously laborious.

dwwork · September 4, 2025, 9:00pm

Hmmm….would ‘noisy’ be a better term? The inputs of my function come from an iterative algorithm of many, many floating-point operations. The algorithm is stable in single precision but ‘noisy’, fluctuating around the ‘true’ answer (for some definition of ‘true’). This noise affects post-processing, and I’d like to try and quantify the noise to some degree. Thanks for the help!

themos · September 5, 2025, 9:38am

Sometimes the designer can choose which inputs to use for a problem and sometimes they don’t. Selecting “good” inputs is another skill that numerical analysis gives you. You can make your life a lot harder, needlessly, by using inappropriate inputs. This variation is usually captured by the term “condition number”.

I remember exactly when this issue really clicked with me. I am an amateur stargazer and astrophotographer. For long exposure astrophotography, one usually arranges for a mechanical axis of a telescope mount to be aligned with the exact point in a starfield that is the prolongation of the Earth’s axis of rotation. I extended a previous method for determining that point using wide-field (a few degrees) photography of the starfield in the region of the North (or South) Celestial Pole (NPC). One takes two images of the NCP region, one before and one after rotating the telescope mount (carrying the camera) by a large angle (say 90 degrees). If you identify the same stars in the two images, simple euclidean geometry can draw for you the implied axis of rotation. The geometric drawing involves two lines (the perpendicular bisectors of the segment connecting the same star in the two images). As luck has it, the two brightest stars in the region (Alpha and Lambda Ursa Minoris) are positioned almost opposite each other relative to the NCP. The lines one needs to draw end up nearly parallel. This results in a magnified error in determining the point of intersection. The solution is to use fainter stars that are not so relatively placed. This is an example where the designer of the method can turn an ill-conditioned problem into a well-conditioned problem by choosing the inputs. Practitioners of the old technology of celestial navigation by sextant were well-aware of this issue and would choose stars/planets so that the intersecting lines would have appreciable angles between them.

This also confirmed to me Nick Trefethen’s maxim If rounding errors vanished, 95% of numerical analysis would remain.

Topic		Replies	Views
Jacobian matrix: discrepancy to Matlab	4	943	August 28, 2020
Fortran port of the Bessels.jl library Announcements	41	2099	November 22, 2022
How do I chase down this error? Help	15	945	October 19, 2021
Using single-precision for faster calculation	42	3839	October 17, 2022
What are some applications of sin(x) for x > 1e16?	17	1406	October 2, 2021

Floating point errors in a function

Related topics