# Working with reals near the machine precision

Hello,

With `real :: v1, v2` , assuming that

• `v2 > nearest(v1,1.0)`
• `abs(v1+v2) <= huge(1.0)`

can we be sure that `v1 < 0.5*(v1+v2) < v2` is always true?

A limit case would be something like `v2 = nearest(nearest(v1,1.0),1.0)`, where I expect `0.5*(v1+v2) == nearest(v1,1.0)` to be true.

1 Like

Your comparison assumes that 0.5*(v1+v2) is a representable number inbetween the two others. Since you ask for a number nearest to v1, I think the result will be:

``````v1 <= 0.5*(v1+v2) <= v2
``````

In other words: no strict inequality. At least, that is what I expect to be tbe the answer.

This is actually my question, not an assumption.

I don’t get what you mean…

Dividing by 2 being a binary right shift, it sounds reasonable. But considering the adverb “always”, I don’t know. It needs a deeper investigation and demonstration.

It is tempting to say yes, but in fortran one must assume that all arithmetic is inexact in the least significant bits, even when the mathematical result can be represented exactly. To give an example. suppose A=nearest(v1,1.0) and v2=nearest(A,1.0). If the v1, A, v2 exponents were all the same, then A==0.5*(v1+v2) would be true in exact arithmetic, but there is nothing in the fortran standard that would require the expression to be evaluated correct to the last bit. If you then consider the corner cases, where the v1, A, v2 exponents might differ, then it is easy to see how the least bits of information could be lost in the addition.

That is for integers. For binary floating point, dividing by 2 is a decrement of the exponent, leaving the mantissa bits unchanged. For hexadecimal floating point (like IBM mainframes), dividing by two could cause a shift of the mantissa bits (losing a 1 for half of the possible values).

1 Like

Yes of course, thanks for correction.

It is hard to imagine `nearest(X,1.0)` working differently than adding 1 to binary mantissa (surely allowing for increasing the exponent when mantissa “overflows”.
The following code clearly shows that `v1<(v1+v2)/2<v2` where `v2=nearest(nearest(v1,1.0),1.0)`, both with and without mantissa “overflow”.

``````program nearnear
integer, parameter :: fp=kind(1.0)
real(fp) :: v1, a, v2

v1 = 4.0
a = nearest(v1, 1.0)
v2 = nearest(a, 1.0)
print '(4EX15.6)', v1, a, v2, (v1+v2)/2
v1 = nearest(v1, -1.0)
a = nearest(v1, 1.0)
v2 = nearest(a, 1.0)
print '(4EX15.6)', v1, a, v2, (v1+v2)/2
v1 = nearest(v1, -1.0)
a = nearest(v1, 1.0)
v2 = nearest(a, 1.0)
print '(4EX15.6)', v1, a, v2, (v1+v2)/2
end program nearnear
``````
``````\$ ifort nearnear.f90 && ./a.out
0X8.000000P-1  0X8.000010P-1  0X8.000020P-1  0X8.000010P-1
0XF.FFFFF0P-2  0X8.000000P-1  0X8.000010P-1  0X8.000000P-1
0XF.FFFFE0P-2  0XF.FFFFF0P-2  0X8.000000P-1  0XF.FFFFF0P-2
``````

At first I agreed with you. Then I tried the following code that browses all possible 32 bits reals (except the ones who would overflow):

``````program bitest
implicit none
real :: v1, v2, a, v0

v1 = -huge(v1)/2
a = nearest(v1,1.0)
do
v2 = nearest(a,1.0)
if (v2 > huge(1.0)/2) exit
v0 = 0.5*(v1+v2)
if (v0 /= a) write(*,*) v1, a, v2, v0
v1 = a; a = v2
end do
end program
``````

It reports no discrepancy between `v0` and `a` with `gfortran -O3`

I think the explanation is that whatever the least significant bit of the mantissa of `v1`, `v2` has the same one. It turns that the least significant bit of the exact sum is necessarily `0`: it can be dropped without losing information.

It is obvious if v1 and v2 have the same exponent, as the sum is performed without shifting any mantissa. If they don’t, there are 2 cases (let’s take a mantissa of 4 bits):

the mantissa of `v1` is 1110 (exponent e)
the mantissa of `a ` is 1111 (exponent e)
the mantissa of `v2` is 1000 (exponent e+1)
Before summing `v1` is expressed as 0111 (exponent e+1), and the sum is 1111 (exponent e+1), that is 1111 (exponent e) after division by 2, which is `a`

or

the mantissa of `v1` is 1111 (exponent e)
the mantissa of `a ` is 1000 (exponent e+1)
the mantissa of `v2` is 1001 (exponent e+1)
Before summing `v1` is expressed as 0111 (exponent e+1), and the sum is 10000 (exponent e+1), that is 1000 (exponent e+2) after dropping the last bit, that is 1000 (exponent e+1) after division by 2.

This would not be necessarily the case for a real number model with a base 3 instead of 2 (imagine a ternary computer), even if v1 and v2 have the same exponent. For instance:

the mantissa of `v1` is 2000 (exponent e)
the mantissa of `a ` is 2001 (exponent e)
the mantissa of `v2` is 2002 (exponent e)
The sum is 11002 (exponent e), that is 1100 (exponent e+1) after dropping the last trit, that is 2000 (exponent e) after division by 2, which is `v1`, not `a`.
Or the sum could be 1101 (exponent e+1) if rounded to the nearest, that is either 2001 or 2002 (exponent e) after division by 2, depending on the rounding: it would be `a` or `v2`

1 Like

So, it looks like that with a binary model for the reals, `0.5*(v1+v2)` (or `(v1+v2)/2`) is safe in practice. But this is not a demonstration, it just relies on the most likely way the FPU perform the operations.

Maybe `v1 + (v2-v1)/2` would be preferable, as it would avoid any precision loss on` (v2-v1)` when they are close to each other.

And if we want a bullet-proof code without any assumption:

``````real recursive function safe_mean(v1,v2) result(v0)
! returns an approximation of (v1+v2)/2 without overflow/underflow and such that
! * v0 = max(v1,v2) if v2 == v1 or v2 == nearest(v1,v2-v1)
! * min(v1,v2) < v0 < max(v1,v2) otherwise
real, intent(in) :: v1, v2
real :: a
if (v2 > v1) then
a = nearest(v1,1.0)
if (v2 == a) then
v0 = v2
else
if (v1 >= 0.0) then
v0 = v1 + (v2-v1)/2
else if (v2 <= 0.0) then
v0 = v2 + (v1-v2)/2
else
v0 = (v1+v2)/2
end if
v0 = max( v0, a )
v0 = min( v0, nearest(v2,-1.0) )
end if
else if (v2 == v1)
v0 = v2
else
v0 = safe_mean(v2,v1)
end if
end function safe_mean
``````
1 Like

There is some ambiguity in this case, but I don’t think it has any consequences. Within the computation of the sum, `v1` could also be represented as 1000 with exponent e+1. This would be a round-to-even convention for the unnormalized mantissa. Then the sum (with the same round-to-even convention) and the division would again give `a` as the result, as before.

This all presumes that the floating point hardware does these operations correctly and consistently. As stated before, the fortran standard does not require that, particularly when there are multiple hardware units that could produce the results, the fortran standard does not require all of them to use the same rounding and truncation conventions, and the same expression in different parts of the code might end up on different hardware units, or they might be evaluated after the rounding mode has been modified, etc. In general, a fortran programmer should assume that the last bits of any floating point operation are inexact, and if the code or algorithm requires special care, it should be treated explicitly in the code. Fortran now makes that much easier with intrinsics such as `nearest()`, `spacing()`, etc.

1 Like

Expressions like `(v2-v1)` when `v1` and `v2` are close are what causes catastrophic loss of accuracy in the first place. However, as you point out, this does avoid overflow when the values are large, say above `0.5*huge()`.

Another way to avoid overflow is to scale the input values by a power of 2, do the safe `0.5*(v1+v2)` operation, and then scale the result back. Those power of 2 scaling operations, which leave the mantissa unchanged, can be done cheaply with intrinsics like `exponent()`, `scale()`, and `set_exponent()`.

It depends on the point of view. That’s true only if `v1` and `v2` are considered as approximations of actual values, not if they are considered as the actual values.

Then underflows can happen Here is the kind of code I had in mind:

``````program mean2
! compute the safe mean of two values that avoids overflow and unnecessary underflow.
implicit none
real :: x, y, a, t

x = 1.0; y = 2.0; t = 1.5
a = safe_mean2( x, y )
write(*,*) 'mean=', a, ' diff=', t-a

x = 0.5*huge(x); y = 0.75*huge(x); t = 0.625*huge(x)
a = safe_mean2( x, y )
write(*,*) 'mean=', a, ' diff=', t-a

x = tiny(x); y = 2.0*tiny(x); t = 1.5*tiny(x)
a = safe_mean2( x, y )
write(*,*) 'mean=', a, ' diff=', t-a

x = huge(x); y = tiny(x); t = 0.5*huge(x)
a = safe_mean2( x, y )
write(*,*) 'mean=', a, ' diff=', t-a
contains
function safe_mean2( x, y ) result(a)
implicit none
real :: a
real, intent(in) :: x, y
integer :: k
intrinsic :: max, exponent, scale
k = max( exponent(x), exponent(y) )
a = scale( (scale(x,-k) + scale(y,-k)), k-1 )
return
end function safe_mean2
end program mean2
``````

That requires two lines of code, after the declarations. As you can see, the only actual floating point operation here is the addition, which is always safe after the scaling operations. Everything else is really just bit-masking and integer shifts within those intrinsics. This code works correctly for gfortran and ifort, but I’m not certain about its standard conformance. First, there is the assumption of binary exponents. Then, when scale(y,-k) underflows in that last test case, this code expects the result to be set to 0.0. Gfortran and ifort both do that, but I don’t think it is required by the standard. Thus, some extra code might be necessary to catch those corner cases, and maybe to generalize to arbitrary radix.

This is indeed an elegant solution to manage the possible overflows/underflows…

In the case the radix is not 2, the code should be

``````a = scale( (scale(x,-k) + scale(y,-k)), k ) / 2
``````

I don’t think either… The standard says:

SCALE (X, I)
Result Value. The result has the value X × b^I, where b is defined in 16.4 for model numbers representing values of X, provided this result is representable; if not, the result is processor dependent."

The corner cases are always the ones that require the most code Btw this code does not address what was my initial concern, and I think it still needs special cases to ensure that the result is always different from x and y as long as `y /= x .and. y /= nearest(x,y-x)`

If you want your function to work for all floating point numbers then often there is a way, but it gets very complicated (and slow) even for a simple formula like (a+b)/2. The math libraries do something similar for trigonometric function like `sin(x)`, which are easy to implement if you are willing to restrict yourself to `abs(x) < 1e10`, but if yo want them to work close to `huge(x)`, then it’s VERY complicated, and it makes them slow even for small arguments.

So my recommendation is to just use the simple formula, but ensure your numbers are reasonable. If your numbers get anywhere close to `huge(x)` then typically it means something is wrong anyway.

1 Like

I agree… However, I started this topic because I have encountered this problem in practice, although nothing was wrong in my approach. I was doing an interative bisection `v0 = (v1+v2)/2`, and it was more efficient to let it continue until possibly an interval as small as the spacing between 2 representable reals. So I wanted to be sure about what was happening at this scale.

That’s a common case, which I even mention here: schools.rst · GitHub (linked in this thread: Can one design coding rules to follow so that `-ffast-math` is safe?).

You CAN do it your way, but then you have to go the extra length to ensure things behave correctly bit to bit, which I call the IEEE school of thought. This approach has pros and cons. If you want to use the other approach, then just adjust your epsilon to be larger than just nearest spacing, and things should be robust and fast (and accurate enough).

1 Like

This is one of those cases where I wish the standard were a little more specific. As written, if the `scale()` operation underflows, then the processor could initiate WWIII. I would like the consequences to be a little more confined than that. That clause could mean that the result is undefined in the same way that the equivalent multiplication operation `(X*B**I)` is undefined when underflow occurs. That is, it could be set to 0.0 (as I would want in the case we are discussing), or it could signal an error. I would not want this operation to result in some random bit pattern with no warning, or start WWIII, or anything like that.

Here is a program that tests the original problem with all possible floating point numbers.

``````program mean2
! compute the safe mean of two values that avoids overflow and unnecessary underflow.
implicit none
real :: x, y, t, a
character(*), parameter :: cfmt='(*(a,1x,es15.8,1x))'

x = -huge(x)
t = nearest(x,1.0)
y = nearest(t,1.0)
do
a = safe_mean2(x,y)
if ( a .ne. t ) then
write(*,cfmt) 'error for x=', x, 't=', t, 'a=', a, ' y=', y
error stop 'error detected'
endif
if ( y == huge(x) ) exit
x = t; t=y;
y = nearest(t,1.0)
enddo
write(*,cfmt) 'No errors detected'
contains
function safe_mean2( x, y ) result(a)
implicit none
real :: a
real, intent(in) :: x, y
integer :: k
intrinsic :: max, exponent, scale
k = max( exponent(x), exponent(y) )
a = scale( (scale(x,-k) + scale(y,-k)), k-1 )
return
end function safe_mean2
end program mean2
``````

I have verified that this works for a few combinations of compilers and with various compiler options. However, with ifort it fails.

``````error for x= -1.17549449E-38 t= -1.17549435E-38 a=  0.00000000E+00  y= -1.17549421E-38
error detected
``````

This is when `y=-tiny(x)`, and the problem does seem to trace back to corner cases of `scale()` where the exponent changes from -125 to -126.

gfortran does not fail in this case, but it does seem to step through all of the denormal values between `-tiny()` and 0.0, which I was not expecting. I was rather expecting `nearest(-tiny(x),1.0)` to step over the denormals and go right to 0.0.

At least the undefined behavior did not start WWIII when I ran the program.

 A little experimentation reveals the following.

``````\$ ifort -fp-model precise mean2.F90 && a.out
No errors detected
``````

The problem without the option seems to be that the `nearest()` function moves through the denormal numbers between -tiny(x) and +tiny(x), but the `scale()` function flushes the denormals to zero. With the option, `nearest()` does the same thing, but `scale()` changes its behavior and produces denormal numbers (the correct ones, apparently).

This behavior of nearest() is a little surprising. I would have expected `nearest(-tiny(x),1.0)` to be `0.0`, and then I would expect `nearest(0.0,1.0)` to be `tiny(x)`. That is, I would have expected `nearest()` to skip over the denormal numbers.

Are we supposed to test the exponents every time we use the intrinsic `scale()` in order to avoid WWIII? I know how to do that, but I don’t know if that is what the standard expects. Are we supposed to test the exponents for the equivalent multiplication operation `X*b**I` beforehand? I know how to do that too, but is that really what the standard expects programmers to do?
In this particular case, I’m not even certain that the result is undefined. I think of the term “undefined” to be more like `sqrt(-1.0)` or `acos(2.0)`. In this `scale()` case (or the equivalent multiplication), the result might be defined somewhere in the standard, but I don’t know where. That is, 0.0 might be the defined result of `scale()` or of the multiplication, and 0.0 is certainly a representable number.
For the code given above, it works as expected on several compilers (various versions of gfortran and nagfor, on intel and arm64 hardware). ifort gets tripped up for values near `tiny(x)`. Right now it isn’t clear to me why. The culprit might be `scale()` or it might be `nearest()` or it might be the addition operation. It does not dump core, it just returns an incorrect value. Maybe the expected value would be returned with a different rounding mode? Right now, I don’t know.