Working with reals near the machine precision

Hello,

With real :: v1, v2 , assuming that

  • v2 > nearest(v1,1.0)
  • abs(v1+v2) <= huge(1.0)

can we be sure that v1 < 0.5*(v1+v2) < v2 is always true?

A limit case would be something like v2 = nearest(nearest(v1,1.0),1.0), where I expect 0.5*(v1+v2) == nearest(v1,1.0) to be true.

1 Like

Your comparison assumes that 0.5*(v1+v2) is a representable number inbetween the two others. Since you ask for a number nearest to v1, I think the result will be:

v1 <= 0.5*(v1+v2) <= v2

In other words: no strict inequality. At least, that is what I expect to be tbe the answer.

This is actually my question, not an assumption.

I don’t get what you mean…

Dividing by 2 being a binary right shift, it sounds reasonable. But considering the adverb “always”, I don’t know. It needs a deeper investigation and demonstration.

It is tempting to say yes, but in fortran one must assume that all arithmetic is inexact in the least significant bits, even when the mathematical result can be represented exactly. To give an example. suppose A=nearest(v1,1.0) and v2=nearest(A,1.0). If the v1, A, v2 exponents were all the same, then A==0.5*(v1+v2) would be true in exact arithmetic, but there is nothing in the fortran standard that would require the expression to be evaluated correct to the last bit. If you then consider the corner cases, where the v1, A, v2 exponents might differ, then it is easy to see how the least bits of information could be lost in the addition.

That is for integers. For binary floating point, dividing by 2 is a decrement of the exponent, leaving the mantissa bits unchanged. For hexadecimal floating point (like IBM mainframes), dividing by two could cause a shift of the mantissa bits (losing a 1 for half of the possible values).

1 Like

Yes of course, thanks for correction.

It is hard to imagine nearest(X,1.0) working differently than adding 1 to binary mantissa (surely allowing for increasing the exponent when mantissa “overflows”.
The following code clearly shows that v1<(v1+v2)/2<v2 where v2=nearest(nearest(v1,1.0),1.0), both with and without mantissa “overflow”.

program nearnear
  integer, parameter :: fp=kind(1.0)
  real(fp) :: v1, a, v2

  v1 = 4.0
  a = nearest(v1, 1.0)
  v2 = nearest(a, 1.0)
  print '(4EX15.6)', v1, a, v2, (v1+v2)/2
  v1 = nearest(v1, -1.0)
  a = nearest(v1, 1.0)
  v2 = nearest(a, 1.0)
  print '(4EX15.6)', v1, a, v2, (v1+v2)/2
  v1 = nearest(v1, -1.0)
  a = nearest(v1, 1.0)
  v2 = nearest(a, 1.0)
  print '(4EX15.6)', v1, a, v2, (v1+v2)/2
end program nearnear
$ ifort nearnear.f90 && ./a.out
  0X8.000000P-1  0X8.000010P-1  0X8.000020P-1  0X8.000010P-1
  0XF.FFFFF0P-2  0X8.000000P-1  0X8.000010P-1  0X8.000000P-1
  0XF.FFFFE0P-2  0XF.FFFFF0P-2  0X8.000000P-1  0XF.FFFFF0P-2

At first I agreed with you. Then I tried the following code that browses all possible 32 bits reals (except the ones who would overflow):

program bitest
implicit none
real :: v1, v2, a, v0

v1 = -huge(v1)/2
a = nearest(v1,1.0)
do 
	v2 = nearest(a,1.0)
	if (v2 > huge(1.0)/2) exit
	v0 = 0.5*(v1+v2)
	if (v0 /= a) write(*,*) v1, a, v2, v0
	v1 = a; a = v2
end do
end program

It reports no discrepancy between v0 and a with gfortran -O3

I think the explanation is that whatever the least significant bit of the mantissa of v1, v2 has the same one. It turns that the least significant bit of the exact sum is necessarily 0: it can be dropped without losing information.

It is obvious if v1 and v2 have the same exponent, as the sum is performed without shifting any mantissa. If they don’t, there are 2 cases (let’s take a mantissa of 4 bits):

the mantissa of v1 is 1110 (exponent e)
the mantissa of a is 1111 (exponent e)
the mantissa of v2 is 1000 (exponent e+1)
Before summing v1 is expressed as 0111 (exponent e+1), and the sum is 1111 (exponent e+1), that is 1111 (exponent e) after division by 2, which is a

or

the mantissa of v1 is 1111 (exponent e)
the mantissa of a is 1000 (exponent e+1)
the mantissa of v2 is 1001 (exponent e+1)
Before summing v1 is expressed as 0111 (exponent e+1), and the sum is 10000 (exponent e+1), that is 1000 (exponent e+2) after dropping the last bit, that is 1000 (exponent e+1) after division by 2.

This would not be necessarily the case for a real number model with a base 3 instead of 2 (imagine a ternary computer), even if v1 and v2 have the same exponent. For instance:

the mantissa of v1 is 2000 (exponent e)
the mantissa of a is 2001 (exponent e)
the mantissa of v2 is 2002 (exponent e)
The sum is 11002 (exponent e), that is 1100 (exponent e+1) after dropping the last trit, that is 2000 (exponent e) after division by 2, which is v1, not a.
Or the sum could be 1101 (exponent e+1) if rounded to the nearest, that is either 2001 or 2002 (exponent e) after division by 2, depending on the rounding: it would be a or v2…

1 Like

So, it looks like that with a binary model for the reals, 0.5*(v1+v2) (or (v1+v2)/2) is safe in practice. But this is not a demonstration, it just relies on the most likely way the FPU perform the operations.

Maybe v1 + (v2-v1)/2 would be preferable, as it would avoid any precision loss on (v2-v1) when they are close to each other.

And if we want a bullet-proof code without any assumption:

real recursive function safe_mean(v1,v2) result(v0)
   ! returns an approximation of (v1+v2)/2 without overflow/underflow and such that 
   ! * v0 = max(v1,v2) if v2 == v1 or v2 == nearest(v1,v2-v1)
   ! * min(v1,v2) < v0 < max(v1,v2) otherwise
   real, intent(in) :: v1, v2
   real :: a
   if (v2 > v1) then
      a = nearest(v1,1.0)
      if (v2 == a) then
         v0 = v2
      else 
         if (v1 >= 0.0) then
            v0 = v1 + (v2-v1)/2
         else if (v2 <= 0.0) then
            v0 = v2 + (v1-v2)/2
         else
           v0 = (v1+v2)/2
         end if
         v0 = max( v0, a )
         v0 = min( v0, nearest(v2,-1.0) )
      end if
   else if (v2 == v1)
      v0 = v2
   else
      v0 = safe_mean(v2,v1)
   end if
end function safe_mean
1 Like

There is some ambiguity in this case, but I don’t think it has any consequences. Within the computation of the sum, v1 could also be represented as 1000 with exponent e+1. This would be a round-to-even convention for the unnormalized mantissa. Then the sum (with the same round-to-even convention) and the division would again give a as the result, as before.

This all presumes that the floating point hardware does these operations correctly and consistently. As stated before, the fortran standard does not require that, particularly when there are multiple hardware units that could produce the results, the fortran standard does not require all of them to use the same rounding and truncation conventions, and the same expression in different parts of the code might end up on different hardware units, or they might be evaluated after the rounding mode has been modified, etc. In general, a fortran programmer should assume that the last bits of any floating point operation are inexact, and if the code or algorithm requires special care, it should be treated explicitly in the code. Fortran now makes that much easier with intrinsics such as nearest(), spacing(), etc.

1 Like

Expressions like (v2-v1) when v1 and v2 are close are what causes catastrophic loss of accuracy in the first place. However, as you point out, this does avoid overflow when the values are large, say above 0.5*huge().

Another way to avoid overflow is to scale the input values by a power of 2, do the safe 0.5*(v1+v2) operation, and then scale the result back. Those power of 2 scaling operations, which leave the mantissa unchanged, can be done cheaply with intrinsics like exponent(), scale(), and set_exponent().

It depends on the point of view. That’s true only if v1 and v2 are considered as approximations of actual values, not if they are considered as the actual values.

Then underflows can happen :slight_smile:

Here is the kind of code I had in mind:

program mean2
   ! compute the safe mean of two values that avoids overflow and unnecessary underflow.
   implicit none
   real :: x, y, a, t

   x = 1.0; y = 2.0; t = 1.5
   a = safe_mean2( x, y )
   write(*,*) 'mean=', a, ' diff=', t-a

   x = 0.5*huge(x); y = 0.75*huge(x); t = 0.625*huge(x)
   a = safe_mean2( x, y )
   write(*,*) 'mean=', a, ' diff=', t-a

   x = tiny(x); y = 2.0*tiny(x); t = 1.5*tiny(x)
   a = safe_mean2( x, y )
   write(*,*) 'mean=', a, ' diff=', t-a

   x = huge(x); y = tiny(x); t = 0.5*huge(x)
   a = safe_mean2( x, y )
   write(*,*) 'mean=', a, ' diff=', t-a
contains
   function safe_mean2( x, y ) result(a)
      implicit none
      real :: a
      real, intent(in) :: x, y
      integer :: k
      intrinsic :: max, exponent, scale
      k = max( exponent(x), exponent(y) )
      a = scale( (scale(x,-k) + scale(y,-k)), k-1 )
      return
   end function safe_mean2
end program mean2

That requires two lines of code, after the declarations. As you can see, the only actual floating point operation here is the addition, which is always safe after the scaling operations. Everything else is really just bit-masking and integer shifts within those intrinsics. This code works correctly for gfortran and ifort, but I’m not certain about its standard conformance. First, there is the assumption of binary exponents. Then, when scale(y,-k) underflows in that last test case, this code expects the result to be set to 0.0. Gfortran and ifort both do that, but I don’t think it is required by the standard. Thus, some extra code might be necessary to catch those corner cases, and maybe to generalize to arbitrary radix.

This is indeed an elegant solution to manage the possible overflows/underflows…

In the case the radix is not 2, the code should be

a = scale( (scale(x,-k) + scale(y,-k)), k ) / 2

I don’t think either… The standard says:

SCALE (X, I)
Result Value. The result has the value X Ă— b^I, where b is defined in 16.4 for model numbers representing values of X, provided this result is representable; if not, the result is processor dependent."

The corner cases are always the ones that require the most code :slight_smile: …

Btw this code does not address what was my initial concern, and I think it still needs special cases to ensure that the result is always different from x and y as long as y /= x .and. y /= nearest(x,y-x)

If you want your function to work for all floating point numbers then often there is a way, but it gets very complicated (and slow) even for a simple formula like (a+b)/2. The math libraries do something similar for trigonometric function like sin(x), which are easy to implement if you are willing to restrict yourself to abs(x) < 1e10, but if yo want them to work close to huge(x), then it’s VERY complicated, and it makes them slow even for small arguments.

So my recommendation is to just use the simple formula, but ensure your numbers are reasonable. If your numbers get anywhere close to huge(x) then typically it means something is wrong anyway.

1 Like

I agree… However, I started this topic because I have encountered this problem in practice, although nothing was wrong in my approach. I was doing an interative bisection v0 = (v1+v2)/2, and it was more efficient to let it continue until possibly an interval as small as the spacing between 2 representable reals. So I wanted to be sure about what was happening at this scale.

That’s a common case, which I even mention here: schools.rst · GitHub (linked in this thread: Can one design coding rules to follow so that `-ffast-math` is safe?).

You CAN do it your way, but then you have to go the extra length to ensure things behave correctly bit to bit, which I call the IEEE school of thought. This approach has pros and cons. If you want to use the other approach, then just adjust your epsilon to be larger than just nearest spacing, and things should be robust and fast (and accurate enough).

1 Like

This is one of those cases where I wish the standard were a little more specific. As written, if the scale() operation underflows, then the processor could initiate WWIII. I would like the consequences to be a little more confined than that. That clause could mean that the result is undefined in the same way that the equivalent multiplication operation (X*B**I) is undefined when underflow occurs. That is, it could be set to 0.0 (as I would want in the case we are discussing), or it could signal an error. I would not want this operation to result in some random bit pattern with no warning, or start WWIII, or anything like that.

Here is a program that tests the original problem with all possible floating point numbers.

program mean2
   ! compute the safe mean of two values that avoids overflow and unnecessary underflow.
   implicit none
   real :: x, y, t, a
   character(*), parameter :: cfmt='(*(a,1x,es15.8,1x))'

   x = -huge(x)
   t = nearest(x,1.0)
   y = nearest(t,1.0)
   do
      a = safe_mean2(x,y)
      if ( a .ne. t ) then
         write(*,cfmt) 'error for x=', x, 't=', t, 'a=', a, ' y=', y
         error stop 'error detected'
      endif
      if ( y == huge(x) ) exit
      x = t; t=y;
      y = nearest(t,1.0)
   enddo
   write(*,cfmt) 'No errors detected'
contains
   function safe_mean2( x, y ) result(a)
      implicit none
      real :: a
      real, intent(in) :: x, y
      integer :: k
      intrinsic :: max, exponent, scale
      k = max( exponent(x), exponent(y) )
      a = scale( (scale(x,-k) + scale(y,-k)), k-1 )
      return
   end function safe_mean2
end program mean2

I have verified that this works for a few combinations of compilers and with various compiler options. However, with ifort it fails.

error for x= -1.17549449E-38 t= -1.17549435E-38 a=  0.00000000E+00  y= -1.17549421E-38
error detected

This is when y=-tiny(x), and the problem does seem to trace back to corner cases of scale() where the exponent changes from -125 to -126.

gfortran does not fail in this case, but it does seem to step through all of the denormal values between -tiny() and 0.0, which I was not expecting. I was rather expecting nearest(-tiny(x),1.0) to step over the denormals and go right to 0.0.

At least the undefined behavior did not start WWIII when I ran the program.

[edit] A little experimentation reveals the following.

$ ifort -fp-model precise mean2.F90 && a.out
No errors detected

The problem without the option seems to be that the nearest() function moves through the denormal numbers between -tiny(x) and +tiny(x), but the scale() function flushes the denormals to zero. With the option, nearest() does the same thing, but scale() changes its behavior and produces denormal numbers (the correct ones, apparently).

This behavior of nearest() is a little surprising. I would have expected nearest(-tiny(x),1.0) to be 0.0, and then I would expect nearest(0.0,1.0) to be tiny(x). That is, I would have expected nearest() to skip over the denormal numbers.

As I said previously, I wish the standard were a little more specific about this case.

Are we supposed to test the exponents every time we use the intrinsic scale() in order to avoid WWIII? I know how to do that, but I don’t know if that is what the standard expects. Are we supposed to test the exponents for the equivalent multiplication operation X*b**I beforehand? I know how to do that too, but is that really what the standard expects programmers to do?

In this particular case, I’m not even certain that the result is undefined. I think of the term “undefined” to be more like sqrt(-1.0) or acos(2.0). In this scale() case (or the equivalent multiplication), the result might be defined somewhere in the standard, but I don’t know where. That is, 0.0 might be the defined result of scale() or of the multiplication, and 0.0 is certainly a representable number.

For the code given above, it works as expected on several compilers (various versions of gfortran and nagfor, on intel and arm64 hardware). ifort gets tripped up for values near tiny(x). Right now it isn’t clear to me why. The culprit might be scale() or it might be nearest() or it might be the addition operation. It does not dump core, it just returns an incorrect value. Maybe the expected value would be returned with a different rounding mode? Right now, I don’t know.