Working with reals near the machine precision

There is a lot of discussion in this post about errors in v1-v2; already 22 posts, but surely the question should be:
why do you continue to use this machine precision ?

Back in the good ol days, there were real*10 registers, which probably made some of us lazy, but increasing the precision is a simple option!

SSE reduced the accuracy of many calculations, with little complaint ?
The F90 standard compounded the problem.

This isn’t really a discussion about precision. No matter what KIND of real is used, there are always issues of overflow, underflow, and how denormal numbers are treated. All of this discussion could have been based on REAL64 arithmetic or even REAL128 arithmetic.

For one example, I think all of the IEEE floating point KINDS have the property that 1.0_wp/huge(1.0_wp) is a denormal number. Sometimes it is flushed to zero, sometimes it is treated as a denormal. I think that fact surprises most people the first time they see it happen.

There is an established history for how overflows and underflows are treated in practice. Most programmers expect overflows to cause a program to abort, or at least to signal an exception, while they expect underflows to be flushed to zero and to not abort. Gradual underflows have that gray area where they don’t flush but they have reduced precision, and then they eventually flush to zero when they are below the smallest denormal. Apart from the IEEE signals that are part of the fortran standard (but are not required to be supported), there is nothing that I know of in the standard that specifies this behavior. Please correct me if I’m wrong on this.

Ron, I don’t agree. The examples have been presented in this thread are with default real. Issues of near equal values are more related to the analysis method, so don’t really scale to different kinds.
This is one of the main issues when choosing which real precision for the analytical method being used. Ever since 1975, I have been using real(8), (actually “real*8”), but for a short time in 80’s and 90’s when real(10) was a viable option for accumulators.
Although over all these years there has been considerable increase in model size, due to increased memory and calculation rate, there has not been a significant change in reported modelling error. The suitability of 64-bit floating point precision has not varied significantly. (which has been an interesting outcome). I use a direct solver for structural engineering finite element analysis, where I do an error re-calculation. This error is dominated by modelling approach, not precision failure.

Essentially, I think that the choice of increased real precision is available for most analysis problems, although probably with a prohibitive computation cost. There may be some/few problems where increased precision is not available, but these must be rare. Reading about what Voyager I and Voyager II continue to achieve suggests to me that REAL32 and REAL64 will have many more years of relevance.

The addition is alright in this expression (no danger of overflow or underflow), but then the subsequent scale() operation will overflow for large x and y, which is what we are trying to avoid. I’m not sure what is the best answer in this case (i.e. hexadecimal exponent radix). Maybe moving the multiplication inside would be the best approach?

a = scale( 0.5*(scale(x,-k) + scale(y,-k)), k )

I don’t have access to a hexadecimal floating point machine to test, but to my eye, this looks safe. Of course, scaling by 0.5 on a hexadecimal machine can lose one bit of accuracy, unlike in the binary case where the mantissa is unchanged, but I don’t think there is any workaround for that.

This issue might also arise for decimal encoded floating point. Machines that implement IEEE decimal floating point now exist, but I have no experience with them.

Well, it is.

Note what OP writes, “I was doing an interative bisection v0 = (v1+v2)/2 , and it was more efficient to let it continue until possibly an interval as small as the spacing between 2 representable reals”

But then with a statement such as “it was more efficient to let it continue until possibly an interval as small as the spacing between 2 representable reals” which is very difficult to understand without full context, it appears all to do with precision. Because as presented, it appears to throw everything else out such as numerical analysis and it reduces to working with floating-point types near the limits of machine precision. For it is very difficult, within the realm of numerical analysis, to fathom an “interative bisection” and a need to “let it continue until possibly an interval as small as the spacing between 2 representable reals.”

Yes, it involves precision, along with issues of overflow, and underflow, but my reply was that the points in this discussion would apply regardless of whether the posted code was REAL32, REAL64, or REAL128. It is not unusual at all to converge some equation solution to machine precision. For example, I generally expect all of the fortran intrinsics (e.g. sqrt(), sin(), log(), etc.) to be accurate to machine precision regardless of the KIND. I also expect intrinsics like hypot() to be computed without overflow or underflow, even when the straightforward way to compute that function might fail. Those same considerations also carry over to other user-written functions, particularly library functions where the eventual application might not even be known when the function is written. Bisection, which is the method being discussed in this thread, is just one approach to achieve that level of accuracy, but the approaches being discussed, involving exponent() and scale(), are broader and more useful generally than just for bisection.

Oops, of course you’re right !

Knuth also has an algorithm to compute the mean of a floating point array safely without overflow. It involves computing the running mean using weights such as real(i)/real(i+1) for each term. When applied to the 2-element vector case and simplified by hand, I think it ends up being

a = 0.5*x + 0.5*y

which of course is correct, but it involves two multiplications and an addition and it does not protect against unnecessary underflow.

The solutions being discussed here protect against both overflow and unnecessary underflow. They also use fortran intrinsics like exponent() and scale(), which many other languages do not have, so this discussion also displays some of the advantages of fortran for numerical calculations.

It was a kind of exercise: find the k-th value in a (very large) unsorted real array, without moving the elements and without duplicating the array (or a part of it). This excludes a “quickselect” approach. I ended up with a bisection algorithm (ok, “iterative” is implied). In the worst case scenario (*) it can end with an interval without any other representable real between the bounds.

(*) this worst case scenario arises when

  • v2 = nearest(v1,1.0)
  • count(a(:) <= v1) < k
  • count(a(:) <= v2) > k

(and in this case the k-th value is v2)

An indexed quicksort will identify the k-th value, as it would the median value. I don’t know an easier way, as the k-th value is dependent on all other values. (I have also wondered if there was an easier way to find the median)

“indexed” means creating a temporary array for the index, of the same size than the input array. The constraint of the exercise was to not do that. And the bisection turns out to be a quite efficient solution.

Quicksort is indeed simple, but it moves the elements of the input array. The complexity is O(N*log(N))

Quickselect is a very simple variant of Quicksort, designed specifically to extract the k-th value. As it only partially sorts the array, it is much faster than Quicksort. The complexity is O(N), but can be O(N**2) in the worst case (there apparently exist variants that ensure the O(N) complexity in all cases)

The bisection approach has also a O(N) complexity, and it doesn’t move the elements of the input array. In my tests it is faster than Quickselect in practice, unless the statistical distribution of the elements of the array is extremely non-uniform.

This is probably the approach I would have taken, but instead of reordering the elements in-place, I would have reordered the elements of an index vector. It was not clear from your problem description that an index vector was not allowed.

I would have thought that a binary search would scale as N*log(N). Since you are not rearranging the elements of the array, or using an index vector, you must scan the entire array each pass, right?

Unnecessary underflows would occur for any x or y that has the same exponent as tiny(x), which is roughly the range tiny(x)<=x<r*tiny(x) for radix r exponents. So it’s not just a single number, it is a whole range of numbers that are problematic. And further, those underflows are unnecessary. For any nonzero normal pair x and y, the mean will also always be normal. Something like 0.5*(0.0+tiny(x)) is still an exception. The fortran code posted previously handles these cases correctly.

The cases that ifort did not handle correctly involved denormal numbers, and with the appropriate compiler option, even those were handled correctly. I’m satisfied with that behavior because denormals take more effort than normals, and if you don’t need special treatment or gradual underflow, it is alright with me to flush them to zero.

The other issue was about how portable is the code. That issue is clouded because the treatment of denormals was left as processor dependent. If both nearest() and scale() treat denormals the same way, then everything works as expected. The ifort compiler doesn’t do that without an option. I did not know that before this discussion thread.

You’re right, my bad. My bisection approach is basically a O(N*log(N)) algorithm with the constraints I gave (no element displacement and no extra memory allocation).

Nonetheless O(N) can be obtained by relaxing the initial constraints and allowing a small amount of extra allocation (for instance at a point where the bisection has narrowed the search to less than 1/10th of the initial number of elements, I possibly allocate a new array for the remaining elements. So 10% more memory is traded to get the O(N) complexity).

Also, the maximum recursion depth is limited anyway. For instance if all the elements are between 0.0 and 1.0, since there are “only” 2**23 different reals in this interval, there will be at most 23 recursion levels, even if the number of elements is much larger than 2**23 (so at this point the complexity turns to O(N))

[quote=“PierU, post:1, topic:5590”]

You might be interested by the following article :

How do you compute the midpoint of an interval?
Frédéric Goualard

https://hal.science/hal-00576641v2

Paul

Somebody made a heroic effort to ‘get this right’ in C++ and presented on it at CppCon: [youtube link]

1 Like