Use case of single-precision real number

Could anyone please explain to me in what cases single-precision real numbers are more suitable than double-precision ones?

You may be interested in “mixed-precision computation”. In addition, low-precision training has become a very important idea in machine learning. Literature is vast.

Due to its importance and powerfulness, there are hardware and floating-point systems dedicated to such computation, e.g., google‘s TPU and BFloat16.

In the above-mentioned topics, the main difficulty is indeed not the precision but the range, both theoretically and computationally. Overflow, underflow, inf, and NaN will become frequent.

2 Likes

As mentioned, if not already 16-bit floats are coming back in a big way, but even with more
conventional computation 


It is very dependent on the programming environment you are using; but
smaller precision values generally require less memory, generate smaller
files (particularly when using binary data) and so are generally quicker
to read and write and take less space; and generate smaller messages when
being passed as messages or being moved in and out of cache (sometimes)
for starters.

A simple program taking some square roots gives evidence that on the
platform I am on the smaller sized values are calculated significantly
faster, for example:

$ gfortran xx.f90
$ ./a.out
 32:
                time read   :   0.1145  seconds
sum is 10736460772798864.
sum is 0.107380131E+17
 64:
                time read   :   0.2071  seconds
sum is 10736460772798864.
sum is 0.107380131E+17
$ ifx xx.f90
$ ./a.out
 32:
                time read   :   0.0640  seconds
sum is .1073837231757661E+17
sum is .1073831E+17
 64:
                time read   :   0.2030  seconds
sum is .1073837231757661E+17
sum is .1073831E+17
program check
use,intrinsic :: iso_fortran_env, only : sp => real32, dp => real64, int64
implicit none
character(len=*),parameter :: g='(*(g0))'
integer,parameter   :: n = 10000000 !! number of values
real(sp)            :: rval_out(n)=0 , rvals(n)
real(dp)            :: dval_out(n)=0,  dvals(n), dsum
integer(kind=int64) :: start, finish, count_rate
real                :: run_time
integer,parameter   :: first=0, last=huge(0)
   call random_number(dvals)
   dvals=(last-first)*dvals+first
   call random_number(rvals)
   rvals=(last-first)*rvals+first
   write(*,*)'32:'
   call system_clock(start, count_rate)
   rval_out=sqrt(rvals)
   call tally()
   write(*,*)'64:'
   call system_clock(start)
   dval_out=sqrt(dvals)
   call tally()
contains
subroutine tally()
   call system_clock(finish)
   run_time=(finish-start)/real(count_rate,dp)
   write(*,'(a30,1x,f7.4,1x,a)') 'time read   : ', run_time, ' seconds'
   write(*,g)'sum is ',sum(dvals)
   write(*,g)'sum is ',sum(rvals)
end subroutine tally
end program check

Given all that, I usually use 64-bit values unless it causes a major
performance limitation :>

2 Likes

The overhead in moving data between processors has a significant effect on the performance of multi-processor systems. We are looking at compressing data for inter-processor communication by removing unnecessary mantissa bits. It is possible to investigate the precision actually required to run codes by emulating the arithmetic - please see: http://simconglobal.com/Testing_the_Numerical_Precisions_Required_to_Execute_Real_World_Programs.pdf . All the emulation codes and machinery to run the tests can be found in the fpt distribution. Please see: fpt Reference: EMULATE REAL ARITHMETIC.

3 Likes

@Jcollins ,

Good, interesting effort.

But please note your choice of SEQUENCE type is most questionable and it will likely hinder adoption.

Given your interest in multi-processor systems and inter-processor communication within the context of Fortran, about the only semantics in the current Fortran standard you can hang your hat on with any semblance of reliability and portability is with the facility around interoperability with C companion processor(s), particularly with BIND(C) clause. You mention in your paper the continued presence of COMMON blocks (de facto use of global variables) in existing FORTRAN programs. As I had pointed out to you a while ago, the interoperability facility with C in the Fortran standard in the context of global variables permits mapping of C structs with named COMMON blocks via BIND(C). So that’s a route to replace the COMMON blocks in legacy FORTRAN codebases with BIND(C) derived types, an effort that can then be extensible to inter-processor communication as well.

The use of interoperability facilities with C is thus what I suggest you move toward in your investigation while abandoning SEQUENCE types.

Of course the issue with EQUIVALENCE remains. Note in most cases now, the continued need for EQUIVALENCE is questionable and it is possible to remove them via TARGET-POINTER semantics of MODULE entities. For the few remaining situations, I would advocate a further enhancement in a future standard revision toward interoperability with C in Fortran to cover C unions. The requirements will include having a way the legacy (and usually nonstandard) FORTRAN codebases can gain an option to readily refactor most of the pending EQUIVALENCE constructs. Surely almost all these existing codebases will see no change as they will never be refactored away from COMMON and EQUIVALENCE - given the La RĂ©sistance to modernization among the FORTRANners - but that’s their problem and besides the point.

The studies and investigations such as yours, however, can and should be able to employ the newer facilities and illustrate potential and value of new ideas and code designs in new or improved applications such as those involving reduced overhead-based inter-processor communication, but without aiding and abetting the continued reliance of COMMON and EQUIVALENCE constructs.

The short answer is that single precision numbers are useful any time the precision and range fits the problem. Using double precision (or other KINDs with even more bits) to store numbers that do not require that many mantissa bits or that large of an exponent range is a waste of memory, disk space, and computer cycles. Also some hardware, such as many common graphical processing units (GPU), support only single precision floating point.

4 Likes

In certain chaotic systems, you might find that floating point numbers lead to fundamental problems. Quoting from the work of Coveney & HIghfield [0]:


 digital computers only use a very small subset of the rational numbers—so-called dyadic numbers, whose denominators are powers of 2 because of the binary system underlying all digital computers—and the way these numbers are distributed is highly nonuniform. Moreover, there are infinitely more irrational than rational numbers, which are ignored by all digital computers because to store any one of them, typically, one would require an infinite memory. Manifestly, the IEEE floating-point numbers are a poor representation even of the rational numbers. Recent work by one of us (PVC), in collaboration with Bruce Boghosian and Hongyan Wan at Tufts University, demonstrates that there are major errors in the computer-based prediction of the behaviour of arguably the simplest of chaotic dynamical systems, the generalised Bernoulli map, for single precision floating point numbers. For a subset of values of the model’s solitary parameter, very large errors accrue that cannot be mitigated by any increase in the precision of the numerical representation. For other parameter values, double precision reduces the sizeable errors substantially (Milan Kloewer, private communication with PVC). However, this leaves open the question as to whether double precision floating point numbers are themselves sufficient to handle the far more exquisite complexity of real world molecular dynamics and fluid turbulence, which originate in dynamical systems that are many orders of magnitude more complicated. The spectrum of the unstable periodic orbits of the map is badly damaged regardless of the precision of the floating point numbers [65].

I have had the pleasure of listening to Bruce Boghosian at a conference in the past. He gave a talk on inequality and wealth distributions (see Is Inequality inevitable? published by the Scientific American).


[0] Coveney, P. V., & Highfield, R. R. (2021). When we can trust computers (and when we can’t). Philosophical Transactions of the Royal Society A, 379(2197), 20200067.

[65] Boghosian, B. M., Coveney, P. V., & Wang, H. (2019). A new pathology in the simulation of chaotic dynamical systems on digital computers. Advanced Theory and Simulations , 2 (12), 1900125.

2 Likes

@FortranFan

Thank you for your comments.

Looking through the emulation code for experiments on precision I found that SEQUENCE derived types are only used in setting up the emulation types:

        TYPE em_real_k4
           SEQUENCE
           REAL(KIND=kr4) value
        END TYPE em_real_k4
!
        TYPE em_real_k8
           SEQUENCE
           REAL(KIND=kr8) value
        END TYPE em_real_k8
!
        TYPE em_complex_k4
           SEQUENCE
           COMPLEX(KIND=kr4) value
        END TYPE em_complex_k4
!
        TYPE em_complex_k8
           SEQUENCE
           COMPLEX(KIND=kr8) value
        END TYPE em_complex_k8
!

In these cases there is only one component in each type, so the keyword SEQUENCE doesn’t do anything. I used SEQUENCE derived types because the emulation technique can be used for other experiments, for example in tracking units and dimensions, and because some compilers will not allow non-sequence types in COMMON or EQUIVALENCE.

I agree with your comments about C interoperability. With Jon power at Sector7 we are exploring mapping COMMON blocks to C structs - this as part of a project to translate legacy Fortran to C++.

I believe that unions, using the VMS / HP-UX / ifort MAP and UNION construct was implemented for Fortran derived types in Salford Software’s FTN95. We would welcome something like this in the Standard. We are currently working on a code with data structures of over 10Gb which uses VMS STRUCTURE/MAP/UNION. Without the UNION construct the memory requirement would be much greater and would exceed the available memory on each processor.

Try n*10, ie n = 100,000,000 or 1.e8 for sum (rval)
integer,parameter :: n = 100000000 !! too many values for real32
While it can get a quicker answer, an incorrect answer is not very useful.

Seriously, it all depends on the accuracy required for the type of calculation being performed.

For my structural finite element analysis which is based on 3x displacements and 3x rotations for each point real32 is not adequate, while some field problems with only 3x displacement can be adequately solved with real32.

The argument about unnecessary mantissa precision is again dependent on type of problem. DEC provided an alternative real64 with higher precision but the same exponent range as real32. This has not had sustained usage.

The argument about storage size (32 vs 64 bits) and computation speed appears contrary to the more rapid advances on storage capacity and increasing vector instructions.

There are fields of study where real16 have a niche, but many where this is not the case, especially where reliability and accuracy of results are important.

When working with hundreds of TBs of data, 32 bits versus 64 bits makes a huge difference ;). For storage, and for computation speed when the algorithms are memory bound.

I have been pondering these problems. To me, the bottleneck appears to be obtaining the data, rather than processing. Perhaps your answer is newer hardware: DDR5 memory and PCIe 4.0 drives, rather than a reduced precision floating point.
For accessing data, I have been investigating using Fortran stream access with 64-bit file addressing for large data-sets, although the 6 GByte/sec access speeds are not easy to reproduce for practical problems. Restructuring the data if doing multi-pass analysis can help. ( I have not been working with TB’s of data! )
Then for processing, OpenMP multi-thread can be an option, although it is difficult to generalise for an appoach.
Given these hardware options, reduced precision floating point is a very niche case, especially for only 40% saving. Efficiencies in receiving the data must have more potential than this ?

Can be both. When dealing with memory bound algorithms, processing with 32 bits can have a significant impact, both because the RAM-CPU transfers are faster, and because one can have more data in RAM. Of course, more/faster RAM and disks are always desirable and welcome, but at some point the cost has to be considered.

For storage we have to use network drives, so getting the data is limited by the network speed anyway. Actually we store the data either in 32 bits, or in 16 bits whenever possible. We process them in 32 bits most of time, and sometimes in 64 bits if needed.

with some compilers single precision runs faster than double. It also occupies less space.

1 Like

This should happen if your code can sufficiently saturate computation capabilities, then you will be able to fit 2x the quantity of single precision values across the same memory bandwidth as when using double precision.

Using real32 will generally accelerate you physical model if you don’t need too many digits and the model is not too sensitive. Just try it and compare the results with real64.

Of course, with certain computations (even very basic) you can have big troubles. For example, according to Wolfram Alpha:

\sum_{n=1e9}^{1} \frac{1}{n}=21.3004815023479440166851018489083469661270727335988...

In C, using double I obtain 21.30048150234615: the last three digits are false. If I am an engineer I will most of the time don’t care.

Using float (32 bits reals), the result is 18.807919. Well, all digits are false
 I have just the order of magnitude. Of course, computing the sum only up to 1e7 would give far more acceptable results


The choice to use single precision is usually driven by having some knowlege of the precision your underlying problem requires which is usually driven by the types of solution algorithms you use. If you are solving very large CFD problems using an implicit solver you probably do need double precision if you want to drive the residuals to machine zero. However, if you only need to converge to a level of some fraction of a drag count (a drag count is a 0.0001 increment in drag coefficient), then you might only need single precision. There is an old saying in the aerospace community that something is “close enough for Government work” implying you use just enough accuracy to meet Gov’t contract requirements etc. In a lot of preliminary design scenarios, being “close enough for Government work” can be obtained with single (or maybe even half) prescision floating point math.

If you use PGPLOT for graphics then your real numbers must be single precision. So I do a lot of calculation in double and then convert to single for plotting. (I know that package is old but if it ain’t broke don’t fix it
)

Just from my own experience: when I do MHD simulations and analyze the results afterwards, if I have a 256x256x256 points file (which can not be considered as a huge sampling nowadays; AMR can provide additional resolution only on top of that) and 9 variables - 3 components of B, 3 components of v, rho and p - even with a 32 float the output file will be ~600 Mb. I need to analyze different time steps and produce various visualizations (I use VisIt for that, which can do it in parralel).

To be honest the whole process of inspection (not calculations) is not fast on my relatively fast PC. With double precision it will be even more problematic.

I am also doing CFD simulation, and I am wondering whether single precision can be used. Our group’s current approach is to directly use double-precision floating point numbers for sufficient accuracy, but doing so will make the calculation time very large. And when performing turbulence simulations, a large amount of data needs to be exported, which results in long export times and large file sizes. However, if single precision is used, non-physical phenomena will occur due to rounding errors, such as the following article https://doi:10.1016/j.compfluid.2009.06.001. Whether single precision can be used is indeed a big problem, at least for the CFD direction I am engaged in, especially the trend of porting algorithms to GPUs.

1 Like

You might find the following journal article by WItherden and Jameson informative.

https://arc.aiaa.org/doi/10.2514/1.J058434

Preprint versions of the article can be found here

https://aero-comlab.stanford.edu/Papers/witherden_jameson_aiaaj_2019.pdf

A conference presentation can be found here

Their conclusion is single precision can be used for LES simulation

1 Like