Single, Double or Mixed Precision?

We write a large quantum mechanics code in Fortran. It solves the Kohn-Sham equations of density functional theory and runs on hundreds or thousands of cores. Speed is very much a priority for us when coding.

In our ongoing quest to get it to run faster and for larger systems, we’ve starting converting parts of it to single-precision (SP). This helps because it requires less memory bandwidth and more SP operations can be carried out per SIMD instruction.

Now here’s the question: given that our methods involve approximations which yield accuracies far lower than the precision of double-precision (DP) floating point numbers, why should we use DP at all? Why not convert the entire code to SP?

I think there are a few potential reasons. First and foremost is stability. The code relies heavily on fixed point methods, and converting to SP may mean that the code never converges to the fixed point. The same goes for time-evolution, we’ve noticed that for long time runs (~100,000 time steps) can give different final states depending on numerical noise in the initial state.

Another reason is that, even though the methods are approximate, it can still be useful to consider small changes in a variable which may have a large absolute value. (For example, calculating the change in total energy of a system with respect to orientation of a magnetic field. This gives changes of order 1.e-3 on top of values of ~10,000). This can be done only with most of the variables set to DP.

So how about mixed-precision? i.e. careful use of both SP and DP so that the overall stability is not affected. For example, I can see that using DP as a cumulative variable when evaluating sums with SP summands may be a good compromise but only if the sum is an average. Thus

d=0.d0
do i=1,n
  d=d+s(i)
end do

where s is SP would yield SP accuracy in DP variable d. However,

d=0.d0
do i=1,n
  d=d+s(i)
end do
d=d/n

should average out some of the error.

There also seems to be a speed penalty in using mixed-precision arithmetic. Consider the following code:

program test
real(8) a
real(4) b,c
read(*,*) a,b
c=a*b
print *,c
end program

Compiling with Intel Fortran and -Ofast, the resulting assembly is

...
        lea       (%rsp), %rdi                                  #11.1
        cvtss2sd  96(%rdi), %xmm0                               #9.5
        mulsd     64(%rdi), %xmm0                               #9.4
        lea       88(%rsp), %r8                                 #11.1
        cvtsd2ss  %xmm0, %xmm0                                  #11.1
...

The SP variable b is converted to DP, then multiplied with DP variable a and converted to SP to be stored in c. This is slower than if a, b and c were exclusively SP or DP. It seems that x86 hardware has no instructions for handling mixed-precision without conversion.

Does anyone have any experience on the pitfalls of (partially or entirely) lowering precision in a large, complicated code?

2 Likes

I can’t recall ever seeing an instruction set that provided mixed-mode arithmetic. I suggest that you familiarize yourself with Fortran’s rules for mixed-mode (generally, the lower-precision is extended to the higher-precision, then the operation performed.)

I would also very strongly recommend against making use of mixed-mode arithmetic. Some compilers will issue warnings if they see you doing it, as the results may be surprising. If you want to do lowering, please explicitly use the REAL intrinsic with the target kind to down-convert before combining with a lower precision operand. Using your last example, instead of:

c=a*b
write
c = real(a,kind(b))*b

I’d also ask that you not use hardcoded kind values - see my rants on this at Doctor Fortran in “It Takes All KINDs” - Doctor Fortran (stevelionel.com)

2 Likes

Yes it does help for saving some memory, but it also complicates everything for development, testing and validation of the solver.

If you were storing small integers, than I might’ve agreed, but floating point inaccuracies can be pretty hard to track, mainly because they’ll show up in production, not in testing.

You’ll save some memory, but Is that tradeoff worth it for your team?

RAM is extremely cheap, but developer time, and time required for validation of your code, is not.

It is significantly cheaper to spend $50,000 to buy a new compute cluster, instead of paying $90,000 for multiple developers who’ll have to verify and validate your code.

The cost of complexity is recurring, because any change you do may introduce a regression bug, which will go unnoticed until found by the user in production.

1 Like

RAM is extremely cheap, but developer time, and time required for validation of your code, is not.

The issue is not the amount of RAM (there’s generally way more than we need on the clusters we use), it’s bandwidth. Our code (like all similar electronic structure codes) is memory-bound. Switching to single-precision gives a large speed-up (at least a factor of two) because even though floating point operations take few clock-cycles, accessing off-CPU RAM can cost hundreds.

1 Like

I think the best answer for your specific case would be to see if other such opensource codes use single/double/mixed precision or not.

If other similar opensource codes are not using mixed precision, then they might have good reasons to avoid it.

The main reasons to avoid would be complexity, and potential inaccuracy.

But if you decide that mixed precision is okay for your case, you may benefit from reading the data as single precision, then converting to double precision, and computing in double precision.

I think that’s the default behavior of Fortran, but I’m not sure.

Doing this would increase data transfer speed from RAM to cache, but it will no longer be able to use SIMD code with mixed precision data, because mixed precision SIMD instructions may not be available in the CPU.

For us CFD is also heavily memory bound, but luckily using roofline modeling we’ve been able to do more work per byte of memory being read from memory, and that improves our code’s performance.

Although for us, we’re mostly concerned about throughput, and distributed scalability/performance of our code, before the local node’s latency.

That’s why we only use double precision.

I tried to use mixed precision before, because StarCCM does mixed precision, but realized that I couldn’t guarantee accuracy, so decided not to.

There are several examples of models being considerably sped up by partial or total transition to single precision in the field of numerical weather prediction and climate modeling. Try to look these articles and references therein:

1 Like

Maybe you can find something useful in these articles,

I also found a few clues that mixed precision is supported in libxsmm, Extended BLAS (see Chapter 4 at the previous link or this presentation), and MAGMA.

1 Like

Moritz Lehmann (now working at Intel) has looked at using mixed precision in lattice Boltzmann solvers:

While the reduction in bandwidth does improve performance, it indeed limits the accuracy you can achieve, as seen from this graph:

For this test case - laminar flow in a circular channel - only the FP32 and FP64 solvers were able to reach an error lower than 1 % (which is arguably sufficient if you know there are other modelling errors that are larger). This is a relatively simple case, I imagine that in a multiphase flow or a combustion problem there would be parameters with wildly different magnitudes, and you may encounter convergence problems sooner rather than later.

At least for LBM, if you use distribution shifting (a change of variables that helps gain one extra digit of precision), single precision appears to be sufficient for many problems. I’d still remain careful if the simulation has to be performed over long time periods (the time convergence is only first order…).

I saw a similar work published recently on mixed precision in OpenFOAM, but haven’t had time to read it yet:

In a previous thread, I mentioned some works which find that for chaotic systems, even double precision may not be enough: Use case of single-precision real number - #7 by ivanpribec

1 Like

I used mixed precision solvers in the early 1990s, when workstation memory was more expensive. We worked within the framework of Iterative Refinement, which has some theoretical support.

The solution vector needed around 10-12 decimal digits, so single precision wasn’t adequate but we didn’t require full double precision. The problems were non-linear, so we were iterating towards a solution and could calculate a residual in extended precision and solve for the correction in single precision. The material properties were not well known, so single precision was sufficient for the coefficient matrices.

We used a range of solvers, including full and sparse Cholesky decomposition and restarted GMRES. Never encountered any numerical issues but we did check the condition number of the problem during development.

I have no idea if it is worth the effort with modern hardware, but people are still publishing in the area. For example, search “mixed precision gmres”.

2 Likes

In my field most of the computations are performed in single precision, as the data are generally not known beyond 3 digits precision, and because of the huge data volumes. Only selected parts of the computations are performed in double precision, where needed.

A recent talk on the topic of mixed precision computing by Theo Mary (Sorbonne Univ.):

https://hpc.fau.de/2023/11/27/nhr-perflab-seminar-mixed-precision-computing-an-overview-december-12-online/

Interesting presentation and paper. The hardware is new but some of the concepts haven’t changed. Almost a shame that I don’t need to use it.

These same comments apply also to data compression methods in general, and to lossy compression methods for floating point values in particular. This applies to i/o operations, say to an external disk, and even more so to a slower networked file system, and also to parallel message passing of floating point data between compute nodes. Using single precision or newer half-precision formats is an example of a lossy floating point data compression from this perspective. The choice of compression algorithm depends on how your data is accessed, random by individual elements, or by blocks, and so on, and once uncompressed, how often is it reused.

This post has a nice historical survey of low precision.

Is Low Precision Arithmetic Safe?
by Wayne Joubert
6 February 2024

The popularity of low precision arithmetic for computing has exploded since the 2017 release of the Nvidia Volta GPU. The half precision tensor cores of Volta offered a massive 16X performance gain over double precision for key operations. The “race to the bottom” for lower precision computations continues: some have even solved significant problems using 1-bit precision arithmetic hardware ([1], [2]). And hardware performance is getting even better: the Nvidia H100 tensor core-enabled FP16 is a full 58X faster than standard FP64, and 1-bit precision is yet another 16X faster than this, for total speedup of over 900X for algorithms that can use it [3].

1 Like

Here is an example of using mixed-precision in a physics calculation (the transfer of radiation though Earth’s atmosphere). When I read it some months ago I was impressed at how careful the authors had been about matching precision to the character of the calculation.