We write a large quantum mechanics code in Fortran. It solves the Kohn-Sham equations of density functional theory and runs on hundreds or thousands of cores. Speed is very much a priority for us when coding.
In our ongoing quest to get it to run faster and for larger systems, we’ve starting converting parts of it to single-precision (SP). This helps because it requires less memory bandwidth and more SP operations can be carried out per SIMD instruction.
Now here’s the question: given that our methods involve approximations which yield accuracies far lower than the precision of double-precision (DP) floating point numbers, why should we use DP at all? Why not convert the entire code to SP?
I think there are a few potential reasons. First and foremost is stability. The code relies heavily on fixed point methods, and converting to SP may mean that the code never converges to the fixed point. The same goes for time-evolution, we’ve noticed that for long time runs (~100,000 time steps) can give different final states depending on numerical noise in the initial state.
Another reason is that, even though the methods are approximate, it can still be useful to consider small changes in a variable which may have a large absolute value. (For example, calculating the change in total energy of a system with respect to orientation of a magnetic field. This gives changes of order 1.e-3 on top of values of ~10,000). This can be done only with most of the variables set to DP.
So how about mixed-precision? i.e. careful use of both SP and DP so that the overall stability is not affected. For example, I can see that using DP as a cumulative variable when evaluating sums with SP summands may be a good compromise but only if the sum is an average. Thus
where s is SP would yield SP accuracy in DP variable d. However,
should average out some of the error.
There also seems to be a speed penalty in using mixed-precision arithmetic. Consider the following code:
Compiling with Intel Fortran and -Ofast, the resulting assembly is
lea (%rsp), %rdi #11.1
cvtss2sd 96(%rdi), %xmm0 #9.5
mulsd 64(%rdi), %xmm0 #9.4
lea 88(%rsp), %r8 #11.1
cvtsd2ss %xmm0, %xmm0 #11.1
The SP variable b is converted to DP, then multiplied with DP variable a and converted to SP to be stored in c. This is slower than if a, b and c were exclusively SP or DP. It seems that x86 hardware has no instructions for handling mixed-precision without conversion.
Does anyone have any experience on the pitfalls of (partially or entirely) lowering precision in a large, complicated code?