Using single-precision for faster calculation

septc · May 13, 2022, 8:24pm

Dear all,

I have recently been struggling to make my program faster because it has been spending a long time to get a result. So for speed-up, I have parallelized the most time-consuming part using OpenMP. This worked very well and gave a nice speedup. Now I am considering the use of single precision (32-bit) for the time-consuming part, because my program is written entirely in double precision (64-bit). So, I am wondering whether there are some remarks that I should keep in mind for this kind of calculations.

For example, I guess it is typical to change the precision parameter from “dp” to “sp” throughout the program, but are there some code patterns for which it is definitely better to keep double precision? (I imagine that this may be the case for result variables for summing data, but not very sure.) Alternatively, is it also common to use 32-bit only for the time-consuming part, while keeping the entire program as 64-bit?

More generally, although I have used only 64-bit up to now, is it rather common to perform an entire calculation in 32-bit? I think neural-net calculations use lower-precisions for speed-up, but wondering what kind of other applications typically use 32-bit (partly or entirely). I think this depends greatly on the nature of calculations (whether the calculation is deterministic and needs high precision, or it is stochastic and inherently subject to random noise).

So I would really appreciate it if you share any insight about such calculations or point to any related pages / explanations that I should read beforehand.
Thanks very much in advance!

Harper · May 13, 2022, 9:44pm

Depends on whether you need the 64-bit accuracy. Try your calculation both ways with the same data and see how much accuracy 32 bits loses. Of course 64-bit calculations seldom give 64-bit accuracy in the results. I once had an ill-conditioned algorithm that gave negative values of some outputs that had to be positive with 32 bits, a problem that 64 bits cured, results differing in the 4th significant figure from 64 bits with 80 bits, but much further down with 128. (Hurrah for gfortran offering all 4 kinds.)

rwmsu · May 13, 2022, 9:54pm

I think its an issue of knowing the algorithms and if PDE’s the equation sets you are solving. Basically, if you are solving systems of equations via matrices do they become ill conditioned and highly impacted by roundoff errors. An example is using pseudo-inverses and normal equations to solve a least squares problem. I have for 40 plus years used double precision exclusively in all the CFD and FEM codes I’ve written or worked with. A recent AIAA journal article has me rethinking the everything double precision approach though. See (if you have access)

Witherden and Jameson, “Impact of Number Representation on High-Order Implicit Large Eddy Simulation”, AIAA Journal, Vol 58, no. 1, January 2020 (DOI:10.2514/1.J058434)
https://arc.aiaa.org/doi/full/10.2514/1.J058434

The focus of this article is doing high order turbulence large-eddy simulations on GPU’s. This is a very demanding and time-consuming problem. The author’s show that for a standard test problem (Taylor-green vortex) the speed up gained from running single precision vs double on an NVIDIA K40c is around 2-3 with no major loss of accuracy.

Their final conclusions were:

"For all test cases, the use of single precision was found to have
little impact on the overall accuracy of the simulations. This suggests
that, for many real-world test cases, single precision arithmetic is
indeed sufficient. For bandwidth bound simulations, speedup factors
between 1.4 and 2.9 were observed, hence resulting in a greatly
reduced time to solution."

Since you can get usable single precision performance out of mid to high end commodity graphics cards, this opens up more options for speeding up simulations without spending 10K on NVIDIAs high end card just to get double precision.

mecej4 · May 13, 2022, 11:40pm

An example of a widely used software package that uses single precision throughout, which ran into problems after X86/x64 CPUs shifted from x87 (80 bit reals internally) to SSE2 instructions (32 or 64 bit reals):

Bugs in SWMS2D and SWMS3D.

Beliavsky · May 14, 2022, 12:18am

If you switch to single precision you will see if the results produced are sensible. You could also set IEEE flags for underflow and overflow, which I tweeted about:

Fortran 2003 introduced an ieee_arithmetic module with elemental functions ieee_is_nan, ieee_is_finite, ieee_is_negative, and ieee_is_normal, which test for those conditions. They have a real argument and logical RESULT.

program test_ieee_arithmetic
use, intrinsic :: ieee_arithmetic
implicit none
character (len=20) :: fmt = "(a16,*(l9))"
integer, parameter :: wp = kind(1.0) ! same output for wp = kind(1.0d0)
real(kind=wp)      :: z,vec(6),NaN,t
z = 0.0_wp
t = tiny(z)
vec = [z/z,z/1.0_wp,-z/1.0_wp,1.0_wp/z,-1.0_wp/z,t/10]
print*,ieee_value(0.0,ieee_positive_inf),1.0/z ! 2 ways of geting +Inf
NaN = ieee_value(0.0,ieee_quiet_nan) ! get NaN
print*,"NaN == NaN?",NaN == NaN ! demonstrate that NaN /= NaN
print*
print "(19x,*(a9))","0.0/0.0","0.0/1.0","-0.0/1.0","1.0/0.0","-1.0/0.0","tiny/10"
print fmt,"ieee_is_nan"     ,ieee_is_nan(vec)
print fmt,"ieee_is_negative",ieee_is_negative(vec)
print fmt,"ieee_is_finite"  ,ieee_is_finite(vec)
print fmt,"ieee_is_normal"  ,ieee_is_normal(vec) 
end program test_ieee_arithmetic
! gfortran output:
!         Infinity         Infinity
! NaN == NaN? F
!
!                      0.0/0.0  0.0/1.0 -0.0/1.0  1.0/0.0 -1.0/0.0  tiny/10
!      ieee_is_nan        T        F        F        F        F        F
! ieee_is_negative        F        F        T        F        T        F
!   ieee_is_finite        F        T        T        F        F        T
!   ieee_is_normal        F        T        T        F        F        F

Call ieee_set_halting_mode() of F2003 module ieee_exceptions to set the floating point conditions (NaN, overflow, underflow, divide-by-zero, inexact) that will halt a program. Similar compiler options include gfortran -ffpe-trap=invalid ifort -fpe0.

! A reference is https://www.nag.com/nagware/np/r70_doc/ieee_arithmetic.html
program test_ieee_exceptions
use :: ieee_exceptions, only: ieee_set_halting_mode, ieee_invalid, &
       ieee_overflow, ieee_underflow, ieee_divide_by_zero, ieee_inexact
implicit none
real :: a, b, c, d, e, f, g
logical, parameter :: halt_program = .true.
call ieee_set_halting_mode([ieee_divide_by_zero, ieee_underflow, &
     ieee_overflow, ieee_invalid],halting=halt_program)
! causes halt if divide_by_zero, underflow, overflow, or NaN conditions occur
! remove element from array constructor to have program continue with that condition
f = 1.0
print*,"computing 1.0/0.0"
print*,f/0.0
e = tiny(e)
print*,"tiny =",e,", computing tiny/10"
print*,e/10
d = huge(d)
print*,"huge =",d,", computing huge**2"
print*,d**2
a = 0.0
b = 0.0
print*,"a, b=",a,b
print*,"computing c = a/b"
c = a/b
print*,"c=",c
! since floating point roundoff is ubiquitous, probably
! do not want the following mode
call ieee_set_halting_mode(ieee_inexact, halting=halt_program)
print*,"computing 2.0/3.0"
g = 2.0/3.0
print*,"2.0/3.0 =",g
end program test_ieee_exceptions
! output on Windows with ifort -traceback ieee_exceptions.f90
!  computing 1.0/0.0
! forrtl: error (73): floating divide by zero
! Image              PC                Routine            Line        Source             
! ieee_exceptions.e  00007FF6EBBF112A  MAIN__                     14  ieee_exceptions.f90

oscardssmith · May 14, 2022, 1:32am

It’s also worth pointing out that single precision is generally only faster if you have vectorized computation.

mecej4 · May 14, 2022, 1:58am

If a computation is related to solving a discretized PDE, the overhead of transferring large arrays from and to memory can be very significant, even with SSE/AVX.

I remember a 2-D transient groundwater/thermal/solute simulation program for which we prepared single and double precision versions and made comparison runs. We concluded that, except in one or two subroutines where means and standard deviations were accumulated, single-precision was sufficient, and ran almost twice as fast as the double precision version.

If an eigenvalue is being calculated, on the other hand, double precision may be essential. I remember a program that calculated rotor vibrations, and double precision was essential.

As others have remarked, “it depends”, and some effort should be spent in investigating what precision is required when a simulation run takes several minutes or hours to run a single case.

Beliavsky · May 14, 2022, 2:12am

For a program slightly modified from Speed of array intrinsics - #3 by Beliavsky, printing the real kind and setting it to either single or double precision, compiling with ifort -Ofast on WSL2, I do find it to be twice as fast to compute the min, max, and mean of a large 1D array in single precision than double precision (except for an algorithm that is slow in both cases), and more than twice as fast to compute exponentials in single than double precision for a large array.

 wp =           8
 n =   100000000

    time                               task
  0.1932                           x=exp(x)
  0.1159               minval(x), maxval(x)
  0.0628                         min_max(x)
  0.0627          min_max_explicit_shape(x)
  0.2164                   min_max_pairs(x)
  0.2168                   min_max_local(x)
  0.1543       minval(x),maxval(x),sum(x)/n
  0.0652                    min_max_mean(x)
  0.0653     min_max_mean_explicit_shape(x)

real	0m1.876s
user	0m1.800s
sys	0m0.061s

wp =           4
 n =   100000000

    time                               task
  0.0859                           x=exp(x)
  0.0588               minval(x), maxval(x)
  0.0313                         min_max(x)
  0.0313          min_max_explicit_shape(x)
  0.2289                   min_max_pairs(x)
  0.2055                   min_max_local(x)
  0.0774       minval(x),maxval(x),sum(x)/n
  0.0316                    min_max_mean(x)
  0.0315     min_max_mean_explicit_shape(x)

real	0m1.199s
user	0m1.110s
sys	0m0.072s

code

module min_max_mod
implicit none
integer, parameter :: wp = kind(1.0)
contains
pure function min_max(x) result(y)
real(kind=wp), intent(in) :: x(:)
real(kind=wp)             :: y(2)
integer                   :: i,n
n = size(x)
if (n < 1) return
y(1) = x(1)
y(2) = y(1)
do i=2,n
   if (x(i) < y(1)) y(1) = x(i)
   if (x(i) > y(2)) y(2) = x(i)
end do
end function min_max
!
pure function min_max_explicit_shape(n,x) result(y)
integer      , intent(in) :: n
real(kind=wp), intent(in) :: x(n)
real(kind=wp)             :: y(2)
integer                   :: i
if (n < 1) return
y(1) = x(1)
y(2) = y(1)
do i=2,n
   if (x(i) < y(1)) y(1) = x(i)
   if (x(i) > y(2)) y(2) = x(i)
end do
end function min_max_explicit_shape
!
pure function min_max_pairs(x) result(y)
real(kind=wp), intent(in) :: x(:)
real(kind=wp)             :: y(2)
integer                   :: i,n
logical                   :: odd
n = size(x)
if (n < 1) return
if (n == 1) then
   y(1) = x(1)
   y(2) = x(2)
   return
end if
odd = mod(n,2) == 1
if (x(2) > x(1)) then
   y(1) = x(1)
   y(2) = x(2)
else
   y(1) = x(2)
   y(2) = x(1)
end if
do i=3,n-1,2
   if (x(i+1) > x(i)) then
      if (y(1) > x(i))   y(1) = x(i)
      if (y(2) < x(i+1)) y(2) = x(i+1)
   else
      if (y(1) > x(i+1)) y(1) = x(i+1)
      if (y(2) < x(i))   y(2) = x(i)
   end if
end do
if (odd) then
   y(1) = min(y(1),x(n))
   y(2) = max(y(1),x(n))
end if
end function min_max_pairs
!
pure function min_max_local(x) result(y)
! same as min_max_pairs but sets 
! xi to x(i) and xi1 to x(i+1)
real(kind=wp), intent(in) :: x(:)
real(kind=wp)             :: y(2)
real(kind=wp)             :: xi,xi1
integer                   :: i,n
logical                   :: odd
n = size(x)
if (n < 1) return
if (n == 1) then
   y(1) = x(1)
   y(2) = x(2)
   return
end if
odd = mod(n,2) == 1
if (x(2) > x(1)) then
   y(1) = x(1)
   y(2) = x(2)
else
   y(1) = x(2)
   y(2) = x(1)
end if
do i=3,n-1,2
   xi1 = x(i+1)
   xi  = x(i)
   if (xi1 > xi) then
      if (y(1) > xi)   y(1) = xi
      if (y(2) < xi1) y(2) = xi1
   else
      if (y(1) > xi1) y(1) = xi1
      if (y(2) < xi)   y(2) = xi
   end if
end do
if (odd) then
   y(1) = min(y(1),x(n))
   y(2) = max(y(1),x(n))
end if
end function min_max_local
!
pure function min_max_mean(x) result(y)
real(kind=wp), intent(in) :: x(:)
real(kind=wp)             :: y(3)
integer                   :: i,n
n = size(x)
if (n < 1) return
y(1)   = x(1)
y(2:3) = y(1)
do i=2,n
   if (x(i) < y(1)) y(1) = x(i)
   if (x(i) > y(2)) y(2) = x(i)
   y(3) = y(3) + x(i)
end do
if (n > 0) y(3) = y(3)/n
end function min_max_mean
!
pure function min_max_mean_explicit_shape(n,x) result(y)
integer      , intent(in) :: n
real(kind=wp), intent(in) :: x(n)
real(kind=wp)             :: y(3)
integer                   :: i
if (n < 1) return
y(1)   = x(1)
y(2:3) = y(1)
do i=2,n
   if (x(i) < y(1)) y(1) = x(i)
   if (x(i) > y(2)) y(2) = x(i)
   y(3) = y(3) + x(i)
end do
if (n > 0) y(3) = y(3)/n
end function min_max_mean_explicit_shape
end module min_max_mod
!
program xmin_max_exp
use min_max_mod
implicit none
integer :: i,n
integer, parameter :: nt = 10, nlen = 30, ncol_extremes = 5, ncol_mmm = 3
integer                    :: icol
real(kind=wp), allocatable :: x(:)
real(kind=wp)              :: extremes(2,ncol_extremes),t(nt),xmin_max_mean(3,ncol_mmm)
character (len=nlen) :: labels(nt-1) = [character(len=nlen) :: &
   "x=exp(x)","minval(x), maxval(x)","min_max(x)","min_max_explicit_shape(x)", &
   "min_max_pairs(x)", &
   "min_max_local(x)","minval(x),maxval(x),sum(x)/n","min_max_mean(x)", &
   "min_max_mean_explicit_shape(x)"]
character (len=*), parameter :: fmt_cr = "(a20,*(1x,f16.12))"
logical :: print_results
n = 10**8
allocate (x(n))
print*,"wp =",wp
print*,"n =",n
call random_number(x)
print_results = x(1) > 1.0_wp ! .false.
call cpu_time(t(1))
x = exp(x)
call cpu_time(t(2))
extremes(:,1) = [minval(x),maxval(x)]
call cpu_time(t(3))
extremes(:,2) = min_max(x)
call cpu_time(t(4))
extremes(:,3) = min_max_explicit_shape(n,x)
call cpu_time(t(5))
extremes(:,4) = min_max_pairs(x)
call cpu_time(t(6))
extremes(:,5) = min_max_local(x)
call cpu_time(t(7))
xmin_max_mean(:,1) = [minval(x),maxval(x),sum(x)/n]
call cpu_time(t(8))
xmin_max_mean(:,2) = min_max_mean(x)
call cpu_time(t(9))
xmin_max_mean(:,3) = min_max_mean_explicit_shape(n,x)
call cpu_time(t(10))
if (print_results) then
   do icol=1,ncol_extremes
      print fmt_cr,"min, max:",extremes(:,icol)
   end do
   do icol=1,ncol_mmm
      print fmt_cr,"min, max, mean:",xmin_max_mean(:,icol)
   end do
end if
print "(/,a8,5x,a30)","time","task"
print "(f8.4,5x,a30)",(t(i+1)-t(i),trim(labels(i)),i=1,nt-1)
end program xmin_max_exp

vmagnin · May 14, 2022, 8:22am

I had thought for a long time that there was no time difference between single and double precision on a 64 bits CPU, remembering having verified that with simple arithmetic computations.

It’s only discussing here on the Discourse that I realized it was generally false with math functions
If they are implemented with series or some converging algorithm, we easily understand that less terms or iterations will be needed to attain the maximal precision possible with real32.

The impact on the results of using real32 instead of real64 will depend a lot on the algorithms used in your problem, on the length of computations, etc. And on the precision you need, an engineer building a car does not need the name number of digits as a quantum scientist.

JohnCampbell · May 14, 2022, 9:41am

It depends on the type of computation.

In Structural analysis, we solve the large set of simultaneous linear equations:
Force = K . disp
where K is a large 2d matrix, and Force and disp are vectors.
K = sum of Ke (smaller) matrices for all the “elements” in the system. This requires higher precision, as the values accumulated, (ie sum over all elements of each Kij) is adding Kij’s of significantly different magnitudes, due both the significantly different stiffnesses of elements and also as there can be 6 degrees of freedom for each point/node.
Basically K is assembled by adding numbers where significant round-off can occur.
Mixing displacements and rotations/first derivitives typically significantly increases the round-off effects, requiring higher precision accumulators.
This is solved using a “direct” solver for the whole system, as apposed to an itterative solver for localised systems, where the numbers combined are (typically) not as affected by round-off.
Poor round-off can also be described as ill-conditioned.

Itterative solutions can be less affected by round-off where the solution is for many local problems, although boundary solutions can introduce round-off.
Itterative solutions can also correct for round-off as they re-solve/adjust each local problem.

It all depends on the type of problem and solution approach that can minimise round-off when values are combined/added.

Analysing a 3-dimensional homogeneous solid with only 3 displacement degrees of freedom per node could probably be solved with 4-byte reals, but shell models that mix plane stress and bending stiffness require 8-byte reals, with appropriate meshing to not introduce ill-conditioned equations, or “poor” results.
There are not many requests for analysing a 3-dimensional homogeneous solid !!

vmagnin · May 14, 2022, 5:49pm

So if an algorithm is for example based essentially on matrix arithmetical computations, single precision should not be faster than double precision.
Do you agree?

Beliavsky · May 14, 2022, 6:25pm

I am finding that matmul is about twice as fast for single vs. double precision. Toggling wp in the code

program xmatmul
implicit none
integer, parameter :: n = 1000, wp = kind(1.0), niter = 100
integer :: iter
real(kind=wp) :: a(n,n),b(n,n),c(n,n)
logical, parameter :: call_matmul = .true.
print*,"wp, n, niter, call_matmul = ",wp, n, niter, call_matmul
do iter=1,niter
   call random_number(a)
   call random_number(b)
   if (call_matmul) then
      c = matmul(a,b)
      if (c(n,n) > 1000.0_wp) print*,c(n,n) ! prevent matmul from being optimized away
   else
      if (max(a(n,n),b(n,n)) > 1000.0_wp) print*,a(n,n) ! prevent random_number calls from being optimized away
   end if
end do
end program xmatmul

and compiling with ifort -Ofast and running on WSL2 gives

 wp, n, niter, call_matmul =            8        1000         100 T

real	0m14.623s
user	0m14.597s
sys	0m0.010s

 wp, n, niter, call_matmul =            4        1000         100 T

real	0m7.452s
user	0m7.436s
sys	0m0.000s

The calls to random_number take a small fraction of the time.

rwmsu · May 14, 2022, 6:39pm

Does anyone know how to quantify potential speedup due to improved use of cache and stack (assuming that actually happens) using single vs double precision. My thinking (and I’ll gladly admit I might be wrong) is that with single precision more of the problem will fit into cache in any given computation cycle. Sort of like the speedup you see in parallel codes when you chop the problem up into small enough pieces that most of the problem will fit into cache on each processor

vmagnin · May 14, 2022, 8:23pm

Thanks @kargl and @Beliavsky
that’s interesting. Of course with big matrices, cache is important. And probably it takes also more time to copy a double precision array.

I have also a question concerning the math functions. In my assembler book, I see that sqrt(), cos(), and sin() can be implemented in the FPU. Are they implemented in two versions? And does the double and single computations take different time? Does libm uses them or have its own implementation? (in that case why?)

RonShepard · May 14, 2022, 9:46pm

I think you will find that this is a complicated and also an interesting subject area. Concerning the accuracy of the results, you will find that there are may sources of errors in numerical calculations. In linear algebra, for example, there are matrix condition numbers. In differential equations, there is positive and negative feedback. In recursions, there are conditions where errors increase with each term and there are conditoins where errors decrease with each term. I remember seeing a recursive algorithm once where the initial guess was set to zero, which means there were no significant digits at all in the initial guess, yet the recursion was stable and produced results to full precision at the end. That seemed like magic to me at the time. In quadrature methods, the number of data points required depends on the precision of the results. In polynomial approximations, there is truncation error that depends on the number of expansion terms. Generally one wants to tune the algorithm so that the truncation error is somewhere between the intrinsic data errors and the numerical roundoff error. There is much research now involving “half-precision” floating point. These algorithms use 16-bit floating point in order to speed up critical parts of algorithms where only two or three significant digits are required in the intermediate results. Of course, errors in these cases need to be kept “local” so that they do not propagate and contaminate other parts of the algorithm, so this is tricky to do. Over time, much of this is also driven by hardware capabilities. 10-15 years ago when GPU calculations where first being developed, they typically worked with either fixed-point or 32-bit floating point arithmetic. Thus research was done to identify which parts of algorithms could be sped up by using the low-precision GPU hardware, while still maintaining full-precision final results. Now, GPUs also do 64-bit floating point arithmetic. In the past, I worked on machines that had 36-bit single precision floating point. When codes were ported from those machines to 32-bit computers, it was sometimes found that they needed to be done in 64-bit double precision to maintain their previous accuracy. Apparently, the acceptance threshold was somewhere in between 32- and 36-bits of accuracy. I’m pretty sure that in my field of quantum chemistry, almost everything we do would be just fine with something like 48-bit floating point. There is little hardware or software support for that these days, so we settle for 64-bit floating point instead.

Finally, as a practical matter, you don’t want to go through your codes and change back and forth between D and E exponents, or between “sp” and “dp” kind parameters. You want to write your code with something neutral, like “wp”, and then change the whole shebang by just defining what the “wp” kind value is in one place. This is one of the big advantages of modern fortran over other languages, so don’t just ignore that capability. Then regarding things like polynomial truncation errors, you can monitor the magnitude of the intermediate terms with fortran intrinsics like nearest(), epsilon(), precision(), and so on.

FortranFan · May 14, 2022, 10:52pm

You might find this paper, a fairly recent one, to be of interest:
https://peerj.com/articles/cs-778.pdf

“ABSTRACT
It is well established that reduced precision arithmetic can be exploited to accelerate
the solution of dense linear systems. Typical examples are mixed precision algorithms
that reduce the execution time and the energy consumption of parallel solvers for
dense linear systems by factorizing a matrix at a precision lower than the working
precision. Much less is known about the efficiency of reduced precision in parallel
solvers for sparse linear systems, and existing work focuses on single core experiments.
We evaluate the benefits of using single precision arithmetic in solving a double
precision sparse linear system using multiple cores. We consider both direct methods
and iterative methods and we focus on using single precision for the key components
of LU factorization and matrix–vector products. Our results show that the anticipated
speedup of 2 over a double precision LU factorization is obtained only for the very
largest of our test problems. We point out two key factors underlying the poor speedup.
First, we find that single precision sparse LU factorization is prone to a severe loss of
performance due to the intrusion of subnormal numbers. We identify a mechanism that
allows cascading fill-ins to generate subnormal numbers and show that automatically
flushing subnormals to zero avoids the performance penalties. The second factor is
the lack of parallelism in the analysis and reordering phases of the solvers and the
absence of floating-point arithmetic in these phases. For iterative solvers, we find
that for the majority of the matrices computing or applying incomplete factorization
preconditioners in single precision provides at best modest performance benefits
compared with the use of double precision. We also find that using single precision
for the matrix–vector product kernels provides an average speedup of 1.5 over double
precision kernels. In both cases some form of refinement is needed to raise the single
precision results to double precision accuracy, which will reduce performance gains.”

CRquantum · May 15, 2022, 5:36am

For CPU, changing double precision to float precision may have some addition risks. Perhaps improving the algorithm may lead to more speedup.

For GPU such as Gefore RTX or Quadro, their hardware are designed to do float precision. So for those card using float precision is typical.
You can also use very high end card such as GP100 whose hardware is designed to do double precision.

septc · May 15, 2022, 6:55pm

Thanks very much for a lot of info! I really appreciate it!!

And by reading the comments, I have noticed that my understanding / assumption of single-precision may be basically wrong… Specifically, when I wrote the above question, I assumed that replacing double precision (DP) by single precision (SP) would give nearly x2 speed-up, regardless of whether the numerical result is okay or not. But now I suppose this may not be the case depending on cases (according to the comments, thanks!). More specifically,

For latest CPUs, is the speed of basic arithmetic operations (like addition and multiplication) possibly similar between SP and DP…?
If so, does the speed difference of SP and DP arise more from different efficiency of data transfer (between slow memory and CPU) and better use of cache, rather than arithmetic operations inside CPU?
Does this mean that SP codes may not give speedup if they involve indirect / random access of slow memory (e.g. using an index table) rather than more linear access (suitable for “vectorization”)?
Also, if the code involves sin etc, the speed difference may also depend significantly on the underlying math libraries?

To check the first point above, I have tried the following code (arith1.F90 and trig1.F90), which involves only arithmetic or trigonometric calculations (with no large arrays).

!! arith1.F90
    integer(int64) :: k
    real(rp) :: ans, x

    ans = 0
    do k = 1, 10 ** 9
        x = real( k, rp )
        ans = ans + 1 / ( x * x )
    end do
    print *, "ans = ", ans

!! trig1.F90
    integer(int64) :: k
    real(rp) :: ans, x

    ans = 0
    do k = 1, 10 ** 8
        x = ( 2 * mod( k, 2 ) - 1 ) / real( k, rp )
        ans = ans + sin( x ) * cos( x )
    end do
    print *, "ans = ", ans

main.F90 (driver code to include arith1.F90 etc for SP and DP)

module test_mod
use iso_fortran_env, only: int64, sp => real32, dp => real64
implicit none
contains

subroutine test_sp()
#define rp sp
#ifdef arith1
#include "arith1.F90"
#endif
#ifdef trig1
#include "trig1.F90"
#endif
#undef rp
end

subroutine test_dp()
#define rp dp
#ifdef arith1
#include "arith1.F90"
#endif
#ifdef trig1
#include "trig1.F90"
#endif
end
end module

program main
    use test_mod
    integer(int64) :: t0, t1, trate

    call system_clock( t0, trate )
    call test_sp()
    call system_clock( t1 )
    print *, "time(sp): ", real( t1 - t0, dp ) / trate

    call system_clock( t0, trate )
    call test_dp()
    call system_clock( t1 )
    print *, "time(dp): ", real( t1 - t0, dp ) / trate
end

Then the result is like this:

MacMini2012, Core-i7, 2.3 GHz, 1333 MHz DDR3 <-- very old

 $ gfortran-10 -O3 -Darith1 main.F90 
 ans =    1.64472532    
 time(sp):    2.4510630000000000     
 ans =    1.6449340578345750     
 time(dp):    4.9258379999999997  <-- DP x2 slower

 $ gfortran-10 -O3 -Dtrig1 main.F90 
 ans =   0.209808990    
 time(sp):    4.1077310000000002     
 ans =   0.20980945474697082     
 time(dp):    2.1485820000000002   <-- DP x2 faster (?)

Xeon E5-2650 v2 @ 2.60GHz, DDR3-1600 <-- a bit old

 $ gfortran-10 -O3 -Darith1 main.F90 
 ans =    1.64472532    
 time(sp):    2.1299402210000000     
 ans =    1.6449340578345750     
 time(dp):    4.1305806760000001   <-- DP x2 slower

$ gfortran-10 -O3 -Dtrig1 main.F90 
 ans =   0.209808990    
 time(sp):   0.51460836600000004     
 ans =   0.20980945474697082     
 time(dp):    1.7247241419999999    <-- DP x3 slower

M1 Mac 3.22 GHz, LPDDR5 (2021) <-- newer

$ gfortran-10 -O3 -Darith1 main.F90 
 ans =    1.64472532    
 time(sp):   0.95877999999999997     
 ans =    1.6449340578345750     
 time(dp):   0.93008999999999997     <-- DP similar to SP

$ gfortran-10 -O3 -Dtrig1 main.F90
 ans =   0.209808990    
 time(sp):    3.5496300000000001     
 ans =   0.20980945474697082     
 time(dp):   0.82566200000000001    <-- DP x4 faster (???)

So, the results vary very much depending on machines, but for newer CPUs SP may not have speed advantage if memory access is not involved, possibly? I hope the above tests are not doing something strange…
(I will add more tests using array accesses later.)

rwmsu · May 15, 2022, 8:04pm

To me the issue is not can you use single instead of double on a CPU but on current GPUs. There are $600 desktop graphics cards that can do 10 + TFLOPS in FP32 and 800 + Gflops in FP64 so if you have a code that can map efficiently to a GPU going with single precision can have a hugh payoff. For anyone interested, the following site lists the capabilities of current GPUs and includes FP32 and FP64 performance.

For example, a AMD RX-6700 XT has a theoretical FP32 performance of 13.21 TFLOPS and 825.9 Gflops for FP64. Street price (depending on vendor) ranges in the $500-$650 range. I would never pay that to do just graphics but I might if I can get anywhere near 500Gflop FP64 performance

septc · May 17, 2022, 10:21am

Today I came across these articles about the use of single and mixed precisions (for molecular dynamics and electronic structure calculations):

Routine Microsecond Molecular Dynamics Simulations with AMBER on GPUs
https://pubs.acs.org/doi/10.1021/ct200909j
Double Precision Is Not Needed for Many-Body Calculations: Emergent Conventional Wisdom
https://pubs.acs.org/doi/10.1021/acs.jctc.8b00321
Accelerating seminumerical Fock-exchange calculations using mixed single- and double-precision arithmethic
https://aip.scitation.org/doi/full/10.1063/5.0045084

This article seems to be using “half-precision”:

Mixed Precision Fermi-Operator Expansion on Tensor Cores from a Machine Learning Perspective
https://pubs.acs.org/doi/10.1021/acs.jctc.1c00057

Up to now I used only double precision in my codes, so various parts uses d0 for floating-point literals… So first of all, I need to fix these literals to _rp for experiment (precisely as you mentioned above )

Yes, GPU is very attractive and I think the motivation for the above-linked papers is probably GPU. Though I cannot use GPUs right now, I would like to try them once the program and calculations become more “stable” (possibly with OpenACC?)

Topic		Replies	Views
Use case of single-precision real number Help	20	1202	March 17, 2024
Single, Double or Mixed Precision? Help	14	1224	February 7, 2024
Question about double precision on ARM processors	52	1943	November 19, 2023
Array intrinsics performances/accuracy	20	890	May 20, 2023
Does the speed (bandwidth) of memory matter?	11	1641	April 11, 2022

Using single-precision for faster calculation

Related topics