# Using single-precision for faster calculation

Dear all,

I have recently been struggling to make my program faster because it has been spending a long time to get a result. So for speed-up, I have parallelized the most time-consuming part using OpenMP. This worked very well and gave a nice speedup. Now I am considering the use of single precision (32-bit) for the time-consuming part, because my program is written entirely in double precision (64-bit). So, I am wondering whether there are some remarks that I should keep in mind for this kind of calculations.

For example, I guess it is typical to change the precision parameter from “dp” to “sp” throughout the program, but are there some code patterns for which it is definitely better to keep double precision? (I imagine that this may be the case for result variables for summing data, but not very sure.) Alternatively, is it also common to use 32-bit only for the time-consuming part, while keeping the entire program as 64-bit?

More generally, although I have used only 64-bit up to now, is it rather common to perform an entire calculation in 32-bit? I think neural-net calculations use lower-precisions for speed-up, but wondering what kind of other applications typically use 32-bit (partly or entirely). I think this depends greatly on the nature of calculations (whether the calculation is deterministic and needs high precision, or it is stochastic and inherently subject to random noise).

So I would really appreciate it if you share any insight about such calculations or point to any related pages / explanations that I should read beforehand.

3 Likes

Depends on whether you need the 64-bit accuracy. Try your calculation both ways with the same data and see how much accuracy 32 bits loses. Of course 64-bit calculations seldom give 64-bit accuracy in the results. I once had an ill-conditioned algorithm that gave negative values of some outputs that had to be positive with 32 bits, a problem that 64 bits cured, results differing in the 4th significant figure from 64 bits with 80 bits, but much further down with 128. (Hurrah for gfortran offering all 4 kinds.)

1 Like

I think its an issue of knowing the algorithms and if PDE’s the equation sets you are solving. Basically, if you are solving systems of equations via matrices do they become ill conditioned and highly impacted by roundoff errors. An example is using pseudo-inverses and normal equations to solve a least squares problem. I have for 40 plus years used double precision exclusively in all the CFD and FEM codes I’ve written or worked with. A recent AIAA journal article has me rethinking the everything double precision approach though. See (if you have access)

Witherden and Jameson, “Impact of Number Representation on High-Order Implicit Large Eddy Simulation”, AIAA Journal, Vol 58, no. 1, January 2020 (DOI:10.2514/1.J058434)
https://arc.aiaa.org/doi/full/10.2514/1.J058434

The focus of this article is doing high order turbulence large-eddy simulations on GPU’s. This is a very demanding and time-consuming problem. The author’s show that for a standard test problem (Taylor-green vortex) the speed up gained from running single precision vs double on an NVIDIA K40c is around 2-3 with no major loss of accuracy.

Their final conclusions were:

``````"For all test cases, the use of single precision was found to have
little impact on the overall accuracy of the simulations. This suggests
that, for many real-world test cases, single precision arithmetic is
indeed sufficient. For bandwidth bound simulations, speedup factors
between 1.4 and 2.9 were observed, hence resulting in a greatly
reduced time to solution."
``````

Since you can get usable single precision performance out of mid to high end commodity graphics cards, this opens up more options for speeding up simulations without spending 10K on NVIDIAs high end card just to get double precision.

2 Likes

An example of a widely used software package that uses single precision throughout, which ran into problems after X86/x64 CPUs shifted from x87 (80 bit reals internally) to SSE2 instructions (32 or 64 bit reals):

1 Like

If you switch to single precision you will see if the results produced are sensible. You could also set IEEE flags for underflow and overflow, which I tweeted about:

Fortran 2003 introduced an ieee_arithmetic module with elemental functions ieee_is_nan, ieee_is_finite, ieee_is_negative, and ieee_is_normal, which test for those conditions. They have a real argument and logical RESULT.

``````program test_ieee_arithmetic
use, intrinsic :: ieee_arithmetic
implicit none
character (len=20) :: fmt = "(a16,*(l9))"
integer, parameter :: wp = kind(1.0) ! same output for wp = kind(1.0d0)
real(kind=wp)      :: z,vec(6),NaN,t
z = 0.0_wp
t = tiny(z)
vec = [z/z,z/1.0_wp,-z/1.0_wp,1.0_wp/z,-1.0_wp/z,t/10]
print*,ieee_value(0.0,ieee_positive_inf),1.0/z ! 2 ways of geting +Inf
NaN = ieee_value(0.0,ieee_quiet_nan) ! get NaN
print*,"NaN == NaN?",NaN == NaN ! demonstrate that NaN /= NaN
print*
print "(19x,*(a9))","0.0/0.0","0.0/1.0","-0.0/1.0","1.0/0.0","-1.0/0.0","tiny/10"
print fmt,"ieee_is_nan"     ,ieee_is_nan(vec)
print fmt,"ieee_is_negative",ieee_is_negative(vec)
print fmt,"ieee_is_finite"  ,ieee_is_finite(vec)
print fmt,"ieee_is_normal"  ,ieee_is_normal(vec)
end program test_ieee_arithmetic
! gfortran output:
!         Infinity         Infinity
! NaN == NaN? F
!
!                      0.0/0.0  0.0/1.0 -0.0/1.0  1.0/0.0 -1.0/0.0  tiny/10
!      ieee_is_nan        T        F        F        F        F        F
! ieee_is_negative        F        F        T        F        T        F
!   ieee_is_finite        F        T        T        F        F        T
!   ieee_is_normal        F        T        T        F        F        F
``````

Call ieee_set_halting_mode() of F2003 module ieee_exceptions to set the floating point conditions (NaN, overflow, underflow, divide-by-zero, inexact) that will halt a program. Similar compiler options include gfortran -ffpe-trap=invalid ifort -fpe0.

``````! A reference is https://www.nag.com/nagware/np/r70_doc/ieee_arithmetic.html
program test_ieee_exceptions
use :: ieee_exceptions, only: ieee_set_halting_mode, ieee_invalid, &
ieee_overflow, ieee_underflow, ieee_divide_by_zero, ieee_inexact
implicit none
real :: a, b, c, d, e, f, g
logical, parameter :: halt_program = .true.
call ieee_set_halting_mode([ieee_divide_by_zero, ieee_underflow, &
ieee_overflow, ieee_invalid],halting=halt_program)
! causes halt if divide_by_zero, underflow, overflow, or NaN conditions occur
! remove element from array constructor to have program continue with that condition
f = 1.0
print*,"computing 1.0/0.0"
print*,f/0.0
e = tiny(e)
print*,"tiny =",e,", computing tiny/10"
print*,e/10
d = huge(d)
print*,"huge =",d,", computing huge**2"
print*,d**2
a = 0.0
b = 0.0
print*,"a, b=",a,b
print*,"computing c = a/b"
c = a/b
print*,"c=",c
! since floating point roundoff is ubiquitous, probably
! do not want the following mode
call ieee_set_halting_mode(ieee_inexact, halting=halt_program)
print*,"computing 2.0/3.0"
g = 2.0/3.0
print*,"2.0/3.0 =",g
end program test_ieee_exceptions
! output on Windows with ifort -traceback ieee_exceptions.f90
!  computing 1.0/0.0
! forrtl: error (73): floating divide by zero
! Image              PC                Routine            Line        Source
! ieee_exceptions.e  00007FF6EBBF112A  MAIN__                     14  ieee_exceptions.f90
``````
1 Like

It’s also worth pointing out that single precision is generally only faster if you have vectorized computation.

1 Like

If a computation is related to solving a discretized PDE, the overhead of transferring large arrays from and to memory can be very significant, even with SSE/AVX.

I remember a 2-D transient groundwater/thermal/solute simulation program for which we prepared single and double precision versions and made comparison runs. We concluded that, except in one or two subroutines where means and standard deviations were accumulated, single-precision was sufficient, and ran almost twice as fast as the double precision version.

If an eigenvalue is being calculated, on the other hand, double precision may be essential. I remember a program that calculated rotor vibrations, and double precision was essential.

As others have remarked, “it depends”, and some effort should be spent in investigating what precision is required when a simulation run takes several minutes or hours to run a single case.

1 Like

For a program slightly modified from Speed of array intrinsics - #3 by Beliavsky, printing the real kind and setting it to either single or double precision, compiling with `ifort -Ofast` on WSL2, I do find it to be twice as fast to compute the min, max, and mean of a large 1D array in single precision than double precision (except for an algorithm that is slow in both cases), and more than twice as fast to compute exponentials in single than double precision for a large array.

`````` wp =           8
n =   100000000

0.1932                           x=exp(x)
0.1159               minval(x), maxval(x)
0.0628                         min_max(x)
0.0627          min_max_explicit_shape(x)
0.2164                   min_max_pairs(x)
0.2168                   min_max_local(x)
0.1543       minval(x),maxval(x),sum(x)/n
0.0652                    min_max_mean(x)
0.0653     min_max_mean_explicit_shape(x)

real	0m1.876s
user	0m1.800s
sys	0m0.061s
``````
``````wp =           4
n =   100000000

0.0859                           x=exp(x)
0.0588               minval(x), maxval(x)
0.0313                         min_max(x)
0.0313          min_max_explicit_shape(x)
0.2289                   min_max_pairs(x)
0.2055                   min_max_local(x)
0.0774       minval(x),maxval(x),sum(x)/n
0.0316                    min_max_mean(x)
0.0315     min_max_mean_explicit_shape(x)

real	0m1.199s
user	0m1.110s
sys	0m0.072s
``````
code
``````module min_max_mod
implicit none
integer, parameter :: wp = kind(1.0)
contains
pure function min_max(x) result(y)
real(kind=wp), intent(in) :: x(:)
real(kind=wp)             :: y(2)
integer                   :: i,n
n = size(x)
if (n < 1) return
y(1) = x(1)
y(2) = y(1)
do i=2,n
if (x(i) < y(1)) y(1) = x(i)
if (x(i) > y(2)) y(2) = x(i)
end do
end function min_max
!
pure function min_max_explicit_shape(n,x) result(y)
integer      , intent(in) :: n
real(kind=wp), intent(in) :: x(n)
real(kind=wp)             :: y(2)
integer                   :: i
if (n < 1) return
y(1) = x(1)
y(2) = y(1)
do i=2,n
if (x(i) < y(1)) y(1) = x(i)
if (x(i) > y(2)) y(2) = x(i)
end do
end function min_max_explicit_shape
!
pure function min_max_pairs(x) result(y)
real(kind=wp), intent(in) :: x(:)
real(kind=wp)             :: y(2)
integer                   :: i,n
logical                   :: odd
n = size(x)
if (n < 1) return
if (n == 1) then
y(1) = x(1)
y(2) = x(2)
return
end if
odd = mod(n,2) == 1
if (x(2) > x(1)) then
y(1) = x(1)
y(2) = x(2)
else
y(1) = x(2)
y(2) = x(1)
end if
do i=3,n-1,2
if (x(i+1) > x(i)) then
if (y(1) > x(i))   y(1) = x(i)
if (y(2) < x(i+1)) y(2) = x(i+1)
else
if (y(1) > x(i+1)) y(1) = x(i+1)
if (y(2) < x(i))   y(2) = x(i)
end if
end do
if (odd) then
y(1) = min(y(1),x(n))
y(2) = max(y(1),x(n))
end if
end function min_max_pairs
!
pure function min_max_local(x) result(y)
! same as min_max_pairs but sets
! xi to x(i) and xi1 to x(i+1)
real(kind=wp), intent(in) :: x(:)
real(kind=wp)             :: y(2)
real(kind=wp)             :: xi,xi1
integer                   :: i,n
logical                   :: odd
n = size(x)
if (n < 1) return
if (n == 1) then
y(1) = x(1)
y(2) = x(2)
return
end if
odd = mod(n,2) == 1
if (x(2) > x(1)) then
y(1) = x(1)
y(2) = x(2)
else
y(1) = x(2)
y(2) = x(1)
end if
do i=3,n-1,2
xi1 = x(i+1)
xi  = x(i)
if (xi1 > xi) then
if (y(1) > xi)   y(1) = xi
if (y(2) < xi1) y(2) = xi1
else
if (y(1) > xi1) y(1) = xi1
if (y(2) < xi)   y(2) = xi
end if
end do
if (odd) then
y(1) = min(y(1),x(n))
y(2) = max(y(1),x(n))
end if
end function min_max_local
!
pure function min_max_mean(x) result(y)
real(kind=wp), intent(in) :: x(:)
real(kind=wp)             :: y(3)
integer                   :: i,n
n = size(x)
if (n < 1) return
y(1)   = x(1)
y(2:3) = y(1)
do i=2,n
if (x(i) < y(1)) y(1) = x(i)
if (x(i) > y(2)) y(2) = x(i)
y(3) = y(3) + x(i)
end do
if (n > 0) y(3) = y(3)/n
end function min_max_mean
!
pure function min_max_mean_explicit_shape(n,x) result(y)
integer      , intent(in) :: n
real(kind=wp), intent(in) :: x(n)
real(kind=wp)             :: y(3)
integer                   :: i
if (n < 1) return
y(1)   = x(1)
y(2:3) = y(1)
do i=2,n
if (x(i) < y(1)) y(1) = x(i)
if (x(i) > y(2)) y(2) = x(i)
y(3) = y(3) + x(i)
end do
if (n > 0) y(3) = y(3)/n
end function min_max_mean_explicit_shape
end module min_max_mod
!
program xmin_max_exp
use min_max_mod
implicit none
integer :: i,n
integer, parameter :: nt = 10, nlen = 30, ncol_extremes = 5, ncol_mmm = 3
integer                    :: icol
real(kind=wp), allocatable :: x(:)
real(kind=wp)              :: extremes(2,ncol_extremes),t(nt),xmin_max_mean(3,ncol_mmm)
character (len=nlen) :: labels(nt-1) = [character(len=nlen) :: &
"x=exp(x)","minval(x), maxval(x)","min_max(x)","min_max_explicit_shape(x)", &
"min_max_pairs(x)", &
"min_max_local(x)","minval(x),maxval(x),sum(x)/n","min_max_mean(x)", &
"min_max_mean_explicit_shape(x)"]
character (len=*), parameter :: fmt_cr = "(a20,*(1x,f16.12))"
logical :: print_results
n = 10**8
allocate (x(n))
print*,"wp =",wp
print*,"n =",n
call random_number(x)
print_results = x(1) > 1.0_wp ! .false.
call cpu_time(t(1))
x = exp(x)
call cpu_time(t(2))
extremes(:,1) = [minval(x),maxval(x)]
call cpu_time(t(3))
extremes(:,2) = min_max(x)
call cpu_time(t(4))
extremes(:,3) = min_max_explicit_shape(n,x)
call cpu_time(t(5))
extremes(:,4) = min_max_pairs(x)
call cpu_time(t(6))
extremes(:,5) = min_max_local(x)
call cpu_time(t(7))
xmin_max_mean(:,1) = [minval(x),maxval(x),sum(x)/n]
call cpu_time(t(8))
xmin_max_mean(:,2) = min_max_mean(x)
call cpu_time(t(9))
xmin_max_mean(:,3) = min_max_mean_explicit_shape(n,x)
call cpu_time(t(10))
if (print_results) then
do icol=1,ncol_extremes
print fmt_cr,"min, max:",extremes(:,icol)
end do
do icol=1,ncol_mmm
print fmt_cr,"min, max, mean:",xmin_max_mean(:,icol)
end do
end if
print "(f8.4,5x,a30)",(t(i+1)-t(i),trim(labels(i)),i=1,nt-1)
end program xmin_max_exp
``````
1 Like

Generally, speaking, generalities tend to be not too general.

It is quite easy to show that single precision evaluation of functions from libm is typically faster than double precision evaluation. Note, there is no vectorization involved here.

For 1 million values, in the indicated intervals, I find timings in microseconds per call to be

``````       Interval            float     double
acos [0.5,1]            0.03133   0.06128
acosh [1,2]              0.05261   0.06759
asin [0.5,1]            0.03731   0.05902
asinh [1,2]              0.06118   0.08488
atan [0.5,10]           0.02205   0.03837
atanh [0.5,0.95]         0.05724   0.06823
cbrt [1,10]             0.03413   0.05424
cos [1,3.14159]        0.01464   0.02683
cosh [1,10]             0.03602   0.05270
cospi [1,10]             0.02016   0.03793
erf [0.5,2]            0.04316   0.07337
erfc [0.5,2]            0.04325   0.07434
exp [1,2]              0.02390   0.03626
exp2 [1,8]              0.01666   0.01809
expm1 [0.0625,1]         0.02435   0.03611
j0 [1,10]             0.11363   0.18887
j1 [1,10]             0.11567   0.19327
lgamma [1,3]              0.02338   0.03559
log [1,2]              0.02396   0.03596
log2 [1,8]              0.02993   0.04086
log1p [0.5,1.5]          0.03795   0.04348
log10 [1,8]              0.03147   0.04221
sin [1,3.14159]        0.01408   0.02705
sinh [1,10]             0.04759   0.07892
sinpi [1,10]             0.02055   0.03812
sqrt [1,10]             0.01406   0.02861
tan [0.0625,0.785398]  0.01483   0.03418
tanh [1,10]             0.04301   0.07502
tanpi [1,10]             0.02547   0.06863
tgamma [1,3]              0.06590   0.05871
y0 [1,10]             0.11862   0.19924
y1 [1,10]             0.11931   0.20070

``````
2 Likes

I had thought for a long time that there was no time difference between single and double precision on a 64 bits CPU, remembering having verified that with simple arithmetic computations. It’s only discussing here on the Discourse that I realized it was generally false with math functions If they are implemented with series or some converging algorithm, we easily understand that less terms or iterations will be needed to attain the maximal precision possible with real32.

The impact on the results of using real32 instead of real64 will depend a lot on the algorithms used in your problem, on the length of computations, etc. And on the precision you need, an engineer building a car does not need the name number of digits as a quantum scientist.

3 Likes

It depends on the type of computation.

In Structural analysis, we solve the large set of simultaneous linear equations:
Force = K . disp
where K is a large 2d matrix, and Force and disp are vectors.
K = sum of Ke (smaller) matrices for all the “elements” in the system. This requires higher precision, as the values accumulated, (ie sum over all elements of each Kij) is adding Kij’s of significantly different magnitudes, due both the significantly different stiffnesses of elements and also as there can be 6 degrees of freedom for each point/node.
Basically K is assembled by adding numbers where significant round-off can occur.
Mixing displacements and rotations/first derivitives typically significantly increases the round-off effects, requiring higher precision accumulators.
This is solved using a “direct” solver for the whole system, as apposed to an itterative solver for localised systems, where the numbers combined are (typically) not as affected by round-off.
Poor round-off can also be described as ill-conditioned.

Itterative solutions can be less affected by round-off where the solution is for many local problems, although boundary solutions can introduce round-off.
Itterative solutions can also correct for round-off as they re-solve/adjust each local problem.

It all depends on the type of problem and solution approach that can minimise round-off when values are combined/added.

Analysing a 3-dimensional homogeneous solid with only 3 displacement degrees of freedom per node could probably be solved with 4-byte reals, but shell models that mix plane stress and bending stiffness require 8-byte reals, with appropriate meshing to not introduce ill-conditioned equations, or “poor” results.
There are not many requests for analysing a 3-dimensional homogeneous solid !!

1 Like

Yes, it comes down to what one is computing. Most, if not all, of the libm functions use at least one polynomial approximation in some interval (possibly after argument reduction). For these polynomials, double precision simply requires more terms than single precision. For example, the power series for `j0(x)` in the interval [iota, 2) converges with 7, 12, 13, and 19 terms for single, double, extended double, and quad precision, respectively. Here, iota is a small threshold value to prevent an underflow exception from the higher order terms.

3 Likes

So if an algorithm is for example based essentially on matrix arithmetical computations, single precision should not be faster than double precision.
Do you agree?

1 Like

I am finding that `matmul` is about twice as fast for single vs. double precision. Toggling `wp` in the code

``````program xmatmul
implicit none
integer, parameter :: n = 1000, wp = kind(1.0), niter = 100
integer :: iter
real(kind=wp) :: a(n,n),b(n,n),c(n,n)
logical, parameter :: call_matmul = .true.
print*,"wp, n, niter, call_matmul = ",wp, n, niter, call_matmul
do iter=1,niter
call random_number(a)
call random_number(b)
if (call_matmul) then
c = matmul(a,b)
if (c(n,n) > 1000.0_wp) print*,c(n,n) ! prevent matmul from being optimized away
else
if (max(a(n,n),b(n,n)) > 1000.0_wp) print*,a(n,n) ! prevent random_number calls from being optimized away
end if
end do
end program xmatmul
``````

and compiling with `ifort -Ofast` and running on WSL2 gives

`````` wp, n, niter, call_matmul =            8        1000         100 T

real	0m14.623s
user	0m14.597s
sys	0m0.010s
``````
`````` wp, n, niter, call_matmul =            4        1000         100 T

real	0m7.452s
user	0m7.436s
sys	0m0.000s
``````

The calls to `random_number` take a small fraction of the time.

2 Likes

Does anyone know how to quantify potential speedup due to improved use of cache and stack (assuming that actually happens) using single vs double precision. My thinking (and I’ll gladly admit I might be wrong) is that with single precision more of the problem will fit into cache in any given computation cycle. Sort of like the speedup you see in parallel codes when you chop the problem up into small enough pieces that most of the problem will fit into cache on each processor

1 Like

No. I do not agree. I think the answer is that it depends on more than algorithmic detail. I can certainly image a matrix computation which fits entirely in a CPU’s cache in SP but would not fit in DP. The computation in DP would incur memory fetch. Depending on load on the OS, that fetch might take a bit of time.

Note, LAPACK from Netlib has a number of timing routines for both SP and DP. One can do some timing tests.

3 Likes

You’re looking for ATLAS (http://math-atlas.sourceforge.net/). Download and build it. You can watch as ATLAS iterates to find the best block algorithm to use.

3 Likes

Thanks @kargl and @Beliavsky
that’s interesting. Of course with big matrices, cache is important. And probably it takes also more time to copy a double precision array.

I have also a question concerning the math functions. In my assembler book, I see that sqrt(), cos(), and sin() can be implemented in the FPU. Are they implemented in two versions? And does the double and single computations take different time? Does libm uses them or have its own implementation? (in that case why?)

1 Like

I think you will find that this is a complicated and also an interesting subject area. Concerning the accuracy of the results, you will find that there are may sources of errors in numerical calculations. In linear algebra, for example, there are matrix condition numbers. In differential equations, there is positive and negative feedback. In recursions, there are conditions where errors increase with each term and there are conditoins where errors decrease with each term. I remember seeing a recursive algorithm once where the initial guess was set to zero, which means there were no significant digits at all in the initial guess, yet the recursion was stable and produced results to full precision at the end. That seemed like magic to me at the time. In quadrature methods, the number of data points required depends on the precision of the results. In polynomial approximations, there is truncation error that depends on the number of expansion terms. Generally one wants to tune the algorithm so that the truncation error is somewhere between the intrinsic data errors and the numerical roundoff error. There is much research now involving “half-precision” floating point. These algorithms use 16-bit floating point in order to speed up critical parts of algorithms where only two or three significant digits are required in the intermediate results. Of course, errors in these cases need to be kept “local” so that they do not propagate and contaminate other parts of the algorithm, so this is tricky to do. Over time, much of this is also driven by hardware capabilities. 10-15 years ago when GPU calculations where first being developed, they typically worked with either fixed-point or 32-bit floating point arithmetic. Thus research was done to identify which parts of algorithms could be sped up by using the low-precision GPU hardware, while still maintaining full-precision final results. Now, GPUs also do 64-bit floating point arithmetic. In the past, I worked on machines that had 36-bit single precision floating point. When codes were ported from those machines to 32-bit computers, it was sometimes found that they needed to be done in 64-bit double precision to maintain their previous accuracy. Apparently, the acceptance threshold was somewhere in between 32- and 36-bits of accuracy. I’m pretty sure that in my field of quantum chemistry, almost everything we do would be just fine with something like 48-bit floating point. There is little hardware or software support for that these days, so we settle for 64-bit floating point instead.

Finally, as a practical matter, you don’t want to go through your codes and change back and forth between D and E exponents, or between “sp” and “dp” kind parameters. You want to write your code with something neutral, like “wp”, and then change the whole shebang by just defining what the “wp” kind value is in one place. This is one of the big advantages of modern fortran over other languages, so don’t just ignore that capability. Then regarding things like polynomial truncation errors, you can monitor the magnitude of the intermediate terms with fortran intrinsics like nearest(), epsilon(), precision(), and so on.

1 Like

You might find this paper, a fairly recent one, to be of interest:
https://peerj.com/articles/cs-778.pdf

“ABSTRACT
It is well established that reduced precision arithmetic can be exploited to accelerate
the solution of dense linear systems. Typical examples are mixed precision algorithms
that reduce the execution time and the energy consumption of parallel solvers for
dense linear systems by factorizing a matrix at a precision lower than the working
precision. Much less is known about the efficiency of reduced precision in parallel
solvers for sparse linear systems, and existing work focuses on single core experiments.
We evaluate the benefits of using single precision arithmetic in solving a double
precision sparse linear system using multiple cores. We consider both direct methods
and iterative methods and we focus on using single precision for the key components
of LU factorization and matrix–vector products. Our results show that the anticipated
speedup of 2 over a double precision LU factorization is obtained only for the very
largest of our test problems. We point out two key factors underlying the poor speedup.
First, we find that single precision sparse LU factorization is prone to a severe loss of
performance due to the intrusion of subnormal numbers. We identify a mechanism that
allows cascading fill-ins to generate subnormal numbers and show that automatically
flushing subnormals to zero avoids the performance penalties. The second factor is
the lack of parallelism in the analysis and reordering phases of the solvers and the
absence of floating-point arithmetic in these phases. For iterative solvers, we find
that for the majority of the matrices computing or applying incomplete factorization
preconditioners in single precision provides at best modest performance benefits
compared with the use of double precision. We also find that using single precision
for the matrix–vector product kernels provides an average speedup of 1.5 over double
precision kernels. In both cases some form of refinement is needed to raise the single
precision results to double precision accuracy, which will reduce performance gains.”

1 Like