I cannot make any statement for other libm implementations. FreeBSD libm will use a hardware FPU for sqrt if the processor has support. For the trig functions, it will use software implementations. The choice for a software implementation for sin, cos, and tan has two reasons. First, historically, 387 FPU was inadequate with argument reduction. The 387 used a 66-bit (or so) approximation of \pi, when for double precision somewhere around 1024 bits are needed. Second, the software implementations are actually faster than the hardware implementations.
For CPU, changing double precision to float precision may have some addition risks. Perhaps improving the algorithm may lead to more speedup.
For GPU such as Gefore RTX or Quadro, their hardware are designed to do float precision. So for those card using float precision is typical.
You can also use very high end card such as GP100 whose hardware is designed to do double precision.
Thanks very much for a lot of info! I really appreciate it!!
And by reading the comments, I have noticed that my understanding / assumption of single-precision may be basically wrong… Specifically, when I wrote the above question, I assumed that replacing double precision (DP) by single precision (SP) would give nearly x2 speed-up, regardless of whether the numerical result is okay or not. But now I suppose this may not be the case depending on cases (according to the comments, thanks!). More specifically,
- For latest CPUs, is the speed of basic arithmetic operations (like addition and multiplication) possibly similar between SP and DP…?
- If so, does the speed difference of SP and DP arise more from different efficiency of data transfer (between slow memory and CPU) and better use of cache, rather than arithmetic operations inside CPU?
- Does this mean that SP codes may not give speedup if they involve indirect / random access of slow memory (e.g. using an index table) rather than more linear access (suitable for “vectorization”)?
- Also, if the code involves
sinetc, the speed difference may also depend significantly on the underlying math libraries?
To check the first point above, I have tried the following code (arith1.F90 and trig1.F90), which involves only arithmetic or trigonometric calculations (with no large arrays).
!! arith1.F90 integer(int64) :: k real(rp) :: ans, x ans = 0 do k = 1, 10 ** 9 x = real( k, rp ) ans = ans + 1 / ( x * x ) end do print *, "ans = ", ans
!! trig1.F90 integer(int64) :: k real(rp) :: ans, x ans = 0 do k = 1, 10 ** 8 x = ( 2 * mod( k, 2 ) - 1 ) / real( k, rp ) ans = ans + sin( x ) * cos( x ) end do print *, "ans = ", ans
main.F90 (driver code to include arith1.F90 etc for SP and DP)
module test_mod use iso_fortran_env, only: int64, sp => real32, dp => real64 implicit none contains subroutine test_sp() #define rp sp #ifdef arith1 #include "arith1.F90" #endif #ifdef trig1 #include "trig1.F90" #endif #undef rp end subroutine test_dp() #define rp dp #ifdef arith1 #include "arith1.F90" #endif #ifdef trig1 #include "trig1.F90" #endif end end module program main use test_mod integer(int64) :: t0, t1, trate call system_clock( t0, trate ) call test_sp() call system_clock( t1 ) print *, "time(sp): ", real( t1 - t0, dp ) / trate call system_clock( t0, trate ) call test_dp() call system_clock( t1 ) print *, "time(dp): ", real( t1 - t0, dp ) / trate end
Then the result is like this:
MacMini2012, Core-i7, 2.3 GHz, 1333 MHz DDR3 <-- very old
$ gfortran-10 -O3 -Darith1 main.F90 ans = 1.64472532 time(sp): 2.4510630000000000 ans = 1.6449340578345750 time(dp): 4.9258379999999997 <-- DP x2 slower $ gfortran-10 -O3 -Dtrig1 main.F90 ans = 0.209808990 time(sp): 4.1077310000000002 ans = 0.20980945474697082 time(dp): 2.1485820000000002 <-- DP x2 faster (?)
Xeon E5-2650 v2 @ 2.60GHz, DDR3-1600 <-- a bit old
$ gfortran-10 -O3 -Darith1 main.F90 ans = 1.64472532 time(sp): 2.1299402210000000 ans = 1.6449340578345750 time(dp): 4.1305806760000001 <-- DP x2 slower $ gfortran-10 -O3 -Dtrig1 main.F90 ans = 0.209808990 time(sp): 0.51460836600000004 ans = 0.20980945474697082 time(dp): 1.7247241419999999 <-- DP x3 slower
M1 Mac 3.22 GHz, LPDDR5 (2021) <-- newer
$ gfortran-10 -O3 -Darith1 main.F90 ans = 1.64472532 time(sp): 0.95877999999999997 ans = 1.6449340578345750 time(dp): 0.93008999999999997 <-- DP similar to SP $ gfortran-10 -O3 -Dtrig1 main.F90 ans = 0.209808990 time(sp): 3.5496300000000001 ans = 0.20980945474697082 time(dp): 0.82566200000000001 <-- DP x4 faster (???)
So, the results vary very much depending on machines, but for newer CPUs SP may not have speed advantage if memory access is not involved, possibly? I hope the above tests are not doing something strange…
(I will add more tests using array accesses later.)
To me the issue is not can you use single instead of double on a CPU but on current GPUs. There are $600 desktop graphics cards that can do 10 + TFLOPS in FP32 and 800 + Gflops in FP64 so if you have a code that can map efficiently to a GPU going with single precision can have a hugh payoff. For anyone interested, the following site lists the capabilities of current GPUs and includes FP32 and FP64 performance.
For example, a AMD RX-6700 XT has a theoretical FP32 performance of 13.21 TFLOPS and 825.9 Gflops for FP64. Street price (depending on vendor) ranges in the $500-$650 range. I would never pay that to do just graphics but I might if I can get anywhere near 500Gflop FP64 performance
Today I came across these articles about the use of single and mixed precisions (for molecular dynamics and electronic structure calculations):
Routine Microsecond Molecular Dynamics Simulations with AMBER on GPUs
Double Precision Is Not Needed for Many-Body Calculations: Emergent Conventional Wisdom
Accelerating seminumerical Fock-exchange calculations using mixed single- and double-precision arithmethic
This article seems to be using “half-precision”:
- Mixed Precision Fermi-Operator Expansion on Tensor Cores from a Machine Learning Perspective
Up to now I used only double precision in my codes, so various parts uses
d0 for floating-point literals… So first of all, I need to fix these literals to
_rp for experiment (precisely as you mentioned above )
Yes, GPU is very attractive and I think the motivation for the above-linked papers is probably GPU. Though I cannot use GPUs right now, I would like to try them once the program and calculations become more “stable” (possibly with OpenACC?)
What is the flexibility when using GPU’s with Fortran ?
Can a general compiler, like gfortran provide this functionality, or are manufacturer specific compilers/languages required ?
@JohnCampbell most compilers support the openACC compiler directives. The NVIDIA HPC toolkit compilers (old flang basically) will now off load DO CONCURRENT (in some cases) automatically to a GPU but probably only for a NVIDIA GPU. openMP 5 (maybe also 4) has the capability to offload directly to a GPU. I don’t use openMP a lot so I’m not sure which compilers support a version that will allow GPU offload. NVIDIA’s compilers support CUDA Fortran and have access to the CUDA libraries. There are also several Fortran specific libraries that are aimed at providing GPU support. See the Parallel Programming section in
I have been doing LES (large eddy simulation) in single precision for many years. Most often, the differences are negligible on simple grids. I do see differences when averaging some higher moments where you sum powers of quantities, but one could use higher precision just there. When doing steady state simulations, one cannot converge to low residuals with single precision, but that is not something I usually need.
Even if the speed of one multiplication is the same, you can fit more in a vectorized operation. And the memory bandwidth is utilized better when you move around half the amount of data.
They are very few situations in what you need double precision. I have written statistical programs and have never used double precision. The only time you have loss of accuracy is when you subtract numbers that are equal in the first several significant digits. You should know when this is likely to happen
The single vs double decision has to be driven by a knowledge of the sensitivity of the underlying algorithms you are using to floating point round off etc. and the levels of accuracy you need. I’ll share an experience I had many years ago (late 1980s) while in grad school. I was taking a viscous flow class and we had a homework assignment to solve the boundary layer equations (subset of the Navier-Stokes equations) using finite differences. About that time one of the local computer stores was having a fire sale on IBM PC Jrs (they were selling complete systems for around $350). I bought one just to have a DOS machine to play with and added a 8087 math coprocesser. I also wanted to try programming in a different language than Fortran so I got a copy of Turbo Pascal. So I had a total of around $600-650 in the machine. I did the assignment in Pascal and was surprised at 1) how fast Turbo Pascal was and 2) that I actually got converged solutions. A couple of my class mates did the same problem on a VAX 11/780 system that was pushing $1 million in cost. They used single precision and Fortran and were having trouble getting converged solutions. I was puzzled by this until after digging more into how Turbo Pascal did floating point math I found out that the Real type I was using was a 6 byte (48 bit) value with a precision of around 16 to 18 decimal places. Plus the 8087 if I remember correctly did something like 80 bit math. I mentioned this to my classmates and they switched to double precision and were finally able to get converged solutions. I have always found it very amusing that my 600 or so dollar system would outperform a $1 million dollar system. This was my first real lesson in the value of knowing when you need to use extended precision and when you don’t.
The Turbo Pascal 48-bit real had a 39-bit mantissa, which can represent 11-12 decimal digits. This was adequate for the first two versions, which did not support the 8087. Do you remember how much you paid for the 8087 add-on?
@mecej4. Thanks for correcting me on the precision. It was so very long ago that I was basing my number on something I read on-line. I think I paid something like $100 to $150 for the 8087 but (again) that was a long time ago and my memory is not what it used to be. I still regret not keeping the keyboard that came with the system. IBM for all its faults knew how to make a keyboard (at least back then).
List prices for coprocessors (1991)
The availability of 80-bit “hardware” accumulators was very useful in the 80’s, especially for dot_product accumulators when round-off was an issue.
I find it confusing as to what “8087” (80 bit / 10 byte) instructions are available on 64-bit windows OS.
The Salford 64-bit compiler I use does not support real*10 (only available for 32-bit).
Gfortran uses a 16 byte format, so I don’t know if this is hardware or software emulation.
SSE and AVX instructions also do not support better than 8-byte instructions, which means there is a considerable performance penalty if selecting better than 8-byte accuracy.
I think this poor support for improved precision is a retrograde step, unless someone can identify errors in my summary.
Ifort allows 3 real precisions: 4,8,16 bytes. Gfortran has those and also 10 bytes.
So ifort does not support “8087” hardware, while gfortran supports 8087 instructions in some way.
( The assumption in my question assumes 8087 hardware support of 80 bit reals would be better than 80 bit or 128 bit software emulation, although this is much slower than 32/64 bit SSE/AVX vector instructions.
I do not know if 80 bit registers are still available in modern intel/amd CPU’s, or
How many AVX registers are available to each CPU, such as when hyper-threading/multi-threading is used ?
The best way of found extended precision on modern cpus is double double math (which can be vectorized trivially). double doubled give you 102 bits of precision, and is generally a bunch faster than 128 bit floating point emulation.
I think that there are 4 issues here:
Is single precision arithmetic faster? It may not be. Some processors may map the single precision values into the typically 10 byte registers before carrying out the operations. The re-mapping may slow the process down. I would be interested to know whether anyone has tested this.
Single precision numbers occupy less memory and for array operations this will improve cache coherence. We have seen this.
The use of single precision numbers reduces inter-processor traffic in multiprocessor tasks. I suspect that this is the most important advantage.
Is it accurate enough? We have set up a system to measure this, please see
We emulate the arithmetic and adjust the precision to a specified number of mantissa bits. All of the libraries and the mechanism to do this are in the fpt distribution at http://simconglobal.com . To test whether single precision is good enough, you can run with single precision, i.e. 23-bits in IEEE numbers, and then re-run with 22 or 21 bits of precision. If the results change significantly you are at or beyond the acceptable precision. If not it should be OK.
We found that we could reduce the precision appreciably in our small test programs. This was not a complete surprise. In the early 1980s, the fastest real-time simulation machine was the Applied Dynamics AD10. This computed all derivative values to a precision of only 16 bits, and used 48 bits for the state variables (the results of integrations). This was good enough for simulation of space shuttle main engine, space shuttle launch and the aerodynamics of most of the military aircraft and missiles in US inventory.
In large scale simulations based on explicit discretization so called double precision is mandatory if time exceeds say 3 milliseconds. Since typical increment time in the model is 1e-8 for Finite Element size 1 milimeter there are millions of increments which accumulate errors.
Interestingly, GPU is not supported AFAIK in such a calculations.