I cannot make any statement for other libm implementations. FreeBSD libm will use a hardware FPU for sqrt if the processor has support. For the trig functions, it will use software implementations. The choice for a software implementation for sin, cos, and tan has two reasons. First, historically, 387 FPU was inadequate with argument reduction. The 387 used a 66bit (or so) approximation of \pi, when for double precision somewhere around 1024 bits are needed. Second, the software implementations are actually faster than the hardware implementations.
For CPU, changing double precision to float precision may have some addition risks. Perhaps improving the algorithm may lead to more speedup.
For GPU such as Gefore RTX or Quadro, their hardware are designed to do float precision. So for those card using float precision is typical.
You can also use very high end card such as GP100 whose hardware is designed to do double precision.
Thanks very much for a lot of info! I really appreciate it!!
And by reading the comments, I have noticed that my understanding / assumption of singleprecision may be basically wrongā¦ Specifically, when I wrote the above question, I assumed that replacing double precision (DP) by single precision (SP) would give nearly x2 speedup, regardless of whether the numerical result is okay or not. But now I suppose this may not be the case depending on cases (according to the comments, thanks!). More specifically,
 For latest CPUs, is the speed of basic arithmetic operations (like addition and multiplication) possibly similar between SP and DPā¦?
 If so, does the speed difference of SP and DP arise more from different efficiency of data transfer (between slow memory and CPU) and better use of cache, rather than arithmetic operations inside CPU?
 Does this mean that SP codes may not give speedup if they involve indirect / random access of slow memory (e.g. using an index table) rather than more linear access (suitable for āvectorizationā)?
 Also, if the code involves
sin
etc, the speed difference may also depend significantly on the underlying math libraries?
To check the first point above, I have tried the following code (arith1.F90 and trig1.F90), which involves only arithmetic or trigonometric calculations (with no large arrays).
!! arith1.F90
integer(int64) :: k
real(rp) :: ans, x
ans = 0
do k = 1, 10 ** 9
x = real( k, rp )
ans = ans + 1 / ( x * x )
end do
print *, "ans = ", ans
!! trig1.F90
integer(int64) :: k
real(rp) :: ans, x
ans = 0
do k = 1, 10 ** 8
x = ( 2 * mod( k, 2 )  1 ) / real( k, rp )
ans = ans + sin( x ) * cos( x )
end do
print *, "ans = ", ans
main.F90 (driver code to include arith1.F90 etc for SP and DP)
module test_mod
use iso_fortran_env, only: int64, sp => real32, dp => real64
implicit none
contains
subroutine test_sp()
#define rp sp
#ifdef arith1
#include "arith1.F90"
#endif
#ifdef trig1
#include "trig1.F90"
#endif
#undef rp
end
subroutine test_dp()
#define rp dp
#ifdef arith1
#include "arith1.F90"
#endif
#ifdef trig1
#include "trig1.F90"
#endif
end
end module
program main
use test_mod
integer(int64) :: t0, t1, trate
call system_clock( t0, trate )
call test_sp()
call system_clock( t1 )
print *, "time(sp): ", real( t1  t0, dp ) / trate
call system_clock( t0, trate )
call test_dp()
call system_clock( t1 )
print *, "time(dp): ", real( t1  t0, dp ) / trate
end
Then the result is like this:
MacMini2012, Corei7, 2.3 GHz, 1333 MHz DDR3 < very old
$ gfortran10 O3 Darith1 main.F90
ans = 1.64472532
time(sp): 2.4510630000000000
ans = 1.6449340578345750
time(dp): 4.9258379999999997 < DP x2 slower
$ gfortran10 O3 Dtrig1 main.F90
ans = 0.209808990
time(sp): 4.1077310000000002
ans = 0.20980945474697082
time(dp): 2.1485820000000002 < DP x2 faster (?)
Xeon E52650 v2 @ 2.60GHz, DDR31600 < a bit old
$ gfortran10 O3 Darith1 main.F90
ans = 1.64472532
time(sp): 2.1299402210000000
ans = 1.6449340578345750
time(dp): 4.1305806760000001 < DP x2 slower
$ gfortran10 O3 Dtrig1 main.F90
ans = 0.209808990
time(sp): 0.51460836600000004
ans = 0.20980945474697082
time(dp): 1.7247241419999999 < DP x3 slower
M1 Mac 3.22 GHz, LPDDR5 (2021) < newer
$ gfortran10 O3 Darith1 main.F90
ans = 1.64472532
time(sp): 0.95877999999999997
ans = 1.6449340578345750
time(dp): 0.93008999999999997 < DP similar to SP
$ gfortran10 O3 Dtrig1 main.F90
ans = 0.209808990
time(sp): 3.5496300000000001
ans = 0.20980945474697082
time(dp): 0.82566200000000001 < DP x4 faster (???)
So, the results vary very much depending on machines, but for newer CPUs SP may not have speed advantage if memory access is not involved, possibly? I hope the above tests are not doing something strangeā¦
(I will add more tests using array accesses later.)
To me the issue is not can you use single instead of double on a CPU but on current GPUs. There are $600 desktop graphics cards that can do 10 + TFLOPS in FP32 and 800 + Gflops in FP64 so if you have a code that can map efficiently to a GPU going with single precision can have a hugh payoff. For anyone interested, the following site lists the capabilities of current GPUs and includes FP32 and FP64 performance.
For example, a AMD RX6700 XT has a theoretical FP32 performance of 13.21 TFLOPS and 825.9 Gflops for FP64. Street price (depending on vendor) ranges in the $500$650 range. I would never pay that to do just graphics but I might if I can get anywhere near 500Gflop FP64 performance
Today I came across these articles about the use of single and mixed precisions (for molecular dynamics and electronic structure calculations):

Routine Microsecond Molecular Dynamics Simulations with AMBER on GPUs
https://pubs.acs.org/doi/10.1021/ct200909j 
Double Precision Is Not Needed for ManyBody Calculations: Emergent Conventional Wisdom
https://pubs.acs.org/doi/10.1021/acs.jctc.8b00321 
Accelerating seminumerical Fockexchange calculations using mixed single and doubleprecision arithmethic
https://aip.scitation.org/doi/full/10.1063/5.0045084
This article seems to be using āhalfprecisionā:
 Mixed Precision FermiOperator Expansion on Tensor Cores from a Machine Learning Perspective
https://pubs.acs.org/doi/10.1021/acs.jctc.1c00057
Up to now I used only double precision in my codes, so various parts uses d0
for floatingpoint literalsā¦ So first of all, I need to fix these literals to _rp
for experiment (precisely as you mentioned above )
Yes, GPU is very attractive and I think the motivation for the abovelinked papers is probably GPU. Though I cannot use GPUs right now, I would like to try them once the program and calculations become more āstableā (possibly with OpenACC?)
What is the flexibility when using GPUās with Fortran ?
Can a general compiler, like gfortran provide this functionality, or are manufacturer specific compilers/languages required ?
@JohnCampbell most compilers support the openACC compiler directives. The NVIDIA HPC toolkit compilers (old flang basically) will now off load DO CONCURRENT (in some cases) automatically to a GPU but probably only for a NVIDIA GPU. openMP 5 (maybe also 4) has the capability to offload directly to a GPU. I donāt use openMP a lot so Iām not sure which compilers support a version that will allow GPU offload. NVIDIAās compilers support CUDA Fortran and have access to the CUDA libraries. There are also several Fortran specific libraries that are aimed at providing GPU support. See the Parallel Programming section in
I have been doing LES (large eddy simulation) in single precision for many years. Most often, the differences are negligible on simple grids. I do see differences when averaging some higher moments where you sum powers of quantities, but one could use higher precision just there. When doing steady state simulations, one cannot converge to low residuals with single precision, but that is not something I usually need.
Even if the speed of one multiplication is the same, you can fit more in a vectorized operation. And the memory bandwidth is utilized better when you move around half the amount of data.
They are very few situations in what you need double precision. I have written statistical programs and have never used double precision. The only time you have loss of accuracy is when you subtract numbers that are equal in the first several significant digits. You should know when this is likely to happen
The single vs double decision has to be driven by a knowledge of the sensitivity of the underlying algorithms you are using to floating point round off etc. and the levels of accuracy you need. Iāll share an experience I had many years ago (late 1980s) while in grad school. I was taking a viscous flow class and we had a homework assignment to solve the boundary layer equations (subset of the NavierStokes equations) using finite differences. About that time one of the local computer stores was having a fire sale on IBM PC Jrs (they were selling complete systems for around $350). I bought one just to have a DOS machine to play with and added a 8087 math coprocesser. I also wanted to try programming in a different language than Fortran so I got a copy of Turbo Pascal. So I had a total of around $600650 in the machine. I did the assignment in Pascal and was surprised at 1) how fast Turbo Pascal was and 2) that I actually got converged solutions. A couple of my class mates did the same problem on a VAX 11/780 system that was pushing $1 million in cost. They used single precision and Fortran and were having trouble getting converged solutions. I was puzzled by this until after digging more into how Turbo Pascal did floating point math I found out that the Real type I was using was a 6 byte (48 bit) value with a precision of around 16 to 18 decimal places. Plus the 8087 if I remember correctly did something like 80 bit math. I mentioned this to my classmates and they switched to double precision and were finally able to get converged solutions. I have always found it very amusing that my 600 or so dollar system would outperform a $1 million dollar system. This was my first real lesson in the value of knowing when you need to use extended precision and when you donāt.
The Turbo Pascal 48bit real had a 39bit mantissa, which can represent 1112 decimal digits. This was adequate for the first two versions, which did not support the 8087. Do you remember how much you paid for the 8087 addon?
@mecej4. Thanks for correcting me on the precision. It was so very long ago that I was basing my number on something I read online. I think I paid something like $100 to $150 for the 8087 but (again) that was a long time ago and my memory is not what it used to be. I still regret not keeping the keyboard that came with the system. IBM for all its faults knew how to make a keyboard (at least back then).
The availability of 80bit āhardwareā accumulators was very useful in the 80ās, especially for dot_product accumulators when roundoff was an issue.
I find it confusing as to what ā8087ā (80 bit / 10 byte) instructions are available on 64bit windows OS.
The Salford 64bit compiler I use does not support real*10 (only available for 32bit).
Gfortran uses a 16 byte format, so I donāt know if this is hardware or software emulation.
SSE and AVX instructions also do not support better than 8byte instructions, which means there is a considerable performance penalty if selecting better than 8byte accuracy.
I think this poor support for improved precision is a retrograde step, unless someone can identify errors in my summary.
Ifort allows 3 real precisions: 4,8,16 bytes. Gfortran has those and also 10 bytes.
So ifort does not support ā8087ā hardware, while gfortran supports 8087 instructions in some way.
( The assumption in my question assumes 8087 hardware support of 80 bit reals would be better than 80 bit or 128 bit software emulation, although this is much slower than 32/64 bit SSE/AVX vector instructions.
I do not know if 80 bit registers are still available in modern intel/amd CPUās, or
How many AVX registers are available to each CPU, such as when hyperthreading/multithreading is used ?
The best way of found extended precision on modern cpus is double double math (which can be vectorized trivially). double doubled give you 102 bits of precision, and is generally a bunch faster than 128 bit floating point emulation.
I think that there are 4 issues here:

Is single precision arithmetic faster? It may not be. Some processors may map the single precision values into the typically 10 byte registers before carrying out the operations. The remapping may slow the process down. I would be interested to know whether anyone has tested this.

Single precision numbers occupy less memory and for array operations this will improve cache coherence. We have seen this.

The use of single precision numbers reduces interprocessor traffic in multiprocessor tasks. I suspect that this is the most important advantage.

Is it accurate enough? We have set up a system to measure this, please see
http://simconglobal.com/Testing_the_Numerical_Precisions_Required_to_Execute_Real_World_Programs.pdf
We emulate the arithmetic and adjust the precision to a specified number of mantissa bits. All of the libraries and the mechanism to do this are in the fpt distribution at http://simconglobal.com . To test whether single precision is good enough, you can run with single precision, i.e. 23bits in IEEE numbers, and then rerun with 22 or 21 bits of precision. If the results change significantly you are at or beyond the acceptable precision. If not it should be OK.
We found that we could reduce the precision appreciably in our small test programs. This was not a complete surprise. In the early 1980s, the fastest realtime simulation machine was the Applied Dynamics AD10. This computed all derivative values to a precision of only 16 bits, and used 48 bits for the state variables (the results of integrations). This was good enough for simulation of space shuttle main engine, space shuttle launch and the aerodynamics of most of the military aircraft and missiles in US inventory.
In large scale simulations based on explicit discretization so called double precision is mandatory if time exceeds say 3 milliseconds. Since typical increment time in the model is 1e8 for Finite Element size 1 milimeter there are millions of increments which accumulate errors.
Interestingly, GPU is not supported AFAIK in such a calculations.