Using single-precision for faster calculation

What is the flexibility when using GPU’s with Fortran ?

Can a general compiler, like gfortran provide this functionality, or are manufacturer specific compilers/languages required ?

1 Like

@JohnCampbell most compilers support the openACC compiler directives. The NVIDIA HPC toolkit compilers (old flang basically) will now off load DO CONCURRENT (in some cases) automatically to a GPU but probably only for a NVIDIA GPU. openMP 5 (maybe also 4) has the capability to offload directly to a GPU. I don’t use openMP a lot so I’m not sure which compilers support a version that will allow GPU offload. NVIDIA’s compilers support CUDA Fortran and have access to the CUDA libraries. There are also several Fortran specific libraries that are aimed at providing GPU support. See the Parallel Programming section in

1 Like

I have been doing LES (large eddy simulation) in single precision for many years. Most often, the differences are negligible on simple grids. I do see differences when averaging some higher moments where you sum powers of quantities, but one could use higher precision just there. When doing steady state simulations, one cannot converge to low residuals with single precision, but that is not something I usually need.

1 Like

Even if the speed of one multiplication is the same, you can fit more in a vectorized operation. And the memory bandwidth is utilized better when you move around half the amount of data.

1 Like

They are very few situations in what you need double precision. I have written statistical programs and have never used double precision. The only time you have loss of accuracy is when you subtract numbers that are equal in the first several significant digits. You should know when this is likely to happen

1 Like

The single vs double decision has to be driven by a knowledge of the sensitivity of the underlying algorithms you are using to floating point round off etc. and the levels of accuracy you need. I’ll share an experience I had many years ago (late 1980s) while in grad school. I was taking a viscous flow class and we had a homework assignment to solve the boundary layer equations (subset of the Navier-Stokes equations) using finite differences. About that time one of the local computer stores was having a fire sale on IBM PC Jrs (they were selling complete systems for around $350). I bought one just to have a DOS machine to play with and added a 8087 math coprocesser. I also wanted to try programming in a different language than Fortran so I got a copy of Turbo Pascal. So I had a total of around $600-650 in the machine. I did the assignment in Pascal and was surprised at 1) how fast Turbo Pascal was and 2) that I actually got converged solutions. A couple of my class mates did the same problem on a VAX 11/780 system that was pushing $1 million in cost. They used single precision and Fortran and were having trouble getting converged solutions. I was puzzled by this until after digging more into how Turbo Pascal did floating point math I found out that the Real type I was using was a 6 byte (48 bit) value with a precision of around 16 to 18 decimal places. Plus the 8087 if I remember correctly did something like 80 bit math. I mentioned this to my classmates and they switched to double precision and were finally able to get converged solutions. I have always found it very amusing that my 600 or so dollar system would outperform a $1 million dollar system. This was my first real lesson in the value of knowing when you need to use extended precision and when you don’t.

3 Likes

The Turbo Pascal 48-bit real had a 39-bit mantissa, which can represent 11-12 decimal digits. This was adequate for the first two versions, which did not support the 8087. Do you remember how much you paid for the 8087 add-on?

@mecej4. Thanks for correcting me on the precision. It was so very long ago that I was basing my number on something I read on-line. I think I paid something like $100 to $150 for the 8087 but (again) that was a long time ago and my memory is not what it used to be. I still regret not keeping the keyboard that came with the system. IBM for all its faults knew how to make a keyboard (at least back then).

List prices for coprocessors (1991)

The availability of 80-bit “hardware” accumulators was very useful in the 80’s, especially for dot_product accumulators when round-off was an issue.
I find it confusing as to what “8087” (80 bit / 10 byte) instructions are available on 64-bit windows OS.
The Salford 64-bit compiler I use does not support real*10 (only available for 32-bit).
Gfortran uses a 16 byte format, so I don’t know if this is hardware or software emulation.
SSE and AVX instructions also do not support better than 8-byte instructions, which means there is a considerable performance penalty if selecting better than 8-byte accuracy.
I think this poor support for improved precision is a retrograde step, unless someone can identify errors in my summary.

Ifort allows 3 real precisions: 4,8,16 bytes. Gfortran has those and also 10 bytes.

So ifort does not support “8087” hardware, while gfortran supports 8087 instructions in some way.

( The assumption in my question assumes 8087 hardware support of 80 bit reals would be better than 80 bit or 128 bit software emulation, although this is much slower than 32/64 bit SSE/AVX vector instructions.
I do not know if 80 bit registers are still available in modern intel/amd CPU’s, or
How many AVX registers are available to each CPU, such as when hyper-threading/multi-threading is used ?

The best way of found extended precision on modern cpus is double double math (which can be vectorized trivially). double doubled give you 102 bits of precision, and is generally a bunch faster than 128 bit floating point emulation.

I think that there are 4 issues here:

  1. Is single precision arithmetic faster? It may not be. Some processors may map the single precision values into the typically 10 byte registers before carrying out the operations. The re-mapping may slow the process down. I would be interested to know whether anyone has tested this.

  2. Single precision numbers occupy less memory and for array operations this will improve cache coherence. We have seen this.

  3. The use of single precision numbers reduces inter-processor traffic in multiprocessor tasks. I suspect that this is the most important advantage.

  4. Is it accurate enough? We have set up a system to measure this, please see
    http://simconglobal.com/Testing_the_Numerical_Precisions_Required_to_Execute_Real_World_Programs.pdf
    We emulate the arithmetic and adjust the precision to a specified number of mantissa bits. All of the libraries and the mechanism to do this are in the fpt distribution at http://simconglobal.com . To test whether single precision is good enough, you can run with single precision, i.e. 23-bits in IEEE numbers, and then re-run with 22 or 21 bits of precision. If the results change significantly you are at or beyond the acceptable precision. If not it should be OK.

We found that we could reduce the precision appreciably in our small test programs. This was not a complete surprise. In the early 1980s, the fastest real-time simulation machine was the Applied Dynamics AD10. This computed all derivative values to a precision of only 16 bits, and used 48 bits for the state variables (the results of integrations). This was good enough for simulation of space shuttle main engine, space shuttle launch and the aerodynamics of most of the military aircraft and missiles in US inventory.

8 Likes

In large scale simulations based on explicit discretization so called double precision is mandatory if time exceeds say 3 milliseconds. Since typical increment time in the model is 1e-8 for Finite Element size 1 milimeter there are millions of increments which accumulate errors.
Interestingly, GPU is not supported AFAIK in such a calculations.

2 Likes

Actually there are instances where FEM codes have used GPUs, usually not in single precision though. Most GPUs (even the ones in commodity graphics cards) can do FP64. Unfortunately, both NVIDIA and AMD throttle back FP64 performance to 1/16 to 1/32 FP32 performance so a 600 dollar card will not compete with a $10,000 to $15,000 Tesla A100 or Instinct MI250 which can run FP64 at either 1/2 or the same speed as FP32. In most cases they don’t even advertise FP64 performance and you have to enable it in the GPU kernel but its there. I’m working with a Fortran based FEM code designed for impact and penetration problems (and has been around for about 40 years now) that has $acc directives all over the code base.

Also, the issue with FEM and GPUs has as much to do with the fact that explicit algorithms don’t map well to GPUs (as compared to implicit ones) as it does with the precision you are using (the major reason LS-DYNA does not support GPUs for their explcit solver). GPU’s are great if you can recast most of your computational work as matrix-vector or matrx-matrix operations. If you can’t do that the overhead of moving data back and forth between the GPU and CPU can limit any performance gain you might get.

1 Like

Can you elaborate more upon what you mean by explicit algorithms don’t map well to GPUs?

In computational fluid dynamics, the lattice Boltzmann method, which is an explicit algorithm, is routinely run on GPUs. A recent study I’m aware of was run up to 50 billion mesh points (Extreme flow simulations reveal skeletal adaptations of deep-sea sponges | Nature). The LBCuda code (CUDA Fortran+ MPI) has been run on a cluster of 512 A100 GPUs.

As far as I’m aware in FEM, even if you choose an explicit time-stepping method, due to the mass matrix you end up with a system of equations. For very large problems you will typically use an iterative solver, since you don’t need to store the matrix factorization. You do get an explicit scheme if you perform mass lumping.

2 Likes

I have certainly seen this. It was years ago - before GPUs were a thing.

The problem was the solution of large, strongly coupled, non-linear systems of equations in magnetostatics. Large here is NxN matrices with N=20k-50k. As the system was strongly nonlinear we needed to progress to a solution by gradually increasing the applied load. At each outer iteration we had a reasonable guess (extrapolating from the previous solutions) so an iterative GMRES solver would “converge” in a handful of steps (3 to 10).

The work was dominated by full matrix-vector multiplies. This is a level 2 BLAS operation, doesn’t use cache effectively and is often controlled by main memory bandwidth. Moving from double precision to single precision roughly doubled the throughput as expected. In each case we used an optimised BLAS routine (SGEMV or DGEMV) which managed parallelization, blocking, etc.

We used an N-dimensional Newton method for the outer loop, which can be considered a form of iterative refinement, so we didn’t suffer numerically from reduced precision.

The code is still in production. The initial implementation using a direct solver took about a day to solve on an 8-cpu SGI compute server. The final version took around a minute. Today - around 25 years later - the same code runs in a couple of seconds on my laptop. I could probably make it even faster, but … meh!

3 Likes

Just adding a different view point, depending on the problem and your algorithm it can be beneficial to run at higher precision. In fact I run code that uses quad precision in places because its overall faster than running in single or double precision!

This happens when we are integrating a set of very stiff equations, so we are adding very big and very small numbers. With double precision there is round off error and results can depend on the order we add terms up. Thus our integrator ends up having to take many small steps. While with quad precision each step is more expensive but we take alot less steps.

Overall we get ~10x speed up by using quad precision (which fortran makes easy by just changing the kind parameter, instead of the other techniques to reduce round error like Kahan summation or pre-sorting the data).

4 Likes

Agree 100%. My example is a specific case where the “different efficiency of data transfer (between slow memory and CPU)” was dominant.

1 Like