To define multiple real precisions in a program, which are relative to each other so that some calculations can be done in a higher precision than others, I came up with

program kinds
implicit none
integer, parameter :: qp_extra_p = 10 ! set to 1 to get qp = 10 for gfortran
integer, parameter :: sp = kind(1.0) ! could be sp = kind(1.0d0)
integer, parameter :: dp = selected_real_kind(p=1+precision(1.0_sp),r=1+exponent(1.0_sp))
integer, parameter :: qp = selected_real_kind(p=qp_extra_p+precision(1.0_dp),r=1+exponent(1.0_dp))
real(kind=sp) :: xsingle
real(kind=dp) :: xdouble
real(kind=qp) :: xquad ! will not compile if sp = kind(1.0d0)
print*," kind",sp,dp,qp
print*,"precision",precision(xsingle),precision(xdouble),precision(xquad)
print*," range",range(xsingle),range(xdouble),range(xquad)
end program kinds

Using the method above of setting kinds, I have written a version of norm2 that appears to handle overflow better than Intel Fortran does.

module norm_mod
implicit none
private
public :: wp, my_norm2, my_norm2_extra
integer, parameter :: wp = kind(1.0)
integer, parameter :: wp2 = selected_real_kind(p=10+precision(1.0_wp),r=1+exponent(1.0_wp))
contains
pure function my_norm2(x) result(y)
real(kind=wp), intent(in) :: x(:)
real(kind=wp) :: y
y = sqrt(sum(x**2))
end function my_norm2
!
pure function my_norm2_extra(x) result(y)
! use extra precision to calculate L2 norm
real(kind=wp), intent(in) :: x(:)
real(kind=wp) :: y
y = sqrt(sum(real(x,kind=wp2)**2))
end function my_norm2_extra
end module norm_mod
!
program test_norm2
use norm_mod, only: wp, my_norm2, my_norm2_extra
implicit none
integer, parameter :: n = 1000, nhuge = 100
real(kind=wp) :: huge_vec(n)
huge_vec = 0.0_wp
huge_vec(:nhuge) = huge(0.0_wp)/nhuge
print*," true",huge(0.0_wp)/sqrt(real(nhuge,kind=wp))
print*,"intrinsic",norm2(huge_vec)
print*," naive",my_norm2(huge_vec)
print*," extra",my_norm2_extra(huge_vec)
end program test_norm2

Intel Fortran output:

true 3.4028235E+37
intrinsic Infinity
naive Infinity
extra 3.4028235E+37

gfortran output:

true 3.40282347E+37
intrinsic 3.40282347E+37
naive Infinity
extra 3.40282347E+37

Also: you increase both the precision and the exponent range. I can imagine that it is not always desirable - you might want to increase precision and leave the range as is. Though probably there are only few systems where these two parameters are disconnected.

A few years ago we looked at a number format where the precision is traded off against the exponent . So, for example, if the exponent fits in 4 bits we have more bits for precision. The advantages are:
i The representation is more even in log space;
ii. The data size can be kept down for interprocessor communication.
But
SELECTED_REAL_KIND gets more interesting.

I have never seen anything of the sort ;). Something that is of considerable interest in the data science community is half-precision reals, because they have little to gain with six decimals but everything with less memory per number. That would, however, still fit in a more or less ordered sequence of number formats - the precision is reduced along with the range. Your example would upset that.

I have seen a Convex machine in years long gone by where there were two number formats. The main difference being a different interpretation of the exponent encoding and a slightly faster processing with the native number format (that was not IEEE-compliant). But that was before Fortran 90 was a serious option. You had to select the format via a compiler option.

Ads it happens there is a thread on the gfortran mailing list that concerns the handling of signalling NaNs. A large number of platforms are mentioned there each with its own ideas about what a number should be (*). Perhaps one of them has something akin to the number format you mention.

(*) Is it my too poor mastering of the English language that I have to use the word â€śnumberâ€ť in at least two different meanings?

I honestly would not know. I guess most of the ones I knew about or heard about have crumbled to dust, but the GCC compiler suite, hence gfortran, is tested for a large range of devices.

I think Itanium VMS Fortran can run in D-format. The real and double precision numbers have the same exponent format (and therefore range). Double precision just has an additional 32 bits of precision at the low precision end. A side effect is that you can read a single precision value from the address of a double precision variable. This was common practice and makes migration to IEEE exciting.

Gould-SEL numbers have the same characteristic but the exponent is HEX. But the Fortran is extended FORTRAN 77.

With respect to variable mantissa and exponent ranges, Iâ€™ll point out the unum number format promoted in recent years by John Gustafson. The wikipedia page is a decent place to start.

I am surprised that you are investigating a 16-bit real to look for an improved performance.

I have no idea of the hardware you have available, but improved efficiency of vector instructions must be preferable to software emulation of smaller memory real calculations, given the ratios are only 50% change of memory.

I would have thought that utilising avx-256 or avx-512 could be more effective, combined with targeting the L1 cache, which is so important for efficient AVX. Surely a software real(2) would not achieve the vector real(4) performance.
The alternative is to utilise a processor with larger cache and increased memory bandwidth. I have been trying to understand the â€śblack artâ€ť AVX inefficiency for a few years now.
My latest attempt with Ryzen 5900X: more cache and faster memory was only moderately successful, but I am hoping that newer hardware and DDR5 memory might show an improvement.
For me, multi thread computation has extra cores, but they share the same limited memory addressing capacity/bandwidth. Am I now arguing for smaller reals ?

I am not involved directly, just following whatâ€™s going on in the field.

The reasons why they are looking at 16-bit reals are several:

the application is extremely memory intensive; for each cell of a 3D spatial grid, you need to store 19 or even 27 DDFâ€™s (discrete distribution functions), plus any additional scalar fields (temperature, pressure, etc.)

given how memory intensive it is, the available GPU memory quickly becomes a restriction on the domain size (or grid size); this also puts a cap on the Reynolds number you can simulate (without resorting to turbulence models)

the algorithm is bandwith-limited; a large share of time is spent just shifting the DDFâ€™s in memory (think particles moving along a grid)

It turns out with some tricks, and assuming a stable simulation, the DDFâ€™s will fall between -2 and 2, mostly aggregating around 0. Parts of the mantissa and exponent are therefore completely unused.

As far I can understand, the application was written in OpenCL, for hardware that had support for 16-bit floats (cl_khr_fp16), and not in Fortran.

As Ivan points out, any way to exploit half precision can give potential benefits on modern GPUs which now seem to prioritise FP16 performance for machine learning applications. In fact any way to exploit lower precision on GPUs is beneficial due to the financial cost now required for full double precision hardware support.

I did some simple investigations into mixed-precision solvers for CFD on GPUs and got a 2x speedup using FP32 by alleviating the effective memory bottleneck.

One place where 16 bit can be really useful is if you are doing iterative algorithms. Running the first few iterations in a low precision type can get you a good approximation that you can then refine in higher precision.