Defining multiple real precisions in a program

I am surprised that you are investigating a 16-bit real to look for an improved performance.

I have no idea of the hardware you have available, but improved efficiency of vector instructions must be preferable to software emulation of smaller memory real calculations, given the ratios are only 50% change of memory.

I would have thought that utilising avx-256 or avx-512 could be more effective, combined with targeting the L1 cache, which is so important for efficient AVX. Surely a software real(2) would not achieve the vector real(4) performance.
The alternative is to utilise a processor with larger cache and increased memory bandwidth. I have been trying to understand the “black art” AVX inefficiency for a few years now.
My latest attempt with Ryzen 5900X: more cache and faster memory was only moderately successful, but I am hoping that newer hardware and DDR5 memory might show an improvement.
For me, multi thread computation has extra cores, but they share the same limited memory addressing capacity/bandwidth. Am I now arguing for smaller reals ?

1 Like