No, you are correct here and the people who tell you not to use newer, more expressive language features that should be easier to optimize have Stockholm syndrome (where compiler developers are the kidnappers).
I’ll note that some compilers do a great job with array notation and intrinsics. The NVIDIA Fortran compiler maps the =
operator to CUDA memcpy and offloads TRANSPOSE
, MATMUL
and many others to GPUs by mapping to a CUTENSOR back-end.
Cray’s compiler also does a good job with such things, although they don’t have GPU support for them at the moment.
BabelStream Fortran has array-based implementations (with OpenACC kernels and OpenMP workshare, too) so that folks can measure the difference versus loop-based versions. It’s certainly not the most complicated benchmark out there, but it’s very easy to reason about.
- Fortran ports by jeffhammond · Pull Request #135 · UoB-HPC/BabelStream · GitHub
- Benchmarking Fortran DO CONCURRENT on CPUs and GPUs Using BabelStream — University of Bristol (open access PDF)
- Benchmarking Fortran DO CONCURRENT on CPUs and GPUs Using BabelStream | IEEE Conference Publication | IEEE Xplore