“GPU” appears nowhere in the Fortran standard and I’m pretty sure that’s intentional. The late Dan Nagle, who chaired the US arm of the Fortran committee, said to me, “The philosophy of Fortran is to give the programmer the ability to communicate properties of their code rather than to mandate what the compiler do to exploit those properties.” I wasn’t on the committee when do concurrent was developed, but my understanding is that the committee developed it with GPUs in mind, but I doubt the point was “baking GPU stuff directly into Fortran.”
A do loop is inherently a sequential construct. It explicitly tells the compiler “do these iterations in this order.” That ordering is essential when it comes to doing things like time advancement, wherein the calculations must respect causality to be correct. But because parallel programming was necessary long before parallel programming languages went mainstream and developers understandably couldn’t wait, we developed a pattern of first telling the compiler explicitly to do something sequentially and then undoing that sequential ordering with directives. One of the worst outcomes of this pattern is that we sometimes end up with more directives than program statements – all in the name of undoing what we did! It seems to me much more clear to just tell the compiler what we mean: these iterations can be done in any order you choose. That’s the purpose of do concurrent and fortunately, there are at least four compilers that can now parallelize do concurrent on CPUs or GPUs: compilers form NVIDIA, Intel, HPE (Cray), and LLVM in that approximately chronological order in terms of how long the compiler has had this capability. For an example of do concurrent achieving essentially the same performance as OpenMP when compiling with LLVM FLang and running on a CPU, see the slides from my “Just Write Fortran” talk at the 2024 Parallel Applications Workshop – Alternatives to MPI+X. That work is based on AMD’s ROCm fork of LLVM Flang, where I believe there is also already a branch that offers experimental support for offloading do concurrent to a GPU.
I’m old enough to remember floating-point co-processors in the 1990s. These days, when I mention floating-point co-processors to anyone under 40, they usually haven’t even heard of them because those devices eventually got absorbed into the CPU. I suspect we’re already seeing the early stages of a similar trend with GPUs, which I suspect is why the committee never intended to explicitly address GPUs in the language. I often wonder whether young developers in future decades will even know the term GPU and will be discussing whether to bake some new form of accelerator into the language.