Thanks @shahmoradi for finding this one. A colleague of mine, Salvatore Cielo, has done the painstaking work of porting ECHO to SYCL:
Here is a performance graph of the SYCL version:
(Image Source: https://doi.org/10.1145/3585341.3585382)
The cool thing is that the same SYCL code runs on GPUs from all three vendors. Support of SYCL on NVIDIA and AMD cards is provided via the oneAPI DPC++ compiler (an open-sourced version of the Intel oneAPI DPC++ compiler). AFAIK, the support of NVIDIA and AMD cards was provided by Codeplay (recently acquired by Intel).
Anyways, the authors of the newly-published Fortran article state:
The highest peak performance using the four GPUs of a single LEONARDO node is 2.2 × 10^8 cells updated each iteration per second, reached at both 512^3 and 640^3 resolutions for the second-order version of the code.
The 2.2E8
updates per second would sit in the top right corner of the plot, so it’s in the same ballpark as the SYCL port. However LEONARDO nodes have 4 A100 GPUs. I think the measurements for the SYCL version are from single GPUs, which would mean a factor of 4 difference, but I will have to verify this. I’m also not familiar enough to say if this was the same test case and precise same method.
The nice thing about the new work is that it was achieved rather easily by modifying the original version of the ECHO code:
… we have demonstrated how a state-of-the-art numerical code for relativistic magnetohydrodynamics, ECHO has been very easily ported to GPU-based systems (in particular, on NVIDIA Volta and Ampere GPUs).
[…]
The version presented here is thus basically the original one, and now the very same code can run indifferently on a laptop, on multiple CPU-based cores, or on GPU accelerated devices. [emphasis added]
Since the Intel ifx
compiler also supports off-loading do concurrent
, it would be interesting to see how the SYCL vs do concurrent
compare. In the new article the authors used OpenACC pragmas for data movement. At least from the Intel article, The Case for OpenMP* Target Offloading, I wasn’t able to tell if one could combine OpenMP data movement directives with do concurrent
. I later found a discussion on LinkedIn about this, where Henry Gabb from Intel stated:
On Intel platforms, I still need the OpenMP directives to control host-device data transfers.
@sumseq also left a comment about his experiences: Early Results Using Fortran’s Do Concurrent
Standard Parallelism on Intel GPUs with the IFX Compiler