@rouson will visit me next week at the University of Miami and he’ll give a seminar that may be interesting here. It’s open to the public by Zoom or in person.
Speaker: Damian Rouson, Computer Languages and Systems Software Group, Lawrence Berkeley National Laboratory, CA.
Title: Language-Based Parallel Programming for Earth System Modeling
Abstract: Language-based parallel programming models offer the promise of improved portability, programmability, and performance along with reduced maintenance costs relative to some alternative approaches. Fortran 2018 delivers on this promise by incorporating and expanding upon Coarray Fortran, a feature set originally defined as a syntactically small extension to Fortran 95. Coarray Fortran supports single-program, multiple data (SPMD) programming compatible with other SPMD models such as the widely employed Message Passing Interface (MPI). Fortran also provides array statements and concurrent loop iterations, “do concurrent”, that a compiler can exploit for multithreading, vectorization, or offloading computation to accelerators such as graphics processing units (GPUs). Four widely available compilers now support the aforementioned language features, making the time ripe for exploring language-based parallelism in Fortran. This talk will demonstrate the use of coarrays for advection prediction in the Intermediate Complexity Atmospheric Research (ICAR) model and the use of array statements and “do concurrent” in Inference-Engine. Developed at the National Center for Atmospheric Research (NCAR) and initially motivated by the precipitation input requirements of hydrological models, ICAR facilitates downscaling to predict the regional impacts of global climate change. Developed at Berkeley Lab, the new deep learning framework Inference-Engine targets the large-batch inference needs of applications such as ICAR, which will need to infer multiple variables at each grid point at each time step if we succeed in training a neural network to serve as the cloud microphysics module. The talk will highlight the challenges and benefits of the language-based approach in the context of ICAR and Inference-Engine from the standpoint of portability, programmability, performance, and maintenance.
I was hoping I could ask a question regarding suitability and difficulties in using GPU hardware for SPMD, should it suit the presentation scope.
As background, I am more familiar with use of OpenMP using cheaper Intel I7 and AMD Ryzen processors. Open MP is a shared memory approach for multi-threading which is more suited to my direct solver approaches.
These processors have been developed for a diverse client base, which has seen marketing influence the design where increased core count that has not “kept pace” with a required memory access bandwidth of my computation type. Recent additions of different core types (pcore or ecore) is an additional complexity for OpenMP load sharing, although this may be better addressed in MPI approaches.
GPU hardware is also being developed for a diverse client base, such as gaming graphics, bit-coin mining and SPMD.
Can you comment on the suitability of GPU hardware for SPMD, given the diverse influences for it’s hardware development ?
The SPMD model is a foundation for kernel programming as it does systematically and massively replicate kernels on devices.
CUDA, DirectCompute, OpenCL, SYCL, TornadoVM, Chapel (if desired by the programmer), Coarray Fortran, and much more, all this is SPMD (with extensions), isn’t it?
SPMD execution has a reputation for being bad (inefficient) at divergent control flow (see page 48 there):
I am already using Coarray Fortran to implement asynchronous coroutines (groups of kernels) on a CPU, to naturally allow for divergent control flow with SPMD (at the level of a single coarray team) by combining SPMD with parallel loops (to allow for divergent task and -hopefully- pipeline parallelism).
My current focus is on FPGA kernel programming (yet only on CPU). If I would focus on GPU kernel programming using Coarray Fortran (yet only on CPU as well), I would probably start by replacing the coarrays (which I would not expect to be supported on GPUs) in my programming by Fortran 2018 collective subroutines (still Coarray Fortran), as I would expect implementers to hopefully recognizing them for usage on GPUs: For example, NVIDIA Collective Communication Library (NCCL) does appear to be similar to these (broadcast, reduce).
Nice! @milancurcic by any chance the slides can be accessed ? (There are some details difficult to follow in the video because of the angle of the camera )
@hkvzjal I don’t see a way to add an attachment to a post here, but I gave a PDF of the slides to Milan so he might be able to point us to where they are online if they’ve been posted. Otherwise, email me at @lbl.gov and I’ll reply with the slides attached. Please reference the University of Miami seminar or this Discourse thread in the Subject line.