Torsten Hoefler’s random access spontaneous talk given at the 42nd anniversary Salishan Conference on High-Speed Computing in 2023. Discusses how to lift Fortran code to a data-centric representation to optimize it for accelerator devices. Work led by Alexandru Calotoiu in SPCL.
For the loop nest at 8:05 Torsten says that not a single compiler can parallelize that loop automatically, but I remain confused if he’s talking about vectorization of the inner loop, or doing the reduction on the outer loop in parallel
The loop appears to be taken out of the dwarf-p-cloud miniapp, specifically from lines 2655 - 2658 of the file cloudsc.F90
(which contains a single ~ 2800 line long subroutine ):
! Backsubstitution
! step 1
DO JN=2,NCLV
DO JM = 1,JN-1
ZQXN(KIDIA:KFDIA,JN)=ZQXN(KIDIA:KFDIA,JN)-ZQLHS(KIDIA:KFDIA,JN,JM) &
& *ZQXN(KIDIA:KFDIA,JM)
ENDDO
ENDDO
Rewriting the array expression as a loop,
DO JM = 1,JN-1
DO JL = KIDIA, KFDIA
ZQXN(JL,JN) = ZQXN(JL,JN) - factor*ZQXN(JL,JM)
ENDDO
ENDDO
gives a loop nest resembling what is shown in the slide.