Asynchronous GPU programming with Fortran

TL;DR: when using hybrid do concurrent + OpenMP/OpenACC paralellization, the transformations appear to depend heavily on the implementation (compiler flags).

I think the best option here would be the OpenMP 6.0 amendments, that allow using the target [teams] loop construct on do concurrent blocks. Quoting the Intel Fortran documentation:

OpenMP 6.0 specifications added support for specifying “loop” constructs on DO CONCURRENT. This enables you to offload DO CONCURRENT loops using OpenMP target construct on DO CONCURRENT loops. It also enables you to control the data transfers using OpenMP mapping clauses.

In this case you could also use the num_teams clause on the omp teams loop directive for program control. I’m not sure if the Nvidia compilers support the use of loop on do concurrent loops.


Just to have a MWE, I took your loops and put them into a function:

subroutine foo(a,b)
real, contiguous :: a(:,:,:), b(:,:,:)

integer :: i, j, k

!$omp target nowait                         ! L6 
do concurrent (i=1:10,j=1:15,k=1:27)        ! L7
  a(i,j,k) = 0.0
end do 
!$omp end target 

!$omp target nowait                         ! L12
do concurrent (i=1:10, j = 4:24, k=1:2)     ! L13
 b(i,j,k) = 77.0
end do 
!$omp end target 

!$omp taskwait                              ! L18
end subroutine

When compiled with the command: ifx -c -O2 -fiopenmp -fopenmp-targets=spir64 -fopenmp-target-do-concurrent -qopt-report=3 -qopt-report-phase=openmp -qopt-report-stdout foo.f90, I get the following report:

Begin optimization report for: foo_

OMP TASK BEGIN at foo.f90 (6, 7)
    remark #30011: task construct ignored
OMP TASK END

OMP TASK BEGIN at foo.f90 (12, 7)
    remark #30011: task construct ignored
OMP TASK END

OMP TASKWAIT BEGIN at foo.f90 (18, 7)
    remark #30008: taskwait construct transformed
OMP TASKWAIT END

OMP DIR.OMP.TARGET BEGIN at foo.f90 (6, 7)
    remark #30012: "A" has an implicit clause "map(map(tofrom: A))" because it is a non-scalar variable referenced at line:[8:3]
    remark #30020: OpenMP: PRIVATE clause item foo_$I$_1 is made work-item-local
    remark #30020: OpenMP: PRIVATE clause item foo_$J$_1 is made work-item-local
    remark #30020: OpenMP: PRIVATE clause item foo_$K$_1 is made work-item-local
    remark #30008: target construct transformed

    OMP LOOP BEGIN at foo.f90 (7, 1)
        remark #30015: OpenMP: loop construct was transformed into parallel do
        remark #30014: OpenMP: 3 loops collapsed
        remark #30008: parallel do construct transformed
    OMP LOOP END
OMP DIR.OMP.TARGET END

OMP DIR.OMP.TARGET BEGIN at foo.f90 (12, 7)
    remark #30012: "B" has an implicit clause "map(map(tofrom: B))" because it is a non-scalar variable referenced at line:[14:2]
    remark #30020: OpenMP: PRIVATE clause item foo_$I$_2 is made work-item-local
    remark #30020: OpenMP: PRIVATE clause item foo_$J$_2 is made work-item-local
    remark #30020: OpenMP: PRIVATE clause item foo_$K$_2 is made work-item-local
    remark #30008: target construct transformed

    OMP LOOP BEGIN at foo.f90 (13, 1)
        remark #30015: OpenMP: loop construct was transformed into parallel do
        remark #30014: OpenMP: 3 loops collapsed
        remark #30008: parallel do construct transformed
    OMP LOOP END
OMP DIR.OMP.TARGET END

With the command nvfortran -O2 -mp=gpu -stdpar=gpu -Minfo=mp,stdpar I get the following report:

      6, !$omp target
          6, Generating "nvkernel_foo__F1L6_2" GPU kernel
      7, Generating NVIDIA GPU code
          7, Loop run sequentially 
             Loop parallelized across CUDA thread blocks, CUDA threads(32) blockidx%x threadidx%x
      7, Generating implicit copyout(a(:10,:15,:27)) [if not already present]
     12, !$omp target
         12, Generating "nvkernel_foo__F1L12_4" GPU kernel
     18, Taskwait

For some reason it looks the first loop is only partially parallelized, while the second loop might not be parallelized at all (?).

If I remove -mp=gpu flag, the report changes to:

      7, Generating NVIDIA GPU code
          7,   ! blockidx%x threadidx%x auto-collapsed
             Loop parallelized across CUDA thread blocks, CUDA threads(128) collapse(3) ! blockidx%x threadidx%x collapsed-innermost
      7, Generating implicit copyout(a(:10,:15,:27)) [if not already present]
     13, Generating NVIDIA GPU code
         13,   ! blockidx%x threadidx%x auto-collapsed
             Loop parallelized across CUDA thread blocks, CUDA threads(128) collapse(3) ! blockidx%x threadidx%x collapsed-innermost
     13, Generating implicit copyout(b(:10,4:24,:2)) [if not already present]

which shows that now the code is parallel, but it probably isn’t asynchronous because the OpenMP directives are turned off.

Here is a link to Compiler Explorer with the snippet above: Compiler Explorer


If I replace with OpenACC directives (ChatGPT did this):

  !$acc parallel loop collapse(3) async(1) present(a)
  do concurrent (i=1:10, j=1:15, k=1:27)
    a(i,j,k) = 0.0
  end do

  !$acc parallel loop collapse(3) async(2) present(b)
  do concurrent (i=1:10, j=4:24, k=1:2)
    b(i,j,k) = 77.0
  end do

  !$acc wait

and use the flags -O2 -acc=gpu -stdpar=gpu -Minfo=acc,stdpar the report is

      6, Generating present(a(:,:,:))
         Generating NVIDIA GPU code
          7, !$acc loop seq
             !$acc loop vector(32) ! threadidx%x
      7, Loop is parallelizable
     11, Generating present(b(:,:,:))
         Generating NVIDIA GPU code
         12, !$acc loop seq
             !$acc loop vector(32) ! threadidx%x
     12, Loop is parallelizable

If I remove -stdpar=gpu the report is,

      6, Generating present(a(:,:,:))
         Generating NVIDIA GPU code
          7,   ! blockidx%x threadidx%x collapsed
             !$acc loop gang, vector(128) collapse(3) ! blockidx%x threadidx%x
     11, Generating present(b(:,:,:))
         Generating NVIDIA GPU code
         12,   ! blockidx%x threadidx%x collapsed
             !$acc loop gang, vector(128) collapse(3) ! blockidx%x threadidx%x

If I change the code to use the !$acc kernels directive,

  !$acc data present(a,b)
    !$acc kernels async(1)
    do concurrent (i=1:10, j=1:15, k=1:27)
      a(i,j,k) = 0.0
    end do
    !$acc end kernels

    !$acc kernels async(2)
    do concurrent (i=1:10, j=4:24, k=1:2)
      b(i,j,k) = 77.0
    end do
    !$acc end kernels

    !$acc wait
  !$acc end data

the same report is generated, with or without -stdpar=gpu:

      5, Generating present(b(:,:,:),a(:,:,:))
      7, Loop is parallelizable
         Generating NVIDIA GPU code
          7,   ! blockidx%x threadidx%x auto-collapsed
             !$acc loop gang, vector(128) collapse(3) ! blockidx%x threadidx%x collapsed-innermost
     13, Loop is parallelizable
         Generating NVIDIA GPU code
         13,   ! blockidx%x threadidx%x auto-collapsed
             !$acc loop gang, vector(128) collapse(3) ! blockidx%x threadidx%x collapsed-innermost

Edit 2: the combination of kernels and do concurrent is governed by the following paragraph from the OpenACC 3.3 spec (page 91, section 2.17.2),

This section refers to the Fortran do concurrent construct that is a form of do construct. When do concurrent appears without a loop construct in a kernels construct it is treated as if it is annotated with loop auto. If it appears in a parallel construct or an accelerator routine then it is treated as if it is annotated with loop independent.


On a tangent, I’m a bit confused by the right-form of OpenACC clause here. With the code:

  !$acc parallel loop async present(a)
  do concurrent (i=1:10, j=1:15, k=1:27)
    a(i,j,k) = 0.0
  end do

I get the report:

      5, Generating present(a(:,:,:))
         Generating NVIDIA GPU code
          6, !$acc loop gang ! blockidx%x
             !$acc loop vector(32) ! threadidx%x
             !$acc loop seq

Using the collapse(3) clause, in

  !$acc parallel loop collapse(3) async present(a)
  do concurrent (i=1:10, j=1:15, k=1:27)
    a(i,j,k) = 0.0
  end do

gives the report:

      5, Generating present(a(:,:,:))
         Generating NVIDIA GPU code
          6,   ! blockidx%x threadidx%x collapsed
             !$acc loop gang, vector(128) collapse(3) ! blockidx%x threadidx%x

Edit: I believe the use of collapse here is technically forbidden by the OpenACC 3.3 spec (page 58, line 2023) which states: “A tile and collapse clause may not appear on loop that is associated with do concurrent.” Seems like the nvfortran compiler accepts it anyway. :man_shrugging: :lady_beetle:

This case of using async is documented in the Nvidia HPC SDK manual (NVIDIA HPC Compilers User's Guide — NVIDIA HPC Compilers User's Guide 25.7 documentation),

Use of OpenACC async clause on the compute constructs can be utilised to perform computations in DC-loop asynchronously.

But there is no “Interoperability [of stdpar] with OpenMP” section in their manual, so I’m not convinced that use case is allowed/tested.

Anyways, I believe OpenMP 6.0 supports do concurrent, but compiler support is still lacking. In the last discussion about this topic on Discourse, we didn’t reach a conclusion if OpenMP 5.2 specified the interaction well enough: OpenMP and `do concurrent` loop = crash at runtime - #11 by PierU. It seemed to fall under the cases where you are at the mercy of your compiler vendor (meaning likely to be unportable).

2 Likes