Asynchronous GPU programming with Fortran

I am playing around with writing some code using OpenMP for data movement and do concurrent for GPU offloading. All the data is allocated on the GPU using: !$omp target enter data map(alloc:a,b). Let’s look at some simple initialization loops, say.

do concurrent (i=1:10,j=1:15,k=1:27)
 a(i,j,k) = 0.0
end do 
do concurrent (i=1:10, j = 4:24, k=1:2)
 b(i,j,k) = 77.0
end do

This loops are totally independent of each other and I could/should launch them concurrently instead of serially with each one being on the GPU.

With OpenMP one can do:

!$omp target nowait 
!$omp loop collapse(3)
do i = 1,10 ; do j = 1,15 ; do k = 1,27 
  a(i,j,k) = 0.0
end do ; end do ; end do 
!$omp end loop 
!$omp end target 

!$omp target nowait 
!$omp loop collapse(3)
do i = 1,10 ; do j = 4,24 ; do k = 1,2
 b(i,j,k) = 77.0
end do ; end do ; end do 
!$omp end loop 
!$omp end target 

!$omp taskwait 

OpenMP can be quite greedy with the number of teams it assigns by default, an excessive amount of teams leads the runtime to not launch the kernels concurrently. By doing: OMP_NUM_TEAMS=128 for example, it will override the default “max_teams”

launch CUDA kernel file=~/no_depend_omp.f90 function=main line=27 device=0 host-threadid=0 num_teams=0 thread_limit=0 kernelname=nvkernel_MAIN__F1L27_2_ grid=<<<12,1,1>>> block=<<<128,1,1>>> shmem=0b

versus

launch CUDA kernel file=~/no_depend_omp.f90 function=main line=27 device=0 host-threadid=0 num_teams=0 thread_limit=0 kernelname=nvkernel_MAIN__F1L27_2_ grid=<<<12,1,1>>> block=<<<128,1,1>>> shmem=0b

We can look at the profiler from before/after:

after;

This is great, execution time went down by a lot and all is happy.

The multi stream openmp code is as fast as the do concurrent code without overlapping computation. Probably because of better resource allocation, I haven’t explored why yet.

This is however, a very small toy example. I was wondering if I could use $omp target nowait regions to overlap do concurrent kernels:

!$omp target nowait 
do concurrent (i=1:10,j=1:15,k=1:27)
  a(i,j,k) = 0.0
end do 
!$omp end target 

!$omp target nowait 
do concurrent (i=1:10, j = 4:24, k=1:2)
do i = 1,10 ; do j = 4,24 ; do k = 1,2
 b(i,j,k) = 77.0
end do 
!$omp end target 

!$omp taskwait 

This compiles and runs and produces the correct results, but it is very slow and the computations are not overlapped. The slowness comes from the do concurrent not being actually launched on the GPU. Slowness seems to stem from the number of teams/gangs launched:

DC: num_gangs=56244 num_workers=1 vector_length=128 grid=12x4687 block=128

OpenMP: kernelname=nvkernel_MAIN__F1L37_4_ grid=<<<439454,1,1>>> block=<<<128,1,1>>> shmem=0b

So, I wonder if anyone has experience with launching overlapping do concurrent loops on the GPU to overlap independent computations? If you’re curious to find my working dummy code: learning_tools/fortran/asynch at main · JorgeG94/learning_tools · GitHub

You’ll just need a GPU. I’ve only tested the nvidia compilers 24.9 and 25.5

To profile: nsys profile –stats=true

5 Likes

I do not have (not aware of) a Fortran compiler available that will apply multiple threads to do concurrent. Fortran users certainly need a database that lists the status of all Fortran compilers and their development timelines for including do concurrent as multi-thread.
I expect that some Fortran compilers (Gfortran possibly?) will emulate do concurrent as !$OMP PARALLEL DO.
However, given the availability, I have a question about the viability of many do concurrent workloads as being suitable for multi-threading as all the examples listed would be better emulated with simd instructions.

With The !$OMP implementations that I have used (ifort and Gfortran) the thread start-up overhead is prohibitive for a practical improvement. Indeed, the do concurrent restriction for pure functions removes most significant workloads as an option.
I wonder if those who have proposed do concurrent (or Forall) as a Fortran pathway to multi-threading have experienced any usage where the thread startup overhead has been overcome and improved performance achieved ?
The addage for OpenMP / SIMD usage is : Multi-thread the outer loop and vectorise the inner loop. With existing compilers, the outer loop must do a lot of work to be effective and setting an array element to a value is not sufficient.

( note: in the example collapse is only effective if number of threads >> 10; the outer loop size, as a 10 fold improvement without collapse would be a significant potential gain and the collapse overhead would show little further improvement. Perhaps the following essential change may be effective
!$omp loop collapse(2)
do k = 1,2 ; do j = 4,24 ; do i = 1,10

Even in this case with the correct memory usage order, if do k = 1,2 achieved a 50% performance improvement, that would be significant, in comparison to results that have been reported in recent usage studies
)

Can there be more discussion of what practical improvements can be achieved with do concurrent, multi-thread implementations ?
Are they only very niche calculation cases ?

1 Like

TL;DR: when using hybrid do concurrent + OpenMP/OpenACC paralellization, the transformations appear to depend heavily on the implementation (compiler flags).

I think the best option here would be the OpenMP 6.0 amendments, that allow using the target [teams] loop construct on do concurrent blocks. Quoting the Intel Fortran documentation:

OpenMP 6.0 specifications added support for specifying “loop” constructs on DO CONCURRENT. This enables you to offload DO CONCURRENT loops using OpenMP target construct on DO CONCURRENT loops. It also enables you to control the data transfers using OpenMP mapping clauses.

In this case you could also use the num_teams clause on the omp teams loop directive for program control. I’m not sure if the Nvidia compilers support the use of loop on do concurrent loops.


Just to have a MWE, I took your loops and put them into a function:

subroutine foo(a,b)
real, contiguous :: a(:,:,:), b(:,:,:)

integer :: i, j, k

!$omp target nowait                         ! L6 
do concurrent (i=1:10,j=1:15,k=1:27)        ! L7
  a(i,j,k) = 0.0
end do 
!$omp end target 

!$omp target nowait                         ! L12
do concurrent (i=1:10, j = 4:24, k=1:2)     ! L13
 b(i,j,k) = 77.0
end do 
!$omp end target 

!$omp taskwait                              ! L18
end subroutine

When compiled with the command: ifx -c -O2 -fiopenmp -fopenmp-targets=spir64 -fopenmp-target-do-concurrent -qopt-report=3 -qopt-report-phase=openmp -qopt-report-stdout foo.f90, I get the following report:

Begin optimization report for: foo_

OMP TASK BEGIN at foo.f90 (6, 7)
    remark #30011: task construct ignored
OMP TASK END

OMP TASK BEGIN at foo.f90 (12, 7)
    remark #30011: task construct ignored
OMP TASK END

OMP TASKWAIT BEGIN at foo.f90 (18, 7)
    remark #30008: taskwait construct transformed
OMP TASKWAIT END

OMP DIR.OMP.TARGET BEGIN at foo.f90 (6, 7)
    remark #30012: "A" has an implicit clause "map(map(tofrom: A))" because it is a non-scalar variable referenced at line:[8:3]
    remark #30020: OpenMP: PRIVATE clause item foo_$I$_1 is made work-item-local
    remark #30020: OpenMP: PRIVATE clause item foo_$J$_1 is made work-item-local
    remark #30020: OpenMP: PRIVATE clause item foo_$K$_1 is made work-item-local
    remark #30008: target construct transformed

    OMP LOOP BEGIN at foo.f90 (7, 1)
        remark #30015: OpenMP: loop construct was transformed into parallel do
        remark #30014: OpenMP: 3 loops collapsed
        remark #30008: parallel do construct transformed
    OMP LOOP END
OMP DIR.OMP.TARGET END

OMP DIR.OMP.TARGET BEGIN at foo.f90 (12, 7)
    remark #30012: "B" has an implicit clause "map(map(tofrom: B))" because it is a non-scalar variable referenced at line:[14:2]
    remark #30020: OpenMP: PRIVATE clause item foo_$I$_2 is made work-item-local
    remark #30020: OpenMP: PRIVATE clause item foo_$J$_2 is made work-item-local
    remark #30020: OpenMP: PRIVATE clause item foo_$K$_2 is made work-item-local
    remark #30008: target construct transformed

    OMP LOOP BEGIN at foo.f90 (13, 1)
        remark #30015: OpenMP: loop construct was transformed into parallel do
        remark #30014: OpenMP: 3 loops collapsed
        remark #30008: parallel do construct transformed
    OMP LOOP END
OMP DIR.OMP.TARGET END

With the command nvfortran -O2 -mp=gpu -stdpar=gpu -Minfo=mp,stdpar I get the following report:

      6, !$omp target
          6, Generating "nvkernel_foo__F1L6_2" GPU kernel
      7, Generating NVIDIA GPU code
          7, Loop run sequentially 
             Loop parallelized across CUDA thread blocks, CUDA threads(32) blockidx%x threadidx%x
      7, Generating implicit copyout(a(:10,:15,:27)) [if not already present]
     12, !$omp target
         12, Generating "nvkernel_foo__F1L12_4" GPU kernel
     18, Taskwait

For some reason it looks the first loop is only partially parallelized, while the second loop might not be parallelized at all (?).

If I remove -mp=gpu flag, the report changes to:

      7, Generating NVIDIA GPU code
          7,   ! blockidx%x threadidx%x auto-collapsed
             Loop parallelized across CUDA thread blocks, CUDA threads(128) collapse(3) ! blockidx%x threadidx%x collapsed-innermost
      7, Generating implicit copyout(a(:10,:15,:27)) [if not already present]
     13, Generating NVIDIA GPU code
         13,   ! blockidx%x threadidx%x auto-collapsed
             Loop parallelized across CUDA thread blocks, CUDA threads(128) collapse(3) ! blockidx%x threadidx%x collapsed-innermost
     13, Generating implicit copyout(b(:10,4:24,:2)) [if not already present]

which shows that now the code is parallel, but it probably isn’t asynchronous because the OpenMP directives are turned off.

Here is a link to Compiler Explorer with the snippet above: Compiler Explorer


If I replace with OpenACC directives (ChatGPT did this):

  !$acc parallel loop collapse(3) async(1) present(a)
  do concurrent (i=1:10, j=1:15, k=1:27)
    a(i,j,k) = 0.0
  end do

  !$acc parallel loop collapse(3) async(2) present(b)
  do concurrent (i=1:10, j=4:24, k=1:2)
    b(i,j,k) = 77.0
  end do

  !$acc wait

and use the flags -O2 -acc=gpu -stdpar=gpu -Minfo=acc,stdpar the report is

      6, Generating present(a(:,:,:))
         Generating NVIDIA GPU code
          7, !$acc loop seq
             !$acc loop vector(32) ! threadidx%x
      7, Loop is parallelizable
     11, Generating present(b(:,:,:))
         Generating NVIDIA GPU code
         12, !$acc loop seq
             !$acc loop vector(32) ! threadidx%x
     12, Loop is parallelizable

If I remove -stdpar=gpu the report is,

      6, Generating present(a(:,:,:))
         Generating NVIDIA GPU code
          7,   ! blockidx%x threadidx%x collapsed
             !$acc loop gang, vector(128) collapse(3) ! blockidx%x threadidx%x
     11, Generating present(b(:,:,:))
         Generating NVIDIA GPU code
         12,   ! blockidx%x threadidx%x collapsed
             !$acc loop gang, vector(128) collapse(3) ! blockidx%x threadidx%x

If I change the code to use the !$acc kernels directive,

  !$acc data present(a,b)
    !$acc kernels async(1)
    do concurrent (i=1:10, j=1:15, k=1:27)
      a(i,j,k) = 0.0
    end do
    !$acc end kernels

    !$acc kernels async(2)
    do concurrent (i=1:10, j=4:24, k=1:2)
      b(i,j,k) = 77.0
    end do
    !$acc end kernels

    !$acc wait
  !$acc end data

the same report is generated, with or without -stdpar=gpu:

      5, Generating present(b(:,:,:),a(:,:,:))
      7, Loop is parallelizable
         Generating NVIDIA GPU code
          7,   ! blockidx%x threadidx%x auto-collapsed
             !$acc loop gang, vector(128) collapse(3) ! blockidx%x threadidx%x collapsed-innermost
     13, Loop is parallelizable
         Generating NVIDIA GPU code
         13,   ! blockidx%x threadidx%x auto-collapsed
             !$acc loop gang, vector(128) collapse(3) ! blockidx%x threadidx%x collapsed-innermost

Edit 2: the combination of kernels and do concurrent is governed by the following paragraph from the OpenACC 3.3 spec (page 91, section 2.17.2),

This section refers to the Fortran do concurrent construct that is a form of do construct. When do concurrent appears without a loop construct in a kernels construct it is treated as if it is annotated with loop auto. If it appears in a parallel construct or an accelerator routine then it is treated as if it is annotated with loop independent.


On a tangent, I’m a bit confused by the right-form of OpenACC clause here. With the code:

  !$acc parallel loop async present(a)
  do concurrent (i=1:10, j=1:15, k=1:27)
    a(i,j,k) = 0.0
  end do

I get the report:

      5, Generating present(a(:,:,:))
         Generating NVIDIA GPU code
          6, !$acc loop gang ! blockidx%x
             !$acc loop vector(32) ! threadidx%x
             !$acc loop seq

Using the collapse(3) clause, in

  !$acc parallel loop collapse(3) async present(a)
  do concurrent (i=1:10, j=1:15, k=1:27)
    a(i,j,k) = 0.0
  end do

gives the report:

      5, Generating present(a(:,:,:))
         Generating NVIDIA GPU code
          6,   ! blockidx%x threadidx%x collapsed
             !$acc loop gang, vector(128) collapse(3) ! blockidx%x threadidx%x

Edit: I believe the use of collapse here is technically forbidden by the OpenACC 3.3 spec (page 58, line 2023) which states: “A tile and collapse clause may not appear on loop that is associated with do concurrent.” Seems like the nvfortran compiler accepts it anyway. :man_shrugging: :lady_beetle:

This case of using async is documented in the Nvidia HPC SDK manual (NVIDIA HPC Compilers User's Guide — NVIDIA HPC Compilers User's Guide 25.7 documentation),

Use of OpenACC async clause on the compute constructs can be utilised to perform computations in DC-loop asynchronously.

But there is no “Interoperability [of stdpar] with OpenMP” section in their manual, so I’m not convinced that use case is allowed/tested.

Anyways, I believe OpenMP 6.0 supports do concurrent, but compiler support is still lacking. In the last discussion about this topic on Discourse, we didn’t reach a conclusion if OpenMP 5.2 specified the interaction well enough: OpenMP and `do concurrent` loop = crash at runtime - #11 by PierU. It seemed to fall under the cases where you are at the mercy of your compiler vendor (meaning likely to be unportable).

2 Likes