Asynchronous GPU programming with Fortran

I am playing around with writing some code using OpenMP for data movement and do concurrent for GPU offloading. All the data is allocated on the GPU using: !$omp target enter data map(alloc:a,b). Let’s look at some simple initialization loops, say.

do concurrent (i=1:10,j=1:15,k=1:27)
 a(i,j,k) = 0.0
end do 
do concurrent (i=1:10, j = 4:24, k=1:2)
 b(i,j,k) = 77.0
end do

This loops are totally independent of each other and I could/should launch them concurrently instead of serially with each one being on the GPU.

With OpenMP one can do:

!$omp target nowait 
!$omp loop collapse(3)
do i = 1,10 ; do j = 1,15 ; do k = 1,27 
  a(i,j,k) = 0.0
end do ; end do ; end do 
!$omp end loop 
!$omp end target 

!$omp target nowait 
!$omp loop collapse(3)
do i = 1,10 ; do j = 4,24 ; do k = 1,2
 b(i,j,k) = 77.0
end do ; end do ; end do 
!$omp end loop 
!$omp end target 

!$omp taskwait 

OpenMP can be quite greedy with the number of teams it assigns by default, an excessive amount of teams leads the runtime to not launch the kernels concurrently. By doing: OMP_NUM_TEAMS=128 for example, it will override the default “max_teams”

launch CUDA kernel file=~/no_depend_omp.f90 function=main line=27 device=0 host-threadid=0 num_teams=0 thread_limit=0 kernelname=nvkernel_MAIN__F1L27_2_ grid=<<<12,1,1>>> block=<<<128,1,1>>> shmem=0b

versus

launch CUDA kernel file=~/no_depend_omp.f90 function=main line=27 device=0 host-threadid=0 num_teams=0 thread_limit=0 kernelname=nvkernel_MAIN__F1L27_2_ grid=<<<12,1,1>>> block=<<<128,1,1>>> shmem=0b

We can look at the profiler from before/after:

after;

This is great, execution time went down by a lot and all is happy.

The multi stream openmp code is as fast as the do concurrent code without overlapping computation. Probably because of better resource allocation, I haven’t explored why yet.

This is however, a very small toy example. I was wondering if I could use $omp target nowait regions to overlap do concurrent kernels:

!$omp target nowait 
do concurrent (i=1:10,j=1:15,k=1:27)
  a(i,j,k) = 0.0
end do 
!$omp end target 

!$omp target nowait 
do concurrent (i=1:10, j = 4:24, k=1:2)
do i = 1,10 ; do j = 4,24 ; do k = 1,2
 b(i,j,k) = 77.0
end do 
!$omp end target 

!$omp taskwait 

This compiles and runs and produces the correct results, but it is very slow and the computations are not overlapped. The slowness comes from the do concurrent not being actually launched on the GPU. Slowness seems to stem from the number of teams/gangs launched:

DC: num_gangs=56244 num_workers=1 vector_length=128 grid=12x4687 block=128

OpenMP: kernelname=nvkernel_MAIN__F1L37_4_ grid=<<<439454,1,1>>> block=<<<128,1,1>>> shmem=0b

So, I wonder if anyone has experience with launching overlapping do concurrent loops on the GPU to overlap independent computations? If you’re curious to find my working dummy code: learning_tools/fortran/asynch at main · JorgeG94/learning_tools · GitHub

You’ll just need a GPU. I’ve only tested the nvidia compilers 24.9 and 25.5

To profile: nsys profile –stats=true

4 Likes