Gfortran GPU offload with OpenMP


Do I understand correctly that OpenMP allows to offload computations to GPU and gfortran supports this?
I’m looking for a minimal example of offloading a simple !$OMP PARALLEL DO construct to GPU.

Or, what can be used to generate GPU FORTRAN programs?
I have some experience with OpenACC and pgfortran (no luck on gfortran).

Yes, gfortran supports off-loading. No. I do not know the requirements (i.e., required hardware and command line switches). Instead of asking here, why not ask on the fortran@gcc mailing list?

I didn’t know OpenMP could also be used on GPU.
I have found that presentation:

Use compiler flag -fopenmp to turn on this feature. Here is an example from the gcc/gfortran testsuite:

subroutine bar (i, j, k, s, a)
integer :: i, j, k, s, a(100), l
!$omp parallel do schedule (guided, s * 2)
do 100, l = j, k
100 a(l) = i
!$omp parallel do schedule (guided, s * 2)
do 101, l = j, k, 3
101 a(l) = i + 1
end subroutine bar

This is not the best example. Show us the simple loop(s) you want to parrallelize and we might be able to help you further.

A simple OpenMP loop is not a problem, but I’m failing to find a good reference/example of OpenMP and GPU interaction. I’ve seen the presentation linked by @vmagnin, but didn’t manage to build working simple example.

Looks like pgfortran is now morphed into nvfortran and the code below can be executed on GPU with OpenACC, but it requires adding !$acc routine info to all procedures.

! OpenMP (cpu)
! gfortran -o test -fopenmp -O0 test.f90
! OpenACC (gpu)
! nvfortran -o test -acc -O0 -ta=nvidia test.f90
module data
  use, intrinsic :: iso_fortran_env, only : rk => real64
  implicit none
  integer, parameter :: loop = 10**6
  real(rk) function task(var)
    !$acc routine seq
    real(rk), intent(in) :: var
    integer :: i
    do i = 1, loop, 1
      task = exp(real(i,rk))
    end do
    task = var
  end function task
end module
program test
  use, intrinsic :: iso_fortran_env, only : rk => real64
  use data
  implicit none
  integer, parameter :: length = 2**10
  real(rk), dimension(length) :: input
  real(rk), dimension(length) :: output
  integer :: i
  input = real([(i, i=1,length,1)], rk)
  !$omp parallel do
  !$acc parallel loop copyin(input) copyout(output)
  do i = 1, length
    output(i) = task(input(i))
  end do
  !$acc end parallel loop
  !$omp end parallel do
  do i = 1, length
    write(*,*) output(i)
  end do
end program test

Check out AMDs AOMP repository. The AOMP package provides clang and flang compilers that will handle GPU offloading. The repository also provides examples for both C and Fortran. In short, you’ll need to wrap your do loop with Target directives. Further, to distribute many GPU threads in parallel, you’ll want to use

!$OMP teams distribute parallel for

For example,

!$OMP target
!$OMP teams distribute parallel for num_threads(256)
DO i = 1, N
a(i) = b(i) + c(i)

!$OMP end target

If you prefer to use gfortran, you’ll need to build the compiler from source. For Nvidia platforms, make sure you have CUDA toolkit installed before building. For AMD, have rocm-dev installed. I’ve haven’t built gfortran for OpenMP offloading since v8, but it looks like GNU has some guidance on this :