Do I understand correctly that OpenMP allows to offload computations to GPU and gfortran supports this?
I’m looking for a minimal example of offloading a simple !$OMP PARALLEL DO construct to GPU.
Or, what can be used to generate GPUFORTRAN programs?
I have some experience with OpenACC and pgfortran (no luck on gfortran).
Use compiler flag -fopenmp to turn on this feature. Here is an example from the gcc/gfortran testsuite:
subroutine bar (i, j, k, s, a)
integer :: i, j, k, s, a(100), l
!$omp parallel do schedule (guided, s * 2)
do 100, l = j, k
100 a(l) = i
!$omp parallel do schedule (guided, s * 2)
do 101, l = j, k, 3
101 a(l) = i + 1
end subroutine bar
This is not the best example. Show us the simple loop(s) you want to parrallelize and we might be able to help you further.
A simple OpenMP loop is not a problem, but I’m failing to find a good reference/example of OpenMP and GPU interaction. I’ve seen the presentation linked by @vmagnin, but didn’t manage to build working simple example.
Looks like pgfortran is now morphed into nvfortran and the code below can be executed on GPU with OpenACC, but it requires adding !$acc routine info to all procedures.
! OpenMP (cpu)
! gfortran -o test -fopenmp -O0 test.f90
! OpenACC (gpu)
! nvfortran -o test -acc -O0 -ta=nvidia test.f90
module data
use, intrinsic :: iso_fortran_env, only : rk => real64
implicit none
integer, parameter :: loop = 10**6
contains
real(rk) function task(var)
!$acc routine seq
real(rk), intent(in) :: var
integer :: i
do i = 1, loop, 1
task = exp(real(i,rk))
end do
task = var
end function task
end module
program test
use, intrinsic :: iso_fortran_env, only : rk => real64
use data
implicit none
integer, parameter :: length = 2**10
real(rk), dimension(length) :: input
real(rk), dimension(length) :: output
integer :: i
input = real([(i, i=1,length,1)], rk)
!$omp parallel do
!$acc parallel loop copyin(input) copyout(output)
do i = 1, length
output(i) = task(input(i))
end do
!$acc end parallel loop
!$omp end parallel do
do i = 1, length
write(*,*) output(i)
end do
end program test
Check out AMDs AOMP repository. The AOMP package provides clang and flang compilers that will handle GPU offloading. The repository also provides examples for both C and Fortran. In short, you’ll need to wrap your do loop with Target directives. Further, to distribute many GPU threads in parallel, you’ll want to use
!$OMP teams distribute parallel for
For example,
!$OMP target
!$OMP teams distribute parallel for num_threads(256)
DO i = 1, N
a(i) = b(i) + c(i)
ENDDO
!$OMP end target
If you prefer to use gfortran, you’ll need to build the compiler from source. For Nvidia platforms, make sure you have CUDA toolkit installed before building. For AMD, have rocm-dev installed. I’ve haven’t built gfortran for OpenMP offloading since v8, but it looks like GNU has some guidance on this :
Better late than never, I suppose… With GNU compilers, you will need to build them with offloading support : Offloading - GCC Wiki
I haven’t tried it yet, but you may be able to install GNU compilers with OpenMP (and OpenACC offloading) to Nvidia GPUs using the +nvptx variant, e.g.