Parallelization on GPU with Intel compiler

ivanpribec · April 10, 2024, 6:25pm

AFAIK, Intel’s OpenMP runtime library doesn’t support running on an NVIDIA GPU (as discussed in the thread that @hkvzjal linked). Perhaps you can increase the debug level to LIBOMPTARGET_DEBUG=2 to find out the reason why it didn’t offload? Intel relies on the SPIR64 intermediate format, which gets passed to the OpenCL plugin. I don’t know if Nvidia’s OpenCL library supports ingesting SPIR. I guess not.

rwmsu · April 10, 2024, 7:01pm

Yes, I think just using straight openMP is just not doable now (hopefully in the future). There was a post by a user on the Codeplay site forum about trying to use Intels MKL with Nvidia GPUs and someone suggested trying the open source version oneMKL which will allow you to use cuBlas etc as a backend. That might give you some way to do linear algebra but doesn’t address the wider problem of offloading loops etc. For now I might just try writing wrappers around C++ SYCL code just to have something to play around with.

ivanpribec · April 10, 2024, 7:37pm

You can use OpenMP offloading with the Nvidia Fortran compiler, but in practice you’ll need to “restrict” yourself to a common supported subset. But indeed using even minor F2008+ features can be problematic (I’ve reported 3 bugs so far regarding F2008+ features in the nvfortran compiler). You could even use both nvfortran and Intel Fortran together, with procedures exposed using C linkage. In a sense it’s not much different from using Fortran and C++ from the same compiler toolchain (which requires glue-code), instead using Fortran from two different toolchains speaking via a “fake” C interface. The idea is due to @JeffH, see blog/Dealing_with_imperfect_Fortran_compilers.md at main · jeffhammond/blog · GitHub

FWIW, flang is making a lot of progress. You can judge for yourself from the Flang Liason Report presented to J3: https://j3-fortran.org/doc/year/24/24-112r1.txt. Every day multiple pull requests get made to flang: Pull requests · llvm/llvm-project · GitHub

rwmsu · April 10, 2024, 7:58pm

Maybe. They have been working on this for about 5 years. IMHO, they should have a working version I could install using a deb package or apt by now. Instead they appear to be stuck in alpha mode. Also, I can see nothing on the llvm flang sites that indicate that anything like a real schedule with a target date for delivering a working compiler exists. To me that is an indication of a lack of adult supervision and they are just making it up as they go along.

Harper · April 10, 2024, 9:22pm

There seem to be at least 2 compilers called flang: I know of AMD(AOCC) and LLVM. Could they be given different names, please? There is a good precedent for that: g95 and gfortran.

aledinola · April 10, 2024, 9:33pm

I attach here the files that I had downloaded from github some time ago. I wasn’t able to find it again after quick search online. If someone finds it, please let me know (I was stupid enough not to write down the address).

01_mm_CPU_sequential.f90 (923 Bytes)
02_mm_CPU_parallel.f90 (1.1 KB)
03_mm_GPU.f90 (1.7 KB)

There was also a makefile and a README but I cannot upload them (I get a warning from this page). This is the start of the README file.

`Matrix Multiply` Sample

The Matrix Multiply sample provides a guided approach to modify a simple Fortran program to offload computations to an Intel® GPU using OpenMP* with the Intel® Fortran Compiler. The sample program is a simple matrix multiply algorithm.

Area	Description
What you will learn	How to use OpenMP directives with Fortran to offload computations to an Intel GPU
Time to complete	10-15 minutes

Purpose

The Matrix Multiply sample demonstrates the steps to modify a simple Fortran program to use OpenMP* directives to offload the compute kernel from the host to an Intel® GPU using the Intel® Fortran Compiler.

Three working versions of the matrix multiply program are provided with the sample:

01_mm_CPU_sequential.f90: the working sequential version of the matrix multiply program.
02_mm_CPU_parallel.f90: the working parallel version of the matrix multiply program, modified to use OpenMP directives to parallelize the program for CPU.
03_mm_GPU.f90: the working parallel version of the matrix multiply program, modified to use OpenMP directives for GPU.

The instructions in this sample show the steps to modify the original sequential version of the program (01_mm_CPU_sequential.f90) into the OpenMP parallel processing versions for CPU and GPU. In this sample, you will create your own version of the CPU parallel and GPU programs.

You can compare your modified programs against the provided working versions of the programs.

Prerequisites

Optimized for	Description
OS	Linux, Windows
Hardware	Intel® GPU
Software	Intel® Fortran Compiler, Intel GPU driver

aledinola · April 10, 2024, 9:51pm

Regarding the OpenMP directives:

Your code has the advantage of parallelizing two loops, which is ok because they are independent. The original code was parallelizing only the loop over j, I’m not sure why.
What about private variables? I noticed that you removed the statement private(j, i, k). Is it because loop variables are considered private by deafult? If I had a more complicated example can I use the clause private?

rwmsu · April 11, 2024, 12:23am

Nvidia HPC SDK nvfortran is also original or " classic flang" as is the ARM Fortran compiler. This version of flang was based on the Portland Group (PGI) compiler which Nvidia obtained when they bought PGI. U.S. Dept of Energy and other agencies paid Nvidia to produce an open source compiler based on the Portland Group front end (which was written in C not C++) with an LLVM back end with the idea being an open source compiler that could support a variety of hardware. Classic flang still has some PGI code for I think the intrinsic math functions etc. Some where along the way, it was decided that continueing with the original PGI C based front end was going to be difficult to sustain in the long term and a new from the ground up C++ based front end was needed. What is called “new flang” (now llvm flang) is basically the front end plus a new runtime (at least thats how I see it and will readily admit I maybe (and am probably) wrong about this). Nvidia and the other companies probably needed something in the interim until llvm flang reached production level so they along with AMD and ARM kept working on classic flang. However, I think all these companies have delayed implementing much beyound a few key F2008 features until LLVM flang reaches maturity. I think just based on comments that I’ve read on the web that one of the issues with both classic flang and new (llvm) flang is that LLVM (backend) was/is very C/C++ centric and not very friendly to Fortran and a lot of new work had to be done with the intermediate representations to make it play nice with the rest of LLVM. I guess the retention of the name “flang” was to give it some status within the LLVM project consistent with the LLVM C/C++ compiler (clang). I personally would have called it something like LFC for LLVM Fortran Compiler just to avoid conflict and confusion with Classic flang. This is all from my memory of how LLVM flang and classic flang came to be and I’ll admit I might have some of the narrative wrong so others can feel free to correct me if I am.

JeffH · April 11, 2024, 4:04am

Can you point me to the public engineering schedules for other compiler projects like GCC and Clang? For example, in which year will gfortran support Fortran 2018 locality specifiers?

For context, I am the Fortran lead in the NVIDIA HPC software product group and part of my job includes working with the engineering team building LLVM Fortran. They do not need “adult supervision”. Your impatience has no bearing on their competence.

The decision was made by Christian Lattner.

ivanpribec · April 11, 2024, 6:51am

Short answer is the work gets scheduled to a queue (or multiple independent queues - streams in case we are talking of CUDA). Here’s an example which might shed some light on this: cuda-samples/Samples/0_Introduction/cudaOpenMP/cudaOpenMP.cu at master · NVIDIA/cuda-samples · GitHub

Yes. Otherwise it wouldn’t make sense if the threads all had access to the same variable. But for other variables within the loop body it may be needed, for instance,

         !$omp target teams distribute parallel do collapse(2) private(tmp)
         do j=1,n
            do i=1,n
               !
               ! each pair (i,j) executes independently
               !
               tmp = c(i,j)
               do k=1,n
                  tmp = tmp + a(i,k) * b(k,j)
               end do
               c(i,j) = tmp
            end do
         end do

PierU · April 11, 2024, 11:47am

IMO k should be declared as private in your exemple.

rwmsu · April 11, 2024, 12:04pm

First I made no comment questioning the competence of the people developing flang. I do have a right to question the leadership of a project thats been going for about 6 years now and has yet to deliver anything like a working product. I’m an old aerospace engineer and a lot of airplanes have gone from back of the envelope drawings to first flight in less time than that. It wouldn’t be so bad and I would be willing to cut you guys some slack if you would just take a couple of hours to update your user facing information be it from the flang web site or whatever to keep your potential customers up to date as to what your real current status is and when you expect to actually have something that I and other users can use without 1. having to build the entire compiler from scratch and 2. have some hope that it will build working executables for our code. As it stands now all I don’t see that information anywhere except the Standard Committee laison reports that really don’t tell the user community at large just what is the real status of the program. While I would like to see firm schedule, if you don’t want to share it fine. In the aerospace industry if I told a customer that I would build him the best airplane in the world if he gives me a boatload of money but I can’t tell him when I will deliver it I would be laughed out of the room. Finally, I’m not “impatient” regarding flang. There are other options available. Frankly, if both compilers where in full production mode today, I would probably choose LFortran over LLVM flang.

ivanpribec · April 11, 2024, 12:17pm

The relevant rule from OpenMP 5.2 standard, is in section 5.1.1,

Certain variables and objects have predetermined data-sharing attributes for the construct in which they are references. […]

Loop iteration variables inside parallel, teams, or task generating constructs are private in the innermost such construct that encloses the loop.

I’m having trouble interpreting what exactly this means. It’s obvious for the iteration variables associated to a task generating construct. But the first part of the sentence implies that it’s private also in cases like this one:

!$omp parallel num_threads(2)

do k = 1, 3
   print *, omp_get_thread_num()
end do

!$omp end parallel

This StackOverflow response, sheds some light on this: Is it necessary to set loop index as a private variable in OpenMP? - Stack Overflow

Bear in mind that threads will be allocated, at loop initialisation, sets of values of i which distribute the individual loop iterations according to the schedule specified (either by the programmer or by some implementation-defined default). Allowing threads to update a ‘local’ value for i would lead to mad code

jkd2022 · April 11, 2024, 1:02pm

Thanks for the example, I see what it’s doing.

For my case, I imagine that most of the CPU threads will be just be idle while they wait for the GPU resources to become available.

It looks like hard work for a Fortran coder to efficiently micro-manage this resource allocation and also make their code portable between compilers and GPU vendors.

One suggestion would be for a CPU thread to automatically default to a CPU calculation if the requested GPU is busy. This way one thread would finish more quickly than the others and (assuming dynamic scheduling) could then get on with the next loop/section/task. In such a scheme all CPUs and GPUs would be working simultaneously.

PierU · April 11, 2024, 1:37pm

Wow… I have been heavily using OpenMP for 25 years, and I am only discovering today that all loop indexes are privatized by default, even the ones of the loops that are not parallelized!

use omp_lib
use iso_fortran_env
implicit none
integer :: i, k
integer(int64) :: iloc, kloc

iloc = loc(i)
kloc = loc(k)

!$omp parallel num_threads(2)
print*, omp_get_thread_num(), "i:", loc(i)-iloc
print*, omp_get_thread_num(), "k:", loc(k)-kloc
do k = 1, 10
   continue
end do
!$omp end parallel

end

Output with gfortan (similar with ifx):

           0 i:                    0
           0 k:                  -92
           1 i:                    0
           1 k:        -327285035964

aledinola · April 11, 2024, 3:30pm

Thanks! I have updated the code as follows

include "mkl_omp_offload.f90"
    
program matrix_multiply
   use omp_lib
   implicit none
   integer :: i, j, k, myid, m, n, istat
   real :: sup_norm, tmp
   real, allocatable, dimension(:,:) :: a, b, c, c_serial
! 
! Different Intel GPUs have varying amounts of memory. If the program
! fails at runtime, try decreasing the value of "n".
!
   n = 50

   myid = OMP_GET_THREAD_NUM()
   if (myid .eq. 0) then
      print *, 'matrix size ', n
      print *, 'Number of CPU procs is ', OMP_GET_NUM_THREADS()
      print *, 'Number of OpenMP Device Available:', omp_get_num_devices()
!$omp target 
      if (OMP_IS_INITIAL_DEVICE()) then
         print *, ' Running on CPU'
      else
         print *, ' Running on GPU'
      endif
!$omp end target 
   endif

allocate( a(n,n), b(n,n), c(n,n), c_serial(n,n), stat=istat)
if (istat/=0) error stop "Allocation of matrices FAILED!"

! Initialize matrices
do j=1,n
    do i=1,n
        a(i,j) = real(i + j - 1)/n
        b(i,j) = real(i - j + 1)/n
    enddo
enddo
c = 0.0
c_serial = 0.0
   
!$omp target data map(to: a, b) map(tofrom: c)  
!$omp target teams distribute parallel do collapse(2) private(j,i,k,tmp)
do j=1,n
    do i=1,n
        tmp = 0.0
        do k=1,n
            tmp = tmp + a(i,k) * b(k,j)
        enddo
        c(i,j) = tmp
    enddo
enddo
!$omp end target data   

! serial compute matrix multiplication
do j=1,n
    do i=1,n
        tmp = 0.0
        do k=1,n
            tmp = tmp + a(i,k) * b(k,j)
        enddo
        c_serial(i,j) = tmp
    enddo
enddo

! verify result
do j=1,n
    do i=1,n
        if (.not. isclose(c(i,j),c_serial(i,j),atol=1.0e-2) ) then
        print *,'FAILED, i, j, c_serial(i,j), c(i,j) ', i, j, c_serial(i,j), c(i,j)
        stop
        endif
    enddo
enddo
   
sup_norm = maxval(abs(c-c_serial))

print *,'PASSED'
   
write(*,*) "||c-c_serial|| = ", sup_norm
   
   
contains

   ! See https://numpy.org/doc/stable/reference/generated/numpy.isclose.html
   elemental function isclose(a,b,atol,rtol)
      real, intent(in) :: a, b
      real, intent(in), optional :: atol, rtol
      logical :: isclose

      real :: atol_, rtol_

      atol_ = 1.0e-5
      rtol_ = 1.0e-9

      if (present(atol)) atol_ = atol
      if (present(rtol)) rtol_ = rtol_

      isclose = abs(a - b) <= (atol_ + rtol_*abs(b))
   end function

end program matrix_multiply

and compiled with the following line:

 ifx -fpp /Qopenmp /Qopenmp-targets:spir64 src\03_mm_GPU.f90 -o exe\run_win.exe

To my surprise, now the code compiles ok but fails at runtime with the following error message:

 matrix size           50
 Number of CPU procs is            1
 Number of OpenMP Device Available:           1

internal compiler error, abnormal program termination

I even further reduced the size of the matrix down to n=50 but to no avail.

aledinola · April 11, 2024, 3:33pm

A general question: can someone please recommend me a good textbook on OpenMP in Fortran, which covers also offloading to the GPU? I know there are many online resources, but I usually prefer textbooks
In any case, even some good online resource could be welcome

hkvzjal · April 11, 2024, 4:36pm

Not a book but you might find this training from the french IDRIS center interesting: http://www.idris.fr/media/formations/openacc/openmp_gpu_idris_fortran.pdf

I’ve also found these slides from @JeffH https://www.openmp.org/wp-content/uploads/20210924-OpenMP-update-for-DOE.pdf very useful

aledinola · April 11, 2024, 5:01pm

Unrelated to the discussion about GPU parallelization, I think there is a small bug in the function isclose in the matrix_multiply.f90 file that you kindly shared. The line

if (present(rtol)) rtol_ = rtol_

should be

if (present(rtol)) rtol_ = rtol

ivanpribec · April 11, 2024, 7:40pm

Thanks for spotting that. I was using the default value, so hopefully it did not come into play.

I don’t know what could be the cause of the internal compiler error (ICE). It’s best to report those over at the Intel Developer Forum (Developer Software Forums - Intel Community).

I was using ifx (IFX) 2024.0.2 20231213 in my experiments. The last version you posted works for me:

$ ifx -fiopenmp -fopenmp-targets=spir64 -qmkl test.f90 
$ OMP_DEFAULT_DEVICE=1 ./a.out
 matrix size           50
 Number of CPU procs is            1
 Number of OpenMP Device Available:           2
 Running on GPU 
 PASSED
 ||c-c_serial|| =   9.5367432E-06

Topic		Replies	Views
Gfortran GPU offload with OpenMP Help	6	3779	August 23, 2023
GPU utilization in a multithreaded code Help	6	873	June 6, 2023
Intel oneAPI 2022.3 release Announcements	22	1630	October 7, 2022
Fortran projects running on GPUs in production Help	16	2342	June 7, 2023
Fortran Language and OpenMP* Features Implemented in Intel ifx Announcements	7	1794	January 28, 2022

Parallelization on GPU with Intel compiler

Matrix Multiply Sample

Purpose

Prerequisites

Related Topics

`Matrix Multiply` Sample