Parallelization on GPU with Intel compiler

I am trying to run a basic example of parallel computing on the GPU with OpenMP. I have a Windows OS and I compile using ifx/ifort.
The code is the following:

program matrix_multiply
   use omp_lib
   implicit none
   integer :: i, j, k, myid, m, n
   real, allocatable, dimension(:,:) :: a, b, c, c_serial
! 
! Different Intel GPUs have varying amounts of memory. If the program
! fails at runtime, try decreasing the value of "n".
!
   n = 2600

   myid = OMP_GET_THREAD_NUM()
   if (myid .eq. 0) then
      print *, 'matrix size ', n
      print *, 'Number of CPU procs is ', OMP_GET_NUM_THREADS()
      print *, 'Number of OpenMP Device Available:', omp_get_num_devices()
!$omp target 
      if (OMP_IS_INITIAL_DEVICE()) then
         print *, ' Running on CPU'
      else
         print *, ' Running on GPU'
      endif
!$omp end target 
   endif

   allocate( a(n,n), b(n,n), c(n,n), c_serial(n,n))

! Initialize matrices
   do j=1,n
      do i=1,n
         a(i,j) = i + j - 1
         b(i,j) = i - j + 1
      enddo
   enddo
   c = 0.0
   c_serial = 0.0

!$omp target teams map(to: a, b) map(tofrom: c)
!$omp distribute parallel do SIMD private(j, i, k)
! parallel compute matrix multiplication.
   do j=1,n
      do i=1,n
         do k=1,n
            c(i,j) = c(i,j) + a(i,k) * b(k,j)
         enddo
      enddo
   enddo
!$omp end target teams

! serial compute matrix multiplication
   do j=1,n
      do i=1,n
         do k=1,n
            c_serial(i,j) = c_serial(i,j) + a(i,k) * b(k,j)
         enddo
      enddo
   enddo

! verify result
   do j=1,n
      do i=1,n
         if (c_serial(i,j) .ne. c(i,j)) then
            print *,'FAILED, i, j, c_serial(i,j), c(i,j) ', i, j, c_serial(i,j), c(i,j)
            exit
         endif
      enddo
   enddo

   print *,'PASSED'


end program matrix_multiply

I compile using the following makefile

# Select Compiler
COMPILER =  ifx 

SWITCH = -QxHost /Qopenmp -fopenmp-targets=spir64
#SWITCH = /Qmkl /Qopenmp /warn:all /check:all /traceback /heap-arrays0 
#GARBAGE = /fast /Qparallel /Qipo /Qprec-div- /QxHost  /heap-arrays0

SRCS = src\03_mm_GPU.f90
	
EXEC = exe\run_win.exe

ifort:
	$(COMPILER) -fpp $(SWITCH) $(SRCS) -o $(EXEC) 
	
# Cleaning everything

clean:
	del *.mod
	del *.obj
	del *.pdb
	del *.ilk
	del $(EXEC)

#To compile in Mac, type:
# $ make -f makefile_mac

#To compile in Windows, type:
# $ nmake /f makefile_win
# option flag /heap-arrays0
# to store all arrays on the heap
# see https://community.intel.com/t5/Intel-Fortran-Compiler/allocatable-automatic-stack-heap/m-p/1229091#M152713

The source for this example is from “guided_matrix_mul_OpenMP” which was freely available on the web some time ago (I wasn’t able to find it again online). I adopted it with minor modifications.

When I compile the code (with ifx or with ifort, it doesn’t matter), I get the following warning:

 ifx -fpp -QxHost /Qopenmp -fopenmp-targets=spir64 src\03_mm_GPU.f90 -o exe\run_win.exe
Intel(R) Fortran Compiler for applications running on Intel(R) 64, Version 2022.2.0 Build 20220730
Copyright (C) 1985-2022 Intel Corporation. All rights reserved.

ifx: command line warning #10006: ignoring unknown option '/fopenmp-targets=spir64'

Then the code runs with the following output on the screen:

 matrix size         2600
 Number of CPU procs is            1
 Number of OpenMP Device Available:           0
 Running on CPU
 PASSED

but it does not do any parallelization and it ignores the !$omp target directives. Since I’m new to this type of parallelization (I am familiar with OpenMP on the cpu only), any help would be greatly appreciated!

1 Like

Which GPU type are you trying to use. As far as I know, Intel only supports direct offload (and do concurrent offload) with their GPUs. I don’t think they support NVidia or AMD but I could be wrong. Maybe they do with openACC etc. I think the latest ifx might support SYCL (again don’t quote me on that) which might be a way to access a non-Intel GPU

1 Like

I have two GPU, one is Intel and the other one is Nvidia. More in detail, the Intel one is

GPU 0

	Intel(R) UHD Graphics

	Driver version:	31.0.101.4502
	Driver date:	15/06/2023
	DirectX version:	12 (FL 12.1)
	Physical location:	PCI bus 0, device 2, function 0

	Utilization	1%
	Dedicated GPU memory	
	Shared GPU memory	0,4/15,8 GB
	GPU Memory	0,4/15,8 GB

and the Nvidia is

GPU 1

	NVIDIA T600 Laptop GPU

	Driver version:	31.0.15.3779
	Driver date:	02/11/2023
	DirectX version:	12 (FL 12.1)
	Physical location:	PCI bus 1, device 0, function 0

	Utilization	0%
	Dedicated GPU memory	0,0/4,0 GB
	Shared GPU memory	0,0/15,8 GB
	GPU Memory	0,0/19,8 GB

Just based on the ifx man page info, I think the GPU has to support the SPIR64 intermediate representation (whatever that is). I haven’t a clue if either the Intel or the Nvidia GPU do. I would hope the Intel GPU would at least be supported.

1 Like

You might find this thread on the Intel Fortran Compiler Forum of interest. I would suggest you post a question there as well as here. I’m interested in playing around with GPUs also but I only have NVidia cards so unless there is a way for ifx to support them via openMP/openCL/SYCL I’ll probably have to use the Nvidia compiler or gfortran.

1 Like

That might work, but you’ll need the CodePlay oneAPI plugins: oneAPI - Codeplay Software Ltd. This would require writing C++ in the end, and interfacing via C, so it’s not the most desirable way of doing things.

You can use Intel’s OpenMP offloading with Intel® Arc™ Graphics series of products, but Float64 is not available (only available via software emulation if I’m not mistaken). Still the GPU can pull off some serious calculations, for example: Intel Arc A750 does all these fluid simulations in real-time - a showcase of FluidX3D v2.14 update. The Arc A750 and A770 both have > 0.5 TB/s memory bandwidth, so you can move a lot of data.

The other GPUs supported are the more powerful Intel Data Center GPU (Flex and Max product lines) mainly targeted at enterprises.

Assuming you have an Intel CPU, it’s possible to use ifx for offloading onto the integrated graphics unit (IGU). Personally, I have an Intel® Core™ i7-11700K Processor with an Intel(R) UHD Graphics 750 device (based on the Intel Xe architecture (Xe-LP)).

Here’s how it works (I slightly modified your program to be able to accept command line arguments):

matrix_multiply.f90 (3.5 KB);

~/lrz/mm_ifx$ export OMP_DEFAULT_DEVICE=1
~/lrz/mm_ifx$ ifx -O2 -xHOST -fiopenmp -fopenmp-targets=spir64 -qmkl matrix_multiply.f90 
~/lrz/mm_ifx$ OMP_TARGET_OFFLOAD=mandatory ./a.out 2400 F
 matrix size         2400
 Number of CPU procs is            1
 Number of OpenMP Device Available:           2
 Running on GPU 
Device time (s)  1.387, GLOPS:    19.9
Host time (s)  0.577, GLOPS:    47.9
 PASSED
~mm_ifx$ OMP_TARGET_OFFLOAD=disabled ./a.out 2400 F
 matrix size         2400
 Number of CPU procs is            1
 Number of OpenMP Device Available:           0
 Running on CPU 
Device time (s)  5.855, GLOPS:     4.7
Host time (s)  0.587, GLOPS:    47.1
 PASSED

As you can see the host CPU is 2-3x times faster than the IGU. In the OpenMP target host “fallback” mode, the target directive makes the computation slower.

More things worth noting:

  • I had to modify the equality check, because it was too strict
  • On Windows the targets flags looks like this: ifx /Qopenmp /Qopenmp-targets:spir64 offload.f90
  • I had to install the Intel Compute Runtime from here: Releases · intel/compute-runtime · GitHub
  • The actual offloading is performed via either Level Zero or the OpenCL plugin meaning (at least) one of them has to be available on your system. For example on my Ubuntu machine:
    $ clinfo --list
    Platform #0: Intel(R) FPGA Emulation Platform for OpenCL(TM)
     `-- Device #0: Intel(R) FPGA Emulation Device
    Platform #1: Intel(R) OpenCL
     `-- Device #0: 11th Gen Intel(R) Core(TM) i7-11700K @ 3.60GHz
    Platform #2: NVIDIA CUDA
     `-- Device #0: NVIDIA GeForce RTX 2060
    Platform #3: Intel(R) OpenCL Graphics
     `-- Device #0: Intel(R) UHD Graphics 750
    
  • The OpenMP device numbering doesn’t match the clinfo platform numbering, which is why I needed to set the OMP_DEFAULT_DEVICE variable to pick the right one (I determined the order by trial and error, and using the LIBOMPTARGET_DEBUG=1 variable to inspect what’s going on under the hood).
  • To pass options to the OpenCL compiler, the LIBOMPTARGET_OPENCL_COMPILATION_OPTIONS variable is available. Optimizations are enabled by default. Valid options are described here: The OpenCL™ Specification.

Some basic profiling information can be obtained using the LIBOMPTARGET_PLUGIN_PROFILE=T flag:

~/mm_ifx$ OMP_TARGET_OFFLOAD=mandatory LIBOMPTARGET_PLUGIN_PROFILE=T ./a.out 2400 F
 matrix size         2400
 Number of CPU procs is            1
 Number of OpenMP Device Available:           2
 Running on GPU 
Device time (s)  1.378, GLOPS:    20.1
Host time (s)  0.562, GLOPS:    49.2
 PASSED
================================================================================
LIBOMPTARGET_PLUGIN_PROFILE(OPENCL) for OMP DEVICE(1) Intel(R) UHD Graphics 750, Thread 0
--------------------------------------------------------------------------------
-- Kernel 0                  : __omp_offloading_10302_260227e_MAIN___l38
-- Kernel 1                  : __omp_offloading_10302_260227e_MAIN___l76
--------------------------------------------------------------------------------
-- Name                      :     Host Time (msec)   Device Time (msec)
-- Compiling                 :                0.195                0.000
-- DataAlloc                 :                0.144                0.000
-- DataRead (Device to Host) :                0.903                0.867
-- DataWrite (Host to Device):               11.092                2.969
-- Kernel 0                  :                0.563                0.006
-- Kernel 1                  :             1363.674             1363.636
-- Linking                   :              193.445                0.000
-- OffloadEntriesInit        :                0.719                0.000
-- Total                     :             1570.736             1367.478
================================================================================

Line 38 and Line 76 are where the ˙!$omp target˙ regions begin and the compiler generates corresponding GPU kernels.

In the modified version I’ve included an option to call BLAS (following the information in the Intel Guide on Offloading oneMKL Computations onto the GPU):

~/mm_ifx$ OMP_TARGET_OFFLOAD=mandatory ./a.out 2400 T
 matrix size         2400
 Number of CPU procs is            1
 Number of OpenMP Device Available:           2
 Running on GPU 
Device time (s)  0.107, GLOPS:   259.4
Host time (s)  0.041, GLOPS:   677.6
 PASSED

Compared to the earlier results, the computations are now ten times faster. But how can we be sure it’s running on the GPU? Luckily MKL supports a Verbose Mode:

$ OMP_TARGET_OFFLOAD=mandatory MKL_VERBOSE=1 ./a.out 2400 T
 matrix size         2400
 Number of CPU procs is            1
 Number of OpenMP Device Available:           2
 Running on GPU 
MKL_VERBOSE oneMKL 2024.0 Product build 20231011 for Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 512 (Intel(R) AVX-512) with support of Intel(R) Deep Learning Boost (Intel(R) DL Boost), EVEX-encoded AES and Carry-Less Multiplication Quadword instructions, Lnx 3.60GHz lp64 intel_thread
MKL_VERBOSE Detected GPU0 Intel(R)_Xe_LP Backend:OpenCL VE:32 Stack:1 maxWGsize:512

MKL_VERBOSE oneapi::mkl::blas::column_major::gemm[float](0x3e3de50,NonTranspose,NonTranspose,2400,2400,2400,1,0xffffaaad5c220000,2400,0xffffaaad5d820000,2400,1,0xffffaaad5ac20000,2400,unset,Vector<sycl::event>OfSize:0) mode:standard host:nan device:nan GPU0
Device time (s)  0.108, GLOPS:   256.1
MKL_VERBOSE SGEMM(N,N,2400,2400,2400,0x47a004,0x7ff7629ff200,2400,0x7ff7613fe240,2400,0x47a004,0x7ff73f2a22c0,2400) 55.77ms CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:8
Host time (s)  1.026, GLOPS:    27.0
 PASSED

For some reason, when I turn on the MKL messages, the host measurement time increases. But if I calculate the GFLOPS from MKL info-line, i.e. 2 * 2400^3 / 0.05577 * 1.0e-9 = 459.75 GLOPS, it seems to be correct. Weird.


For more information concerning Intel’s implementation of OpenMP offloading, you can dig into the following links:

For OpenMP target offloading in general, the best resource available right now is probably the new book by Tom Deakin: Programming Your GPU with OpenMP: Performance Portability for GPUs | MIT Press. You can also find several tutorial videos on the OpenMP YouTube channel, including from Tom Deakin.


Here’s a die photo of the Rocket Lake processor (Source: Intel "Rocket Lake-S" Die Annotated | TechPowerUp). FWIU, the offload “device” are the 2x16 execution units in the top-right corner of the die:

1 Like

Thanks Ivan. The ifx man pages for latest ifx has several compiler options related to SYCL so I was assuming it had at least some level of support for SYCL. SYCL with DPC++ will offload to Nvidia GPUs. My main system is an 8 core Zen 3 processor with an Nvidia RTX A4500 GPU with 20 Gb Memory and a memory bandwidth of 640Gbytes/s. I got the GPU when I thought I would be doing a lot of Machine Learning work but I decided to retire instead. Still I would like to learn how to offload etc with ifx if possible (mainly because nvfortran is still stuck at F2003 effectively. I guess Nvidia is waiting for llvm flang but at the rate its being developed I don’t see a stable production version of llvm flang for another five years or so). I’ll do some more research on the Codeplay plugin and see if I can figure a way to get it working. I did find a couple of videos from a Codeplay guy from the 2021 oneAPI developer conference that talks about using oneAPI (DPC++) with Nvidia GPUs that look interesting. Might be something there that can be applied with the latest ifx.

https://www.intel.com/content/www/us/en/developer/videos/developing-nvidia-gpus-using-sycl-oneapi-part-1.html

https://www.intel.com/content/www/us/en/developer/videos/developing-nvidia-gpus-using-sycl-oneapi-part-2.html

Trivial question: the nvfortran compiler is available for linux but not Windows, correct?

Yes, Nvidia HPC SDK download site says it will be available for Windows 64 “at a later date”

FWIW, I managed to put together a hacky version that also works with the Nvidia HPC SDK and gfortran (but without OpenMP offloading):

matrix_multiply.F90 (5.9 KB)

~/mm_ifx$ nvfortran -fast -O3 -mp=gpu -cuda matrix_multiply.F90 -lopenblas -cudalib
~/mm_ifx$ OMP_TARGET_OFFLOAD=mandatory ./a.out 2400 F     # No BLAS 
 matrix size          2400
 Number of CPU procs is             1
 Number of OpenMP Device Available:            1
  Running on GPU
Device time (s)  0.231, GLOPS:   119.7
Host time (s) 31.633, GLOPS:     0.9
 PASSED

~mm_ifx$ OMP_TARGET_OFFLOAD=mandatory ./a.out 2400 T     # With BLAS 
 matrix size          2400
 Number of CPU procs is             1
 Number of OpenMP Device Available:            1
  Running on GPU
Device time (s)  0.016, GLOPS:  1741.4
Host time (s)  0.119, GLOPS:   232.7
 PASSED

The hacks are:

  • writing my own BLAS wrapper for the two GPU-capable BLAS libraries (cuBLAS and oneMKL)
  • relying on the fact the nvfortran’s OpenMP implementation uses CUDA behind the scenes, and target data mapping “just works”

The BLAS wrapper looks like this:

   subroutine wrapper_sgemm(transa,transb,m,n,k,alpha,a,lda,b,ldb,beta,c,ldc)
      use, intrinsic :: iso_fortran_env, only: error_unit
      use, intrinsic :: iso_c_binding, only: c_char, c_int, c_float

      character(kind=c_char), intent(in) :: transa, transb
      integer(c_int), intent(in) :: m, n, k, lda, ldb, ldc
      real(c_float), intent(in) :: alpha, beta
      real(c_float), intent(in) :: a(lda,*), b(ldb,*)
      real(c_float), intent(inout) :: c(ldc,*)
!@cuf attributes(device) :: a, b, c

   !$omp target data map(to: a(1:lda,1:k), b(1:ldb,1:n)) &
   !$omp             map(tofrom: c(1:ldc,1:n))
   
#if defined(__NVCOMPILER)
      call cublas_sgemm(transa,transb,m,n,k,alpha,a,lda,b,ldb,beta,c,ldc)
#elif defined(__INTEL_LLVM_COMPILER)
      !$omp dispatch
      call onemkl_sgemm(transa,transb,m,n,k,alpha,a,lda,b,ldb,beta,c,ldc)   
#else
      write(error_unit,'(A)') "wrapper_sgemm: fatal error (only ifx and nvfortran are supported!)"
      error stop 1
#endif

   !$omp end target data
   end subroutine

Installing the plugin could hardly be made easier. Assuming you have working Intel oneAPI and NVIDIA HPC SDK installations:

  1. Pick the right version of your libraries here: Downloads - oneAPI for NVIDIA® GPUs - Products - Codeplay Developer, and download the shell script installer using curl.
  2. Run the install script
  3. Source the oneAPI environment settings
  4. Voilà! SYCL is available for your Nvidia GPU.
$ curl -LOJ "https://developer.codeplay.com/api/v1/products/download?product=oneapi&variant=nvidia&version=2024.0.2&filters[]=12.0&filters[]=linux" 
$ sh oneapi-for-nvidia-gpus-2024.0.2-cuda-12.0-linux.sh 
$ source /opt/intel/oneapi/setvars.sh 
$ sycl-ls
[opencl:acc:0] Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device OpenCL 1.2  [2023.16.12.0.12_195853.xmain-hotfix]
[opencl:cpu:1] Intel(R) OpenCL, 11th Gen Intel(R) Core(TM) i7-11700K @ 3.60GHz OpenCL 3.0 (Build 0) [2023.16.12.0.12_195853.xmain-hotfix]
[opencl:gpu:2] Intel(R) OpenCL Graphics, Intel(R) UHD Graphics 750 OpenCL 3.0 NEO  [24.05.28454.6]
[ext_oneapi_cuda:gpu:0] NVIDIA CUDA BACKEND, NVIDIA GeForce RTX 2060 7.5 [CUDA 12.4]

Edit:

Example of running SYCL on the NVIDIA GPU (click on the small triangle)

Using the minimal example problem taken from here, I checked that it indeed runs on the NVIDIA GPU:

~/dpcpp$ icpx -fsycl -fsycl-targets=spir64,nvptx64-nvidia-cuda --cuda-path=$CUDA_HOME sycl-mwe.cpp 
~/dpcpp$ SYCL_PI_TRACE=1 ONEAPI_DEVICE_SELECTOR=ext_oneapi_cuda:gpu ./a.out
SYCL_PI_TRACE[basic]: Plugin found and successfully loaded: libpi_cuda.so [ PluginVersion: 14.38.1 ]
SYCL_PI_TRACE[basic]: Plugin found and successfully loaded: libpi_unified_runtime.so [ PluginVersion: 14.37.1 ]
SYCL_PI_TRACE[all]: Requested device_type: info::device_type::automatic
SYCL_PI_TRACE[all]: Selected device: -> final score = 1500
SYCL_PI_TRACE[all]:   platform: NVIDIA CUDA BACKEND
SYCL_PI_TRACE[all]:   device: NVIDIA GeForce RTX 2060
Running on NVIDIA GeForce RTX 2060
A[ 0 ] = 0
...
A[ 99 ] = 9.9

The same program on the UHD Graphics:

~/dpcpp$ SYCL_PI_TRACE=1 ONEAPI_DEVICE_SELECTOR=opencl:gpu ./a.out
SYCL_PI_TRACE[basic]: Plugin found and successfully loaded: libpi_opencl.so [ PluginVersion: 14.37.1 ]
SYCL_PI_TRACE[basic]: Plugin found and successfully loaded: libpi_unified_runtime.so [ PluginVersion: 14.37.1 ]
SYCL_PI_TRACE[all]: Requested device_type: info::device_type::automatic
SYCL_PI_TRACE[all]: Selected device: -> final score = 1500
SYCL_PI_TRACE[all]:   platform: Intel(R) OpenCL Graphics
SYCL_PI_TRACE[all]:   device: Intel(R) UHD Graphics 750
Running on Intel(R) UHD Graphics 750
terminate called after throwing an instance of 'sycl::_V1::exception'
  what():  Required aspect fp64 is not supported on the device
Aborted (core dumped)

@rwmsu you might want to look at my question and the answer I got here Solving Heterogeneous Programming Challenges with Fortran and OpenMP - Intel Community seems like it is not gonna happen any time soon :frowning:

I have a question for anyone knowledgeable:

Say you have, for example, 64 CPU cores engaged in OpenMP parallelism (some complicated task, like setting up a matrix for diagonalization). Suppose the system has only one GPU device. When the cores arrive at a point which can be done on a GPU (like a BLAS operation), what happens to the GPU resources? Are they divided up among all 64 threads or is it first-come-first-serve and one thread gets the entire GPU?

Yes I saw your Intel Fortran post yesterday when I was looking around the web for info on ifx and Nvidia GPUs. Again, since the latest version of ifx (2024.1) has several compiler flags related to linking with SYCL objects, I’m hoping ifx will now supply some level of support for offloading via SYCL. I know that Intel being Intel I shouldn’t get my hopes up too high but it might be an interesting exercise to pass what looks to be another rainy day in my corner of the world.

I managed to get a little “hello world” example of a Fortran main program calling some C++ SYCL code to compile and run on Perlmutter not long ago. It absolutely should be doable (if perhaps not straightforward) to use Fortran and GPU programming together with the Intel toolchain right now.

Thank you so much! I updated my code following your instructions and changed the makefile to the following:

# Select Compiler
COMPILER =  ifx 

SWITCH = /Qopenmp /Qopenmp-targets:spir64 
#SWITCH = /Qmkl /Qopenmp /warn:all /check:all /traceback /heap-arrays0 
#GARBAGE = /fast /Qparallel /Qipo /Qprec-div- /QxHost  /heap-arrays0

SRCS = src\03_mm_GPU.f90
	
EXEC = exe\run_win.exe

ifort:
	$(COMPILER) -fpp $(SWITCH) $(SRCS) -o $(EXEC) 
	
# Cleaning everything

clean:
	del *.mod
	del *.obj
	del *.pdb
	del *.ilk
	del $(EXEC)

#To compile in Mac, type:
# $ make -f makefile_mac

#To compile in Windows, type:
# $ nmake /f makefile_win

Now the code compiles fine, but when I run it I get the message:

matrix size         2600
 Number of CPU procs is            1
 Number of OpenMP Device Available:           1
 Running on GPU
Libomptarget error: Executing target region abort target.
Libomptarget error: Run with
Libomptarget error: LIBOMPTARGET_DEBUG=1 to display basic debug information.
Libomptarget error: LIBOMPTARGET_DEBUG=2 to display calls to the compute runtime.
Libomptarget error: LIBOMPTARGET_INFO=4 to dump host-target pointer mappings.
Libomptarget error: Source location information not present. Compile with -g or -gline-tables-only.
Libomptarget fatal error 1: failure of target construct while offloading is mandatory

Now it finds one OpenMP device available (which is good) which I guess is my Intel graphics card, it tries to run on the GPU but then it fails.
Can you please suggest me how to update my makefile to make the code run successfully?
Thanks!

Can you rerun your application with the debug variable and show the output?

> set LIBOMPTARGET_DEBUG=1
> run_win.exe
1 Like

I actually managed to run it. But I had to replace this code (which I had copied from the Intel example):

!$omp target teams map(to: a, b) map(tofrom: c)
!$omp distribute parallel do SIMD private(j, i, k)
! parallel compute matrix multiplication.
   do j=1,n
      do i=1,n
         do k=1,n
            c(i,j) = c(i,j) + a(i,k) * b(k,j)
         enddo
      enddo
   enddo
!$omp end target teams

with your code

!$omp target data map(to: a, b) map(tofrom: c)  
!$omp target teams distribute parallel do collapse(2)
do j=1,n
    do i=1,n
    do k=1,n
        c(i,j) = c(i,j) + a(i,k) * b(k,j)
    enddo
    enddo
end do
!$omp end target data  

The code runs and passes the test. Thanks for providing the function isclose to check for (almost) equality between real numbers. I had to change the default value for the optional input atol to something larger, however. I computed the maximum absolute deviation between c and c_serial with the line

sup_norm = maxval(abs(c-c_serial))

and I got

matrix size          100
 Number of CPU procs is            1
 Number of OpenMP Device Available:           1
 Running on GPU
 PASSED
 ||c-c_serial|| =   3.0517578E-05

I also had to decrease the size of the matrix because it was taking forever with the GPU. I will try later with BLAS (as it is, the performance is really bad). I also had an Nvidia graphics card on my computer but ifx does not recognize it, it seems. I use it with Matlab and it works well.

The specific matrices used appear to accumulate a lot of error in single precision, and the multiplied values are very big. Double precision however isn’t available on the UHD 750 device.

Concerning the OMP directives, I think the outer parallelization only on the j loop is the culprit, i.e. the work on the matrix elements is not being distributed correctly. Do you have a link to the OpenMP page with the faulty example?

I downloaded the Codepage plugin Ivan suggested and was able to compile and run a C++/SYCL matrix multiply program from

(Look at the matix multiply programs under the DirectProgrmming subdirectory)

on my RTX A4500. I had to set LD_LIBRARY_PATH to the Nvidia HPC SDK 24.3 cuda libraries though.

However, when I tried to run the equivalent openMP program (had to set the OPENMP_DEFAULT_DEVICE=1 before it actually tried to use the GPU. It was defaulting to 0) it acts like it sees the GPU but says offloads are disabled and then tries to use the CPU but ends up with incorrect results. I think its trying to fall back to using openCL.

Here is the last lines of output with LIBOMPTARGET_DEBUG=1

The default device id: 1
Libomptarget → Call to omp_get_num_devices returning 1
Libomptarget → Entering target region for device 1 with entry point 0x0000000000403483
Libomptarget → Call to omp_get_num_devices returning 1
Libomptarget → Default TARGET OFFLOAD policy is now mandatory (devices were found)
Libomptarget → Call to omp_get_num_devices returning 1
Libomptarget → Call to omp_get_num_devices returning 1
Libomptarget → Call to omp_get_initial_device returning 1
Libomptarget → Device is host (1), returning as if offload is disabled
Libomptarget → Not offloading to device 1
Result of matrix multiplication using GPU offloading: Fail - The result is incorrect for element: [0, 0], expected: 45150, but found: 140534
Fail - The result is incorrect for element: [0, 1], expected: 45150, but found: 153417
Fail - The result is incorrect for element: [0, 2], expected: 45150, but found: 159191
Fail - The result is incorrect for element: [0, 3], expected: 45150, but found: 251518
Fail - The result is incorrect for element: [0, 4], expected: 45150, but found: 253624
Fail - The results mismatch!
Libomptarget → Unloading target library!
Libomptarget → Clearing Interop Table
Libomptarget → Unregistered image 0x00000000004038a0 from RTL 0x0000000000fc3350!
Libomptarget → Done unregistering images!
Libomptarget → Removing translation table for descriptor 0x0000000000403880
Libomptarget → Done unregistering library!
Target OPENCL RTL → Deinit OpenCL plugin!
Target OPENCL RTL → Closed RTL successfully
Libomptarget → Deinit target library

Still no luck with Fortran. The Codepage software is a plugin for the C++ compiler and probably is not visible in any form to Fortran (other than wrapping C++ code and calling it from Fortran). So I think the current status for using NVIDIA GPUs is that if you want to use SYCL you can make something work with a lot of effort. It looks like using openMP might not be doable.