Parallelization on GPU with Intel compiler

ivanpribec · April 12, 2024, 12:24pm

The books use the C syntax in the examples, but usually show directive syntax in both languages. The concepts of OpenMP are general however. In some cases Fortran even has a slight advantage due to it’s abstract arrays.

The book by Tom Deakin will have a Fortran supplement (it’s not available yet):
https://twitter.com/tjdeakin/status/1722499524847894665

Tom also has a tutorial with slides in Fortran:

ivanpribec · April 12, 2024, 12:51pm

We’d have to check in earlier issues of the standard if it’s been always like that. There are some additional rules in the 5.2 standard,

for loops that are associated to a loop construct (i.e. parallelized) the indexes may be listed in a private or lastprivate clause
loop iteration variables that are not associated to a directive may (emphasis mine) be listed in data-sharing attribute clauses on the surrounding teams, parallel or task-generating constructs, and on enclosed constructs, subject to other restrictions

So it’s not wrong to list them as private, but it’s already the predetermined behavior.

My personal preference is to omit them, less noisy. I never use default(none) because it leads to long variable lists. I usually pick either default(shared) or default(private), and then privatize/share what is necessary to make the code work correctly and leads to a shorter list.

Maybe this a little bit surprising and subtle but default(shared) does not influence the loop indexes, as default() only applies to variables with implicitly-determined data-sharing. Variables which are implicitly-determined, are the ones which are not predetermined, or explicitly-determined. Explicitly-determined means they appear in a shared() or private() clause (subject to other rules).

aledinola · April 12, 2024, 5:26pm

I managed to fix the problem and I have run successfully a slightly more complicated example on the GPU. The relevant part of the code is here:

!$omp target data map(to: a_grid,z_grid,ap_grid) map(tofrom: payoff_gpu)
!$omp target teams distribute parallel do collapse(2) private(z_c,a_c,ap_c,cons,util)
do z_c=1,n_z
    do a_c=1,n_a
        do ap_c=1,n_a
            cons = R*a_grid(a_c) + z_grid(z_c) - ap_grid(ap_c)
            if (cons>0.0) then
                util = -1.0/cons
            else
                util = penalty
            endif 
            payoff_gpu(ap_c,a_c,z_c) = util
        enddo
    enddo
enddo
!$omp end target data

This runs on my GPU but it gives different results compared to the serial version. I am trying to understand if it is because

I made a mistake in the OpenMP directives (declaring variables in to', tofrom’, from' and private’, etc.)
Computations on the GPU are “less precise” compared to the CPU (?)

I report here below the complete code to run this example:

! Author: Alessandro Di Nola
! Date: 2024-04-09
! Test openMP parallelization CPU vs GPU 
!===============================================================================!

include "mkl_omp_offload.f90"
    
module mod_numerical
    implicit none
    
    private
    public :: linspace, isclose
    
    contains
    
    function linspace(my_start, my_stop, n)
        ! Purpose: replicate Matlab function <linspace>
        implicit none
        ! Inputs:
        integer :: n
        real :: my_start, my_stop
        ! Function result:
        real :: linspace(n)
        ! Locals:
        integer :: i
        real :: step, grid(n)

        if (n == 1) then
            grid(n) = (my_start+my_stop)/2.d0
        elseif (n>=2) then
            grid(1) = my_start
            if (n>2) then
                step = (my_stop-my_start)/(real(n-1,8))
                do i = 2, n-1
                    grid(i) = grid(i-1) + step
                end do
            end if
            grid(n) = my_stop
        endif
        linspace = grid
    
    end function linspace
    
    ! See https://numpy.org/doc/stable/reference/generated/numpy.isclose.html
    elemental function isclose(a,b,atol,rtol)
        real, intent(in) :: a, b
        real, intent(in), optional :: atol, rtol
        logical :: isclose

        real :: atol_, rtol_

        atol_ = 1.0e-5
        rtol_ = 1.0e-9

        if (present(atol)) atol_ = atol
        if (present(rtol)) rtol_ = rtol

        isclose = abs(a - b) <= (atol_ + rtol_*abs(b))
    end function

end module mod_numerical
!===============================================================================!

program main
    use omp_lib
    use mod_numerical, only: linspace, isclose
    implicit none
    ! Declare variables here
    character(len=*), parameter :: savedir = "output"
    logical :: good
    real :: t1, t2, R, beta, cons, util,penalty
    real :: t_serial,t_par_cpu,t_par_gpu
    integer :: n_z,n_a,istat,myid,a_c,ap_c,z_c,ind(3)
    real, allocatable :: payoff_serial(:,:,:), payoff_cpu(:,:,:), payoff_gpu(:,:,:)
    real, allocatable :: a_grid(:),ap_grid(:),z_grid(:),z_tran(:,:)

    write(*,*), '*********************************************** '

    ! ---------------------- Initialize variables -------------------------------!
    !write(*,*) "Enter the size of the grid, n_a ="
    !read(*,*) n_a
    n_a   = 500           ! Number of grid points for asset, a
    n_z   = 5            ! Number of grid points for income shock, z
    R     = 1.03d0       ! Gross interest rate, R = 1+r
    beta  = 0.96d0       ! Discount factor
    penalty = -huge(0.0) ! Penalty for negative consumption

    ! ---------------------- Define grids ---------------------------------------!
    allocate(a_grid(n_a),ap_grid(n_a),z_grid(n_z),z_tran(n_z,n_z),stat=istat)
    if (istat /= 0) then
        error stop "Error allocating memory for arrays z_grid and z_tran"
    endif
    a_grid = linspace(0.01, 4.0, n_a)
    ap_grid = a_grid
    z_grid = [real :: 0.9792, 0.9896, 1.0000, 1.0106, 1.0212]
    z_tran = reshape( [real :: 0.9727, 0.0273, 0., 0., 0., &
                        0.0041, 0.9806, 0.0153, 0.0, 0.0, &
                        0.0, 0.0082, 0.9837, 0.0082, 0.0, &
                        0.0, 0.0, 0.0153, 0.9806, 0.0041, &
                        0.0, 0.0, 0.0, 0.0273, 0.9727 ], [5,5])

    !if (do_disp) call disp("Matrix z_tran", z_tran)

    ! Main program statements here
    write(*,*) "Testing parallelization with OpenMP"
    
    !$omp parallel
    myid = OMP_GET_THREAD_NUM()
    if (myid == 0) then
        print *, 'Grid size ', n_a
        print *, 'Number of procs is ', OMP_GET_NUM_THREADS()
    endif
    !$omp end parallel

    ! Initialize big array
    allocate(payoff_cpu(n_a,n_a,n_z),payoff_serial(n_a,n_a,n_z),payoff_gpu(n_a,n_a,n_z), stat=istat)
    if (istat /= 0) then
        error stop "Error allocating memory for payoff_cpu,payoff_gpu,payoff_serial"
    end if
    
    write(*,*) "1. Serial" 
    t1 = omp_get_wtime()
    do z_c=1,n_z
        do a_c=1,n_a
            do ap_c=1,n_a
                cons = R*a_grid(a_c) + z_grid(z_c) - ap_grid(ap_c)
                if (cons>0.0) then
                    util = -1.0/cons
                else
                    util = penalty
                endif 
                payoff_serial(ap_c,a_c,z_c) = util
            enddo !end ap_c
        enddo !end a_c
    enddo !end z_c
    t2 = omp_get_wtime()
    t_serial = t2-t1
    
    write(*,*) "2. Parallel with OpenMP on CPU" 
    t1 = omp_get_wtime()
    !$omp parallel default(shared) & 
    !$ private(z_c,a_c,ap_c,cons,util)
    !$omp do collapse(2) 
    do z_c=1,n_z
        do a_c=1,n_a
            do ap_c=1,n_a
                cons = R*a_grid(a_c) + z_grid(z_c) - ap_grid(ap_c)
                if (cons>0.0) then
                    util = -1.0/cons
                else
                    util = penalty
                endif 
                payoff_cpu(ap_c,a_c,z_c) = util
            enddo
        enddo
    enddo
    !$omp enddo
    !$omp end parallel
    t2 = omp_get_wtime()
    t_par_cpu = t2-t1

    !======================================= GPU OPENMP ======================================!
    
    myid = OMP_GET_THREAD_NUM()
    if (myid .eq. 0) then
        print *, 'Grid size ', n_a
        print *, 'Number of CPU procs is ', OMP_GET_NUM_THREADS()
        print *, 'Number of OpenMP Device Available:', omp_get_num_devices()
    !$omp target
    if (OMP_IS_INITIAL_DEVICE()) then
        print *, ' Running on CPU'
    else
        print *, ' Running on GPU'
    endif
    !$omp end target
    endif
    
    write(*,*) "3. Parallel with GPU" 
    t1 = omp_get_wtime()
    !!!!$ private(z_c,a_c,ap_c,cons,util)
    !$omp target data map(to: a_grid,z_grid,ap_grid) map(tofrom: payoff_gpu)
    !$omp target teams distribute parallel do collapse(2) private(z_c,a_c,ap_c,cons,util)
    do z_c=1,n_z
        do a_c=1,n_a
            do ap_c=1,n_a
                cons = R*a_grid(a_c) + z_grid(z_c) - ap_grid(ap_c)
                if (cons>0.0) then
                    util = -1.0/cons
                else
                    util = penalty
                endif 
                payoff_gpu(ap_c,a_c,z_c) = util
            enddo
        enddo
    enddo
    !$omp end target data
    t2 = omp_get_wtime()
    t_par_gpu = t2-t1
    !====================================================================================================!
    
    ! verify result
    write(*,*) "Verify results, CPU parallel" 
    good = .true.
    outer: do z_c=1,n_z
        do a_c=1,n_a
            do ap_c=1,n_a
                if ( .not. isclose(payoff_serial(ap_c,a_c,z_c),payoff_cpu(ap_c,a_c,z_c),1.0e-3) ) then
                    write(*,*) 'ap_c = ', ap_c, ' a_c = ', a_c, ' z_c = ', z_c
                    write(*,*) 'payoff_serial = ', payoff_serial(ap_c,a_c,z_c),' payoff_cpu = ', payoff_cpu(ap_c,a_c,z_c)
                    print *,'FAILED'
                    good = .false.
                    exit outer 
                endif
            enddo
        enddo
    enddo outer
    
    if (good) print *,'CPU PASSED'
    
    write(*,*) "Verify results, GPU parallel" 
    good = .true.
    outer2: do z_c=1,n_z
        do a_c=1,n_a
            do ap_c=1,n_a
                if ( .not. isclose(payoff_serial(ap_c,a_c,z_c),payoff_gpu(ap_c,a_c,z_c),1.0e-3) ) then
                    write(*,*) 'ap_c = ', ap_c, ' a_c = ', a_c, ' z_c = ', z_c
                    write(*,*) 'payoff_serial = ', payoff_serial(ap_c,a_c,z_c),' payoff_cpu = ', payoff_gpu(ap_c,a_c,z_c)
                    print *,'FAILED'
                    good = .false.
                    exit outer2 
                endif
            enddo
        enddo
    enddo outer2
    
    if (good) print *,'GPU PASSED'
    
    write(*,'(A,F5.3,A)'),"Time for serial =       ", t_serial,  ' seconds'
    write(*,'(A,F5.3,A)'),"Time for parallel CPU = ", t_par_cpu, ' seconds'
    write(*,'(A,F5.3,A)'),"Time for parallel GPU = ", t_par_gpu, ' seconds'
    write(*,'(A,F5.3)') "Parallel_cpu/serial = ", t_par_cpu/t_serial
    write(*,'(A,F5.3)') "Parallel_gpu/serial = ", t_par_gpu/t_serial
    
    write(*,'(A,F15.6)') "||payoff_cpu-payoff_serial|| = ", maxval(abs(payoff_cpu-payoff_serial))
    write(*,'(A,F15.6)') "||payoff_gpu-payoff_serial|| = ", maxval(abs(payoff_gpu-payoff_serial))
    
    if (maxval(abs(payoff_gpu-payoff_serial))>1e-2) then
        write(*,*) "Results on GPU seem wrong..."
        ind = maxloc(abs(payoff_gpu-payoff_serial))
        write(*,'(A,F15.6)') "payoff_serial(ind) = ", payoff_serial(ind(1),ind(2),ind(3))
        write(*,'(A,F15.6)') "payoff_gpu(ind)    = ", payoff_gpu(ind(1),ind(2),ind(3))
    endif
    
    
    !call sub_write()
    
    contains
    
    subroutine sub_write()
    
    ! Comment out this part if you don't want to write to txt files
        call writescalar_i(n_a,trim(savedir)//'n_a.txt')
        call writescalar_i(n_z,trim(savedir)//'n_z.txt')
        call write_arr3(payoff_serial,trim(savedir)//'payoff_serial.txt')
        call write_arr3(payoff_cpu,trim(savedir)//'payoff_cpu.txt')
        call write_arr3(payoff_gpu,trim(savedir)//'payoff_gpu.txt')
    
    end subroutine sub_write
    
    subroutine writescalar_i(x,file_name)
        implicit none
        !Declare inputs:
        integer, intent(in) :: x
        character(len=*), intent(in) :: file_name
        integer :: unitno,ierr
    
        !Write scalar x into a txt file
        open(newunit=unitno, file=file_name, status='replace',  iostat=ierr)
        if (ierr/=0) then
            error stop "Error: writescalar: cannot open file"
        endif
    
        write(unitno,*) x

        close(unitno)
    end subroutine writescalar_i
    
    subroutine write_arr1(x,file_name)
        real, intent(in) :: x(:)
        character(len=*), intent(in) :: file_name
        integer :: unitno, ierr, i1
    
        !Write 1 dim array x into a txt file
        open(newunit=unitno, file=file_name, status='replace',  iostat=ierr)
        if (ierr/=0) then
            error stop "Error: write_arr1: cannot open file"
        endif
        
        do i1 = 1,size(x,dim=1)
            write(unitno,*) x(i1)
        enddo
        
        close(unitno)
    
    end subroutine write_arr1
    
    subroutine write_arr2(x,file_name)
        real, intent(in) :: x(:,:)
        character(len=*), intent(in) :: file_name
        integer :: unitno, ierr,i1,i2
    
        !Write 1 dim array x into a txt file
        open(newunit=unitno, file=file_name, status='replace',  iostat=ierr)
        if (ierr/=0) then
            error stop "Error: write_arr2: cannot open file"
        endif
        
        do i2 = 1,size(x,dim=2)
            do i1 = 1,size(x,dim=1)
                write(unitno,*) x(i1,i2)
            enddo
        enddo
        
        close(unitno)
    
    end subroutine write_arr2
    
    subroutine write_arr3(x,file_name)
        real, intent(in) :: x(:,:,:)
        character(len=*), intent(in) :: file_name
        integer :: unitno, ierr,i1,i2,i3
    
        !Write 1 dim array x into a txt file
        open(newunit=unitno, file=file_name, status='replace',  iostat=ierr)
        if (ierr/=0) then
            error stop "Error: write_arr3: cannot open file"
        endif
        
        do i3 = 1,size(x,dim=3)
            do i2 = 1,size(x,dim=2)
                do i1 = 1,size(x,dim=1)
                    write(unitno,*) x(i1,i2,i3)
                enddo
            enddo
        enddo
        
        close(unitno)
    
    end subroutine write_arr3

end program main

I compiled the above code with

 ifx /Qopenmp /Qopenmp-targets:spir64 src\main1.f90 -o exe\run_win.exe

and obtained the following results:

  ***********************************************
 Testing parallelization with OpenMP
 Grid size          500
 Number of procs is           20
 1. Serial
 2. Parallel with OpenMP on CPU
 Grid size          500
 Number of CPU procs is            1
 Number of OpenMP Device Available:           1
 Running on GPU
 3. Parallel with GPU
 Verify results, CPU parallel
 CPU PASSED
 Verify results, GPU parallel
 ap_c =          181  a_c =           58  z_c =            1
 payoff_serial =   -103.4175      payoff_cpu =   -103.4188
 FAILED
Time for serial =       0.000 seconds
Time for parallel CPU = 0.000 seconds
Time for parallel GPU = 0.000 seconds
Parallel_cpu/serial =   NaN
Parallel_gpu/serial =   NaN
||payoff_cpu-payoff_serial|| =        0.000000
||payoff_gpu-payoff_serial|| =    17476.250000
 Results on GPU seem wrong...
payoff_serial(ind) =  -279620.250000
payoff_gpu(ind)    =  -262144.000000

ivanpribec · April 12, 2024, 5:35pm

Don’t you need to collapse three loops? All elements are independent.

aledinola · April 12, 2024, 5:37pm

Yes, but I don’t have to… with openMP on the CPU the best is to collapse only two loops. In any case the results should be correct, or am I wrong?

aledinola · April 12, 2024, 5:41pm

I tried do collapse(3) but the results are the same. The odd thing is that almost all of the elements of the array computed on the GPU coincide with the cpu array, it’s only less than 1% that are different.

write(*,*) 'Fraction of elements that vary less than 0.00001 in abs:'
write(*,*) real(count(abs(payoff_gpu-payoff_serial)<0.00001))/real(size(payoff_gpu))

and I get

Fraction of elements that vary less than 0.00001 in abs:
  0.9942720

ivanpribec · April 12, 2024, 5:57pm

You are right, sorry for the oversight. As a general rule, GPU programming guidelines state that maximizing the amount of parallel work is beneficial.

Concerning the fraction of wrong values, could it be due to the hardware lacking correctly-rounded division? The clinfo program for the UHD Graphics 750 shows:

  Single-precision Floating-point support         (core)
    Denormals                                     Yes
    Infinity and NANs                             Yes
    Round to nearest                              Yes
    Round to zero                                 Yes
    Round to infinity                             Yes
    IEEE754-2008 fused multiply-add               Yes
    Support is emulated in software               No
    Correctly-rounded divide and sqrt operations  No

But I don’t know if that’s the real culprit.

Just for the record, nvfortran returns the following with your program:

(base) ivan@maxwell:~/mm_ifx$ nvfortran -mp=gpu -Minfo=all -o payoff payoff.f90
isclose:
     56, FMA (fused multiply-add) instruction(s) generated
main:
    105, !$omp parallel
    124, FMA (fused multiply-add) instruction(s) generated
    139, !$omp parallel
    166, !$omp target
        166, Generating "nvkernel_MAIN__F1L166_4" GPU kernel
    166, Generating implicit map(tofrom:payoff_gpu(:,:,:),z_grid(:),a_grid(:),ap_grid(:)) 
    179, !$omp target teams distribute parallel do
        178, Generating implicit map(to:z_grid(:)) 
             Generating implicit map(tofrom:payoff_gpu(:,:,:)) 
             Generating implicit map(to:a_grid(:),ap_grid(:)) 
        179, Generating "nvkernel_MAIN__F1L179_6" GPU kernel
    179, Generating implicit map(tofrom:z_grid(:),payoff_gpu(:,:,:),a_grid(:),ap_grid(:)) 
    241, maxval reduction inlined
    242, maxval reduction inlined
    244, maxval reduction inlined
(base) ivan@maxwell:~/mm_ifx$ ./payoff
 *********************************************** 
 Testing parallelization with OpenMP
 Grid size           500
 Number of procs is            16
 1. Serial
 2. Parallel with OpenMP on CPU
 Grid size           500
 Number of CPU procs is             1
 Number of OpenMP Device Available:            1
  Running on GPU
 3. Parallel with GPU
 Verify results, CPU parallel
 CPU PASSED
 Verify results, GPU parallel
 GPU PASSED
Time for serial =       0.000 seconds
Time for parallel CPU = 0.000 seconds
Time for parallel GPU = 0.000 seconds
Parallel_cpu/serial =   NaN
Parallel_gpu/serial =   NaN
||payoff_cpu-payoff_serial|| =    0.000000E+00
||payoff_gpu-payoff_serial|| =    0.000000E+00

I also had to modify line 140 to compile the program, the !$omp needs to be repeated on the continued line,

    !$omp parallel default(shared) & 
    !$omp private(z_c,a_c,ap_c,cons,util)

On the other hand, the fact all values are zero, strikes me as a bit suspicious.

Edit: for timing it helps to use G0 as the format specifier, and increase the precision:

    real(kind(1.0d0)) :: t1, t2
    real(kind(1.0d0)) :: t_serial,t_par_cpu,t_par_gpu

If you store the inverse payoff,

                if (cons>0.0) then
                    util = cons
                else
                    util = 1.0/penalty
                endif

the result becomes,

 *********************************************** 
 Testing parallelization with OpenMP
 Grid size         5000
 Number of procs is           16
 1. Serial
 2. Parallel with OpenMP on CPU
 Grid size         5000
 Number of CPU procs is            1
 Number of OpenMP Device Available:           2
 Running on GPU 
 3. Parallel with GPU
 Verify results, CPU parallel
 CPU PASSED
 Verify results, GPU parallel
 GPU PASSED
Time for serial =       2.645219087600708 seconds
Time for parallel CPU = .2093110084533691 seconds
Time for parallel GPU = .7673230171203613 seconds
Parallel_cpu/serial = .7912804252566483E-01
Parallel_gpu/serial = .2900791925769544
||payoff_cpu-payoff_serial|| = .000000
||payoff_gpu-payoff_serial|| = .9536743E-06

which corroborates my hypothesis.

aledinola · April 12, 2024, 6:14pm

Thanks! nvfortran is available only for linux, right? I have access to Windows computers only

Edit: I changed the payoff to the square root of consumption and the discrepancy b/w serial and GPU is greatly reduced:

if (cons>0.0) then
    util = sqrt(cons)
else
    util = penalty
endif

and now I get

||payoff_cpu-payoff_serial|| =        0.000000
||payoff_gpu-payoff_serial|| =        0.000062

Still the computations with OpenMP on the CPU are more precise.
So from these examples, should I conclude that my integrated graphics card is really bad and can scarcely be used for numerical computations?
Note: I have also a good Nvidia card on my PC, just that I can’t use it with ifx.

ivanpribec · April 12, 2024, 6:16pm

You could use the Windows Subsystem for Linux: Windows 10 WSL Ubuntu 20.04: Fortran MPI+OpenACC+DC GPU code not running - nvc, nvc++ and nvfortran - NVIDIA Developer Forums (maybe @sumseq can comment on this)

ivanpribec · April 12, 2024, 6:57pm

It depends. The OpenCL math library (used by ifx behind the scenes) has more relaxed precision requirements: https://registry.khronos.org/OpenCL/specs/opencl-1.2.pdf#page=319

For division x/y it requires precision to be only ≤ 2.5 ulp and for sqrt, ≤ 3 ulp. If you can live with the errors, or avoid divisions and other mathematical functions, it may still be okay.

Edit: the values here seem to be worse than 2.5 ulp however,

payoff_serial(ind) = -279620.2
payoff_gpu(ind)    = -262144.0
payoff_serial(ind) (EX) = -0X8.88888P+15
payoff_gpu(ind)    (EX) = -0X8.P+15
cons = .3576279E-05
1/cons = -279620.3

Edit 2: at the worst value, there appears to be cancellation when calculating cons,

See Details

$ ifx -O0 -fopenmp-targets=spir64 -fiopenmp test2.f90 -warn all -o payoff
$ OMP_DEFAULT_DEVICE=1 ./test2
 ap   2.080958    
 a    1.049477    
 z    1.000000    
CPU cons .3576279E-05
GPU cons .3814697E-05
CPU cons 0XF.P-22
GPU cons 0X8.P-21
CPU payoff -279620.3
GPU payoff -262144.0
CPU payoff -0X8.88889P+15
GPU payoff -0X8.P+15

! test2.f90
program opencl_accuracy
implicit none
integer :: n_a, n_z, i1, i2, i3
real :: R
real :: cons_cpu, cons_gpu, payoff_cpu, payoff_gpu
real, allocatable :: a_grid(:), ap_grid(:), z_grid(:)

 n_a   = 500           ! Number of grid points for asset, a
 n_z   = 5            ! Number of grid points for income shock, z
 R     = 1.03d0       ! Gross interest rate, R = 1+r

 ! ---------------------- Define grids ---------------------------------------!
 allocate(a_grid(n_a),ap_grid(n_a),z_grid(n_z))

a_grid = linspace(0.01, 4.0, n_a)
ap_grid = a_grid

z_grid = [real :: 0.9792, 0.9896, 1.0000, 1.0106, 1.0212]

i1 = 260
i2 = 131
i3 = 3

print *, "ap", ap_grid(i1)
print *, "a ", a_grid(i2)
print *, "z ", z_grid(i3)

cons_cpu = R*a_grid(i2) + z_grid(i3) - ap_grid(i1)
payoff_cpu = -1.0/cons_cpu

!$omp target map(to:a_grid, z_grid, ap_grid,R) map(from: cons_gpu, payoff_gpu)
cons_gpu = R*a_grid(i2) + z_grid(i3) - ap_grid(i1)
payoff_gpu = -1.0/cons_gpu
!$omp end target

write(*,'(A,G0)') "CPU cons ", cons_cpu
write(*,'(A,G0)') "GPU cons ", cons_gpu
write(*,'(A,EX0.0)') "CPU cons ", cons_cpu
write(*,'(A,EX0.0)') "GPU cons ", cons_gpu

write(*,'(A,G0)') "CPU payoff ", payoff_cpu
write(*,'(A,G0)') "GPU payoff ", payoff_gpu
write(*,'(A,EX0.0)') "CPU payoff ", payoff_cpu
write(*,'(A,EX0.0)') "GPU payoff ", payoff_gpu

contains

function linspace(my_start, my_stop, n)
   ! Purpose: replicate Matlab function <linspace>
   implicit none
   ! Inputs:
   integer :: n
   real :: my_start, my_stop
   ! Function result:
   real :: linspace(n)
   ! Locals:
   integer :: i
   real :: step, grid(n)

   if (n == 1) then
      grid(n) = (my_start+my_stop)/2.d0
   elseif (n>=2) then
      grid(1) = my_start
      if (n>2) then
          step = (my_stop-my_start)/(real(n-1,8))
          do i = 2, n-1
              grid(i) = grid(i-1) + step
          end do
      end if
      grid(n) = my_stop
   endif
   linspace = grid
end function linspace

end program

greenrongreen · April 12, 2024, 7:35pm

aledinola:

include "mkl_omp_offload.f90"
    
program matrix_multiply
   use omp_lib
   implicit none
   integer :: i, j, k, myid, m, n, istat
   real :: sup_norm, tmp
   real, allocatable, dimension(:,:) :: a, b, c, c_serial
! 
! Different Intel GPUs have varying amounts of memory. If the program
! fails at runtime, try decreasing the value of "n".
!
   n = 50

   myid = OMP_GET_THREAD_NUM()
   if (myid .eq. 0) then
      print *, 'matrix size ', n
      print *, 'Number of CPU procs is ', OMP_GET_NUM_THREADS()
      print *, 'Number of OpenMP Device Available:', omp_get_num_devices()
!$omp target 
      if (OMP_IS_INITIAL_DEVICE()) then
         print *, ' Running on CPU'
      else
         print *, ' Running on GPU'
      endif
!$omp end target 
   endif

allocate( a(n,n), b(n,n), c(n,n), c_serial(n,n), stat=istat)
if (istat/=0) error stop "Allocation of matrices FAILED!"

! Initialize matrices
do j=1,n
    do i=1,n
        a(i,j) = real(i + j - 1)/n
        b(i,j) = real(i - j + 1)/n
    enddo
enddo
c = 0.0
c_serial = 0.0
   
!$omp target data map(to: a, b) map(tofrom: c)  
!$omp target teams distribute parallel do collapse(2) private(j,i,k,tmp)
do j=1,n
    do i=1,n
        tmp = 0.0
        do k=1,n
            tmp = tmp + a(i,k) * b(k,j)
        enddo
        c(i,j) = tmp
    enddo
enddo
!$omp end target data   

! serial compute matrix multiplication
do j=1,n
    do i=1,n
        tmp = 0.0
        do k=1,n
            tmp = tmp + a(i,k) * b(k,j)
        enddo
        c_serial(i,j) = tmp
    enddo
enddo

! verify result
do j=1,n
    do i=1,n
        if (.not. isclose(c(i,j),c_serial(i,j),atol=1.0e-2) ) then
        print *,'FAILED, i, j, c_serial(i,j), c(i,j) ', i, j, c_serial(i,j), c(i,j)
        stop
        endif
    enddo
enddo
   
sup_norm = maxval(abs(c-c_serial))

print *,'PASSED'
   
write(*,*) "||c-c_serial|| = ", sup_norm
   
   
contains

   ! See https://numpy.org/doc/stable/reference/generated/numpy.isclose.html
   elemental function isclose(a,b,atol,rtol)
      real, intent(in) :: a, b
      real, intent(in), optional :: atol, rtol
      logical :: isclose

      real :: atol_, rtol_

      atol_ = 1.0e-5
      rtol_ = 1.0e-9

      if (present(atol)) atol_ = atol
      if (present(rtol)) rtol_ = rtol_

      isclose = abs(a - b) <= (atol_ + rtol_*abs(b))
   end function

end program matrix_multiply

What compiler version are you using? ifx -what -V along with the other options. I ran this code on linux without an error. I’ll try it on my PC Windows system.

sumseq · April 12, 2024, 8:46pm

Hi,

Yes WSL on Windows does work with nvfortran for GPU off load.
Recently I have found that the performance is not as good as when running in native Linux, but its not too bad.

Some tips:

Make sure you are logged into an admin account.
Make sure you are connected to the internet.

Open command prompt as admin (or powershell).

wsl --install

REBOOT!

wsl --install -d Ubuntu-20.04

[Should be prompted to make account username and passwd]

sudo apt update
sudo apt upgrade
sudo apt update
sudo apt upgrade

sudo apt install bash make build-essential gfortran tar openmpi-bin libopenmpi-dev git libhdf5-dev python3-pip

pip3 install numpy matplotlib

GPU:

Download CUDA toolkit into WSL:

GOTO: Linux->x86_64->WSL-Ubuntu->2.0->deb (network)

Copy-paste and run commands one at a time

Select Lunix x86_64 Ubuntu (apt)

Copy-paste and run commands one at a time

Add the following to your .bashrc file:

#!/bin/bash

version=24.3

NVARCH=uname -s_uname -m; export NVARCH
NVCOMPILERS=/opt/nvidia/hpc_sdk; export NVCOMPILERS
MANPATH=$MANPATH:$NVCOMPILERS/$NVARCH/$version/compilers/man; export MANPATH
PATH=$NVCOMPILERS/$NVARCH/$version/compilers/bin:$PATH; export PATH

export PATH=$NVCOMPILERS/$NVARCH/$version/comm_libs/openmpi/openmpi-3.1.5/bin:$PATH
export MANPATH=$MANPATH:$NVCOMPILERS/$NVARCH/$version/comm_libs/openmpi/openmpi-3.1.5/man

aledinola · April 12, 2024, 11:37pm

Thanks! I tried it again on another computer (always on Windows with ifx) and it runs ok with the following output:

 matrix size           50
 Number of CPU procs is            1
 Number of OpenMP Device Available:           1
 Running on GPU
 PASSED
 ||c-c_serial|| =   9.5367432E-06

Compilation with
ifx -fpp /Qopenmp /Qopenmp-targets:spir64 matrix_multiply.f90 -o run_win.exe

I got the error on my office PC. I will try it again on Monday to see if the problem persists.

aledinola · April 12, 2024, 11:55pm

I run again the same code following your suggestions, i.e.

Replacing the payoff with

 if (cons>0.0) then
     util = cons
else
    util = 1.0/penalty
endif

Using G0 as a format descriptor for the timings, also increasing precision.

The results are:

 ***********************************************
 Testing parallelization with OpenMP
 Grid size         5000
 Number of procs is           20
 1. Serial
 2. Parallel with OpenMP on CPU
 Grid size         5000
 Number of CPU procs is            1
 Number of OpenMP Device Available:           1
 Running on GPU
 3. Parallel with GPU
 Verify results, CPU parallel
 CPU PASSED
 Verify results, GPU parallel
 GPU PASSED
Time for serial =       .8310039999196306E-01 seconds
Time for parallel CPU = .3115119997528382E-01 seconds
Time for parallel GPU = .3089370000234339 seconds
Parallel_cpu/serial = .3748622146018137
Parallel_gpu/serial = 3.717635535488546
||payoff_cpu-payoff_serial|| = .000000
||payoff_gpu-payoff_serial|| = .9536743E-06

so it seems that the GPU is more than three times slower than the serial code, while in your case GPU/serial is 0.76
I guess this is because we use different computers and/or operating systems. Still, I’m a bit disappointed by the fact that offloading to the GPU makes the code so much slower.

ivanpribec · April 13, 2024, 9:49am

Try moving the timing within the target data region. In a more realistic scenario, the slow transfer to and from GPU would be amortized by more computation (i.e. iterations). That said, expecting the integrated graphics (IGU) to be faster than the CPU cores on general tasks is not realistic. The purpose of the IGU is to lighten the load of the CPU cores, for instance when using an application or playing a game.

ivanpribec · September 20, 2024, 7:46am

Concerning the accuracy issues, I noticed that Intel published some guidance on how to control the floating-point precision when using the GPU:

oneAPI GPU Optimization Guide - Accuracy Versus Performance Tradeoffs in Floating-Point Computations | Intel

Topic		Replies	Views
How to use IFX and offload openMP to GPU?	0	1651	April 2, 2022
GPU offloading in Fortran Advocacy	37	1424	February 1, 2025
OpenMP and FORTRAN Help	32	1112	December 9, 2024
OpenMP code implementation in gfortran and intel Help	35	818	September 5, 2024
GPU utilization in a multithreaded code Help	6	1005	June 6, 2023

Parallelization on GPU with Intel compiler

Related topics