Bug or programmer error?

I am playing around with using do concurrent for GPU offloading and multicore architectures, I wrote a very simple daxpy which I’ve uploaded here. I am using the 25.5 version of nvfortran.

In short, for the sizes:

integer(int64), parameter :: n1 = 1024_int64*1024_int64*1024_int64*1_int64
integer(int64), parameter :: n2 = 1024_int64*1024_int64*1024_int64*2_int64
integer(int64), parameter :: n3 = 1024_int64*1024_int64*1024_int64*3_int64
integer(int64), parameter :: n4 = 1024_int64*1024_int64*1024_int64*4_int64

I perform a daxpy between vectors B = 3.0_dp *A + B, where B is initialized to 0.0_dp and A to 1.0_dp

I compile with nvfortran -O3 fail.f90 and alternate between using -stdpar=multicore. In the github you can use make and as long as nvfortran is on the path it should work.

Without multicore enabled my 4 tests pass, doing the daxpy from n1 to n4. With multicore enabled, once I hit n3 my arrays are filled with zeroes.

The daxpy in question is just:

subroutine do_daxpy_of_size(n)
use, intrinsic :: iso_fortran_env, only: int64
real(dp), parameter :: tol = 1.0e-8_dp
real(dp), allocatable :: A(:), B(:)
real(dp) :: alpha
integer(int64), intent(in) :: n

print *, " my size is ", n
allocate(A(n), B(n))
do concurrent (i=1:n)
A(i) = 1.0_dp
B(i) = 0.0_dp
end do
alpha = 3.0_dp
do concurrent (i=1:n)
  B(i) = alpha * A(i) + B(i)
end do
call check_array(B, 3.0_dp, tol, n)
deallocate(A,B)

end subroutine do_daxpy_of_size

Is this a bug or am I overlooking something obvious?

1 Like

You don’t give a complete code to test, so this is just a guess. Try declaring the loop index i to be int64. I think the range 1:n will be alright, despite the fact that the two integers are different kinds, but you might try 1_int64:n to see if that makes a difference.

it’s in the github link GitHub - JorgeG94/nvfortran-dc-bug

It’s possible that nvfortran is creating temporary arrays and running out of stack space for the larger values of n.

On a Linux machine you can try setting

ulimit -s unlimited

or increase the OpenMP stack size per thread (assuming that’s what do concurrent uses):

export OMP_STACKSIZE=512M

You can also remove the option -Mstack_arrays (if it was already set).

It seems there’s a problem when the loop index exceeds the int32 limit. I ran a test here: GitHub - gha3mi/nvfortran-dc-bug

fpm @nv-multicore --verbose

Here are the outputs:

n           =           1073741824
chunk       =           2147483647
huge(int32) =           2147483647
huge(int64) =  9223372036854775807
Initialize  :     1.064724000 [s]
Elapsed time:     0.947305000 [s]
Finalize    :     1.265888000 [s]

n           =           2147483648
chunk       =           2147483647
huge(int32) =           2147483647
huge(int64) =  9223372036854775807
Initialize  :     2.124329000 [s]
Elapsed time:     2.014653000 [s]
Finalize    :     2.347173000 [s]

n           =           3221225472
chunk       =           2147483647
huge(int32) =           2147483647
huge(int64) =  9223372036854775807
Initialize  :     3.274863000 [s]
Elapsed time:     2.804645000 [s]
Finalize    :     3.312508000 [s]

n           =           4294967296
chunk       =           2147483647
huge(int32) =           2147483647
huge(int64) =  9223372036854775807
Initialize  :     7.620918000 [s]
Elapsed time:    25.498055000 [s]
Finalize    :     6.859171000 [s]

I see that you’re also using i=1_int64,n, this is interesting

Is this the correct meaning, or is that OMP_STACKSIZE value the total size for all of the OpenMP stacks? The documentation is a little ambiguous.

I’ve tried a simpler program that does not use arrays, and it also gives an unexpected result with -stdpar=multicore and n = 3 * 10^9 (here, no output from inside do concurrent). The code works for n = 2 * 10^9 and the order of the output becomes irregular (maybe because of threading). Also, the code gives the expected result with ifort & ifx for n = 3 * 10^9 with -fopenmp. (I could not compile it with other compilers.) The assembly code corresponding to the do concurrent line (highlighted in different colors) has various i32 with -stdpar=multicore, but I don’t know whether they are related here … Anyway, it seems that the result becomes strange even without arrays, so possibly a compiler bug…? (if the program is no problem)

program main
    call test()
end program

subroutine test()
    use iso_fortran_env, only: ip => int64
    implicit none
    integer(ip) :: n, i
    i = 777

    !! n = 2_ip * (10_ip)**9   !! works for nvfort -O3 -stdpar=multicore
    n = 3_ip * (10_ip)**9   !! works for nvfort -O3

    print *, "n = ", n
    print *, "before: i = ", i

    do concurrent (i = 1_ip : n)
      if (mod(i, n / 5_ip) == 0) &
          print *, "loop:   i = ", i
    end do

    print *, "after:  i = ", i
end

Results (all with x86 nvfortran-25.5)

!! -O3 & n = 2 * 10^9
n =                2000000000
 before: i =                       777
 loop:   i =                 400000000
 loop:   i =                 800000000
 loop:   i =                1200000000
 loop:   i =                1600000000
 loop:   i =                2000000000
 after:  i =                       777

!! -O3 -stdpar=multicore & n = 2 * 10^9
n =                2000000000
 before: i =                       777
 loop:   i =                1200000000
 loop:   i =                 400000000
 loop:   i =                1600000000
 loop:   i =                 800000000
 loop:   i =                2000000000
 after:  i =                       777

!! -O3 & n = 3 * 10^9
 n =                3000000000
 before: i =                       777
 loop:   i =                 600000000
 loop:   i =                1200000000
 loop:   i =                1800000000
 loop:   i =                2400000000
 loop:   i =                3000000000
 after:  i =                       777

!! -O3 -stdpar=multicore & n = 3 * 10^9
 n =                3000000000
 before: i =                       777
 after:  i =                       777

This simple test also confirms that if a do concurrent loop’s range exceeds the int32 limit (n2 - s2 > huge(int32)), the loop fails silently

program test2
   use, intrinsic :: iso_fortran_env, only: int32, int64
   implicit none

   integer(int64), parameter :: s1 = 10_int64
   integer(int64), parameter :: s2 = 10_int64
   integer(int64), parameter :: n1 = int(huge(int32), kind=int64) + 10_int64
   integer(int64), parameter :: n2 = int(huge(int32), kind=int64) + 11_int64
   integer(int64) :: i

   print'(a)', "Test for loop range within int32"
   print'(a,g0)', "n1          =  ", n1
   print'(a,g0)', "s1          =  ", s1
   print'(a,g0)', "n1 - s1     =  ", n1 - s1
   print'(a,g0)', "huge(int32) =  ", huge(int32)
   do concurrent (i = s1:n1)
      if (i == n1) print*, "ok"
   end do
   print'(a)',    "if n1 - s1  <= huge(int32): ok"

   print*, ""

   print'(a)', "Test for loop range exceeding int32"
   print'(a,g0)', "n2          =  ", n2
   print'(a,g0)', "s2          =  ", s2
   print'(a,g0)', "n2 - s2     =  ", n2 - s2
   print'(a,g0)', "huge(int32) =  ", huge(int32)
   do concurrent (i = s2:n2)
      if (i == n2) print*, "ok"
   end do
   print'(a)',    "if n2 - s2  >  huge(int32): not ok"
end program

results:

Test for loop range within int32
n1          =  2147483657
s1          =  10
n1 - s1     =  2147483647
huge(int32) =  2147483647
 ok
if n1 - s1  <= huge(int32): ok
 
Test for loop range exceeding int32
n2          =  2147483658
s2          =  10
n2 - s2     =  2147483648
huge(int32) =  2147483647
if n2 - s2  >  huge(int32): not ok