I am playing around with using do concurrent for GPU offloading and multicore architectures, I wrote a very simple daxpy which I’ve uploaded here. I am using the 25.5 version of nvfortran.
In short, for the sizes:
integer(int64), parameter :: n1 = 1024_int64*1024_int64*1024_int64*1_int64
integer(int64), parameter :: n2 = 1024_int64*1024_int64*1024_int64*2_int64
integer(int64), parameter :: n3 = 1024_int64*1024_int64*1024_int64*3_int64
integer(int64), parameter :: n4 = 1024_int64*1024_int64*1024_int64*4_int64
I perform a daxpy between vectors B = 3.0_dp *A + B, where B is initialized to 0.0_dp
and A to 1.0_dp
I compile with nvfortran -O3 fail.f90
and alternate between using -stdpar=multicore
. In the github you can use make
and as long as nvfortran is on the path it should work.
Without multicore
enabled my 4 tests pass, doing the daxpy from n1 to n4. With multicore enabled, once I hit n3 my arrays are filled with zeroes.
The daxpy in question is just:
subroutine do_daxpy_of_size(n)
use, intrinsic :: iso_fortran_env, only: int64
real(dp), parameter :: tol = 1.0e-8_dp
real(dp), allocatable :: A(:), B(:)
real(dp) :: alpha
integer(int64), intent(in) :: n
print *, " my size is ", n
allocate(A(n), B(n))
do concurrent (i=1:n)
A(i) = 1.0_dp
B(i) = 0.0_dp
end do
alpha = 3.0_dp
do concurrent (i=1:n)
B(i) = alpha * A(i) + B(i)
end do
call check_array(B, 3.0_dp, tol, n)
deallocate(A,B)
end subroutine do_daxpy_of_size
Is this a bug or am I overlooking something obvious?