Looking at Section 4 of HPC Compilers User's Guide Version 23.11 for ARM, OpenPower, x86 seems like the team from nvidia put a big effort in that direction.
The NVIDIA HPC compilers provide two categories of inlining:
- Automatic function inlining – In C++ and C, you can inline static functions with the inline keyword by using the -Mautoinline option, which is included with -fast.
- Function inlining – You can inline functions which were extracted to the inline libraries in Fortran, C++ and C. There are two ways of enabling function inlining: with and without the lib suboption. For the latter, you create inline libraries, for example using the nvfortran compiler driver and the -o and -Mextract options.
Out of curiosity I tried playing with their saxpy example that compares do concurrent and added a call to a cpp implementation:
saxpy.cpp
extern"C" {
void saxpy_cpp(float x[], float y[], int & n, float & a)
{
for (int i = 0; i < n; ++i)
y[i] = a*x[i] + y[i];
}
}
m_saxpy.f90
module m_saxpy
use iso_c_binding
interface
subroutine saxpy_cpp(x,y,n,a) bind(c)
import c_int, c_float
real(kind=c_float) :: x(:), y(:)
real(kind=c_float) :: a
integer(kind=c_int) :: n
end subroutine
end interface
contains
subroutine saxpy_concurrent(x,y,n,a)
real,dimension(:) :: x, y
real :: a
integer :: n, i
do concurrent (i = 1: n)
y(i) = a*x(i)+y(i)
enddo
end subroutine
subroutine saxpy_do(x,y,n,a)
real,dimension(:) :: x, y
real :: a
integer :: n, i
do i = 1, n
y(i) = a*x(i)+y(i)
enddo
end subroutine
end module
main.f90
program main
use m_saxpy
real,allocatable :: x(:), x2(:), x3(:), y(:)
real :: a = 2.0
integer :: n, err
integer :: c0, c1, c2, c3, c4, c5, cpar, cseq, ccpp
n = 5e7
allocate(x(n) , source=1.0)
allocate(x2(n), source=1.0)
allocate(x3(n), source=1.0)
allocate(y(n) , source=[(real(i),i=1,n)])
!$acc enter data copyin(x,x2,x3, y, n, a)
call system_clock( count=c0 )
call saxpy_do(x, y, n, a)
call system_clock( count=c1 )
call system_clock( count=c2 )
call saxpy_cpp(x2, y, n, a)
call system_clock( count=c3 )
call system_clock( count=c4 )
call saxpy_concurrent(x3, y, n, a)
call system_clock( count=c5 )
!$acc exit data delete(x,x2,x3, y, n, a)
cseq = c1 - c0
ccpp = c3 - c2
cpar = c5 - c4
err = 0
if( any(x.ne.x2) .or. any(x.ne.x3) ) err = 1
print *, cseq, ' microseconds do'
print *, ccpp, ' microseconds do cpp'
print *, cpar, ' microseconds do concurrent'
if(err .eq. 0) then
print *, "SAXPY: Test PASSED"
else
print *, "SAXPY: Test FAILED"
endif
end program
with that in a fpm project, used the following flags:
build.sh
export LD_LIBRARY_PATH=/usr/lib/wsl/lib:$LD_LIBRARY_PATH
export FPM_CXX=/opt/nvidia/hpc_sdk/Linux_x86_64/2023/compilers/bin/nvcc
export FPM_FC=/opt/nvidia/hpc_sdk/Linux_x86_64/2023/compilers/bin/nvfortran
fpm run --flag "-fast -stdpar=gpu -acc=gpu -gpu=cc80,cuda12.0,nomanaged -Minline -Minfo=accel"
And got the following results on a RTX A6000:
29633 microseconds do
122100 microseconds do cpp
1745 microseconds do concurrent
SAXPY: Test PASSED
Not sure if this lack of performance of the cpp implementation is due to lack of inlinement or something else …
One can read this in their manual
4.2. Invoking Procedure Inlining
To invoke the procedure inliner, use the -Minline option. If you do not specify an inline library, the compiler performs a special prepass on all source files named on the compiler command line before it compiles any of them. This pass extracts procedures that meet the requirements for inlining and puts them in a temporary inline library for use by the compilation pass.
And regarding the restrictions for inlinement:
The following types of C and C++ functions cannot be inlined:
- Functions which accept a variable number of arguments
Certain C/C++ functions can only be inlined into the file that contains their definition:
- Static functions
- Functions which call a static function
- Functions which reference a static variable
Reading through it I don’t see any explicit limitation, but I might be missing something… would have to look at the ASM code to have a better idea