LANL Report – Looming Fortran Talent Scarcity is Threatening

Looking at Section 4 of HPC Compilers User's Guide Version 23.11 for ARM, OpenPower, x86 seems like the team from nvidia put a big effort in that direction.

The NVIDIA HPC compilers provide two categories of inlining:

  • Automatic function inlining – In C++ and C, you can inline static functions with the inline keyword by using the -⁠Mautoinline option, which is included with -⁠fast.
  • Function inlining – You can inline functions which were extracted to the inline libraries in Fortran, C++ and C. There are two ways of enabling function inlining: with and without the lib suboption. For the latter, you create inline libraries, for example using the nvfortran compiler driver and the -⁠o and -⁠Mextract options.

Out of curiosity I tried playing with their saxpy example that compares do concurrent and added a call to a cpp implementation:

saxpy.cpp
extern"C" {
    void saxpy_cpp(float x[], float y[], int & n, float & a)
    {
        for (int i = 0; i < n; ++i)
            y[i] = a*x[i] + y[i];
    }
}
m_saxpy.f90
module m_saxpy
    use iso_c_binding

    interface
      subroutine saxpy_cpp(x,y,n,a) bind(c)
        import c_int, c_float
        real(kind=c_float) :: x(:), y(:)
        real(kind=c_float) :: a
        integer(kind=c_int) :: n
      end subroutine
    end interface
    contains
    subroutine saxpy_concurrent(x,y,n,a)
        real,dimension(:) :: x, y
        real :: a
        integer :: n, i  
        do concurrent (i = 1: n)
          y(i) = a*x(i)+y(i)
        enddo  
    end subroutine 

    subroutine saxpy_do(x,y,n,a)
        real,dimension(:) :: x, y
        real :: a
        integer :: n, i  
        do i = 1, n
          y(i) = a*x(i)+y(i)
        enddo  
    end subroutine 
end module
main.f90
program main
    use m_saxpy
    real,allocatable :: x(:), x2(:), x3(:), y(:)
    real :: a = 2.0
    integer :: n, err
    integer :: c0, c1, c2, c3, c4, c5, cpar, cseq, ccpp
    n = 5e7

    allocate(x(n) , source=1.0)
    allocate(x2(n), source=1.0)
    allocate(x3(n), source=1.0)
    allocate(y(n) , source=[(real(i),i=1,n)])

    !$acc enter data copyin(x,x2,x3, y, n, a)
    call system_clock( count=c0 )
    call saxpy_do(x, y, n, a)
    call system_clock( count=c1 )
    
    call system_clock( count=c2 )
    call saxpy_cpp(x2, y, n, a)
    call system_clock( count=c3 )

    call system_clock( count=c4 )
    call saxpy_concurrent(x3, y, n, a)
    call system_clock( count=c5 )
    !$acc exit data delete(x,x2,x3, y, n, a)

    cseq = c1 - c0
    ccpp = c3 - c2
    cpar = c5 - c4

    err = 0
    if( any(x.ne.x2) .or. any(x.ne.x3) ) err = 1

    print *, cseq, ' microseconds do'
    print *, ccpp, ' microseconds do cpp'
    print *, cpar, ' microseconds do concurrent'
    if(err .eq. 0) then
      print *, "SAXPY: Test PASSED"
    else
      print *, "SAXPY: Test FAILED"
    endif

end program

with that in a fpm project, used the following flags:

build.sh
export LD_LIBRARY_PATH=/usr/lib/wsl/lib:$LD_LIBRARY_PATH
export FPM_CXX=/opt/nvidia/hpc_sdk/Linux_x86_64/2023/compilers/bin/nvcc
export FPM_FC=/opt/nvidia/hpc_sdk/Linux_x86_64/2023/compilers/bin/nvfortran
fpm run --flag "-fast -stdpar=gpu -acc=gpu -gpu=cc80,cuda12.0,nomanaged -Minline -Minfo=accel"

And got the following results on a RTX A6000:

        29633  microseconds do
       122100  microseconds do cpp
         1745  microseconds do concurrent
 SAXPY: Test PASSED

Not sure if this lack of performance of the cpp implementation is due to lack of inlinement or something else …

One can read this in their manual

4.2. Invoking Procedure Inlining
To invoke the procedure inliner, use the -⁠Minline option. If you do not specify an inline library, the compiler performs a special prepass on all source files named on the compiler command line before it compiles any of them. This pass extracts procedures that meet the requirements for inlining and puts them in a temporary inline library for use by the compilation pass.

And regarding the restrictions for inlinement:

The following types of C and C++ functions cannot be inlined:

  • Functions which accept a variable number of arguments

Certain C/C++ functions can only be inlined into the file that contains their definition:

  • Static functions
  • Functions which call a static function
  • Functions which reference a static variable

Reading through it I don’t see any explicit limitation, but I might be missing something… would have to look at the ASM code to have a better idea

1 Like