Accessing the stride multiplier of an assumed-shape array

Consider the following group of ?axpy subroutines from BLAS (vector-scalar product and add):

call saxpy(n, a, x, incx, y, incy)   ! F77-style
call daxpy(n, a, x, incx, y, incy)   ! F77-style

call axpy( x, y [, a])               ! BLAS F95 convention

According to the Intel MKL documentation,

The Fortran 95 interface to BLAS and Sparse BLAS Level 1 routines is implemented through wrappers that call respective FORTRAN 77 routines.

I am wondering how is this possible without making array copies (and without using Fortran array descriptors from F2018)?

I’ve found one answer on Stack Overflow, Obtaining the stride of an Fortran Array, that suggests the non-standard extensions loc and sizeof can be used to recover the stride multiplier. Here’s a sketch of the solution using this language hack:

   subroutine blas95_saxpy( x, y, a )
      real(sp), intent(in) :: x(:)
      real(sp), intent(inout) :: y(:)
      real(sp), intent(in), optional :: a

      integer :: n, incx, incy
      real(sp) :: a_

      external :: saxpy ! F77 MKL routine

      a_ = 1.0_sp
      if (present(a)) a_ = a

      incx = 1
      incy = 1

      n = size(y)
   
      if (n > 1) then
         incx = (loc(x(2)) - loc(x(1))) / sizeof(x(1))
         incy = (loc(y(2)) - loc(y(1))) / sizeof(y(1))
      end if

      call saxpy(n, a_, x(1), incx, y(1), incy)
   
   end subroutine

Are there any other possibilities?

Here’s my attempt at a solution using C-Fortran array descriptors:

// f18_saxpy.c

#include <ISO_Fortran_binding.h>
#include "mkl.h" // Intel MKL 

void f18_saxpy(CFI_cdesc_t *x, CFI_cdesc_t *y, float *a) {

	int n, incx, incy;
	float a_;

	a_ = 1.0;
	if (a) 
		a_ = *a;

	n = y->dim[0].extent;

	incx = 1; 
	incy = 1;

	// Ideally, cast to CFI_index_t instead of int,
	// but the standard BLAS interface uses int's for indexing
	if (n > 1) {
		incx = (int) ( x->dim[0].sm / x->elem_len );
		incy = (int) ( y->dim[0].sm / y->elem_len );
	}

	cblas_saxpy(n, a_, (float *) x->base_addr, incx, (float *) y->base_addr, incy);

}

And here’s a driver program to test the routines:

! test_axpy.F90

module blas95

   implicit none
   private

   public :: axpy

   integer, parameter :: sp = kind(1.0e0)

   interface axpy
#ifdef CFI 
      subroutine f18_saxpy( x, y, a ) bind(c,name='f18_saxpy')
         import sp
         real(sp), intent(in) :: x(:) 
         real(sp), intent(inout) :: y(:)
         real(sp), intent(in), optional :: a
      end subroutine
#else
      module procedure blas95_saxpy
#endif
   end interface 

contains

   subroutine blas95_saxpy( x, y, alpha )
      real(sp), intent(in) :: x(:)
      real(sp), intent(inout) :: y(:)
      real(sp), intent(in), optional :: alpha

      integer :: n, incx, incy
      real(sp) :: alpha_
      
      external :: saxpy

      alpha_ = 1.0
      if (present(alpha)) alpha_ = alpha

      incx = 1
      incy = 1

      n = size(y)
   
      if (n > 1) then
         incx = (loc(x(2)) - loc(x(1))) / sizeof(x(1))
         incy = (loc(y(2)) - loc(y(1))) / sizeof(y(1))
      end if

      call saxpy(n, alpha_, x(1), incx, y(1), incy)
   
   end subroutine

end module

program test_axpy

   use blas95, only: axpy
   implicit none (type, external)

   integer, parameter :: sp = kind(1.0e0)

   real(sp) :: x(6), y(6), alpha
   integer :: i

   y = 0.0
   x = [1.0,2.0,3.0,4.0,5.0,6.0]

   call axpy( x, y )
   write(*,'(6(F11.6,4X))') y
   
   y = 0
   alpha = 3.0
   call axpy( x(1::2), y(::2), alpha )
   write(*,'(6(F11.6,4X))') y

   y = 0
   alpha = 2.0
   call axpy( x(2::2), y(2::2), alpha )
   write(*,'(6(F11.6,4X))') y

   y = 0
   call axpy( x(2::3), y(2::3), 1.0 )
   write(*,'(6(F11.6,4X))') y

   y = 0
   call axpy_second_element( x, y, 4.0 )
   write(*,'(6(F11.6,4X))') y

   y = 0
   call axpy_second_element( x(::2), y(::2), 4.0 )
   write(*,'(6(F11.6,4X))') y


contains

   ! Subroutine used to test sub-slicing
   subroutine axpy_second_element( x, y, alpha )
      real(sp), intent(in) :: x(:)
      real(sp), intent(inout) :: y(:)
      real(sp), intent(in), optional :: alpha

      call axpy( x(::2), y(::2), alpha )

   end subroutine

end program

I’ve only tested this with the Intel Fortran compiler (due to the convenient -qmkl flag), and following command:

$ icc -qmkl -c -Wall f18_saxpy.c 
$ ifort -DCFI test_axpy.F90 f18_saxpy.o -qmkl -o test_axpy 

Omit the -DCFI flag to test the Fortran implementation.


Here are instructions for the GNU compilers:

  1. Download cblas.tgz from the BLAS page at Netlib
  2. Extract the files to a folder, and locate cblas.h and cblas_f77.h from the include/ folder, and cblas_saxpy.c in the src/ folder. Copy the files into the folder containing the test_axpy.F90 driver.
  3. Replace mkl.h with cblas.h in the include section of f18_saxpy.c
  4. Add interoperable type declarations in the Fortran interface to silence gfortran warnings:
       subroutine f18_saxpy( x, y, a ) bind(c,name='f18_saxpy')
          use, intrinsic :: iso_c_binding, only: c_float
          real(c_float), intent(in) :: x(:) 
          real(c_float), intent(inout) :: y(:)
          real(c_float), intent(in), optional :: a
       end subroutine
    
  5. Compile the CBLAS wrapper of the F77 routine, the f18_saxpy.c wrapper of the CBLAS wrapper, and finally the F95+ wrapper of the wrapper wrapper. Don’t forget to link with the system BLAS library.
    $ gcc -c -Wall cblas_saxpy.c
    $ gcc -c -Wall f18_saxpy.c 
    $ gfortran -DCFI -Wall test_axpy.F90 f18_saxpy.o -o test_axpy -lblas
    

Amusingly, we have gone through 3 layers of indirection to achieve something which can be done easily in Fortran:

   subroutine mf_saxpy( x, y, a )
      real(sp), intent(in) :: x(:)
      real(sp), intent(inout) :: y(:)
      real(sp), intent(in), optional :: a

      real(sp) :: a_
      
      a_ = 1.0
      if (present(a)) a_ = a

      y = y + a_ * x   
   end subroutine

@ivanpribec , what you show with your f18_saxpy wrapper is the cleanest Fortran standard-based option to work with the commonly recommended assumed shape dummy argument approach.

“any other possibilities” will be non-standard, you’ve already shown one though the same approach is presumably viable with c_loc and c_sizeof.

It is unfortunate the abstraction at this level is inadequate to some e.g., it doesn’t guarantee AVX instructions but the user wants to enforce that in certain situations.

Thank you @FortranFan for the confirmation. One concern I have is in the statement

incx = (int) ( x->dim[0].sm / x->elem_len );

The elem_len is the storage size in bytes of an array element, while sm is the spacing between elements (the difference in bytes between successive elements). I presume that for intrinsic types (integer, real, complex), the available processors don’t have any padding between array elements. However I have been warned that for derived types this might not be that case. Is this interpretation correct?

It feels a bit odd, the standard would allow you to do something in C, but have no way to do it in Fortran. It feels almost if a bar was set, so that only users with sufficient knowledge of C (and hence a good understanding of pointers) would be able to tinker with strided access. The average Fortran programmer on the other hand, has to rely on the compiler making a contiguouos copy to get the job done.

Good question, one would have to grok the compiler “manuals” on storage semantics and conformance vis-a-vis the standard, etc. to be absolutely sure but you’re generally right - padding shouldn’t be a factor with intrinsic types.

Moreover with interoperable derived types also where each type component is by itself interoperable with the companion C processor, what you show with the relationship between the memory stride and the storage size in bytes of each element of the object should also hold i.e., any padding should get factored appropriately into the storage size provided by the C descriptor.