Pure wrappers for impure procedures

This is a fairly general question, but I’ll try to motivate it with a concrete example. I want to write a nice wrapper around LAPACK’s dgesv for solving a general linear systems of equations (Ax=b). This subroutine is impure because the matrix A gets overwritten with the LU factorization and the vector b gets overwritten with the solution. I typically do not want this behavior, so in my wrapper, I copy A and b into temporary arrays. In code:

module lapack
  implicit none
  private
  public :: solve

  integer, parameter :: dp = kind(1d0)

  interface
    subroutine dgesv(n, nrhs, a, lda, ipiv, b, ldb, info)
      integer, intent(in) :: n, nrhs, lda, ldb
      double precision, intent(in out) :: a(lda,*), b(ldb,*)
      integer, intent(out) :: ipiv(*), info
    end subroutine dgesv
  end interface

contains

  subroutine solve(a, b, x)
    real(dp), intent(in) :: a(:,:), b(:)
    real(dp), intent(out) :: x(:)

    integer :: n, info
    real(dp), allocatable :: atmp(:,:), btmp(:,:)
    integer, allocatable :: ipiv(:)

    n = size(x)
    if (any(shape(a) /= n)) error stop 'shape mismatch between A and x'
    if (size(b) /= n) error stop 'shape mismatch between b and x'

    atmp = a
    btmp = reshape(b, [n, 1])
    allocate (ipiv(n))
    call dgesv(n, 1, atmp, n, ipiv, btmp, n, info)
    if (info /= 0) error stop 'dgesv error'
    x = btmp(:,1)
  end subroutine solve

end module lapack

Now, from the caller’s perspective solve is a pure subroutine. However, it cannot be declared as such because of the internal call to dgesv. As a result, any procedures calling solve can’t be pure either.

There was a bit of related discussion in the thread Attribute for “pure” procedures that do I/O, with @rwmsu noting that an “impure block” construct within a pure procedure would be a useful language enhancement for this case. Lacking that, does the current language give us a viable workaround?

A pure function must have intent(in) arguments, but a pure subroutine need not. Thus a pure subroutine can overwrite its arguments. The following code compiles, for example.

module m
implicit none
contains
pure subroutine sub(x)
real, intent(in out) :: x
x = 2*x
end subroutine sub
end module m
1 Like

You can circumvent the purity requirements by manually specifying an interface to the impure procedure. I got an example on how to do this here.

Whether that’s a good idea is an interesting discussion for sure. As long as the public API has no side effects I’d say it’s OK, but it would be interesting to hear opinions from others as well.

1 Like

@nshaffer ,

Does DGESV (and its callees) do IO? And is that why you are asking about purity? Because, IIRC, LAPACK is mostly a collection of F77+few extensions-based subroutines that are technically pure otherwise even though they won’t be marked as such given their vintage. By “technically pure”, I mean in 2 ways: a) LAPACK does not use global or static data, let alone write to them and b) their purity would be consistent with the Fortran standard view where PURE SUBROUTINEs are allowed INTENT(*OUT) attributes.

So then have you considered working with Fortran stdlib volunteers to author an stdlib_lapack which takes the BSD license based LAPACK on netlib (or wherever it’s now) and transforms it suitably into a modern MIT-license based stdlib component where each and every Fortran subroutine in stdlib_lapack is indeed PURE? One can start “small” with first PURE attribute as widely as possible and later do other modernization aspects.

Cheers, @Beliavsky, I had not realized that intent(in out) is allowed in a pure subroutine. I am biased by functional languages’ definition of “purity”, which I think would not ordinarily include intent(in out) dummy arguments of subroutines (although, the concept of a “subroutine” is already outside what true functional languages deal with, I think).

In private messages, I was pointed to section 15.7 of the 2018 standard, which has two helpful notes at the end (emphasis mine)

NOTE 15.49

The above constraints are designed to guarantee that a pure procedure is free from side effects (modifications of data visible outside the procedure), which means that it is safe to reference it in constructs such as DO CONCURRENT and FORALL, where there is no explicit order of evaluation.

The constraints on pure subprograms appear to be complicated, but it is not necessary for a programmer to be intimately familiar with them. From the programmer’s point of view, these constraints can be summarized as follows: a pure subprogram cannot contain any operation that could conceivably result in an assignment or pointer assignment to a common variable, a variable accessed by use or host association, or an INTENT (IN) dummy argument; nor can a pure subprogram contain any operation that could conceivably perform any external file input/output or execute an image control statement (including a STOP statement). Note the use of the word conceivably; it is not sufficient for a pure subprogram merely to be side-effect free in practice. For example, a function that contains an assignment to a global variable but in a block that is not executed in any invocation of the function is nevertheless not a pure function. The exclusion of functions of this nature is required if strict compile-time checking is to be used.

It is expected that most library procedures will conform to the constraints required of pure procedures, and so can be declared pure and referenced in DO CONCURRENT constructs, FORALL statements and constructs, and within user-defined pure procedures.

NOTE 15.50

Pure subroutines are included to allow subroutine calls from pure procedures in a safe way, and to allow forall-assignment-stmts to be defined assignments. The constraints for pure subroutines are based on the same principles as for pure functions, except that side effects to INTENT (OUT), INTENT (INOUT), and pointer dummy arguments are permitted.

No, because I generally just wrap the few procedures I happen to need in the code I am working on. Writing a comprehensive set of wrappers for the LAPACK driver routines is a big, boring task that I don’t have the time or interest to undertake.

@nshaffer , it will likely take you significantly less time to simply take the LAPACK routines from netlib (or the current online source) and simply place them in MODULEs that then gives you ready interfaces, add the PURE attributes, fpm the package, and run some tests than you would with trying to wrap them and wonder with posts like these where the wrappers with PURE are inconsistent with the implementations.

2 Likes

Although, I agree with the statement, personal experience of having implemented explicit interfaces for BLAS/LAPACK, actually resulted in a measurable +5-7% increase in runtime (ceteris paribus). I have no good explanation on the matter other than, maybe more temporary array copies are happening behind the scenes with an explicit BLAS/LAPACK interface.

P.S. I would like to share a MWE for others to have a look but unfortunately it’s our matrix-free solver for radiation transport, which is export controlled so it’s unlikely I will be able to provide anything other than the BLAS/LAPACK interface (which is nothing special)

1 Like

That’s really strange, I find this really hard to believe and it’s contrary to our experience on Windows OS with IFORT.

On Windows using IFORT and its O2/O3 optimization, the teams I’ve worked with have noticed considerable improvement on the order of 20% or more with refactoring of FORTRAN 77 style code to just MODULEs hosting all the subroutines. With LAPACK style, I mean codebases that are FORTRAN 77 style one-.f file-per-external subroutine approach with no use of global or static data, there can be further performance gains. Then moving such code to MODULEs with hosted subroutines and perhaps due to the interprocedural optimization facility (Linux: -ipo) with `IFORT, performance sees a good improvement.

And I just tried a simple 1000x1000 Ax=B DGESV case and “ceteris paribus” on Windows: using IFORT with /O3 and /Qipo optimization, it’s showing >40% performance improvement.

Click to see the driver code
module runs_m

   use iso_fortran_env, only : I8 => int64
   use ieee_arithmetic, only : ieee_selected_real_kind
   use lapack_m, only : DGESV  !<-- This module hosts all LAPACK routines having `PURE` attribute

   integer, parameter :: WP = ieee_selected_real_kind( p=15, radix=2 )

   integer, parameter :: N = 1000
   integer, parameter :: NRHS = 1000

   real(WP), allocatable :: A_save(:,:)
   real(WP), allocatable :: X(:,:)
   real(WP), allocatable :: A(:,:)
   real(WP), allocatable :: BX(:,:)
   integer, allocatable :: ipvt(:)

contains

   subroutine setup( astat )
   
      integer, intent(out) :: astat

      allocate( A_save(N,N), stat=astat )
      if ( astat /= 0 ) return
      allocate( A(N,N), stat=astat )
      if ( astat /= 0 ) return
      allocate( X(N,NRHS), stat=astat )
      if ( astat /= 0 ) return
      allocate( BX(N,NRHS), stat=astat )
      if ( astat /= 0 ) return
      allocate( ipvt(N), stat=astat )
      if ( astat /= 0 ) return

      call random_number( A_save )
      call random_number( X )
       
   end subroutine 

   impure elemental subroutine run( t )

      ! Argument list      
      real(WP), intent(inout) :: t

      ! Local variables
      real(WP) :: t1, t2
      integer :: info
      integer :: lda, ldb

      A = A_save
      BX = matmul( A, X )
      lda = N ; ldb = N

      call my_cpu_time( t1 )
      call DGESV( n , nrhs,  a , lda , ipvt, bx, ldb , info )
      call my_cpu_time( t2 )

      t = t2 - t1
      ! assert the result
      if ( (abs(BX(1,1)-X(1,1))>0.01_wp).or.(abs(BX(N,NRHS)-X(N,NRHS))>0.01_wp) ) then
         t = huge(t)
      end if
      return 

   end subroutine
    
   impure elemental subroutine my_cpu_time( time )

      !.. Argument list
      real(WP), intent(inout) :: time

      !.. Local variables
      integer(I8) :: tick
      integer(I8) :: rate

      call system_clock (tick, rate)

      time = real(tick, kind=kind(time) ) / real(rate, kind=kind(time) )

      return

   end subroutine my_cpu_time

end module
! calling executive
    use runs_m
    integer, parameter :: ntrial = 12
    real(WP) :: times(ntrial)
    integer :: astat
    call setup( astat )
    if ( astat /= 0 ) stop "Setup failed."
    call run( times )
    print *, "CPU time (sec): ", (sum(times) - maxval(times) - minval(times))/(ntrial-2)  
end
! LAPACK module: only a snippet is shown below
      module lapack_m
      contains
..
      PURE SUBROUTINE dgesv( N, NRHS, A, LDA, IPIV, B, LDB, INFO )
*
*  -- LAPACK driver routine --
*  -- LAPACK is a software package provided by Univ. of Tennessee,    --
*  -- Univ. of California Berkeley, Univ. of Colorado Denver and NAG Ltd..--
*
*     Argument list
      INTEGER, intent(in) :: N
      INTEGER, intent(in) :: NRHS
      INTEGER, intent(in) :: LDA
      DOUBLE PRECISION, intent(inout) :: A( LDA, * )
      INTEGER, intent(out) :: IPIV(*)
      INTEGER, intent(in) :: LDB
      DOUBLE PRECISION, intent(inout) :: B( LDB, * )
      INTEGER, intent(out) :: INFO
      ...

Say with above, the CPU time is set as 1 sec, then the time with the case where DGESV is marked EXTERNAL and each of LAPACK subroutines are external procedures the time is ~1.41 sec. For reference, on one Windows 10 personal laptop with `Intel(R) Core™ i7-6820HQ @ 2.70GHz’ CPU, the measured CPU time of above shown code was about 0.6 seconds whereas it was 0.851 sec with the case where DGESV and its callees are external procedures.

I would thus encourage those working on stdlib to attempt stdlib_lapack, the biggest hurdle here might be what to with BSD license with LAPACK compared to MIT license with stdlib.

2 Likes

It’s also contrary to my experience with gfortran, albeit on a different code. Hosting the subroutines in modules and adding intent, led to a 15-20 % improvement in one case.

1 Like

Thanks @FortranFan and @ivanpribec for the MWE and comments. Out of curiosity how do you implement the LAPACK interface?

My approach is just declaring an interface in the module. Adding INTENT slightly improved performance but nothing like the stated 5-40% improvements.

module lapack_interfaces

   !< @brief
   !! Module containing explicit interfaces to LAPACK methods for the compiler
   !! to run diagnostics
   !! Please see the LAPACK documentation for more information

   implicit none
   interface
      pure subroutine dgesv(n, nrhs, a, lda, ipiv, b, ldb, info)
         !< @brief
         !! DGESV computes the solution to a real system of linear equations
         !! A * X = B,
         !! where A is an N-by-N matrix and X and B are N-by-NRHS matrices.
         !!
         !! The LU decomposition with partial pivoting and row interchanges is
         !! used to factor A as
         !! A = P * L * U,
         !! where P is a permutation matrix, L is unit lower triangular, and U is
         !! upper triangular.  The factored form of A is then used to solve the
         !! system of equations A * X = B.
         INTEGER, intent(in) :: N
         INTEGER, intent(in) :: NRHS
         INTEGER, intent(in) :: LDA
         DOUBLE PRECISION, intent(inout) :: A( LDA, * )
         INTEGER, intent(out) :: IPIV(*)
         INTEGER, intent(in) :: LDB
         DOUBLE PRECISION, intent(inout) :: B( LDB, * )
         INTEGER, intent(out) :: INFO
      end subroutine dgesv
   end interface
end module lapack_interfaces

The other weird thing that we sometimes do is that we pass arrays, similar to C; start pointer + how much you need to read. So say we only want to operate in a section of A, we do A(start1, start2). I am not sure if doing A(start1:end1, start2:end2) leads to array copies (that previously were not present), ultimately leading to performance loss.

At this point the performance difference is just academic curiosity from my part.

PS Compilation is made with the appropriate flags e.g. -O3 -march=native or -O3 -ipo

In my case, and from what I understand also @FortranFan’s, the benefits are noticeable if you put the full procedures in a module and not just the interface. This way the compiler might perform more aggressive inlining and dead code elimination than if the subroutines were in separate compilation units. I am guessing that @FortranFan’s results pertain primarily to the reference (or “Netlib”) LAPACK.

However, in case of LAPACK we should be cautious about making generalizations, since optimized versions like OpenBLAS, ReLAPACK, and Intel MKL LAPACK also make certain algorithmic changes compared to the reference Fortran LAPACK. If you take a look at the OpenBLAS repository, it has lapack-netlib/ and lapack/ subfolders. In the latter, a subset of the most commonly used LAPACK routines is actually re-implemented (in C) for better performance.

1 Like

You can use the Intel Fortran Classic compiler flag -check arg_temp_created or the GNU Fortran -fcheck=array-temps to verify there are no temporary copies being made. I remember having similar issues once, but have forgotten the details.

1 Like

Thanks for the suggestion. I recall back when I looked into this, 1.5y ago, I used -fcheck=array-temps to check for temps but nothing popped up.