Linear algebra support in stdlib

Hello Fortran Discourse readers,

I’m initiating this thread to gather your thoughts and insights on the upcoming task of integrating linear algebra operations into stdlib. Here are the key developments we’d like to introduce:

  1. Accessible Interfaces for Common Linear Algebra Operations:

    • We want stdlib to handle linear algebra tasks efficiently by leveraging libraries like BLAS, LAPACK, and SCALAPACK. The goal is to create a user-friendly interface, inspired by the syntax of libraries like scipy, while also providing an expert interface with fine-tuning options (e.g. stored in derived types).
    • We’re debating whether to develop our own Modernized BLAS/LAPACK at fortran-lang or automate the download process from netlib. I believe maintaining our own modernized version would avoid us to stumble on the same problems scipy is having with old F77 code down the road, and give the world a reference Modern Fortran implementation for the next 50 years; however, it is also a pretty intimidating amount of work. Also, hardware-tuned custom BLAS/LAPACK implementations are blooming, so, I’m not sure it would be all time well spent. Your input on this decision would be valuable.
  2. Support for Common IO Formats for Matrices and Tensors:

    • The aim is to support serialization/deserialization and conversion among formats like NPZ and matrixmarket for multidimensional arrays. We also plan to provide an easily extendable API for plugins, accommodating other formats outside the stdlib scope.
    • We’re considering the development of derived types to store temporary information beyond matrix data, optimizing performance during repeated algebraic operations.

We’re keen to receive your feedback, ideas, and criticisms to ensure we will give stdlib the best, most practical and effective linear algebra API that will make people move away from other languages in favor of Fortran! Please share your thoughts on this page, and let’s work together on this task.

Thank you,
Federico

15 Likes

From personal experience, I would say this step would be a big one in making the Fortran ecosystem as self-sufficient as possible. Specially for point 1.
Regarding the modernisation of BLAS/LAPACK, it is indeed something that needs to be done if we don’t want it to be replaced by some other (non Fortran) implementations. However, I do think it’s something that should happen in parallel to this delivery. I’d start by firstly using the current implementation as backend. Otherwise, this shipping would risk to be lost in the time needed to actually get to a very stable implementation of the modernised version. On the other hand, I do also ackonwledge that this might indeed affect the acual API or requiring some porting, but IMO, the added value of having soon a working version using the old code base vs. waiting is indeed on the positive side.

Lastly, just thinking out loud.
Adding a 3rd point about some basic Visualisation Framework*, then I think you covered at least half of the applications for which other well-known languages/platforms are used. Which could really launch Fortran to the stars. But yes, this is another no-joke task (maybe too much?).

*maybe starting gathering the scattered Fortran repo trying to cover some aspects of this matter?

3 Likes

Regarding this point I think that it would be necessary to have a simultaneous discussion on the OOP API for linear algebra. Why? It would enable hiding the actual kernel implementations in backstage, so at the beginning use the standard well-known ones, enable replacing by vendor specific ones, or implement new ones respecting the user-interface that stdlib should decide.

In this regard, I’ll put more focus right now on the user API, and later on thinking about new kernel implementations, as there are already many from where to choose from.

This is a very good idea, it would facilitate enormously the job of testing new ideas!! I had that idea also for FSPARSE, at least for MatrixMarket and Harwell-Boeing, if they are added in stdlib that would be an excellent step forward :slight_smile:

I would not neglect the importance of having an API that can be compatible with SPMD execution (distributed memory under MPI or Coarray). Or at least, that can be extended easily and not block this use-case. This would be decided by the data-containers API.

2 Likes

Thank you @mEm @hkvzjal,

Yes, I think you’re right: modernization and user-friendly API should grow in parallel, to avoid getting stuck in the modernization effort if that turns out to take longer than expected.

Regarding the matrix API, I’m a heavy OOP user, but the way I see it here is two-fold: for the more common functions, a more Fortranic/functional API would work best imho, for two reasons:

  1. We can overload the language intrinsics like matmul, dot_product with the sparse types, and provide additional ones like solve, eig, etc.
  2. Generic interfaces are resolved at compile time and hence faster

I think OOP would work better for the expert interface like provide settings and customize matrix data. E.g., one may want to incrementally edit a sparse matrix, change bandwith, etc. in which case some form of (i,j) access is easier from the user standpoint, although of course slow.

2 Likes

Here is where I slightly disagree. I think that a decently designed OOP interface would actually empower the non-expert to use the API “out-of-the-shelf” without needing to know how things are handle backstage. And only when one wants to tweak or go beyond, then one would look into the implementation details. That’s kind of the magic with SciPy sparse. You can do a lot of stuff without ever looking at the actual structure of the data, so a good way of getting acquainted with methods in a top-down approach.

These are indeed kind of limit cases from the efficiency point of view, but honestly, I don’t think those are a must for the API. If the data-handler structure are based on allocatables, that can evolve later on, or the user can build on top easily.

2 Likes

An issue with with BLAS/LAPACK and actually with many legacy libraries is that all dummy arrays are contiguous by design (either assumed size or explicit shape). Working on non-contiguous matrices/vectors is handled by the LD* and INC* arguments, without the need of temporary copies. If you want to build wrappers with assumed shape dummy arrays and without the LD*/INC* arguments, then you are facing the problem of potential copy-in/copy-out to cope with the non-contiguity.

Intrisics like MATMULT have access to the internal array descriptors, and can use the strides to derive the appropriate LD*/INC* arguments in the BLAS/LAPACK calls, but in standard Fortran it’s not possible.

What are your thoughts about that ?

1 Like

I think we don’t want to lose that capability at least for the expert interface.
I have never thoroughly explored a solution, but I believe that with the new standard, the stride may be read back into Fortran from a C routine that calls a C-Fortran descriptor on the input array:

1 Like

Indeed, this should do.

My request would be for a simple subroutine/function interface rather than OOP based.

Here is a C example from libflame for performing a Cholesky factorization:

#include "FLAME.h"
int main( void )
{
double* buffer;
int m, rs, cs;
FLA_Obj A;
// Initialize libflame.
FLA_Init();
// Get the matrix buffer address, size, and row and column strides.
get_matrix_info( &buffer, &m, &rs, &cs );
// Create an m x m double-precision libflame object without a buffer,
// and then attach the matrix buffer to the object.
FLA_Obj_create_without_buffer( FLA_DOUBLE, m, m, &A );
FLA_Obj_attach_buffer( buffer, rs, cs, &A );
// Compute the Cholesky factorization, storing to the lower triangle.
FLA_Chol( FLA_LOWER_TRIANGULAR, A );
// Free the object without freeing the matrix buffer.
FLA_Obj_free_without_buffer( &A );
// Free the matrix buffer.
free_matrix( buffer );
// Finalize libflame.
FLA_Finalize();
return 0;
}

This seems to be unnecessarily complicated compared the to original LAPACK interface:

program main
implicit none
integer n,info
real(8), allocatable :: a(:,:)
...
allocate(a(n,n))
...
call dpotrf('L',n,a,n,info)
...
deallocate(a)
end program

What would be a welcome improvement over the standard BLAS/LAPACK interface is the use of optional arguments, at which Fortran excels. For example:

call stdlib_dpotrf(uplo='L',a=a)

The stdlib interface would then infer the value of the missing arguments (n,lda and info in this case). Similarly

call stdlib_zheevd(jobz='V',uplo='U',a=a,w=w,info=info)

would fill in the missing arguments and create all the required work arrays. This would make for a simpler interface for new users, but advanced users could pass their own work arrays, viz.:

call stdlib_zheevd(jobz='V',uplo='U',a=a,w=w,work=work,rwork=rwork,iwork=iwork)

which would allow for repeated calls without allocating new arrays.

5 Likes

Since subroutines that overwrite input arguments are error-prone, I think there should also be a pure function to compute the Cholesky decomposition, perhaps patterned after chol in Matlab. At least if the user writes

A = chol(A)

it is clear that A is being overwritten.

3 Likes

Thank you @jkd2022 @Beliavsky, I like both approaches to the interface. the aim is exactly to make it as easy as possible for the user. So I think that the high-level API should look like what @Beliavsky is showing: easy, simple syntax that asks nothing to the user, like Matlab/NumPy/SciPy, then extending from it for the expert interface.

@jkd2022 I like optional keywords but I think someone pointed out that they may be slow, because they’re evaluated at runtime? So I’m thinking whether generic interfaces, or a settings structure, may work better? They would be evaluated at compile time

! Structure returned from the expert interface
type :: chol_factorization
   logical :: upper_triangle = .true.
   integer :: status = 0 ! info
   real(RK), allocatable :: A(:,:) ! the factored matrix
end type chol_factorization

so that we can pass the LAPACK routine a non-factorized matrix, assuming it will have to be factorized on-the-fly:

eigenvalues = eig(A) ! eigenvalues only, factorize internally
call eig(A,values,vectors) ! eigenvalues + eigenvectors

where either

real(RK) :: A(:,:) ! use non-factorized matrix

or

type(chol_factorization) :: A ! use factorized matrix data

The second case would avoid re-allocations and re-factorizations, what do you think?

This looks a bit complicated. I think OOP in this case can be useful for the sparse matrices, as they need special containers… everything else should be kept “functional”.

So, whether we talk about matmult, dot_product, eig, chol, or any other functionality I think the best thing would be simple input/output strategy. In the case of dense matrices the OOP aspect can be fully avoided, but with the proper generic approach or interfaces, the procedures can distinguish at compile time whether the input is a

real(real64), allocatable :: A(:,:) !> dense matrix ... shall it be wrapped within a type??

or something like

type(COO) :: A !> where A contains all necessary info an data

Then, in both cases I should be able to do something like

B = chol( A )
!> OR
call chol( A , B ) !> output in the last position

having defined the right container for receiving the result, which shall be the user’s responsibility.
EDIT

!> maybe better the output at the beginning
call chol( B , A , ... options ... ) 
!> in the case of using optional arguments they have to be the last inputs 

I also like optionals, it is true that they have a small overhead to verify the presence of the optional argument, but is it that slow? not sure, I guess the answer is “it depends” (on the ratio of processed data vs number of calls). Anyhow, better to go for interfaces to handle the optional arguments if this looks like a true bottleneck.

3 Likes

Yes, I agree with basically everything. I’d say now that we should start seeing what the first implementation comes up like and see if we do/don’t like it.

For example, the first problem will be to consistently define names for the function vs. in-place subroutine version, as Fortran does not allow mixed function/subroutine interfaces:

B = chol(A, uplo=.true., etc.)
call chol_inplace(A, B, uplo=.true., etc.) ! find a good naming convention
1 Like

I think having an explicit interface plays to Fortran’s strengths. stdlib could also check that each passed array has the expected size and shape. Thus if a work array is too small the library could stop and report an error. Currently, BLAS/LAPACK simply assumes the arrays are large enough and have the correct dimensions. This is can be a real problem for inexperienced users.

Also, one more request I have is to be able to locally define the number of threads a BLAS or LAPACK operation can use. This can be done with MKL, for example from our code:

nts=mkl_set_num_threads_local(nthd)
call zhegvx(...)
nts=mkl_set_num_threads_local(0)

This actually has pretty decent scaling with thread number.

stdlib could do the same, for example:

call chol_inplace(num_threads=10, A, B, uplo=.true., etc.)

or

call chol_inplace(use_gpu=.true., A, B, uplo=.true., etc.)

if a GPU is available.

2 Likes

While I have no dogma here, I think that the subroutine version would be more robust in terms of data management. Even more as the l.h.s can be an allocatable or a derived type. This should not block having a dynamic management of the memory, say:

Case 1: the user did not allocate B, the subroutine could use the information from A to allocate B.
Case 2: the user did allocate B, the subroutine shall assume that the user allocated it properly, check dimensions and throw an error otherwise.

This question was discussed in the thread Testing whether an argument is PRESENT is very fast .

4 Likes

Nice @Beliavsky, so up-vote for the optional args approach :wink:

1 Like

I am posting the related GitHub Issue below, but it would be more convenient if we kept the discussion here (like we have done until now).

3 Likes

You can look at IMSL for ideas about the interface. They use optional arguments heavily. For example, for lin_sol_gen, which

Solves a general system of linear equations Ax = b. Using optional arguments, any of several related computations can be performed. These extra tasks include computing the LU factorization of A using partial pivoting, representing the determinant of A, computing the inverse matrix A-1, and solving ATx = b or Ax = b given the LU factorization of A.

some sample calls are

! Compute the solution matrix of Ax=b. 
      call lin_sol_gen(A, b, x)
! Compute the matrix inverse and its determinant. 
      call lin_sol_gen(A, b, x, nrhs=0, & 
                ainv=inv, det=determinant)
1 Like

Nice discussion! Did you consider the BLAS/LAPACK Fortran 95 API conventions. Although I don’t like to use them (often issues with large matrices) the interfaces are quite straighfoward. Building on top of them for other procedures (Cholesky decomposition, inverse, …) could be useful.

1 Like