Writing wrappers for LAPACK and BLAS routines

Ok this makes, sense. I am using ifort version 2021.9.0.

ifort has the -heap-arrays option that forces the temporary arrays to be allocated on the heap memory, which should solve the issue. But it would be better to write the code so that no copy is ever needed. Adding the contiguous attribute to the whole call chain should do.

I agree with the other responses in this thread about trying the contiguous array attribute.

I just wanted to add that the reason compilers use stack memory for temporary arrays rather than heap memory is that stack memory is very fast to grab. It is almost free, in fact, by comparison to heap allocation. The problem is that it is often limited in size by the compiler and/or by the underlying operating system. Two common options are to increase the stack size and to forcing heap allocation of temporary arrays. If you use compiler options to force heap allocation, then you may see a performance hit. Of course, if your code runs with heap, but not with stack (even after increasing the stack size), then it doesn’t matter what the hypothetical performance difference would have been.

I think there should be an option for programmers to use that directs the compiler to first attempt stack allocation, and then if that fails, to switch to heap allocation for temporary arrays. That should be, at least, a compile time option, and even better if it could also be enabled at run time. The fortran standard does not specify how memory is allocated, or the features of stacks and heaps, so it is unlikely to ever include this capability. It would probably only be a compiler-specific feature.

That’s my biggest complain against fortran, compared to C/C++ where you at least have a way of make sure whether it goes to heap or stack. But, explicitly using the flag -heap-arrays seems lower the overall performance.

I am actually running this on a larger machine (a node with 2 CPUs) with about 500Gb and intend to run it with much large size matrices (>50k). However, there’s another limitation using contiguous arrays, since 500Gb is divide into 250Gb for each CPU with NUMA. In that case, the only option to scale this up then would be to store the matrices on disk.

You have the option in Fortran too, don’t pass an array expression to a subroutine, i.e. modify this

to

            kt = yi + (dt*0.5d0)*k1
            call zmul_mv(func, kt, k2)
            kt = yi + (dt*0.5d0)*k2
            call zmul_mv(func, kt, k3)
            kt = yi + (dt*1.0d0)*k3
            call zmul_mv(func, kt, k4)

and now you can be certain it is using heap memory obtained with an allocate statement, and not left to the judgement of the compiler.

How about using Parallel BLAS?

! Standard BLAS
call  zgemv(trans, m, n, alpha, a, lda, &
                                x, incx, &
                          beta, y, incy)

! Parallel BLAS
call pzgemv(trans, m, n, alpha, a, ia, ja, desca, & 
                                x, ix, jx, descx, incx, &
                          beta, y, iy, jy, descy, incy)

You will need to establish the distributed arrays and descriptors at first, but otherwise your code can remain mostly the same.

More info:

If the allocation overhead is significant, then this is a good way to avoid it. If the max size of the kt array is known ahead of time, it can be allocated once and then reused for the entire run. When the array expressions are used as the actual arguments, the compiler might be allocating new temporary arrays for each expression evaluation and subroutine call. If the effective size changes during the run, then be sure to use slice notation on the lhs of the assignments to override the allocation-on-assignment that would otherwise occur.

It appears to be constant at least for the duration of the subroutine, based on the code @fracton posted above in #19.

I noticed this issue a few weeks ago while I was taking a closer look at inline expressions in rklib via Compiler Explorer,

        call me%f(t,       x,f1)
        call me%f(t+a2*h,  x+h*(b21*f1),f2)
        call me%f(t+a3*h,  x+h*(b31*f1 + b32*f2),f3)
        call me%f(t+a4*h,  x+h*(b41*f1 + b42*f2 + b43*f3),f4)
        call me%f(t+h,     x+h*(b51*f1 + b52*f2 + b53*f3 + b54*f4),f5)
        call me%f(t+a6*h,  x+h*(b61*f1 + b62*f2 + b63*f3 + b64*f4 + b65*f5),f6)

and noticed that gfortran would create a bunch of calls to malloc and free (i.e. heap), whereas ifort had no mallocs (stack).

1 Like

Thanks for the suggestion, I’ll check that out.

Yup, I found something similar while I was checking for some memory leaks with valgrind. The no. of calls per iteration of the code block was very different between gfortran and ifort. While my initial plan was to use kt as temporary, for some reason I missed using it in the code I posted earlier.

In #26 you mentioned PBLAS and ScaLAPACK. However, is there simpler way to parallelize the matrix-vector multiplication step ?

While it would significantly add to the computational overload, I am thinking calculating matrix elements on the fly could help scale this problem. Then, it would effectively reduced to row-vector multiplications steps parallelised with OpenMP. This seems especially lucrative if the matrix size (nxn) of the order of millions (n~100 million).

Also, the func matrices in my code are mostly sparse, I am not sure how to take advantage of that.

But C/C++ doesn’t have array syntax (which includes a very flexible way to pass arrays, with sections, etc…): the array syntax (generally speaking) potentially generates temporary arrays, and since they are hidden to the user there can hardly be a way in the langage to chose where they are allocated.

I think that gfortran allocates them on the stack up to a certain size, and on the heap otherwise. But it’s not that clear.

Note that even in Fortran you have some choice: usually, allocatable arrays are allocated on the heap, while automatic arrays (the ones that are dimensioned upon entering a routine) are allocated on the stack.

NUMA is mostly an orthogonal issue to the issue we are talking about here…

1 Like

I mean you’re right, the problem is that making sure the arrays are stored on heap solves only a part of the problem. I am not sure about this, but forcing arrays to be contiguous puts a constraint on the matrix size?

I am still trying to scale this up to use it with matrices that are much larger. Another way would be to store the array on disk. For example, in numpy one can use numpy.memmap() to store large arrays on disk call it one the fly.

The contiguous attribute does not play a role here. It is a “contract” with the compiler that there will be no “gaps” in an array (be that some locally declared variable or the argument to a subroutine or function). The language introduced this notion because Fortran arrays can have non-unit stride:

! contiguity.f90
real, target :: a(20)
real, pointer :: b(:)
real, pointer, contiguous :: c(:)

b => a(1::2)  ! the odd elements
c => a(10:)   ! the second half
print *, is_contiguous(b), is_contiguous(c)
end
$ ifort contiguity.f90 
$ ./a.out
 F T

If you try to associate the pointer c with a non-contiguous target, some compilers will raise an error. For example if we try to set c => a(2::2) (c points to every even element), you would get an error with ifort:

$ ifort contiguousness.f90 
contiguousness.f90(8): error #8375: Array section is not contiguous if it has stride other than unity.
c => a(2::2)
-------^
contiguousness.f90(8): error #8371: If pointer is declared CONTIGUOUS, target must be contiguous as well.   [A]
c => a(2::2)
-----^
compilation aborted for contiguousness.f90 (code 1)

By adding contiguous, you are making a promise to the compiler, stating that the elements of an array will be laid out tightly in memory, separated precisely by the width of the declared type apart. When passing an array to a subroutine that expects a contiguous entity, the compiler will automatically take care that a contiguous copy is made if needed.

What numpy can, Fortran ought to be able do to. Check the documentation of your OS mmap (mmap(2) - Linux manual page) function.

That said, I’m still confused by the architecture of the machine you’d like to run this on. Do the CPUs share the same address space? If yes, you don’t need to use a distributed programming model but you’ll probably need to pay attention to the first-touch memory placement to get good performance from your GEMV and GEMM, i.e. making sure that memory is allocated on both NUMA nodes (typically done by proper initialization). Here is a slide from an OpenMP presentation by Oracle:

Now that you mention this, totally different story. You don’t need to multiply and store what are mostly zeros. Don’t use BLAS or PBLAS in that case. You have Sparse BLAS for this purpose:

librsb is used in some Lattice QCD software. It performs best for symmetric matrices and/or multiple right-hand sides.

Edit: depending on how expensive it is to compute the entries of your matrix func, it could be also an option to compute them on the fly. This is known as the matrix-free approach, where instead of doing Mx, you work with M as a linear operator rather than as a 2-D array of elements. For example multiplying a vector by a diagonal matrix, M = \mathrm{diag}(d_1,...,d_n), can be done in O(n) and not O(n^2).

Comment: It’s interesting how this genuinely nice thread, turned into an XY problem. This is why it’s good to be more descriptive when posting a question than just saying “numerical calculations”.

2 Likes

Here’s an example using MKL Sparse BLAS to multiply a sparse symmetric matrix specified in triplet format (also known as COOrdinate format), that you can perhaps adapt to your needs:

! This has to be placed in one program unit only
include 'mkl_spblas.f90'

! To compile the program it suffices to do:
!
! $ source /opt/intel/oneapi/setvars.h
! $ ifort -warn all -O2 -xHost -qmkl spmv_mkl.f90
!
program spmv_mkl

use, intrinsic :: iso_c_binding
use mkl_spblas

implicit none

integer, parameter :: dp = c_double

real(dp), allocatable :: va(:), x(:), y(:), d(:)
integer(c_int), allocatable :: ia(:), ja(:)
integer(c_int) :: n, nnz

type(sparse_matrix_t) :: A
type(matrix_descr) :: Adescr
real(dp) :: alpha
integer(c_int) :: stat

n = 4
nnz = 8

allocate(x(n),y(n),d(n))
allocate(va(nnz),ja(nnz),ia(nnz))

! sparse matrix coefficients
va = [1.0_dp, 2.0_dp, 3.0_dp, 4.0_dp, &
              5.0_dp,         6.0_dp, &
                      7.0_dp,         &
                              8.0_dp]

! storage pattern
ia(:) = [1,1,1,1,2,2,3,4]
ja(:) = [1,2,3,4,2,4,3,4]

! Create matrix from COO entries
stat = mkl_sparse_d_create_coo(A,SPARSE_INDEX_BASE_ONE,n,n,nnz,ia,ja,va)
if (stat /= SPARSE_STATUS_SUCCESS) error stop "mkl_sparse_d_create_coo failed."

Adescr = matrix_descr(type=SPARSE_MATRIX_TYPE_SYMMETRIC, &
                      mode=SPARSE_FILL_MODE_UPPER, &
                      diag=SPARSE_DIAG_NON_UNIT)

! multiplicand vector
x = 1

alpha = 1.0_dp

! y = alpha A * x + beta * y
stat = mkl_sparse_d_mv(SPARSE_OPERATION_NON_TRANSPOSE, &
        alpha,A,Adescr,x,beta=0.0_dp,y=y)
if (stat /= SPARSE_STATUS_SUCCESS) error stop "mkl_sparse_d_mv failed."

write(*,'(A,/,1(F10.3,2X))') "Result vector = ", y

stat = mkl_sparse_destroy(A)
if (stat /= SPARSE_STATUS_SUCCESS) error stop "mkl_sparse_destroy failed."

end program

Thanks for sharing this example.

On a side note, just a quick question. Wouldn’t parallelizing those array additions using daxpy or zaxpy be more efficient?

Because, just passing array pointers to wrappers functions reduces the memory overhead. Then, all that is stored on the stack would be the pointers to original arrays.

What type of parallelization are you implying? Over the array index space 1:N, or the fact you could compute partial sums in parallel?

In case of the first one, absolutely, if your arrays are big enough, distributing the work across multiple threads/machines could give a speed up. This is why the SUNDIALS package has multiple NVector classes, depending on the type of parallel hardware or software library you are trying to exploit. Using a library where daxpy is vectorized or multi-threaded would be an example of this. But generally, axpy is a simple kernel, and Fortran can vectorize it well automatically.

In case of the second type - doing parts of the sum in parallel - there are some examples of doing this in Fortran. For example Damian @rouson has worked on an “Abstract Calculus Pattern”, where you build a mini “operator” language, such as:

class(tensor):: ut, u
real :: nu = 1.0
ut = nu ∗ (.laplacian.u) − (u.dot.(.grad.u))

Although simple in appearance, the operators hide an asynchronous parallel execution, as shown by this figure copied from his article:

image

That would be desirable, IMO, and the workaround above using the kt as temporary storage achieves this. I assume the author of rklib is using it on problems of modest size, where memory or performance may not be such a concern.

In itself, it doesn’t. The point here is that BLAS routines take contiguous arrays as arguments, your initial arrays are contiguous by design, so what you want is avoid any temporary copy in between. In principle, a smart enough compiler should be able to avoid any copy as long as you pass the entire arrays (i.e. no array section and no expression), but you have to rely on the compiler.

What I don’t understand in your case if why introducing the contiguous attribute seem to trigger temporary copies that were existing before (?)…

Memory mapped files would be desirable in Fortran IMO, but 1) they have a performance penalty, and 2) I can’t see how it would help here

Ultimately, we shouldn’t forget the bigger picture. What is the goal of the computation? Where is the bottleneck?

Even if your rungekutta routine is written optimally, maybe the bulk of the runtime is in the place which forms the (sparse) matrix func. Maybe you could relax the accuracy a bit and pick a larger dt value. Maybe you could ask a friend to run the program on a machine with a higher clock speed and more suitable instruction set. Perhaps you could use a different algorithm entirely, with spectral properties that are better suited to your problem. Maybe even a short-cut calculation exists. If it’s a one-time calculation, perhaps you can just wait a little bit longer, and do something else meanwhile.

On a forum like this one, I’d usually assume all of these options have been exhausted already. Still it would be a good idea to do some profiling of the program, and determine if this is the spot where you can make the maximum impact. From what you’ve described so far, if you’d like to scale the problem to bigger N (so called weak scaling), then using sparse matrices will in principle scale as O(cN), whereas your current dense solution is O(N^2), both in terms of work and storage. Depending on the pre-factor c, there is probably a break-point where sparsity will provide an advantage.

1 Like

contiguous forces the arrays to be contiguous in memory? Otherwise, the array can be distributed across the entire available memory.

My problem is two fold:
(1) How do I use BLAS routines to perform matrix-vector operations efficiently.
(2) How can scale such code to handle matrix sizes of the order \approx 10^5- 10^6

Yes indeed! In orbital mechanics, typically the state is only 6 elements (position and velocity) and sometimes a few more. I would be interested in any suggestions for this library if any changes would make it better for integrating much larger systems.