Complaints about NumPy

A post I don't like NumPy has been much discussed on Hacker News. With interfaces to Fortran linear algebra libraries and possibly operator overloading, can Fortran do better? The author says the need to avoid loops in NumPy is a problem, but in Fortran it’s fine to use loops.

2 Likes

For a batch of linear systems you could use the Batched BLAS and LAPACK extensions, introduced in:

Abdelfattah, Ahmad, et al. “A set of batched basic linear algebra subprograms and LAPACK routines.” ACM Transactions on Mathematical Software (TOMS) 47.3 (2021): 1-23. https://doi.org/10.1145/3431921

The Intel MKL versions can be found here:

An example is shown here: Solving Linear Systems Using oneMKL and OpenMP* Target Offloading | Intel

Depending on what you are trying to achieve, there are some caveats. It depends on how many times you need to reuse the matrix factorization and if you are targeting the CPU or a GPU.

IMO, the MKL batched LAPACK interface is a bit weird in the way it dimensions the matrices (probably due to its origin in the C function prototypes specified by the BLAS working group):

subroutine dgetrs_batch_strided(trans,n,nrhs,a,lda,stride_a,ipiv,stride_ipiv, &
                                b,ldb,stride_b,batch_size,info)
    integer, parameter :: dp = kind(1.0d0)
    character(len=1), intent(in) :: trans
    integer, intent(in) :: n, nrhs, lda, stride_a, stride_ipiv, ldb, stride_b
    integer, intent(in) :: batch_size
    real(dp), intent(in) :: a(lda,*)
    real(dp), intent(inout) :: b(ldb,*)
    integer, intent(in) :: ipiv(*)
    integer, intent(out) :: info(batch_size)

To solve a batch of linear problems once only (on the CPU), it would be better to use,

! N.b. the strides are linearized!
subroutine dgesv_batch_strided(n,nrhs,a,lda,stride_a,ipiv,stride_ipiv, &
                               b,ldb,stride_b,batch_size,info)
    implicit none
    integer, parameter :: dp = kind(1.0d0)
 
    integer, intent(in) :: n, nrhs, lda, stride_a, stride_ipiv, ldb, stride_b
    integer, intent(in) :: batch_size

    real(dp), intent(in) :: a(lda,*)
    real(dp), intent(inout) :: b(ldb,*)
    integer, intent(in) :: ipiv(*)
    integer, intent(out) :: info(batch_size)

    external :: dgesv

    integer :: stride_a_col, stride_b_col, ja, jb, jipiv
    integer :: k

    stride_a_col = stride_a / lda
    stride_b_col = stride_b / ldb

    !$omp parallel do schedule(static)
    do k = 1, batch_size
        ja = (k-1)*stride_a_col + 1
        jb = (k-1)*stride_b_col + 1
        jipiv = (k-1)*stride_ipiv + 1

        ! ATTENTION: sequence association used here!
        call dgesv(n,nrhs,A(1,ja),lda,ipiv(jipiv),b(1,jb),ldb,info(k))
    end do

end subroutine

When applying multithreading on the outer loop, it might be necessary to disable (or control) the internal multithreading of the dgesv implementation. This will also depend on the size of the matrices.

You can find batched LU factorization routines (for the GPU) also in cuBLAS and MAGMA.

I don’t get what’s wrong with

y = np.empty_like(x)
for i in range(100):
    y[i,:] = np.linalg.solve(A[i,:,:], x[i,:])

?
Loops in Python are slow, but I expect the loop overhead to be small compared to the runtime of np.linalg.solve (unless of course the dimensions of the system are small).

I haven’t read the article in detail, but maybe the author wants to solve tons of small problems at once using GPU etc for best performance? From some internet search, I have come across these pages also (both of which seem to be interested in a lot of small problems).

1 Like

Because why not, I asked ChatGPT to help me translate the attention and multi-head attention examples from the blog to Fortran using softmax – Fortran-lang/stdlib

module mod_attention
use stdlib_specialfunctions, only: softmax

contains

function attention(X, W_q, W_k, W_v) result(output)
    real, intent(in) :: X(:, :)
    real, intent(in) :: W_q(:, :), W_k(:, :), W_v(:, :)
    real :: Q(size(X,1), size(W_q,2))
    real :: K(size(X,1), size(W_k,2))
    real :: V(size(X,1), size(W_v,2))
    real :: scores(size(X,1), size(X,1))
    real :: attention_weights(size(X,1), size(X,1))
    real :: output(size(X,1), size(W_v,2))
    real :: d_k

    d_k = real(size(W_k,2))

    Q = matmul(X, W_q)
    K = matmul(X, W_k)
    V = matmul(X, W_v)

    scores = matmul(Q, transpose(K)) / sqrt(d_k)
    attention_weights = softmax(scores, dim=2)

    output = matmul(attention_weights, V)
  end function attention

  function multi_head_attention(X, W_q, W_k, W_v, W_o, n_head) result(output)
    real, intent(in) :: X(:, :)
    real, intent(in) :: W_q(:, :, :), W_k(:, :, :), W_v(:, :, :)
    real, intent(in) :: W_o(:, :, :)
    integer, intent(in) :: n_head

    integer :: seq_len, input_dim, d_k, d_v, i, n
    real, allocatable :: projected(:, :, :)
    real, allocatable :: head_output(:, :)
    real, allocatable :: my_proj(:, :)
    real, allocatable :: output(:, :)

    seq_len = size(X, 1)
    input_dim = size(X, 2)
    d_k = size(W_q, 3)
    d_v = size(W_v, 3)

    allocate(projected(n_head, seq_len, input_dim / n_head))
    allocate(output(seq_len, n_head * (input_dim / n_head)))

    do n = 1, n_head
      head_output = attention(X, W_q(n,:,:), W_k(n,:,:), W_v(n,:,:))
      my_proj = matmul(head_output, W_o(n,:,:))
      projected(n, :, :) = my_proj
    end do

    ! Concatenate projected heads for each sequence element
    do i = 1, seq_len
      output(i, :) = reshape(projected(:, i, :), [n_head * (input_dim / n_head)])
    end do

  end function multi_head_attention

end module

program main
  use mod_attention
  implicit none

  integer, parameter :: input_dim = 4, seq_len = 4, d_k = 5, d_v = input_dim
  integer, parameter :: n_head = 2
  real :: X(seq_len, input_dim)
  real :: W_q(n_head, input_dim, d_k)
  real :: W_k(n_head, input_dim, d_k)
  real :: W_v(n_head, input_dim, d_v)
  real :: W_o(n_head, d_v, input_dim / n_head)
  real :: out(seq_len, input_dim)

  call random_number(X)
  call random_number(W_q)
  call random_number(W_k)
  call random_number(W_v)
  call random_number(W_o)

  out = multi_head_attention(X, W_q, W_k, W_v, W_o, n_head)

  print *, "Multi-head attention output:"
  print *, out
end program main

Haven’t checked the correctness of the code snippet but it looks promising and simple enough.

1 Like

Honestly, I don’t understand the blog post’s point about the self-attention example. Both implementation, either calling to the single-head attention or writing it from scratch looks fine to me.

Honestly (and this might be an unpopular opinion here), I think don’t think Fortran will ever attract someone making those criticisms of numpy. Fortran is wonderful if you’re dealing with complicated calculations involving fancy arrays that you’ll code once and rarely return to. It’s not good for one-off code, hacked together solutions, or quickly prototyping algorithms. And I personally think this can be a strength that Fortran has over Python/numpy. If you want something that’s quick and easy, something that doesn’t require you to look at documentation or maybe sketch things out on paper to be sure you write the logic correctly, then the long argument lists and strong, static typing of Fortran will simply not be attractive and that’s okay.

Hi @Ellehan and welcome.

AFAICT, Fortran is not mentioned in the blog post. The post is mostly about how avoiding loops led to the introduction of overcomplicated solutions (compared to good-ol’ MATLAB-like syntax).

But in the real world, the MVP (i.e., the prototype) often becomes the product, so Fortran might not be considered at all even if it’s the best tool for the job.

Before Fortran 90 there were no assumed-shape array arguments or optional arguments. so array dimensions needed to be passed, as did control parameters specifying details such as an error tolerance, maximum number of iterations, how many intermediate results to print, etc. But in modern Fortran you can use assumed-shape arguments and use optional arguments for the control parameters (with reasonable defaults if they are not set) or set them in a single derived type, so even a complicated procedure need not have many required arguments.

2 Likes