A post I don't like NumPy has been much discussed on Hacker News. With interfaces to Fortran linear algebra libraries and possibly operator overloading, can Fortran do better? The author says the need to avoid loops in NumPy is a problem, but in Fortran it’s fine to use loops.
For a batch of linear systems you could use the Batched BLAS and LAPACK extensions, introduced in:
Abdelfattah, Ahmad, et al. “A set of batched basic linear algebra subprograms and LAPACK routines.” ACM Transactions on Mathematical Software (TOMS) 47.3 (2021): 1-23. https://doi.org/10.1145/3431921
The Intel MKL versions can be found here:
An example is shown here: Solving Linear Systems Using oneMKL and OpenMP* Target Offloading | Intel
Depending on what you are trying to achieve, there are some caveats. It depends on how many times you need to reuse the matrix factorization and if you are targeting the CPU or a GPU.
IMO, the MKL batched LAPACK interface is a bit weird in the way it dimensions the matrices (probably due to its origin in the C function prototypes specified by the BLAS working group):
subroutine dgetrs_batch_strided(trans,n,nrhs,a,lda,stride_a,ipiv,stride_ipiv, &
b,ldb,stride_b,batch_size,info)
integer, parameter :: dp = kind(1.0d0)
character(len=1), intent(in) :: trans
integer, intent(in) :: n, nrhs, lda, stride_a, stride_ipiv, ldb, stride_b
integer, intent(in) :: batch_size
real(dp), intent(in) :: a(lda,*)
real(dp), intent(inout) :: b(ldb,*)
integer, intent(in) :: ipiv(*)
integer, intent(out) :: info(batch_size)
To solve a batch of linear problems once only (on the CPU), it would be better to use,
! N.b. the strides are linearized!
subroutine dgesv_batch_strided(n,nrhs,a,lda,stride_a,ipiv,stride_ipiv, &
b,ldb,stride_b,batch_size,info)
implicit none
integer, parameter :: dp = kind(1.0d0)
integer, intent(in) :: n, nrhs, lda, stride_a, stride_ipiv, ldb, stride_b
integer, intent(in) :: batch_size
real(dp), intent(in) :: a(lda,*)
real(dp), intent(inout) :: b(ldb,*)
integer, intent(in) :: ipiv(*)
integer, intent(out) :: info(batch_size)
external :: dgesv
integer :: stride_a_col, stride_b_col, ja, jb, jipiv
integer :: k
stride_a_col = stride_a / lda
stride_b_col = stride_b / ldb
!$omp parallel do schedule(static)
do k = 1, batch_size
ja = (k-1)*stride_a_col + 1
jb = (k-1)*stride_b_col + 1
jipiv = (k-1)*stride_ipiv + 1
! ATTENTION: sequence association used here!
call dgesv(n,nrhs,A(1,ja),lda,ipiv(jipiv),b(1,jb),ldb,info(k))
end do
end subroutine
When applying multithreading on the outer loop, it might be necessary to disable (or control) the internal multithreading of the dgesv
implementation. This will also depend on the size of the matrices.
You can find batched LU factorization routines (for the GPU) also in cuBLAS and MAGMA.
I don’t get what’s wrong with
y = np.empty_like(x)
for i in range(100):
y[i,:] = np.linalg.solve(A[i,:,:], x[i,:])
?
Loops in Python are slow, but I expect the loop overhead to be small compared to the runtime of np.linalg.solve
(unless of course the dimensions of the system are small).
I haven’t read the article in detail, but maybe the author wants to solve tons of small problems at once using GPU etc for best performance? From some internet search, I have come across these pages also (both of which seem to be interested in a lot of small problems).
Because why not, I asked ChatGPT to help me translate the attention and multi-head attention examples from the blog to Fortran using softmax – Fortran-lang/stdlib
module mod_attention
use stdlib_specialfunctions, only: softmax
contains
function attention(X, W_q, W_k, W_v) result(output)
real, intent(in) :: X(:, :)
real, intent(in) :: W_q(:, :), W_k(:, :), W_v(:, :)
real :: Q(size(X,1), size(W_q,2))
real :: K(size(X,1), size(W_k,2))
real :: V(size(X,1), size(W_v,2))
real :: scores(size(X,1), size(X,1))
real :: attention_weights(size(X,1), size(X,1))
real :: output(size(X,1), size(W_v,2))
real :: d_k
d_k = real(size(W_k,2))
Q = matmul(X, W_q)
K = matmul(X, W_k)
V = matmul(X, W_v)
scores = matmul(Q, transpose(K)) / sqrt(d_k)
attention_weights = softmax(scores, dim=2)
output = matmul(attention_weights, V)
end function attention
function multi_head_attention(X, W_q, W_k, W_v, W_o, n_head) result(output)
real, intent(in) :: X(:, :)
real, intent(in) :: W_q(:, :, :), W_k(:, :, :), W_v(:, :, :)
real, intent(in) :: W_o(:, :, :)
integer, intent(in) :: n_head
integer :: seq_len, input_dim, d_k, d_v, i, n
real, allocatable :: projected(:, :, :)
real, allocatable :: head_output(:, :)
real, allocatable :: my_proj(:, :)
real, allocatable :: output(:, :)
seq_len = size(X, 1)
input_dim = size(X, 2)
d_k = size(W_q, 3)
d_v = size(W_v, 3)
allocate(projected(n_head, seq_len, input_dim / n_head))
allocate(output(seq_len, n_head * (input_dim / n_head)))
do n = 1, n_head
head_output = attention(X, W_q(n,:,:), W_k(n,:,:), W_v(n,:,:))
my_proj = matmul(head_output, W_o(n,:,:))
projected(n, :, :) = my_proj
end do
! Concatenate projected heads for each sequence element
do i = 1, seq_len
output(i, :) = reshape(projected(:, i, :), [n_head * (input_dim / n_head)])
end do
end function multi_head_attention
end module
program main
use mod_attention
implicit none
integer, parameter :: input_dim = 4, seq_len = 4, d_k = 5, d_v = input_dim
integer, parameter :: n_head = 2
real :: X(seq_len, input_dim)
real :: W_q(n_head, input_dim, d_k)
real :: W_k(n_head, input_dim, d_k)
real :: W_v(n_head, input_dim, d_v)
real :: W_o(n_head, d_v, input_dim / n_head)
real :: out(seq_len, input_dim)
call random_number(X)
call random_number(W_q)
call random_number(W_k)
call random_number(W_v)
call random_number(W_o)
out = multi_head_attention(X, W_q, W_k, W_v, W_o, n_head)
print *, "Multi-head attention output:"
print *, out
end program main
Haven’t checked the correctness of the code snippet but it looks promising and simple enough.
Honestly, I don’t understand the blog post’s point about the self-attention example. Both implementation, either calling to the single-head attention or writing it from scratch looks fine to me.
Honestly (and this might be an unpopular opinion here), I think don’t think Fortran will ever attract someone making those criticisms of numpy. Fortran is wonderful if you’re dealing with complicated calculations involving fancy arrays that you’ll code once and rarely return to. It’s not good for one-off code, hacked together solutions, or quickly prototyping algorithms. And I personally think this can be a strength that Fortran has over Python/numpy. If you want something that’s quick and easy, something that doesn’t require you to look at documentation or maybe sketch things out on paper to be sure you write the logic correctly, then the long argument lists and strong, static typing of Fortran will simply not be attractive and that’s okay.
Hi @Ellehan and welcome.
AFAICT, Fortran is not mentioned in the blog post. The post is mostly about how avoiding loops led to the introduction of overcomplicated solutions (compared to good-ol’ MATLAB-like syntax).
But in the real world, the MVP (i.e., the prototype) often becomes the product, so Fortran might not be considered at all even if it’s the best tool for the job.
Before Fortran 90 there were no assumed-shape array arguments or optional arguments. so array dimensions needed to be passed, as did control parameters specifying details such as an error tolerance, maximum number of iterations, how many intermediate results to print, etc. But in modern Fortran you can use assumed-shape arguments and use optional arguments for the control parameters (with reasonable defaults if they are not set) or set them in a single derived type, so even a complicated procedure need not have many required arguments.