Parallelize matmul

Is there any way to parallelize matmul?

A naive test. :grin:

 subroutine Parallel_matmul(A, B, C, mm, nn, ll)
    real(kind=DP), intent(in) :: A(:,:), B(:,:)
    real(kind=DP), intent(inout) :: C(:,:)
    integer, intent(in), optional :: mm, nn, ll
    integer :: k, i, j
    real(kind=DP) :: tmp_mat

    !$omp parallel do private(tmp_mat,i,j,k)
    do i = 1, mm   
       do j = 1, nn
          tmp_mat = 0.0_DP
          do k = 1, ll
             tmp_mat = tmp_mat + (A(i,k) * B(k,j))
          end do
          C(i,j) = tmp_mat
       end do
    end do
    !$omp end parallel do

  end subroutine Parallel_matmul

Example: A(3,n); B(n,3); C(3,3); mm=3; nn=3; ll=3

2 Likes

You can reorder the loops to mitigate the issues of

  • memory conflicts between threads, and
  • cache localisation for larger matrices A,B & C
  subroutine Parallel_matmul(A, B, C, mm, nn, ll)
    integer, parameter :: DP = kind(1.0d0)
    real(kind=DP), intent(in) :: A(:,:), B(:,:)
    real(kind=DP), intent(inout) :: C(:,:)
    integer, intent(in), optional :: mm, nn, ll
    integer :: k, i, j, di
    real(kind=DP) :: tmp_mat

!  this approach is poorly suited to OpenMP
  !$omp parallel do private(tmp_mat,i,j,k)
    do i = 1, mm   
       do j = 1, nn
          tmp_mat = 0.0_DP
          do k = 1, ll
             tmp_mat = tmp_mat + (A(i,k) * B(k,j))
          end do
          C(i,j) = tmp_mat
       end do
    end do
    !$omp end parallel do

!  The general form of MATMUL is
  !$omp parallel do private(i,j,k) shared(A,B,C,mm,nn,ll)
    do i = 1, mm   
       do j = 1, nn
          C(i,j) = 0.0_DP
          do k = 1, ll
             C(i,j) = C(i,j) + (A(i,k) * B(k,j))
          end do
       end do
    end do
  !$omp end parallel do

!  this is a poor DO order for OpenMP, as i is shared between threads where
!   C(i,j) conflicts between threads locally on the same memory page for c(:,j)
!
!  a better DO order for C memory conflict is use j as the thread and choo
  !$omp parallel do private(i,j,k) shared(A,B,C,mm,nn,ll) schedule(static)
    do j = 1, nn
       do i = 1, mm   
          C(i,j) = 0.0_DP
          do k = 1, ll
             C(i,j) = C(i,j) + A(i,k) * B(k,j)
          end do
       end do
    end do
  !$omp end parallel do

! an improvement in localising memory use would be to make DO j the outer loop
  !$omp parallel do private(i,j,k) shared(A,B,C,mm,nn,ll) schedule(static)
    do j = 1, nn
       C(:,j) = 0.0_DP
       do k = 1, ll
          do i = 1, mm
             C(i,j) = C(i,j) + A(i,k) * B(k,j)
          end do
       end do
    end do
  !$omp end parallel do

!  This is equivalent to
  !$omp parallel do private(j,k) shared(A,B,C,mm,nn,ll) schedule(static)
    do j = 1, nn
       C(1:mm,j) = 0.0_DP
       do k = 1, ll
          C(1:mm,j) = C(1:mm,j) + A(1:mm,k) * B(k,j)
       end do
    end do
  !$omp end parallel do

!  or effectively what is DAXPY using AVX instructions for C(:,j) = C(:,j) + A(:,k) * Const
!  and Const = B(k,j)

!   A partitioned approach for i could be like the following for appropriate di
!   Note : j is already partitioned by OpenMP
    di = 32
  !$omp parallel do private(j,k) shared(A,B,C,mm,nn,ll,di) schedule(static)
    do j = 1, nn
       do i1 = 1,mm,di
          i2 = min(i1+di-1,mm)
          C(i1:i2,j) = 0.0_DP
          do k = 1, ll
             C(i1:i2,j) = C(i1:i2,j) + A(i1:i2,k) * B(k,j)
          end do
       end do
    end do
  !$omp end parallel do

  end subroutine Parallel_matmul

The above approach improves the use of L3 cache and AVX instructions, generating improved performance for larger arrays.

Further optimisation can be achieved for larger arrays by partitioning the matrix for targeted L1 cache usage of each thread.

7 Likes

I’d go looking into the GPGPU literature on the topic, by necessity any decently performant GPGPU implementations will be massively parallel.

1 Like

Thanks for this suggestion, which is suggested as the new frontier for multi-threaded calculation.
Do you have any suggestion on how scalable this GPGPU approach might be for a range of matrix sizes ?
The two parameters I have “indicated” for partitioning are num_threads for loop j and di = 32 for loop i. These are strategies to minimise the memory to L3 cache and then L3 to L1 cache transfers.

Can you suggest if there are equivalent approaches when moving to graphics memory optimisation, or might DDR5 graphics memory transfer rates and their 64-bit AVX instructions overcome these issues/bottlenecks ?
Also what Fortran compilers support this approach ?

A working example would be very helpful to many of us who are not proficient in transferring 64-bit real calculations to this new GPU opportunity.

1 Like