Parallelize matmul

fortran4r · July 20, 2022, 1:43am

Is there any way to parallelize matmul?

alozada · July 20, 2022, 2:23am

A naive test.

 subroutine Parallel_matmul(A, B, C, mm, nn, ll)
    real(kind=DP), intent(in) :: A(:,:), B(:,:)
    real(kind=DP), intent(inout) :: C(:,:)
    integer, intent(in), optional :: mm, nn, ll
    integer :: k, i, j
    real(kind=DP) :: tmp_mat

    !$omp parallel do private(tmp_mat,i,j,k)
    do i = 1, mm   
       do j = 1, nn
          tmp_mat = 0.0_DP
          do k = 1, ll
             tmp_mat = tmp_mat + (A(i,k) * B(k,j))
          end do
          C(i,j) = tmp_mat
       end do
    end do
    !$omp end parallel do

  end subroutine Parallel_matmul

Example: A(3,n); B(n,3); C(3,3); mm=3; nn=3; ll=3

JohnCampbell · July 20, 2022, 7:27am

You can reorder the loops to mitigate the issues of

memory conflicts between threads, and
cache localisation for larger matrices A,B & C

  subroutine Parallel_matmul(A, B, C, mm, nn, ll)
    integer, parameter :: DP = kind(1.0d0)
    real(kind=DP), intent(in) :: A(:,:), B(:,:)
    real(kind=DP), intent(inout) :: C(:,:)
    integer, intent(in), optional :: mm, nn, ll
    integer :: k, i, j, di
    real(kind=DP) :: tmp_mat

!  this approach is poorly suited to OpenMP
  !$omp parallel do private(tmp_mat,i,j,k)
    do i = 1, mm   
       do j = 1, nn
          tmp_mat = 0.0_DP
          do k = 1, ll
             tmp_mat = tmp_mat + (A(i,k) * B(k,j))
          end do
          C(i,j) = tmp_mat
       end do
    end do
    !$omp end parallel do

!  The general form of MATMUL is
  !$omp parallel do private(i,j,k) shared(A,B,C,mm,nn,ll)
    do i = 1, mm   
       do j = 1, nn
          C(i,j) = 0.0_DP
          do k = 1, ll
             C(i,j) = C(i,j) + (A(i,k) * B(k,j))
          end do
       end do
    end do
  !$omp end parallel do

!  this is a poor DO order for OpenMP, as i is shared between threads where
!   C(i,j) conflicts between threads locally on the same memory page for c(:,j)
!
!  a better DO order for C memory conflict is use j as the thread and choo
  !$omp parallel do private(i,j,k) shared(A,B,C,mm,nn,ll) schedule(static)
    do j = 1, nn
       do i = 1, mm   
          C(i,j) = 0.0_DP
          do k = 1, ll
             C(i,j) = C(i,j) + A(i,k) * B(k,j)
          end do
       end do
    end do
  !$omp end parallel do

! an improvement in localising memory use would be to make DO j the outer loop
  !$omp parallel do private(i,j,k) shared(A,B,C,mm,nn,ll) schedule(static)
    do j = 1, nn
       C(:,j) = 0.0_DP
       do k = 1, ll
          do i = 1, mm
             C(i,j) = C(i,j) + A(i,k) * B(k,j)
          end do
       end do
    end do
  !$omp end parallel do

!  This is equivalent to
  !$omp parallel do private(j,k) shared(A,B,C,mm,nn,ll) schedule(static)
    do j = 1, nn
       C(1:mm,j) = 0.0_DP
       do k = 1, ll
          C(1:mm,j) = C(1:mm,j) + A(1:mm,k) * B(k,j)
       end do
    end do
  !$omp end parallel do

!  or effectively what is DAXPY using AVX instructions for C(:,j) = C(:,j) + A(:,k) * Const
!  and Const = B(k,j)

!   A partitioned approach for i could be like the following for appropriate di
!   Note : j is already partitioned by OpenMP
    di = 32
  !$omp parallel do private(j,k) shared(A,B,C,mm,nn,ll,di) schedule(static)
    do j = 1, nn
       do i1 = 1,mm,di
          i2 = min(i1+di-1,mm)
          C(i1:i2,j) = 0.0_DP
          do k = 1, ll
             C(i1:i2,j) = C(i1:i2,j) + A(i1:i2,k) * B(k,j)
          end do
       end do
    end do
  !$omp end parallel do

  end subroutine Parallel_matmul

The above approach improves the use of L3 cache and AVX instructions, generating improved performance for larger arrays.

Further optimisation can be achieved for larger arrays by partitioning the matrix for targeted L1 cache usage of each thread.

whuhn · July 20, 2022, 2:11pm

I’d go looking into the GPGPU literature on the topic, by necessity any decently performant GPGPU implementations will be massively parallel.

JohnCampbell · July 21, 2022, 4:03am

Thanks for this suggestion, which is suggested as the new frontier for multi-threaded calculation.
Do you have any suggestion on how scalable this GPGPU approach might be for a range of matrix sizes ?
The two parameters I have “indicated” for partitioning are num_threads for loop j and di = 32 for loop i. These are strategies to minimise the memory to L3 cache and then L3 to L1 cache transfers.

Can you suggest if there are equivalent approaches when moving to graphics memory optimisation, or might DDR5 graphics memory transfer rates and their 64-bit AVX instructions overcome these issues/bottlenecks ?
Also what Fortran compilers support this approach ?

A working example would be very helpful to many of us who are not proficient in transferring 64-bit real calculations to this new GPU opportunity.

Topic		Replies	Views
GPU utilization in a multithreaded code Help	6	969	June 6, 2023
Writing wrappers for LAPACK and BLAS routines	54	2020	December 14, 2023
MPI run time and arrays rank Help	21	1397	December 13, 2021
Does LAPACK/BLAS automatically use multi cores or threads?	35	6333	August 3, 2022
Why the performance is poorer after using OpenMP? Help	20	5673	June 2, 2022

Parallelize matmul

Related topics