Performance of Intel MKL routines for convolutions/correlations

Hello,

I have a process where I need to do repeated 3D convolutions and correlations between “large” blocks of data (say 500x50x50) and small operators (typically 51x5x5, centered along all dimensions).

I have some routines that I wrote for that, just a “naive” implementation with 5 nested loops, plus a call to BLAS sapxy or sdot in the inner loop (that is, 6 nested loops, as expected).

Recently I have tested the Intel MKL routines that perform multidimensional convolutions/correlations, to see if I could get some speed-up compared to my naive implementation. And the results somehow puzzle me…

For simplicity I took blocks and operators of equal sizes in all dimensions: data blocks are 100x100x100, and I am testing various operator sizes lop * lop * lop, with lop varying from 1 to 25. The MKL routines have 3 different modes: “DIRECT”, “FFT”, “AUTO”, and I’m testing all of them. I am using the ifort 18 version, with -O3 -mkl options, and I set MKL_NUM_THREADS=1 to prevent MKL using multithreading.

The timings are shown below:


As expected, the FFT mode performance does not depend that much on the operator size (as all FFTs are probably set according to the largest size between the data block and the operator), and consequently this mode is faster for large enough operator sizes.

But besides there are some surprises:

  1. The DIRECT mode is always much slower than my naive loops. I was expecting this mode to implement something very similar to what I’m doing, but possibly better optimized (with blocking or whatever). At worst I was expecting similar runtimes, not larger ones!
  2. In the correlations, the DIRECT mode is faster than the FFT mode for small operator sizes, but still the AUTO mode internally selects the FFT mode! It seems the AUTO mode always reverts to FFT, which looks like a bug to me.
  3. In the MKL FFT mode, correlations are about 8x slower than the convolutions (and I can’t get why).

I would be interested in your comments and possible explanations.

The test code is below… BTW how Intel distributing the headers if weird: instead of a .mod file, this is an source include file that contains Fortran modules. Not convenient when you have constraints on the module names in a project (unless copying the file in the project and modifying the names).

module convcorr_m
  implicit none

  public

contains
  !*******************************************************************************
  ! 3D convolution y = y + f*x                                                   *
  ! or                                                                           *
  ! 3D correlation f = f + x'*y                                                  *
  !*******************************************************************************
  subroutine cc_sconv3d(x,y,n1,n2,n3,f,l1n,l1p,l2n,l2p,l3n,l3p,cc,reset)

    implicit none

    ! Arguments
    integer         , intent(in   ) :: n1, n2, n3, l1n, l1p, l2n, l2p, l3n, l3p
    real            , intent(in   ) :: x(n1,n2,n3)
    real            , intent(inout) :: y(n1,n2,n3)
    real            , intent(inout) :: f(l1n:l1p,l2n:l2p,l3n:l3p)
    character(len=*), intent(in   ), optional :: cc
    logical         , intent(in   ), optional :: reset

    ! Local variables
    integer :: mode, i2, i3, j1, j2, j3, i1i, i1f, i2i, i2f, i3i, i3f
    real    :: sdot

    !*******************************************************************************

    mode = 0
    if (present(cc)) then
       if (cc == 'corr' .or. cc == 'CORR') mode = 1
    end if

    if (mode == 0) then

       if (present(reset)) then
          if (reset) y(:,:,:) = 0.0
       end if

       do j3 = l3n, l3p
          i3i = max(1,1+j3) ; i3f = min(n3,n3+j3)

          do i3 = i3i, i3f
             do j2 = l2n, l2p
                i2i = max(1,1+j2) ; i2f = min(n2,n2+j2)
                do i2 = i2i, i2f
                   do j1 = l1n, l1p
                      i1i = max(1,1+j1) ; i1f = min(n1,n1+j1)
                      call saxpy(i1f-i1i+1,f(j1,j2,j3),x(i1i-j1,i2-j2,i3-j3),1,y(i1i,i2,i3),1)
                   end do
                end do
             end do
          end do

       end do

    else

       if (present(reset)) then
          if (reset) f(:,:,:) = 0.0
       end if


          do j3 = l3n, l3p
             i3i = max(1,1+j3) ; i3f = min(n3,n3+j3)
             do j2 = l2n, l2p
                i2i = max(1,1+j2) ; i2f = min(n2,n2+j2)
                do i3 = i3i, i3f
                   do i2 = i2i, i2f
                      do j1 = l1n, l1p
                         i1i = max(1,1+j1) ; i1f = min(n1,n1+j1)
                         f(j1,j2,j3) = f(j1,j2,j3) + sdot(i1f-i1i+1,y(i1i,i2,i3),1,x(i1i-j1,i2-j2,i3-j3),1)
                      end do
                   end do
                end do
             end do
          end do


    end if
    

  end subroutine cc_sconv3d

end module convcorr_m


include "mkl_vsl.fi"


program convcorr
use iso_fortran_env
use convcorr_m
use mkl_vsl
implicit none

	integer, parameter :: n = 100
	real, allocatable :: x(:,:,:), y(:,:,:), f(:,:,:)

	integer :: lop, convmode, corrmode, stat, imode, npass
	type(vsl_conv_task) :: taskonv
	type(vsl_corr_task) :: taskorr

	integer(int64) :: tic, toc
	real(real64) :: rate


	allocate( x(n,n,n), y(n,n,n) )

	write(*,"(4X,4A24)") "---- LOOPS -----", "-- VSL_DIRECT --", "--- VSL_FFT ----", "--- VSL_AUTO ---"
	write(*,"(A4,8A12)") "LOP", "CONV", "CORR", "CONV", "CORR", "CONV", "CORR", "CONV", "CORR"

	do lop = 1, 25, 1

		write(*,"(4I4)",advance='no') lop

		allocate( f(lop,lop,lop) )
		call random_number(x); x = x - 0.5
		call random_number(f); f = f - 0.5

		call system_clock(tic,rate); npass = 0
		do
			npass = npass+1
			call cc_sconv3d(x,y,n,n,n,f,-lop/2,(lop-1)/2,-lop/2,(lop-1)/2,-lop/2,(lop-1)/2,reset=.true.)
			call system_clock(toc,rate)
			if ((toc-tic)/rate >= 1.0) exit
		end do

		write(*,"(F12.6)",advance='no') ((toc-tic)/rate)/npass

		call system_clock(tic,rate); npass = 0
		do
			npass = npass+1
			call cc_sconv3d(x,y,n,n,n,f,-lop/2,(lop-1)/2,-lop/2,(lop-1)/2,-lop/2,(lop-1)/2,cc='CORR',reset=.true.)
			call system_clock(toc,rate)
			if ((toc-tic)/rate >= 1.0) exit
		end do
		write(*,"(F12.6)",advance='no') ((toc-tic)/rate)/npass

        VSLMODE: do imode = 1, 3

			if (imode == 1) then
				convmode = VSL_CONV_MODE_DIRECT
				corrmode = VSL_CORR_MODE_DIRECT
			else if (imode == 2) then
				convmode = VSL_CONV_MODE_FFT
				corrmode = VSL_CORR_MODE_FFT
			else if (imode == 3) then
				convmode = VSL_CONV_MODE_AUTO
				corrmode = VSL_CORR_MODE_AUTO
			end if

			call random_number(x); x = x - 0.5
			call random_number(f); f = f - 0.5

			stat = vslsconvnewtask(taskonv, convmode, 3, [n,n,n], [lop,lop,lop], [n,n,n])
			stat = vslconvsetstart(taskonv,[lop/2,lop/2,lop/2])
	 		stat = vslconvsetinternalprecision(taskonv, VSL_CONV_PRECISION_SINGLE)

			stat = vslscorrnewtask(taskorr, corrmode, 3, [n,n,n], [n,n,n], [lop,lop,lop])
			stat = vslcorrsetstart(taskorr,[-lop/2,-lop/2,-lop/2])
	 		stat = vslcorrsetinternalprecision(taskorr, VSL_CORR_PRECISION_SINGLE)

			call system_clock(tic,rate); npass = 0
			do
				npass = npass+1
				stat = vslsconvexec(taskonv,x,[1,n,n*n],f,[1,lop,lop*lop],y,[1,n,n*n])
				call system_clock(toc,rate)
				if ((toc-tic)/rate >= 1.0) exit
			end do
			write(*,"(F12.6)",advance='no') ((toc-tic)/rate)/npass

			call system_clock(tic,rate); npass = 0
			do
				npass = npass+1
				stat = vslscorrexec(taskorr,x,[1,n,n*n],y,[1,n,n*n],f,[1,lop,lop*lop])
				call system_clock(toc,rate)
				if ((toc-tic)/rate >= 1.0) exit
			end do
			write(*,"(F12.6)",advance='no') ((toc-tic)/rate)/npass

			stat = vslconvdeletetask(taskonv)
			stat = vslcorrdeletetask(taskorr)

		end do VSLMODE

		deallocate( f )
		write(*,*)

	end do

end program
1 Like

For convolutions in lower dimensions there is a project

fastconv: simple library for 1D and 2D convolutions, by Dominik Gronkiewicz @gronki

Maybe the algorithms can be extended to 3D.

@Beliavsky What does make it fast? Is it the use of explicit vector operations in C? I am using BLAS saxpy/sdot (obviously from MKL as well) in the inner loops, and I am assuming these ones are vectorized, and consequently that there’s not much to gain here.

As an author of that project, it was more my own toy benchmark to how fast I can make a convolution in pure Fortran (without using explicit SIMD intrinsics) and experiment with different architectures. What I have found that it is extremely beneficial to make your kernel size a multiply of 4 or 8 in the first dimension. I just pad it with zeros (which will keep the result of the convolution the same). The other trickery did not do much of a difference in my case (although I did try quite a few options, as can be seen in the repo).

As you can see from the benchmarks, the pure fortran version actually beats the explicit SIMD version, while being 100% portable.

Perhaps when I am done with my thesis I can take some time and make it an actually usable project with fpm and all the good stuff :slight_smile: i did not actually expect anyone to find it!

5 Likes

Maybe it depends on how the convolution algorithm is coded. The way I have coded it is not supposed to be sensitive to the alignment of the kernel (“operator” in my original post), as the innermost (hidden in saxpy) loop just uses a single elements of it (the f array is the kernel):

                      call saxpy(i1f-i1i+1,f(j1    ,j2   ,j3   ),   &
                                           x(i1i-j1,i2-j2,i3-j3),1, &
                                           y(i1i   ,i2   ,i3   ),1)

I have remade the above graph with even values of lop instead of the odd values only, and I can see no effect on multiples of 4, 8,…

However, the alignment of the first element of the data blocks x and y that are passed to saxpy probably matters, but I can’t ensure that (even the leading dimension being a multiple of 4/8/… wouldn’t).

Regarding the MKL routine in the FFT mode, clearly some kernel sizes should be avoided, but it’s not easy to understand why… All the more than it depends on the data sizes. I’ve plotted for n=100, but also tried 96, 101, 102, 104, that having alignments of the data columns on 4, 8, 16, 32, and 64 bytes boundaries: it doesn’t change much for the convolutions, but it has significant impact on the correlations:

n Convolutions Correlations
101 ~ 0.05 s ~ 1.00 s
102 ~ 0.05 s ~ 0.95s
100 ~ 0.05 s ~ 0.35s
104 ~ 0.05 s ~ 0.35s
96 ~ 0.05 s ~ 0.27s

I don’t have an answer, but here is what I could learn in one hour using Compiler Explorer (follow to see the analysis). I was using the compilation command ifx -O3 -march=skylake -qmkl, but I expect similar results with other compilers.

The innermost loop compiles to the following:

.LBB1_15:
        test    rbx, rbx
        mov     eax, r12d
        cmovle  eax, esi
        mov     rdx, qword ptr [rsp + 104]
        cmp     edx, ebp
        mov     ecx, ebp
        cmovl   ecx, edx
        sub     ecx, eax
        inc     ecx
        mov     dword ptr [rsp + 68], ecx
        movsxd  rcx, eax
        add     rcx, r15
        mov     rdx, qword ptr [rsp + 304]
        lea     rdx, [rdx + 4*rcx]
        mov     rcx, qword ptr [rsp + 296]
        lea     r8, [rcx + 4*rax]
        add     r8, -4
        mov     ecx, offset __unnamed_1
        mov     r9d, offset __unnamed_1
        mov     rsi, r13
        xor     eax, eax
        mov     r14, rdi
        call    saxpy_@PLT                ; Note the saxpy call here
        mov     rdi, r14
        mov     esi, 1
        mov     r8, qword ptr [rsp + 288]
        add     r13, 4
        inc     rbx
        inc     ebp
        dec     r15
        inc     r12d
        cmp     r8, rbx
        jne     .LBB1_15

I couldn’t find a way to force Compiler Explorer to inline the MKL saxpy (if it’s even possible) so I decided to replace it with an array expression:

              associate(inn => i1f-i1i+1)
                 y(i1i:i1f,i2,i3) = y(i1i:i1f,i2,i3) + f(j1,j2,j3)*x(i1i-j1:i1i-j1+inn,i2-j2,i3-j3)
              end associate

The resulting assembly and flow control is hard to follow, but deep inside are the following two hot loops that execute fused multiply-adds:

  • a core loop (presumably) with packed vector instructions
    .LBB1_27:
            vmovups ymm2, ymmword ptr [r13 + 4*rbx]
            vfmadd213ps     ymm2, ymm1, ymmword ptr [rbp + 4*rbx]
            vmovups ymmword ptr [rbp + 4*rbx], ymm2
            add     rbx, 8
            cmp     rbx, r11
            jl      .LBB1_27
    
  • a scalar peel/remainder loop (presumably), with scalar instructions
    .LBB1_30:
            vmovss  xmm1, dword ptr [rsi + 4*r11]
            vfmadd213ss     xmm1, xmm0, dword ptr [rcx + 4*r11]
            vmovss  dword ptr [rcx + 4*r11], xmm1
            inc     r11
            cmp     rax, r11
            jne     .LBB1_30
    

Compiler Explorer recently added a nice feature named “Analysis” (I saw it in this video). You can open a new editor window, paste a snippet of assembly, and analyze it using a tool named llvm-mca which stands for LLVM Machine Code Analyzer.

For the packed instruction block, the analysis returns the following instruction count estimated from a simulated processor scheduling model:

Iterations:        100
Instructions:      600
Total Cycles:      147
Total uOps:        800

Dispatch Width:    6
uOps Per Cycle:    5.44
IPC:               4.08
Block RThroughput: 1.3

The estimate is that in 100 loop iterations, 600 instructions (100 * 6) will be executed in (merely) 147 CPU cycles, for an instructions per cycle (IPC) count of approx. 4. In the bulk loop the fused multiply-add is using ymm registers which are 256-bit registers, which can process 8 single precision reals simultaneously. Meaning that in c. 150 cycles the CPU would process 800 array elements. This doesn’t seem all that bad, however in reality IPC doesn’t give the full picture of whether the CPU is saturated.

Using either ifort 19.0.0 and the same compiler flags, or gfortran 13.2, I see the same hot loop with the same instructions (the registers slightly differ):

..B2.44:                        # Preds ..B2.42 ..B2.40 ..B2.64
        vbroadcastss ymm1, xmm0                                 #46.64
..B2.45:                        # Preds ..B2.45 ..B2.44
        vmovups   ymm2, YMMWORD PTR [-4+r15+rdx*4]              #46.64
        vfmadd213ps ymm2, ymm1, YMMWORD PTR [-4+rbx+rdx*4]      #46.26
        vmovups   YMMWORD PTR [-4+rbx+rdx*4], ymm2              #46.26
        add       rdx, 8                                        #46.26
        cmp       rdx, r10                                      #46.26
        jb        ..B2.45       # Prob 82%                      #46.26

Roughly speaking, in Fortran terms the CPU does the following:

     a(1:8) = alpha                     ! scalar broadcast
  45 continue
     ytmp(1:8) = y(i:i+8)               ! load (unaligned)
     ytmp(1:8) = ytmp(1:8) + a*x(i:i+8) ! fma
     y(i:i+8) = ytmp                    ! store (unaligned)
     i = i + 8                          ! increment loop counter
     if (i < n) goto 45                 ! conditional jump

Leaving the vectorization and layout issues aside, ultimately it’s the power function scaling that makes the naive loop approach unfeasible. This of course doesn’t explain any of the issues you observe with Intel MKL. I’d expect in a log-log plot, your loop version, and the MKL “DIRECT” method would have the same slope.

One minor issue worth investigating is if linking against the sequential implementation of MKL (libmkl_sequential.a) and not the threaded one (libmkl_intel_thread.a) brings any changes. It might be that the latter has a noticeable overhead on one thread. You can select the sequential library with the flag -qmkl=sequential (the default is parallel).

But for a definite answer I’d suggest you ask your question on the Intel MKL Forum.

3 Likes

There is also an option to plot a control flow graph (it works with gfortran, but not with ifort): Compiler Explorer. In this visualization it’s slightly easier to identify the vectorized inner loop, the blocks for handling the remainder, and the 5 outer loops

Here is a PDF for easier navigation: https://drive.google.com/uc?export=download&id=1v0__2LKemD7bAdlrhi0gvsFpDPJgag3x (PDF, 39 KB download)

.

1 Like

I have tried that, and the timings are very similar.