What is the fastest way to do cumulative sum in Fortran to mimic Matlab cumsum?

ivanpribec · October 5, 2021, 10:39pm

For the record, the block construct doesn’t seem to play any role. Also the step2 version does not appear to utilize any advanced vector extensions (AVX) due to data dependencies. This can be inferred from the Intel Optimization report:

CUMSUM_DO_LOOP

Source code:

  !> Canonical cumsum
  subroutine cumsum_do_loop(a,b)
    real(dp), intent(in) :: a(:)
    real(dp), intent(out) :: b(:)

    integer :: i
    real(dp) :: sm

    b(1) = a(1)
    do i = 2, size(a)
      b(i) = b(i-1) + a(i)
    end do
  end subroutine

Compile instructions:

ifort -warn all -O3 -xHost -qopt-report-phase=loop,vec -qopt-report=2 cumsum_benchmark.f90 -o f_cumsum

Optimization report:

Begin optimization report for: CUMSUM_DO_LOOP

    Report from: Loop nest & Vector optimizations [loop, vec]


LOOP BEGIN at cumsum_benchmark.f90(19,5)
<Multiversioned v1>
   remark #25233: Loop multiversioned for stride tests on Assumed shape arrays
   remark #15344: loop was not vectorized: vector dependence prevents vectorization. First dependence is shown below. Use level 5 report for details
   remark #15346: vector dependence: assumed FLOW dependence between B(I) (20:7) and B(I-1) (20:7)
   remark #25439: unrolled with remainder by 2  
LOOP END

LOOP BEGIN at cumsum_benchmark.f90(19,5)
<Remainder, Multiversioned v1>
LOOP END

LOOP BEGIN at cumsum_benchmark.f90(19,5)
<Multiversioned v2>
   remark #15304: loop was not vectorized: non-vectorizable loop instance from multiversioning
   remark #25439: unrolled with remainder by 2  
LOOP END

LOOP BEGIN at cumsum_benchmark.f90(19,5)
<Remainder, Multiversioned v2>
LOOP END

CUMSUM_STEP2

Source code:

  !> Two-step cumsum
  !> Source: https://github.com/lemire/Code-used-on-Daniel-Lemire-s-blog/tree/master/2012/07/23
  subroutine cumsum_step2(a, b)
    real(dp), intent(in) :: a(:)
    real(dp), intent(out) :: b(:)
    real(dp) :: s0, s1
    integer :: i, n

    s0 = 0.0_dp
    s1 = 0.0_dp

    n = size(a)

    do i = 1, n, 2
      block
        real(dp) :: x0, x1
        x0 = a(i)
        x1 = a(i+1)
        s0 = s1 + x0
        s1 = s1 + (x1 + x0)
        b(i) = s0
        b(i+1) = s1
      end block
    end do

    if (mod(n,2) /= 0) then
      b(n) = b(n-1) + a(n)
    end if
  end subroutine

Compile instructions:

ifort -warn all -O3 -xHost -qopt-report-phase=loop,vec -qopt-report=2 cumsum_benchmark.f90 -o f_cumsum

Optimization report:

Begin optimization report for: CUMSUM_STEP2

    Report from: Loop nest & Vector optimizations [loop, vec]


LOOP BEGIN at cumsum_benchmark.f90(76,5)
   remark #25084: Preprocess Loopnests: Moving Out Store    [ cumsum_benchmark.f90(81,9) ]
   remark #25084: Preprocess Loopnests: Moving Out Store    [ cumsum_benchmark.f90(82,9) ]
   remark #25084: Preprocess Loopnests: Moving Out Store    [ cumsum_benchmark.f90(86,5) ]
   remark #15344: loop was not vectorized: vector dependence prevents vectorization. First dependence is shown below. Use level 5 report for details
   remark #15346: vector dependence: assumed ANTI dependence between cumsum_step2 (81:9) and cumsum_step2 (82:9)
   remark #25439: unrolled with remainder by 2  
LOOP END

LOOP BEGIN at cumsum_benchmark.f90(76,5)
<Remainder>
LOOP END

I’ve tried to take a look at the differences with Godbolt (Compiler Explorer), but the reasons why the second executes faster for small workloads is way beyond my knowledge.

Here are the sections inside the loop body (for single precision):

CUMSUM_DO_LOOP

..B2.6:                         # Preds ..B2.4 ..B2.6
        movss     xmm1, DWORD PTR [r8+rdx*8]                    #19.14
        movss     xmm0, DWORD PTR [8+rax+rdx*8]                 #19.23
        addss     xmm1, DWORD PTR [4+rax+rdx*8]                 #19.7
        movss     DWORD PTR [4+r8+rdx*8], xmm1                  #19.7
        addss     xmm1, xmm0                                    #19.7
        movss     DWORD PTR [8+r8+rdx*8], xmm1                  #19.7
        inc       rdx                                           #18.5
        cmp       rdx, rcx                                      #18.5
        jb        ..B2.6        # Prob 63%                      #18.5
        lea       ebx, DWORD PTR [1+rdx+rdx]

As you can see the loop has been unrolled, and two additions are performed before jumping back to the start.

CUMSUM_STEP2

..B3.4:                         # Preds ..B3.4 ..B3.3
        movss     xmm2, DWORD PTR [r12+r8*4]                    #39.9
        movaps    xmm3, xmm0                                    #41.9
        movss     xmm1, DWORD PTR [r14+r8*4]                    #40.9
        inc       r9                                            #36.5
        movss     xmm5, DWORD PTR [rdx+r8*4]                    #39.9
        addss     xmm3, xmm2                                    #41.9
        addss     xmm2, xmm1                                    #42.23
        movss     xmm4, DWORD PTR [rsi+r8*4]                    #40.9
        movaps    xmm6, xmm5                                    #41.9
        movss     DWORD PTR [r10+rbx*4], xmm3                   #43.9
        add       r8, r11                                       #36.5
        addss     xmm0, xmm2                                    #42.9
        addss     xmm5, xmm4                                    #42.23
        addss     xmm6, xmm0                                    #41.9
        movss     DWORD PTR [r13+rbx*4], xmm0                   #44.9
        addss     xmm0, xmm5                                    #42.9
        movss     DWORD PTR [rax+rbx*4], xmm6                   #43.9
        movss     DWORD PTR [rcx+rbx*4], xmm0                   #44.9
        add       rbx, rbp                                      #36.5
        cmp       r9, r15                                       #36.5
        jb        ..B3.4        # Prob 63%                      #36.5
        mov       r15d, DWORD PTR [-48+rsp]                     #[spill]

Topic		Replies	Views
What is the most convenient way to create a matrix with given elements?	4	569	September 29, 2021
Matlab sub2ind in Fortran	10	308	June 27, 2024
Suitesparse equivalent in Fortran Help	14	2807	January 18, 2022
Array comprehensions in Fortran Help	23	2870	January 12, 2022
Simple summation 8x slower than in Julia	89	14983	April 2, 2022

What is the fastest way to do cumulative sum in Fortran to mimic Matlab cumsum?

Related topics