What is the fastest way to do cumulative sum in Fortran to mimic Matlab cumsum?

For the record, the block construct doesn’t seem to play any role. Also the step2 version does not appear to utilize any advanced vector extensions (AVX) due to data dependencies. This can be inferred from the Intel Optimization report:

CUMSUM_DO_LOOP

Source code:

  !> Canonical cumsum
  subroutine cumsum_do_loop(a,b)
    real(dp), intent(in) :: a(:)
    real(dp), intent(out) :: b(:)

    integer :: i
    real(dp) :: sm

    b(1) = a(1)
    do i = 2, size(a)
      b(i) = b(i-1) + a(i)
    end do
  end subroutine

Compile instructions:

ifort -warn all -O3 -xHost -qopt-report-phase=loop,vec -qopt-report=2 cumsum_benchmark.f90 -o f_cumsum

Optimization report:

Begin optimization report for: CUMSUM_DO_LOOP

    Report from: Loop nest & Vector optimizations [loop, vec]


LOOP BEGIN at cumsum_benchmark.f90(19,5)
<Multiversioned v1>
   remark #25233: Loop multiversioned for stride tests on Assumed shape arrays
   remark #15344: loop was not vectorized: vector dependence prevents vectorization. First dependence is shown below. Use level 5 report for details
   remark #15346: vector dependence: assumed FLOW dependence between B(I) (20:7) and B(I-1) (20:7)
   remark #25439: unrolled with remainder by 2  
LOOP END

LOOP BEGIN at cumsum_benchmark.f90(19,5)
<Remainder, Multiversioned v1>
LOOP END

LOOP BEGIN at cumsum_benchmark.f90(19,5)
<Multiversioned v2>
   remark #15304: loop was not vectorized: non-vectorizable loop instance from multiversioning
   remark #25439: unrolled with remainder by 2  
LOOP END

LOOP BEGIN at cumsum_benchmark.f90(19,5)
<Remainder, Multiversioned v2>
LOOP END
CUMSUM_STEP2

Source code:

  !> Two-step cumsum
  !> Source: https://github.com/lemire/Code-used-on-Daniel-Lemire-s-blog/tree/master/2012/07/23
  subroutine cumsum_step2(a, b)
    real(dp), intent(in) :: a(:)
    real(dp), intent(out) :: b(:)
    real(dp) :: s0, s1
    integer :: i, n

    s0 = 0.0_dp
    s1 = 0.0_dp

    n = size(a)

    do i = 1, n, 2
      block
        real(dp) :: x0, x1
        x0 = a(i)
        x1 = a(i+1)
        s0 = s1 + x0
        s1 = s1 + (x1 + x0)
        b(i) = s0
        b(i+1) = s1
      end block
    end do

    if (mod(n,2) /= 0) then
      b(n) = b(n-1) + a(n)
    end if
  end subroutine

Compile instructions:

ifort -warn all -O3 -xHost -qopt-report-phase=loop,vec -qopt-report=2 cumsum_benchmark.f90 -o f_cumsum

Optimization report:

Begin optimization report for: CUMSUM_STEP2

    Report from: Loop nest & Vector optimizations [loop, vec]


LOOP BEGIN at cumsum_benchmark.f90(76,5)
   remark #25084: Preprocess Loopnests: Moving Out Store    [ cumsum_benchmark.f90(81,9) ]
   remark #25084: Preprocess Loopnests: Moving Out Store    [ cumsum_benchmark.f90(82,9) ]
   remark #25084: Preprocess Loopnests: Moving Out Store    [ cumsum_benchmark.f90(86,5) ]
   remark #15344: loop was not vectorized: vector dependence prevents vectorization. First dependence is shown below. Use level 5 report for details
   remark #15346: vector dependence: assumed ANTI dependence between cumsum_step2 (81:9) and cumsum_step2 (82:9)
   remark #25439: unrolled with remainder by 2  
LOOP END

LOOP BEGIN at cumsum_benchmark.f90(76,5)
<Remainder>
LOOP END

I’ve tried to take a look at the differences with Godbolt (Compiler Explorer), but the reasons why the second executes faster for small workloads is way beyond my knowledge.

Here are the sections inside the loop body (for single precision):

CUMSUM_DO_LOOP
..B2.6:                         # Preds ..B2.4 ..B2.6
        movss     xmm1, DWORD PTR [r8+rdx*8]                    #19.14
        movss     xmm0, DWORD PTR [8+rax+rdx*8]                 #19.23
        addss     xmm1, DWORD PTR [4+rax+rdx*8]                 #19.7
        movss     DWORD PTR [4+r8+rdx*8], xmm1                  #19.7
        addss     xmm1, xmm0                                    #19.7
        movss     DWORD PTR [8+r8+rdx*8], xmm1                  #19.7
        inc       rdx                                           #18.5
        cmp       rdx, rcx                                      #18.5
        jb        ..B2.6        # Prob 63%                      #18.5
        lea       ebx, DWORD PTR [1+rdx+rdx]   

As you can see the loop has been unrolled, and two additions are performed before jumping back to the start.

CUMSUM_STEP2
..B3.4:                         # Preds ..B3.4 ..B3.3
        movss     xmm2, DWORD PTR [r12+r8*4]                    #39.9
        movaps    xmm3, xmm0                                    #41.9
        movss     xmm1, DWORD PTR [r14+r8*4]                    #40.9
        inc       r9                                            #36.5
        movss     xmm5, DWORD PTR [rdx+r8*4]                    #39.9
        addss     xmm3, xmm2                                    #41.9
        addss     xmm2, xmm1                                    #42.23
        movss     xmm4, DWORD PTR [rsi+r8*4]                    #40.9
        movaps    xmm6, xmm5                                    #41.9
        movss     DWORD PTR [r10+rbx*4], xmm3                   #43.9
        add       r8, r11                                       #36.5
        addss     xmm0, xmm2                                    #42.9
        addss     xmm5, xmm4                                    #42.23
        addss     xmm6, xmm0                                    #41.9
        movss     DWORD PTR [r13+rbx*4], xmm0                   #44.9
        addss     xmm0, xmm5                                    #42.9
        movss     DWORD PTR [rax+rbx*4], xmm6                   #43.9
        movss     DWORD PTR [rcx+rbx*4], xmm0                   #44.9
        add       rbx, rbp                                      #36.5
        cmp       r9, r15                                       #36.5
        jb        ..B3.4        # Prob 63%                      #36.5
        mov       r15d, DWORD PTR [-48+rsp]                     #[spill]