For the record, the block construct doesn’t seem to play any role. Also the step2
version does not appear to utilize any advanced vector extensions (AVX) due to data dependencies. This can be inferred from the Intel Optimization report:
CUMSUM_DO_LOOP
Source code:
!> Canonical cumsum
subroutine cumsum_do_loop(a,b)
real(dp), intent(in) :: a(:)
real(dp), intent(out) :: b(:)
integer :: i
real(dp) :: sm
b(1) = a(1)
do i = 2, size(a)
b(i) = b(i-1) + a(i)
end do
end subroutine
Compile instructions:
ifort -warn all -O3 -xHost -qopt-report-phase=loop,vec -qopt-report=2 cumsum_benchmark.f90 -o f_cumsum
Optimization report:
Begin optimization report for: CUMSUM_DO_LOOP
Report from: Loop nest & Vector optimizations [loop, vec]
LOOP BEGIN at cumsum_benchmark.f90(19,5)
<Multiversioned v1>
remark #25233: Loop multiversioned for stride tests on Assumed shape arrays
remark #15344: loop was not vectorized: vector dependence prevents vectorization. First dependence is shown below. Use level 5 report for details
remark #15346: vector dependence: assumed FLOW dependence between B(I) (20:7) and B(I-1) (20:7)
remark #25439: unrolled with remainder by 2
LOOP END
LOOP BEGIN at cumsum_benchmark.f90(19,5)
<Remainder, Multiversioned v1>
LOOP END
LOOP BEGIN at cumsum_benchmark.f90(19,5)
<Multiversioned v2>
remark #15304: loop was not vectorized: non-vectorizable loop instance from multiversioning
remark #25439: unrolled with remainder by 2
LOOP END
LOOP BEGIN at cumsum_benchmark.f90(19,5)
<Remainder, Multiversioned v2>
LOOP END
CUMSUM_STEP2
Source code:
!> Two-step cumsum
!> Source: https://github.com/lemire/Code-used-on-Daniel-Lemire-s-blog/tree/master/2012/07/23
subroutine cumsum_step2(a, b)
real(dp), intent(in) :: a(:)
real(dp), intent(out) :: b(:)
real(dp) :: s0, s1
integer :: i, n
s0 = 0.0_dp
s1 = 0.0_dp
n = size(a)
do i = 1, n, 2
block
real(dp) :: x0, x1
x0 = a(i)
x1 = a(i+1)
s0 = s1 + x0
s1 = s1 + (x1 + x0)
b(i) = s0
b(i+1) = s1
end block
end do
if (mod(n,2) /= 0) then
b(n) = b(n-1) + a(n)
end if
end subroutine
Compile instructions:
ifort -warn all -O3 -xHost -qopt-report-phase=loop,vec -qopt-report=2 cumsum_benchmark.f90 -o f_cumsum
Optimization report:
Begin optimization report for: CUMSUM_STEP2
Report from: Loop nest & Vector optimizations [loop, vec]
LOOP BEGIN at cumsum_benchmark.f90(76,5)
remark #25084: Preprocess Loopnests: Moving Out Store [ cumsum_benchmark.f90(81,9) ]
remark #25084: Preprocess Loopnests: Moving Out Store [ cumsum_benchmark.f90(82,9) ]
remark #25084: Preprocess Loopnests: Moving Out Store [ cumsum_benchmark.f90(86,5) ]
remark #15344: loop was not vectorized: vector dependence prevents vectorization. First dependence is shown below. Use level 5 report for details
remark #15346: vector dependence: assumed ANTI dependence between cumsum_step2 (81:9) and cumsum_step2 (82:9)
remark #25439: unrolled with remainder by 2
LOOP END
LOOP BEGIN at cumsum_benchmark.f90(76,5)
<Remainder>
LOOP END
I’ve tried to take a look at the differences with Godbolt (Compiler Explorer), but the reasons why the second executes faster for small workloads is way beyond my knowledge.
Here are the sections inside the loop body (for single precision):
CUMSUM_DO_LOOP
..B2.6: # Preds ..B2.4 ..B2.6
movss xmm1, DWORD PTR [r8+rdx*8] #19.14
movss xmm0, DWORD PTR [8+rax+rdx*8] #19.23
addss xmm1, DWORD PTR [4+rax+rdx*8] #19.7
movss DWORD PTR [4+r8+rdx*8], xmm1 #19.7
addss xmm1, xmm0 #19.7
movss DWORD PTR [8+r8+rdx*8], xmm1 #19.7
inc rdx #18.5
cmp rdx, rcx #18.5
jb ..B2.6 # Prob 63% #18.5
lea ebx, DWORD PTR [1+rdx+rdx]
As you can see the loop has been unrolled, and two additions are performed before jumping back to the start.
CUMSUM_STEP2
..B3.4: # Preds ..B3.4 ..B3.3
movss xmm2, DWORD PTR [r12+r8*4] #39.9
movaps xmm3, xmm0 #41.9
movss xmm1, DWORD PTR [r14+r8*4] #40.9
inc r9 #36.5
movss xmm5, DWORD PTR [rdx+r8*4] #39.9
addss xmm3, xmm2 #41.9
addss xmm2, xmm1 #42.23
movss xmm4, DWORD PTR [rsi+r8*4] #40.9
movaps xmm6, xmm5 #41.9
movss DWORD PTR [r10+rbx*4], xmm3 #43.9
add r8, r11 #36.5
addss xmm0, xmm2 #42.9
addss xmm5, xmm4 #42.23
addss xmm6, xmm0 #41.9
movss DWORD PTR [r13+rbx*4], xmm0 #44.9
addss xmm0, xmm5 #42.9
movss DWORD PTR [rax+rbx*4], xmm6 #43.9
movss DWORD PTR [rcx+rbx*4], xmm0 #44.9
add rbx, rbp #36.5
cmp r9, r15 #36.5
jb ..B3.4 # Prob 63% #36.5
mov r15d, DWORD PTR [-48+rsp] #[spill]