What is the fastest way to do cumulative sum in Fortran to mimic Matlab cumsum?

Your code is faulty: in the DO loop, when i = 1 the statement in the loop references qout(0), which is a bounds violation. You can fix this bug easily.

Please define what you mean by “best way” and what you mean by “mimic Matlab”.

1 Like

Thank you. I updated my description.
Thank you for pointing out the potential issue, my code is more for illustration, I am sorry it is a little sloppy. I would require A and B in the input start from index 1.
I would actually wish A can be any 1D array starting from any index, does not have to start from 1. In the i loop, i should start from the lbound(A), end at lbound(A) + size(A) -1. But that is not the most relevant for now.

I wish to find the ‘fastest’ way to do cumsum.

cumsum = [(sum(a(1:i)), i = 1, size(a))]

The compiler should be allowed to vectorize, parallelize, do whatever optimizations it likes with that.

1 Like

Correction inside loop:

do i = 2, size(A)
    B(i) = A(i-1) + A(i)
enddo

Also, the most elegant way is how @everythingfunctional suggested, but with cumsum corrected as an array not scalar.

pure function cumsum(a) result (r)
real(dp), intent(in) :: a(:)
real(dp) :: r(size(a))
integer :: i
r(:) = [(sum(a(1:i)),i=1,size(a))]
end function cumsum

then you simply call it as you initially intended: b=cumsum(a)

here is a working example link

1 Like

Thank you so much @stavros Yes, this is truly an elegant one line solution.

Indeed, let me say that I do not mean to offend anyone, nothing personal.

Just from business point of view, this elegant code may not be efficient enough. The cost is O( (size(a))^2 ).
The code I show is O( size(a) ) which is a factor of size(a) faster (however I wish to further improve it and make it faster.)

subroutine cumsum(a,b)
real(kind=r8) :: a(:),b(size(a))
integer(kind=i8) :: i
b(1)=a(1)
do i = 2, size(a)
    b(i) = b(i-1) + a(i)
enddo
return
end subroutine cumsum

You could check the speed difference with a array whose size(a) = 1000000.
Note that I did not set the size = 1000000 just to show that the elegant code is slow. I set it as 1000000 is because I need this number of particles in my particle filter (sequential Monte Carlo).

I can use

a + random array

to propose the move of all the particles randomly and fully using vectorization. This may be faster than do loop which loop over 1000000 particles.

Slightly more detailed, in the below code, np = 1000000,

noise = matmul(reshape(gaussian(gbl%nx*np),[np,gbl%nx]),sq)
x0tmp = spread(gbl%x0,1,np)
call mf(x0tmp,gbl%kel)
x = x0tmp + noise ! this is the a + random array


subroutine mf(x,kel)
real(kind=r8) :: x(:,:),kel
x(:,1) = exp(-0.2_r8*kel)*x(:,1)    
return
end subroutine mf

subroutine sysresample(q,i)
use random
integer(kind=i8) :: m,j,k,l
real(kind=r8) :: q(:),qc(size(q)),rn(1),u(size(q))
integer(kind=i8) :: i(size(q))
call cumsum(q,qc)
!qc=cumsum(q)
m = size(q)
rn=randn(1)
u = ( [(l, l = 0, m-1)] + rn(1))/m
i=0; k=1
do j=1,m
    do while (qc(k)<u(j))
        k = k + 1 
    enddo
    i(j) = k
enddo
return
end subroutine sysresample

subroutine cumsum(a,b)
real(kind=r8) :: a(:),b(size(a))
integer(kind=i8) :: i
b(1)=a(1)
do i = 2, size(a)
    b(i) = b(i-1) + a(i)
enddo
return
end subroutine cumsum

Note that subroutine sysresample may not be the most efficient. It was a translation from Schon’s matlab code,
http://user.it.uu.se/~thosc112/research/rao-blackwellized-particle.html

For those who are interested, can check a paper

See these parts in the paper


image

Oh, no offense at all!!! I am actually agreeing with you. My comment was only about elegant :grinning: In my actual code, which by the way is in a similar field to yours (Monte Carlo and particles) I use nothing functional within the main loops, perhaps only during the preprocessor part where speed doesn’t really matter, and even then not for large arrays.
Of course, there is nothing inherently wrong with the functional style, the problem is on the compiler’s side. Such expressions should be automatically vectorized/unrolled. What really happens is that every time such expression is called a temporary array is created. The difference in speed is in orders of magnitudes.
By the way, to further optimize use only subroutines, for *some reason code generated from subroutines is better optimized resulting in minor speed improvement, it is noticeable in large sizes.
*Extra code is generated to reallocate the allocatable array.

2 Likes
3 Likes

CRquantum, I am afraid that this thread illustrates why we should heed Knuth’s warning, “Premature optimization is the root of all evil”.

You focused on a small part (i.e., the formation of cumsum) of the overall algorithm, and until you posted #7, we did not know what you planned to do with the computed cumsum array. But now, it turns out that even if you could compute that array in zero time, your overall calculation is going to suffer from the slowness of linear search. The cumsum array is already in increasing order (strictly so, if none of the elements of q are zero), so you could do a binary search for u(j) in the array qc. Further study of the values that you may expect for the arrays u and qc could enable the binary search to be conducted with a segment of the array qc rather than with the entire array, …

4 Likes

@mecej4 Thanks. I got you. The piece I show is a small piece in my code.
True, I totally agree, one should not spend 90% of the time optimizing a part of the code which at most give like 1% speedup.
Uhm, but if there is a chance to speedup each piece and learn something from you guys, why not? :slight_smile:

The thing I mainly focused, actually is something else. But that may be a little off topic now (I may put these issues into smaller specific question and ask you guys one by one, as @VladimirF always reminds me).

If I may,
The true issue, is that perhaps you may ask, why do you need 10^6 particles to represent a distribution?
In the code, bringing down the number of particle from 10^6 to 10^3 or so, is where true speedup occur. Because in my code that reduces the number of stochastic differential equations (SDE) to be solved after all.

The subroutine sysresample is subroutine for resampling.

 sysresample(q,i)

q is an array size 10^6, each of its value means the normalized weight of a particle. I need to resample the new 10^6 particles according to weights. The i is the index to help me choose particles to form the new resampled array x from the old x. like below,

    q = q/sum(q)
    call sysresample(q,idx)
    x = x(idx,:) ! the x on the LHS is the resampled x, using the index from the x on the RHS.

I believe the sysresample subroutine I took from Thomas Schon, Software and data sets | Thomas Schön
is at most N*log(N) algorithm (which is not too bad) where N mean the number of particles which is the size of the array.
The Carpenter etal paper I showed in thread 7 is a perhaps O(N) algorithm.

Resampling, or sometimes people call bootstrap sampling or importance-sampling-resampling (SIR). Eg, see a paper,

The binary search you mentioned is like perhaps bisection method or something like that, is to find some particular values in an array. I could be wrong, but perhaps it is not exactly what I needed here now for resampling. But I do thank you very much indeed for your thoughts which is indeed very helpful, not only for me, but also for everyone here.

I think you missed the point here. The semantics of the one-line expression using implied do and the sum intrinsic contain sufficient information, such that an optimizing compiler could recognize this expression as the cumulative sum operation, and choose the fastest set of machine instructions for the target architecture.

At least for the hardware (x86-64) and compilers (gfortran and ifort) available to me, this does not appear to be the case today. A blog post from Daniel Lemire on “Is C++ worth it” provides an anecdote about cumulative sum performance in C++ and Java with a somewhat-related takeaway message.

Now to receive the answer you are looking for, you probably need to ask the more accurate question

How to obtain good performance for the cumulative sum for double precision reals on hardware A with compiler B?

@ivanpribec One line method is elegant that is obvious. I learned from this line a lot.

However, please try the one line method, and the regular do loop cumsum, for size=1000000 array.

On the one hand, on my laptop one-line method for size=1000000 array, after I hit enter it does not seem to give me answer.

Modern laptop’s CPU is Ghz level, so roughly 10^9 operations per second.
For the one line method, 10^6*10^6=10^12 operations will likely take 10^3 seconds. Considering some vectorization, it may took 10 to 10^2 seconds or so. That is my rough estimation.

I am afraid the Intel OneAPI compiler I use may not be that smart to magically reduce a N^2 method to N.

On the other hand, the do loop one stupid version gives me answer in less than 0.1 second.

I think my question and description is not too unclear for most people, and I am afraid I am not very capable of understanding your accurate question.

Perhaps ‘King’ at S.O gives me answer already,

Again, I can be absolutely wrong and I may indeed miss the point, but if you find the one-line method is indeed faster than the stupid do loop subroutine, please let me know with your concrete timing. I feel that may help more.

I accept all of your criticism, and I admit most people here have much higher ability in Fortran than me. That is why I come here and learn. But I also think that, if we can just focus on this small topic, how to write cumsum as fast as possible from technical point of view, it may be more positive and useful. But anyway, since there may not be too much room to improve for this cumsum, we may just move on to new topics in other threads.

I think @ivanpribec said that the one-liner is indeed slower (note especially the last sentence below) when he wrote

The semantics of the one-line expression using implied do and the sum intrinsic contain sufficient information, such that an optimizing compiler could recognize this expression as the cumulative sum operation, and choose the fastest set of machine instructions for the target architecture.

At least for the hardware (x86-64) and compilers ( gfortran and ifort ) available to me, this does not appear to be the case today

1 Like

That statement opens up a teachable moment. The answer is, “Because Worse is Better”.

In other words, if I have to spend considerable amounts of time describing and learning good techniques for each sub-task, by the time I learn all the techniques and put together a program to solve the complete problem, I may be too old to care or someone else may have solved it much sooner.

2 Likes

Thank you @Beliavsky .
If @ivanpribec could simply add However, and delete At least, in the beginning of the 2nd paragraph, such that

"

However, for the hardware (x86-64), …


"

Then the logic seems more clear and I can understand where my point is missing more easily.
So he is saying that he believe the one-liner gives the compiler enough information such that the compiler ‘intelligently’ know this is ‘cumulative sum’. However, in reality compilers are not that smart.

Actually I am not missing the point entirely. But feel free to say anything @ivanpribec , I am not really bothered by any criticism, as long as the logic is perhaps a little bit more clear. (9th floor the comment by @mecej4 is more clear to me.)

I know one-liner concise code is elegant and usually gives better performance, particularly when deal with array operations which can be vectorized.

See here,

This part by veryreverie

  do i=1,4
    ts(i) = t + sum(as(:i-1,i)) * h
    xs(i) = x + dot_product(as(:i-1,i), ks(:i-1))
    ks(i) = h * (fi(xs(i)) + gi(xs(i))*warray(i))
  enddo

I understand dot_product(as(:i-1,i), ks(:i-1)) is efficient. In that code, nothing is repeatedly calcuated more than twice.

But in the cumsum case, because you know there is some dependence there, like in order to go to step k+1, need to calculate step k first. Therefore although I wish, I do not know how the compiler is intelligent enough to realize it is a cumsum.
Unless the engineer perhaps particularly hard-code something in the compiler.

But that can be another topic for compiler engineer how to more intelligently optimized user’s code, particularly realize the logic behind the code. However, just for cumsum problem, even if the compiler realize it is cumsum, it will still calculate the cumsum use the stupid loop method perhaps. So it may come back the question again, how to write the fastest cumsum.

I have spent two evenings with the cumulative sum in hopes of providing some closure to this thread. To this effect I’ve ignored the warning from Knuth and the many earlier threads on Discourse which have discussed how micro-benchmarking rarely provides good evidence.

The way the question from the OP is framed,

is there faster way to do cumsum what perhaps fully utilize vectorization or something?

tells us that he’s interested in architecture-specific solutions. The answer will obviously depend upon the hardware, compiler, compiler settings, array size, memory layout, and other factors making it difficult to give any kind of general answer. Best I can do is try with experimentation.

In post #10, @mecej4 has brought up Knuth’s warning, which also appears as the epigraph at the beginning of Chapter 1 in the book “Scientific Software Design” by Rouson, Xia & Xu. Reading the chapter further, Rouson & coauthors introduce the central thesis of the book:

Your Time is Worth More Than Your Computer’s Time

Total solution time and cost can be reduced in greater proportion by reducing > development time and costs than by reducing computing time and costs.

If you are developing an application where performance is critical, I can only suggest you first profile your code using a tool like gprof or the Intel® VTune™ Profiler. This will help you to identify the sections of your code which occupy most of the runtime, and where you should focus your efforts to gain speed-up.


Before you reject the canonical implemention with the do loop, it’s fair to do a comparison with other programming languages. I’ve chosen the following three:

I’ve placed all the benchmark codes I used here:

GitHub - ivan-pi/cumsum_benchmark: Some benchmarks for the cumulative sum of an array

Results I got for C++, Python, and MATLAB are the following:

As you can see the C++ std::partial_sum implementation leads with just over 1000 million doubles per second. MATLAB is not trailing very far behind.

For Fortran cumsum, I’ve prepared some variations following the C++ codes of Prof. Daniel Lemire. Hopefully, there are no mistakes left.

With gfortran I get the following results (columns correspond to array sizes N = 1000, 10000, 100000, 1000000, rows are distinct trials):

 # Compiler version: GCC version 10.3.0
 # Compiler options: -mtune=generic -march=x86-64 -O3 -Wall
 # do_loop
  391.389   828.912  1024.947   749.708
  994.036  1066.553  1028.987   748.018
  979.432  1006.239   967.090   753.746
 1037.344  1070.320  1039.696   757.796
 1018.330  1073.768  1041.558   753.090
 1057.082  1074.691  1025.178   759.203
 1009.082  1071.582  1033.506   747.828
 # step 2
 1592.357  1639.882  1334.971   746.350
 1515.152  1682.369  1351.461   748.210
 1605.136  1554.243  1352.174   665.873
 1420.455  1618.123  1245.051   750.325
 1517.451  1632.120  1347.073   744.820
 1305.483  1500.600  1364.647   751.274
 1515.152  1670.844  1322.576   735.441
 # step 2 unrolled
 1027.749  1040.908  1029.569   639.030
 1001.001  1024.695  1014.641   701.242
 1025.641  1073.537  1016.085   627.270
 1018.330   996.711   933.158   623.454
 1003.009  1032.418   864.655   686.328
 1061.571  1066.780  1027.359   749.817
 1060.445  1041.775  1036.506   753.150

before experiencing a crash with the 3-step method. As you can see the simple do loop provides equal performance as the C++ standard library. The step2 version is able to achieve a 50 % performance increase for small array sizes, but with large array sizes, the advantage seems to vanish.

The second two step version step2_unrolled provides no benefit compared to the canonical do loop. Moral of the story is - don’t try to be smarter than the compiler, unless you are a low-level performance tuning engineer with intimate knowledge of the cache organization and CPU instructions sets.

I also made a run with the Intel Fortran compiler:

 # Compiler version: Intel(R) Fortran Intel(R) 64 Compiler Classic for applicati
 ons running on Intel(R) 64, Version 2021.3.0 Build 20210609_000000
 # Compiler options: -warn all -O2 -o f_cumsum
 # do_loop
  280.269   443.027   524.172   458.147
  532.481   532.992   514.099   490.898
  513.611   517.813   510.485   496.566
  534.474   522.166   524.929   475.125
  518.941   517.518   506.827   497.843
  533.333   519.292   516.054   484.909
  533.333   522.384   516.316   482.826
 # step 2
 1216.545  1280.082  1071.432   719.369
 1485.884  1548.947  1321.528   707.293
 1479.290  1604.879  1374.627   652.008
 1519.757  1534.684  1351.936   683.041
 1519.757  1546.551  1353.858   742.416
 1524.390  1540.357  1306.626   708.752
 1483.680  1586.798  1366.139   651.351
 # step 2 unrolled
  326.584   333.400   344.710   334.904
  350.877   336.655   345.354   334.240
  340.252   351.136   352.399   338.091
  356.379   344.187   345.642   331.063
  344.353   344.982   344.578   328.011
  339.789   335.435   328.517   325.161
  344.234   333.901   332.197   323.531
 # step 3
  993.049   987.069   900.163   623.206
 1199.041  1139.601   935.121   632.438
 1447.178  1176.471   972.328   697.442
 1552.795  1265.342  1035.540   691.362
 1468.429  1104.850  1039.685   653.648
 1108.647  1159.420   940.230   632.437
 1335.113  1144.427  1054.863   708.983
 # step four
  585.823   659.979   614.881   286.593
  683.527   708.617   635.300   294.837
  796.813   726.111   624.836   288.141
  772.798   738.007   595.745   275.712
  816.327   745.268   649.199   255.521
  653.595   667.824   539.110   264.546
  788.644   745.935   642.422   282.847

Here the step2_unrolled version gives performance roughly equal to numpy. The basic do_loop version is 50 % slower than gfortran. The step2 and step3 versions appear to perform reasonably well.

The step_four version is taken from the Stack Exchange thread linked in the original post, but it also doesn’t provide much benefit against the simple do loop.

In conclusion based upon my limited and imperfect evidence, I agree with @mecej4’s wisdom, “Because Worse is Better”. Using the simple do loop is a good return on the invested coding effort. Given the 6 hours I just invested in preparing the code for this post, I don’t think the 50 % increase in two-step unrolled version is really worth it. I would probably prefer to wait a few years, and buy a new CPU.

@CRquantum, please feel free to expand the code I put on GitHub with your own cumsum variations or borrow it for other performance-related questions.


Concerning the implied do loop suggested by Brad (@everythingfunctional), the performance is in fact terribly slow, that I could not afford to wait for the results of the N > 10^4 cases. As indicated by @CRquantum, the work appears to scale as O(N^2) introducing a severe performance penalty. For N = 1000 the implied do version is roughly 500 times slower for the smallest array size, and drops another order of magnitude for N = 10000:

 # implied_do
    2.250     0.211
    2.245     0.208
    2.166     0.206
    2.148     0.208
    2.086     0.210
    2.204     0.212
    2.238     0.210

Edit: the results were measured on a ThinkPad T530 laptop with an Intel(R) Core™ i5-3320M (first released 2012). It would be interesting to learn what speed new processors achieve, and whether the gains already outperform the tweaks like loop unrolling.

6 Likes

Thank you @ivanpribec that is truly remarkable work! :+1:

My motivation is just that I saw matlab have cumsum, but cumsum in matlab is just like an API, I do not know its source code. So I wonder in Fortran can we have the fastest cunsum algorithm. I just feel that a fastest algorithm overall, should not depend on hardware too much.

I have downloaded your code, and I will read your code carefully.
Right now, you may use

procedure() :: 

so that there is no need to write the interface blocks.

Eg, in your fortran file line 198,

subroutine run_test(ntrials,nreps,n,cumsum)
integer, intent(in) :: ntrials, nreps
integer, intent(in) :: n(:)
interface
  subroutine cumsum(a,b)
    import dp
    real(dp), intent(in) :: a(:)
    real(dp), intent(out) :: b(:)
  end subroutine
end interface

I believe you may do

subroutine run_test(ntrials,nreps,n,cumsum)
use cumsum_mod
integer, intent(in) :: ntrials, nreps
integer, intent(in) :: n(:)
procedure(cumsum_do_loop) :: cumsum

The one line

procedure(cumsum_do_loop) :: cumsum

will automatically take care of the interface for the cumsum.
Actually I think you may pick any name of your cumsum subroutines inside the () in procedure(), because they have the same interface.

I just ran your code, here is the results. I am using Thinkpad P72 with Xeon-2186M CPU, perhaps it may of some help for your to see how the performance of CPU increased over the years (I do not really think single core performance increased much since Intel’s
Haswell architecture)

Your subroutine cumsum_step2(a, b) looks like the best cumsum. The block trick is new to me, and it looks very interesting. Perhaps it fully takes advantage of AVX2, perhaps for CPU with AVX512, step2 may be even better.

subroutine cumsum_step2(a, b)
real(dp), intent(in) :: a(:)
real(dp), intent(out) :: b(:)

real(dp) :: s0, s1

integer :: i, n

s0 = 0.0_dp
s1 = 0.0_dp

n = size(a)

do i = 1, n, 2
  block
    real(dp) :: x0, x1
    x0 = a(i)
    x1 = a(i+1)
    s0 = s1 + x0
    s1 = s1 + (x1 + x0)
    b(i) = s0
    b(i+1) = s1
  end block
end do

if (mod(n,2) /= 0) then
  b(n) = b(n-1) + a(n)
end if
end subroutine

Here is a modified version of the main loop in CRQuantum’s sysresample() subroutine in his post #7,

k=1
qcum = q(k)
do j=1,m
   do while (qcum < u(j))
      k = k + 1 
      qcum = qcum+q(k)
   enddo
   i(j) = k
enddo

I have not tested the code, so it may need correction.

Note that it makes the qc array and, therefore, the function (or subroutine) cumsum superfluous. If this version works well, we have in this thread discussed some great answers to the wrong question as far as the task of doing CRQuantum’s Monte Carlo simulation is concerned.

On the other hand, if there is sufficient demand for cumsum, it could be added to the list of Fortran intrinsic functions.

2 Likes

@mecej4 Thank you! Your code is correct.
Yes we indeed prevented the use of cumsum function/subroutine explicitly. Uhm, however the calculation of cumsum is still there in the loop, and the time cost is the same if not more. But nonetheless your code/suggestion is insightful!

The reason I am curious about cumsum is from Thomas Schon’s Matlab code,
http://user.it.uu.se/~thosc112/research/rao-blackwellized-particle.html
In the pf.m, he uses his sysresample function which uses cumsum, Matlab code is below,

function i=sysresample(q)
qc=cumsum(q);    M=length(q);
u=([0:M-1]+rand(1))/M;
i=zeros(1,M);    k=1;
for j=1:M
  while (qc(k)<u(j))
    k=k+1;
  end
  i(j)=k;
end
end 

I support putting cumsum into Fortran’s intrinsic function like the Matlab one, Cumulative sum - MATLAB cumsum

I found a link, leonfoks seems already did something about this,

https://leonfoks.github.io/coretran/interface/cumsum.html


PS.
Just to be clear, I do not have any intention to waste your guy’s time to solve my problem for me.
In this post I was simply just asking about cumsum function because it is simple enough.
I know cumsum not something really slow down my code because it is O(N) which is not too bad after all.

Just in case someone want to see a minimal working example, I put my not even baked MWE in the link, Files · master · Chen Rong / SDE-PF · GitLab
If you have visual studio + intel OneAPI installed, just click the .sln file and it will open the project, just build and run, it should take 0.15 sec or less.
main.f90 is main program. The cumsum is in the file pf.f90, that is the module pf (particle filter), in the sysresample subroutine. around line 153. Above it, which is commented now is my translated version with cumsum, below line 153 is the one @mecej4 suggested without explicit cumsum.

The very naive SDE solver (which is not used in this example in this thread) I slightly wrote is based on Kasdin’s paper and John Burkardt’s code

Note that John’s stochastic_RK is a scalar version which only solve 1 equation. While Kasdin’s paper provides a general solution which can solve a system of SDEs, so it is actually a vector version.
I plan to learn more from FLINT ODE code by @prince_mahajan

and make my SDE solver much better than it is now.

About SDE and Kasdin’s method, a relevant S.O post is,

The comments there by Lutz Lehmann about SDE is very useful.
I also asked Kasdin’s method at

Chris Rack told me that Kasdin’s Runge-Kutta method is like order 0.5 than order 4. But nonetheless here I will begin with Kasdin’s method.

Thing is, in Fortran, we does not seem to have some good SDE solver now it seems.

I don’t see why it would cost more.

By using a scalar variable qcum to store the partial sum, the code from mecej4 got rid of a reference to the qc array. Instead of needing to fetch both qc and u values from the cache, only u will be fetched and qcum will likely remain in one the registers. I would expect this to provide a small but perhaps significant speed-up to the sysresample routine. I am happy to be proven wrong.

Of course we can only be certain after performing an experiment and/or analyzing the output assembly code. This reminds me of some projects I came across recently:

2 Likes

For the record, the block construct doesn’t seem to play any role. Also the step2 version does not appear to utilize any advanced vector extensions (AVX) due to data dependencies. This can be inferred from the Intel Optimization report:

CUMSUM_DO_LOOP

Source code:

  !> Canonical cumsum
  subroutine cumsum_do_loop(a,b)
    real(dp), intent(in) :: a(:)
    real(dp), intent(out) :: b(:)

    integer :: i
    real(dp) :: sm

    b(1) = a(1)
    do i = 2, size(a)
      b(i) = b(i-1) + a(i)
    end do
  end subroutine

Compile instructions:

ifort -warn all -O3 -xHost -qopt-report-phase=loop,vec -qopt-report=2 cumsum_benchmark.f90 -o f_cumsum

Optimization report:

Begin optimization report for: CUMSUM_DO_LOOP

    Report from: Loop nest & Vector optimizations [loop, vec]


LOOP BEGIN at cumsum_benchmark.f90(19,5)
<Multiversioned v1>
   remark #25233: Loop multiversioned for stride tests on Assumed shape arrays
   remark #15344: loop was not vectorized: vector dependence prevents vectorization. First dependence is shown below. Use level 5 report for details
   remark #15346: vector dependence: assumed FLOW dependence between B(I) (20:7) and B(I-1) (20:7)
   remark #25439: unrolled with remainder by 2  
LOOP END

LOOP BEGIN at cumsum_benchmark.f90(19,5)
<Remainder, Multiversioned v1>
LOOP END

LOOP BEGIN at cumsum_benchmark.f90(19,5)
<Multiversioned v2>
   remark #15304: loop was not vectorized: non-vectorizable loop instance from multiversioning
   remark #25439: unrolled with remainder by 2  
LOOP END

LOOP BEGIN at cumsum_benchmark.f90(19,5)
<Remainder, Multiversioned v2>
LOOP END
CUMSUM_STEP2

Source code:

  !> Two-step cumsum
  !> Source: https://github.com/lemire/Code-used-on-Daniel-Lemire-s-blog/tree/master/2012/07/23
  subroutine cumsum_step2(a, b)
    real(dp), intent(in) :: a(:)
    real(dp), intent(out) :: b(:)
    real(dp) :: s0, s1
    integer :: i, n

    s0 = 0.0_dp
    s1 = 0.0_dp

    n = size(a)

    do i = 1, n, 2
      block
        real(dp) :: x0, x1
        x0 = a(i)
        x1 = a(i+1)
        s0 = s1 + x0
        s1 = s1 + (x1 + x0)
        b(i) = s0
        b(i+1) = s1
      end block
    end do

    if (mod(n,2) /= 0) then
      b(n) = b(n-1) + a(n)
    end if
  end subroutine

Compile instructions:

ifort -warn all -O3 -xHost -qopt-report-phase=loop,vec -qopt-report=2 cumsum_benchmark.f90 -o f_cumsum

Optimization report:

Begin optimization report for: CUMSUM_STEP2

    Report from: Loop nest & Vector optimizations [loop, vec]


LOOP BEGIN at cumsum_benchmark.f90(76,5)
   remark #25084: Preprocess Loopnests: Moving Out Store    [ cumsum_benchmark.f90(81,9) ]
   remark #25084: Preprocess Loopnests: Moving Out Store    [ cumsum_benchmark.f90(82,9) ]
   remark #25084: Preprocess Loopnests: Moving Out Store    [ cumsum_benchmark.f90(86,5) ]
   remark #15344: loop was not vectorized: vector dependence prevents vectorization. First dependence is shown below. Use level 5 report for details
   remark #15346: vector dependence: assumed ANTI dependence between cumsum_step2 (81:9) and cumsum_step2 (82:9)
   remark #25439: unrolled with remainder by 2  
LOOP END

LOOP BEGIN at cumsum_benchmark.f90(76,5)
<Remainder>
LOOP END

I’ve tried to take a look at the differences with Godbolt (Compiler Explorer), but the reasons why the second executes faster for small workloads is way beyond my knowledge.

Here are the sections inside the loop body (for single precision):

CUMSUM_DO_LOOP
..B2.6:                         # Preds ..B2.4 ..B2.6
        movss     xmm1, DWORD PTR [r8+rdx*8]                    #19.14
        movss     xmm0, DWORD PTR [8+rax+rdx*8]                 #19.23
        addss     xmm1, DWORD PTR [4+rax+rdx*8]                 #19.7
        movss     DWORD PTR [4+r8+rdx*8], xmm1                  #19.7
        addss     xmm1, xmm0                                    #19.7
        movss     DWORD PTR [8+r8+rdx*8], xmm1                  #19.7
        inc       rdx                                           #18.5
        cmp       rdx, rcx                                      #18.5
        jb        ..B2.6        # Prob 63%                      #18.5
        lea       ebx, DWORD PTR [1+rdx+rdx]   

As you can see the loop has been unrolled, and two additions are performed before jumping back to the start.

CUMSUM_STEP2
..B3.4:                         # Preds ..B3.4 ..B3.3
        movss     xmm2, DWORD PTR [r12+r8*4]                    #39.9
        movaps    xmm3, xmm0                                    #41.9
        movss     xmm1, DWORD PTR [r14+r8*4]                    #40.9
        inc       r9                                            #36.5
        movss     xmm5, DWORD PTR [rdx+r8*4]                    #39.9
        addss     xmm3, xmm2                                    #41.9
        addss     xmm2, xmm1                                    #42.23
        movss     xmm4, DWORD PTR [rsi+r8*4]                    #40.9
        movaps    xmm6, xmm5                                    #41.9
        movss     DWORD PTR [r10+rbx*4], xmm3                   #43.9
        add       r8, r11                                       #36.5
        addss     xmm0, xmm2                                    #42.9
        addss     xmm5, xmm4                                    #42.23
        addss     xmm6, xmm0                                    #41.9
        movss     DWORD PTR [r13+rbx*4], xmm0                   #44.9
        addss     xmm0, xmm5                                    #42.9
        movss     DWORD PTR [rax+rbx*4], xmm6                   #43.9
        movss     DWORD PTR [rcx+rbx*4], xmm0                   #44.9
        add       rbx, rbp                                      #36.5
        cmp       r9, r15                                       #36.5
        jb        ..B3.4        # Prob 63%                      #36.5
        mov       r15d, DWORD PTR [-48+rsp]                     #[spill]