I have spent two evenings with the cumulative sum in hopes of providing some closure to this thread. To this effect Iāve ignored the warning from Knuth and the many earlier threads on Discourse which have discussed how micro-benchmarking rarely provides good evidence.
The way the question from the OP is framed,
is there faster way to do cumsum what perhaps fully utilize vectorization or something?
tells us that heās interested in architecture-specific solutions. The answer will obviously depend upon the hardware, compiler, compiler settings, array size, memory layout, and other factors making it difficult to give any kind of general answer. Best I can do is try with experimentation.
In post #10, @mecej4 has brought up Knuthās warning, which also appears as the epigraph at the beginning of Chapter 1 in the book āScientific Software Designā by Rouson, Xia & Xu. Reading the chapter further, Rouson & coauthors introduce the central thesis of the book:
Your Time is Worth More Than Your Computerās Time
Total solution time and cost can be reduced in greater proportion by reducing > development time and costs than by reducing computing time and costs.
If you are developing an application where performance is critical, I can only suggest you first profile your code using a tool like gprof or the Intel® VTune⢠Profiler. This will help you to identify the sections of your code which occupy most of the runtime, and where you should focus your efforts to gain speed-up.
Before you reject the canonical implemention with the do loop, itās fair to do a comparison with other programming languages. Iāve chosen the following three:
Iāve placed all the benchmark codes I used here:
GitHub - ivan-pi/cumsum_benchmark: Some benchmarks for the cumulative sum of an array
Results I got for C++, Python, and MATLAB are the following:
As you can see the C++ std::partial_sum
implementation leads with just over 1000 million doubles per second. MATLAB is not trailing very far behind.
For Fortran cumsum
, Iāve prepared some variations following the C++ codes of Prof. Daniel Lemire. Hopefully, there are no mistakes left.
With gfortran I get the following results (columns correspond to array sizes N = 1000, 10000, 100000, 1000000, rows are distinct trials):
# Compiler version: GCC version 10.3.0
# Compiler options: -mtune=generic -march=x86-64 -O3 -Wall
# do_loop
391.389 828.912 1024.947 749.708
994.036 1066.553 1028.987 748.018
979.432 1006.239 967.090 753.746
1037.344 1070.320 1039.696 757.796
1018.330 1073.768 1041.558 753.090
1057.082 1074.691 1025.178 759.203
1009.082 1071.582 1033.506 747.828
# step 2
1592.357 1639.882 1334.971 746.350
1515.152 1682.369 1351.461 748.210
1605.136 1554.243 1352.174 665.873
1420.455 1618.123 1245.051 750.325
1517.451 1632.120 1347.073 744.820
1305.483 1500.600 1364.647 751.274
1515.152 1670.844 1322.576 735.441
# step 2 unrolled
1027.749 1040.908 1029.569 639.030
1001.001 1024.695 1014.641 701.242
1025.641 1073.537 1016.085 627.270
1018.330 996.711 933.158 623.454
1003.009 1032.418 864.655 686.328
1061.571 1066.780 1027.359 749.817
1060.445 1041.775 1036.506 753.150
before experiencing a crash with the 3-step method. As you can see the simple do loop provides equal performance as the C++ standard library. The step2
version is able to achieve a 50 % performance increase for small array sizes, but with large array sizes, the advantage seems to vanish.
The second two step version step2_unrolled
provides no benefit compared to the canonical do loop. Moral of the story is - donāt try to be smarter than the compiler, unless you are a low-level performance tuning engineer with intimate knowledge of the cache organization and CPU instructions sets.
I also made a run with the Intel Fortran compiler:
# Compiler version: Intel(R) Fortran Intel(R) 64 Compiler Classic for applicati
ons running on Intel(R) 64, Version 2021.3.0 Build 20210609_000000
# Compiler options: -warn all -O2 -o f_cumsum
# do_loop
280.269 443.027 524.172 458.147
532.481 532.992 514.099 490.898
513.611 517.813 510.485 496.566
534.474 522.166 524.929 475.125
518.941 517.518 506.827 497.843
533.333 519.292 516.054 484.909
533.333 522.384 516.316 482.826
# step 2
1216.545 1280.082 1071.432 719.369
1485.884 1548.947 1321.528 707.293
1479.290 1604.879 1374.627 652.008
1519.757 1534.684 1351.936 683.041
1519.757 1546.551 1353.858 742.416
1524.390 1540.357 1306.626 708.752
1483.680 1586.798 1366.139 651.351
# step 2 unrolled
326.584 333.400 344.710 334.904
350.877 336.655 345.354 334.240
340.252 351.136 352.399 338.091
356.379 344.187 345.642 331.063
344.353 344.982 344.578 328.011
339.789 335.435 328.517 325.161
344.234 333.901 332.197 323.531
# step 3
993.049 987.069 900.163 623.206
1199.041 1139.601 935.121 632.438
1447.178 1176.471 972.328 697.442
1552.795 1265.342 1035.540 691.362
1468.429 1104.850 1039.685 653.648
1108.647 1159.420 940.230 632.437
1335.113 1144.427 1054.863 708.983
# step four
585.823 659.979 614.881 286.593
683.527 708.617 635.300 294.837
796.813 726.111 624.836 288.141
772.798 738.007 595.745 275.712
816.327 745.268 649.199 255.521
653.595 667.824 539.110 264.546
788.644 745.935 642.422 282.847
Here the step2_unrolled
version gives performance roughly equal to numpy. The basic do_loop
version is 50 % slower than gfortran. The step2
and step3
versions appear to perform reasonably well.
The step_four
version is taken from the Stack Exchange thread linked in the original post, but it also doesnāt provide much benefit against the simple do loop.
In conclusion based upon my limited and imperfect evidence, I agree with @mecej4ās wisdom, āBecause Worse is Betterā. Using the simple do
loop is a good return on the invested coding effort. Given the 6 hours I just invested in preparing the code for this post, I donāt think the 50 % increase in two-step unrolled version is really worth it. I would probably prefer to wait a few years, and buy a new CPU.
@CRquantum, please feel free to expand the code I put on GitHub with your own cumsum
variations or borrow it for other performance-related questions.
Concerning the implied do loop suggested by Brad (@everythingfunctional), the performance is in fact terribly slow, that I could not afford to wait for the results of the N > 10^4 cases. As indicated by @CRquantum, the work appears to scale as O(N^2) introducing a severe performance penalty. For N = 1000 the implied do version is roughly 500 times slower for the smallest array size, and drops another order of magnitude for N = 10000:
# implied_do
2.250 0.211
2.245 0.208
2.166 0.206
2.148 0.208
2.086 0.210
2.204 0.212
2.238 0.210
Edit: the results were measured on a ThinkPad T530 laptop with an Intel(R) Core⢠i5-3320M (first released 2012). It would be interesting to learn what speed new processors achieve, and whether the gains already outperform the tweaks like loop unrolling.