Will using Vectorization speed up the program?

ivanpribec · November 25, 2023, 7:25pm

Just reposting the graph here from the other thread.

(Full set of slides on the topic can be found here: NHR PerfLab Seminar: A Review of Processor Advances Over the Past Thirty Years)

RonShepard · November 26, 2023, 1:22am

These are peak bandwidth numbers. Should we assume that these are level-1 cache results?

ivanpribec · November 26, 2023, 12:08pm

Unfortunately, the author doesn’t state the conditions precisely in the video, nor which test was used. I’m assuming the numbers were measured with likwid-bench since the author is one of it’s creators. The graph above shows the peak Flops for sequential mode (no SIMD instructions).

Running the stream benchmark myself on my 8-core Intel(R) Core™ i7-11700K @ 3.6 Ghz processor (generation Rocket Lake, launched 2021), I get the numbers:

$ likwid-bench -t stream -w S0:10MB
...
MByte/s:		265655.11
...
$ likwid-bench -t stream_avx -w S0:10MB
...
MByte/s:		272272.28
...

(most output has been omitted)

The workload of 10MB fits into the (shared) L3 cache and roughly matches the Ice Lake figures above in terms of bandwidth.

With smaller work loads that fit into the L2 and L1 caches the bandwidth is larger:

$ likwid-bench -t stream_avx -w S0:1MB   # 4 MB L2 cache (total)
...
MByte/s:		1063662.52
...
$ likwid-bench -t stream_avx -w S0:320kB   # 40 kB per core, each core has 48 kB L1d cache
...
MByte/s:		1969639.55
...

So my guess would be the peak bandwidth is measured for the largest data cache (L3).

Topic		Replies	Views
Optimizing vectorized array operations Help	79	1603	May 5, 2025
Performance of vectorized code in ifort and ifx	28	1398	April 9, 2024
Fortran is dead – Long live Fortran! Tutorials	9	888	June 27, 2025
Fortran: Array Language (video) Advocacy	20	1084	February 3, 2024
Advice for approaching data locality improvements in legacy code Help	32	679	December 15, 2025

Will using Vectorization speed up the program?

Related topics