Does the speed (bandwidth) of memory matter?

I noticed recently at least there are two threads,

talked about big matrix and/or array operations.

I mean, for those operations, you can either using vectorization yo do matrix operations, or write your own code and do matrix operations using big do loops. But in either case, the bottleneck is actually the speed of memory, is it?

In short, my question, in those memory operation heavy code, does the speed (bandwidth) of memory matter?

PS.

I mean, like, if CPU’s speed can process 200 GB/s double precision numbers, however memory can only provide 20 GB/s. If the code heavily depend on memory operations, then it seems by increasing the speed (bandwidth) of memory is very profitable.

That seems is what Apple M1 and its memory is doing,

See here,

It seems M1 family’s memory bandwidth (speed) is much bigger than some PC (such as mine which Xeon 2186M + DDR4 2666 ECC which only have 35GB/s or so bandwidth, even the M1 in Macbook Air is 2X faster than mine in terms of memory speed).

It also reminds me that High-Bandwidth-Memory (HBM) is an importance task in HPC. I remember AMD and Intel both implemented HBM technique in some way. AMD now have big cache in their chip. Intel had Xeon Phi Knights Landing (KNL) with HBM (Intel call it MCDRAM or something like that, I have the luck of using 5120 Xeon Phi cores on the cluster which I can use, now Intel give up Phi and put more effort on GPU computing I guess).

2 Likes

Interesting question, I suspect it does, although I have no evidence to support my claim.
From my experience the cache hit/miss ratio tends to be more important when thinking of performance.

For example, in Finite Element Methods using an element local cache, which stores copies of some values from global arrays would probably result in better performance than if you were to access the global array directly.

1 Like

The Roofline model has helped me understand issues of performance.

5 Likes

Yeah, I am not very sure.
Uhm, I think if most of data can be kept in the cache when doing the computations, the speed will be mostly depend on CPU (so CPU is the bottleneck).
If on the other hand, a computation requires a lot of data which simply has to be obtained from memory frequently, so a lot of data transfer between cache and memory, then memory will be the bottleneck.

I remember I have watched a course in Coursera, taught by Intel, called “Fundamentals of Parallelism on Intel Architecture”. It is based on C/C++ which I am not very familiar with. But anyway, the instructor basically illustrated the below code (which is perhaps typical in some fluid dynamics, heat transfer, image processing (convolution matrix), cellular automata, etc area) on a Xeon Phi cluster,

by adding vectorization, adding openMP (so multi-threads), then using HBM on Xeon Phi, finally using MPI, achieved performance gain, from 2.3 → 17 → 90 → 500 → 1100 → 3100 GFLOPs, as below,

But it seems the using HBM memory part is not very trivial (so I skipped that part LOL), there are some specialized command needs to be added in the code to use HBM, I was lost there LOL.

Thank you @themos for pointing Roofline model, this is also mentioned in that Intel course, I will take a closer look at it. Thank you again, this seems to be an important concept in HPC.

PS.
For anyone who is sensitive with copyright issue, I make it clear here that the screenshots are from Intel’s slides, I hope there is no copyright issue, and I just simply display here purely for discussion purpose.

2 Likes

I have been using Intel i5/i7 and AMD Ryzen for OpenMP. All these are “dual channel memory”.

My calculations are using 8-byte real arrays, typically larger than the cache size.
My bottleneck for performance is memory access speed, not processor mHz.
Typically there are more threads available than can be efficiently utilised.
I need better memory access speeds (memory bandwidth), not more mHz or more cores.

I have little experience of other (more channel) memory access types to comment on their effectiveness. I wonder if multiple channels can access the same/near memory addresses, that would be required to utilise the wider bandwidths claimed when performing multi-threaded calculation on shared arrays.

Increased cores are only marketing when memory access is the bottleneck.

2 Likes

Thanks @JohnCampbell .

Yeah I mentioned in another thread,

For a Fortran code I wrote which depend on heavy memory operations, I did found even the entry M1 with 68.5GB/s memory bandwidth run the code 2X faster than my Xeon-2186m laptop with DDR4 2666 which is 35 GB/s bandwidth.
I do not think M1 is really faster than intel’s CPU, but the memory bandwidth seems do make a difference. Perhaps that is why M1 may be good at video editing or something which heavily depend on the speed of memory. The high bandwidth memory may be a game changer.

I figure perhaps to match M1 Max’s 400GB/s bandwidth, on a regular PC motherboard which only support dual channel memory, you may need DDR5 30000, lol. Current world record is DDR5 8888.

I know the workstation mother board may support quad-channel memory. It perhaps is time for consumer PC to support quad-channel memory as well.

Most importantly, all the PC’s memory seems are all 64bit width, if they use 256bit width, even DDR4 2666 with on dual channel memory motherboard can achieve 35*4= 140GB/s. I guess M1 chip is mostly likely equipped with 128/256/512bit width memory, just like some GPU’s memory.

Fortunately, I’ve recently got access to M1 Max (not Ultra, though), so if you have any test program to run, I will try to run it and measure the timing. Hopefully, the test programs will be as “minimal and simple” as possible, such that the bottleneck is clear (e.g. CPU load vs data transfer), but other types of codes may also be interesting.

By the way, there are a lot of sites that compare “CPU performance” of different processors, e.g.

Here is also a comparison between M1 and Xeon E-2186M (the latter used in another thread)

I am wondering to what extent I can use these comparisons for specific type of calculations… (e.g., CPU or memory intensive). Looking at some other sites, M1 and M1 max are just similar to, e.g., Ryzen 9 series for some calculations, which is of course good but not “super” fast (as suggested by the bandwidth). I’ve also run my codes, but the experience was similar. So, the comparison may depend very much on the type of calculations.

1 Like

Thank you @septc , I have posted a code here,

You may test when you have time. Thanks again!

1 Like

Let me start out with an H.L.Mencken quote:

 "For every complex problem there is an answer that is clear, simple, and wrong."

That witticism seems to be applicable to this thread, the title and content of which seem to portray memory bandwidth as all-important. I think that more attention should be paid to the quality of the RNG used and the quality of the results obtained, the quality of the algorithms used, the quality of the coding and the optimizing capabilities of the compilers used.

I do not have an M1 or a Xeon workstation, just a lowly Intel NUC (mini PC, 5 in X 5 in X 2 in) with a 1.1 GHz i7-10710U and DDR4-2666 SODIMM RAM, running W11-Home on the “balanced” power plan. Such a PC should know better than to compete against the likes of M1 and Xeon. I am listing the timings that I obtained with @CRquantum’s test code mainly to prove the points that I listed above. The program crashed when I used /Qopenmp, so I turned that off and built the program using

ifort /MP /Qxhost /fast /Qopt-matmul /heap-arrays:8 /Qparallel constants.f90 fg.f90 pf.f90 ran.f90 stats.f90 stochastic_rk.f90 tests.f90 main.f90 /Fecrq
...
random number generating took n seconds, n =    6.532000
Time cost =           3.061000     sec
Time cost =           3.085000     sec
Time cost =           3.048000     sec
Time cost =           3.083000     sec
Time cost =           3.102000     sec
total time cost =    22.20000      seconds

The M1 timings given in @CRquantum’s linked post are 24 s for the RN generation and about 1.6 s for each of the remaining tests. For the Xeon 2186M, his reported timings are ~10 s for RN generation and 3 to 4 s for each of the remaining tests.

The time for generating the random numbers is the dominant time cost. The RNGs used by other compilers, such as Gfortran’s, will probably exhibit substantially different timings, and one should remember that speed is not everything. Let us remember this deprecatory quote about an old, discredited RNG: "We guarantee that each number is random individually, but we don’t guarantee that more than one of them is random "

There you are; I have contributed one answer that is, I hope, “clear, simple and wrong”. On the other hand, perhaps we should measure how many Joules of electric energy were consumed in obtaining these results … !

4 Likes

Thanks @mecej4 !

Your i7 10710U seems faster than my xeon 2186M, LOL.

Note this is just a sample code (again, for those who do not know, the sample code is here,
Memory bandwidth test code for M1 and PC) from a more complete sequential Monte Carlo code.
In the real code, once the gaussian random number is generated, I will reuse it in each of the SDE calls. The SDE part will be repeated for more than 20 times instead of 5 times here. So in the real code the time cost of random number generating is not the most dominant part. However I checked, if I use the random number generator in Intel’s MKL, the MKL’s RNG is three times faster than the one in the sample code.

The reason I generate the big random number array first (drawback is that it consumes big memory), is to reuse it in each of the rest SDE call. So in each of the rest SDE call, we do not need to regenerate random number anymore (so it saves time). After all, accessing the already generated gaussian random number stored in the memory is faster than generating them on the fly.

But anyway, the sample tests two parts,
first is the speed of generating a large number of gaussian random number and storing them in the memory.
second part is solving the vectorized SDE which seems is memory operation frequent.

About the crash, Intel confirmed that it seems is a compiler bug. I.e., if I have

do concurrent

in the code. Then, when /qopenmp and /heap-arrays are enabled at the same time, the do concurrent part just crash. See below,

Re: Why /heap-arrays0 cause /qopenmp show access violation for do concurrent? - Intel Communities

About algorithm. Overall, I agree with what you said.
However, I need to point out that, in this code, no matter how good the algorithm is (even if you can bring down the number of particles from 10^5 to 10^3), the most time consuming part is still solving SDEs. That is the point why I only post the SDE part of the code.

Yes, in the SDE part, as you point out, RNG speed matters, the SDE solver speed also matters.
I think the bottlenect of RNG speed is the CPU speed. Perhaps not too difficult to think about it. Say the memory bandwidth is 200GB/s. Can CPU generate 200GB high quality gaussian random number per second? If the answer is no, then the bottlenect is CPU.

While the SDE part (which require frequent memory operations) the bottleneck is memory speed. Perhaps not too difficult to think about it either. Can CPU handle matrix (more than 200GB in size) operations per second. If the answer is yes, then the bottle next is speed of memory.

Note that, there are definitely reasons why Apple use high bandwidth memory (HBM). If HBM does not have benefit I suspect Apple will devote much time on it. There are much better explanations in the below link,

Ideally, the memory speed should be the same as CPU’s L1 cache speed (but cache may have smaller latency). RIght now, it seems only Apple M1 somewhat achieved this. Of course the drawback is that the memory is not ungradable anymore. Intel Xeon Phi also have HBM which they call MCDRAM, unfortunately they stopped Xeon Phi, but the idea of Xeon Phi is continued in their GPUs I guess.

Of course memory speed (bandwidth and latency) matters critically. That is what made Crays so exciting (and spawned array processors, GPUs and etc.). Just coding a simple set of DO loops is unlikely to get optimal results.

Short loops, and logical branches in innermost loops do best with low latency to memory (or lucky/careful use of caches). For long vectors, high bandwidth direct memory is easiest to deal with (sparse vectors as outliers ;>).

The tradeoffs are complicated, workload and implementation dependent … and the target systems often evolve faster than we can rewrite all the codes in the world (and while libraries may be well tuned, to maximize the reuse of values (fetch into memory, do a lot of computations, then get the next one, rather than streaming/vectorization over and over) “whole program analysis” and “inter-procedural analysis” combined with cache and streaming optimizations can do better (not that either are easy or practically speaking always effective even when available).

1 Like