Question about hyper-threading

Hello,

I am now building & installing several machines / workstations, and checking several parameters in the BIOS setup. I remember having read that “it is better to turn off hyper-threading”, but wondering if it is true also for recent CPUs (e.g., models released in the last 5 years)? For typical OpenMP and MPI applications, for example, are there both pros / cons for turning hyper-threading off, or is it almost always better to turn it off (for better use of cores / computational efficiency)?

FYI, I have installing machines with Ryzen (1 socket) or Xeon (2 sockets), if it matters. I do not run any other “desktop” applications simultaneously; just use them for computation only.

I would appreciate it for any comments / hints / advices about this. Thanks very much!

1 Like

For supercomputer cluster, I am not sure if they enabled hyperthreading, perhaps they usually don’t.

For personal computer, my experience is that you can enable hyper-threading without any problem. Hyper-threading is useful for daily usage, so that you can run many different programs, open many webpages at the same time, etc.
However, it is just that, if your computer is 6 core 12 thread, then if you do mpi,

mpiexec -n XXX

You get best performance at XXX = 6. Beyond 6 there is no performance gain, or the program may be even slower. The speed depend on the number real cores (not the number of threads) you have.

2 Likes

In general, hyper-threading tends to benefit programs that are bottlenecked by ram latency rather than pure cpu performance (or memory bandwidth). As such, hyper-threading is often nearly a 2x speedup for poorly optimized applications, but for well optimized programs, it generally is performance neutral for programs that scale perfectly, or harmful for programs that have scaling overhead.

5 Likes

I agree that it depends on the code. At the end one usually does not know before testing

2 Likes

I am similarly confused about the effectiveness of Hyper-threading, which is Intel’s name for running 2 threads on the one cpu. This is very much a black box for my understanding !

Similar to Intel’s “Hyper-Threading technology”, AMD implemented 2-way simultaneous multithreading.

My computing involves real-64 large arrays (multi gigabyte) where I hope to use AVX2 instructions, however if I limit the number of threads to number of cores, there is insignificant computation loss as measured in GFLOP/s.
I am using 6 core i7-8700K and 12 core Ryzen 5900X, which are both DDR4 dual memory channel architecture.
My bottleneck appears to be the memory to L3 cache bandwidth, as by reducing the memory demand rate (modifing the numerical approach), this appears to improve GFLOP performance.

I have tested different processors with faster DDR4 memory speeds that demonstrate improved performance.
I have tested different array sizes, which appear to demonstrate improved AVX performance if the arrays can be stored in L1 cache.
I have not tested intel XEON, AMD Threadripper or Apple M1 processors with more claimed memory channels to understand if more memory channels can improve bandwidth for selected memory addresses to know if this can improve performance. (ie using 4 gbyte arrays when 64 gbyte is installed, can more memory channels better access the same memory pages?)
I am puzzled if each CPU can support multiple AXV vector registers for hyper-threading. (I can not identify if the performance limit is AVX registers, memory bandwidth or Hyper-Threading.)

I have carried out a number of tests for cores, threads, memory speed, L3 and L1 cache utilisation, but with limited success. The problem in managing these in Fortran code is the controls available are very indirect.

(I have lately turned off AMD 2-way simultaneous multithreading, as with more threads and with dynamic clock rates for each core, some threads perform very slowly in OpenMP computation. Win10 does not appear to choose the best CPU sharing)

My supllementary question for this thread is;

For dual channel memory processors that I have identified, are many cores, Hyper-threading, AVX registers and Cache sizes more marketing than a useful measure of performance ?
Have we been mislead by the marketing, where there are too many cores/threads for the memory bandwith currently provided ?

My computation is structural finite element analysis, using a direct solver of a large “skyline array”. This single shared array is well suited to shared memory OpenMP, rather than a distributed network array of computers (super computer architecture ?).

3 Likes

For my astrophysics code using raycasting and a chemical solver the execution speed increased if the number of threads on my Ryzen system of the second generation wad increased above the number phsical cores (gfortran, OpenMP).
But I do not think this result can be generalized.

2 Likes

Thanks very much for various info!

Just a bit more background: Because this is the first time that I built machines “from scratch”, i.e., by buying each computer parts separately and assemble them manually, I needed to take a much closer took at the BIOS setup than before (partly because some combination of CPU + motherboard did not work properly… (due to default overclocking? still not sure…). Anyway, I will play with the setting to see how it may affect the performance of the programs at hand.

I do not pretend to be a firmware/hardware developer, but I’m not sure why you want to disable hyper-threading at the BIOS level when you could just force “OMP_NUM_THREADS = 1” (or whatever other multithreading model you’re choosing).

Multithreading is one of those things where you really need to try it out and see for yourself, there’s way too many independent variables involved. The only “safe” advice is that if you’re running any decently-optimized scientific computation package, unless your package was specifically designed to use multithreading, you’re probably not going to see any appreciable performance gain by turning it on.

1 Like

I did some performance measurements with hyper threading a while (approx. 5 years) ago and at that time I decided to disable it.

The reason was that a a CPU with N cores pretends to have 2xN cores, so setting OMP_NUM_THREADS=N could mean that only half of the “real” cores are doing the calculation.

But this experiment is a while ago, and

is IMHO the only sensible answer for any multi-threading/performance question.

2 Likes

Register pressure in hyper threading context comes to mind as a reason. It is still not suggested to be ON for structural FEM stimulations.

1 Like

@JohnCampbell what you’re describing makes sense. With current hardware, I generally find most scientific software (including FEM-type workloads) are limited by memory bandwidth more than core count, processor clock, SIMD throughput, or other such factors.

I have not tested intel XEON, AMD Threadripper or Apple M1 processors with more claimed memory channels to understand if more memory channels can improve bandwidth for selected memory addresses to know if this can improve performance. (ie using 4 gbyte arrays when 64 gbyte is installed, can more memory channels better access the same memory pages?)

This depends on BIOS settings and implementation details but in general the hardware will interleave the different memory channels so you can indeed see an improvement with more memory channels, even when your application’s memory demand is smaller than the machine’s memory capacity. I use a workstation with 2x 16 core AMD Epyc CPUs, which each have 8 memory channels, and I can confirm that I’m getting an enormous speedup in FVM CFD codes relative to a consumer chip with 2 memory channels, even when the case I’m solving needs only a handful of GBs out of the 256GB total.

Granted, having to buy 256GB of RAM, when you only need 32, just to get better performance is kind of annoying, but that’s a separate discussion…

Have we been mislead by the marketing, where there are too many cores/threads for the memory bandwith currently provided ?

My opinion: Yes, especially for FEM. This story can change depending on the software though. Some problems do need to perform a very large number of FLOPs per memory load or store… In my experience, making lots of trig or exponential function calls is a common way to get into this situation, since trig functions are generally emulated in software.

3 Likes

Remember that the older Intel CPUs may be prone to some HT-related bugs. See, e.g. here or there

2 Likes

The answer to this is very interesting (in my opinion).

Using a Ryzen 5900X (12 cores, 24 threads) with Windows 10 (21H1), if I choose to run an OpenMP program that uses 10 threads, the allocation of threads to cores for my actual processor will target only 9 cores. My understanding is that this is a result of the variable clock rates chosen for each core, based on the Ryzen optimising approach, so that (in this case) the 3 cores with slower clock rates are biased out of the selection. (My Ryzen is not a perfect bit of silicon!)
The particular calculation I am performing requires each thread to have a very similar performance rate, as all threads are reading and sharing a 25Gbyte array. (I use !$OMP BARRIER to keep them synked at each pass) If the 10 threads get out of sync by more than the L3 cache size when processing this array, then the memory <> L3 cache bottleneck degrades the performance of all threads.
An alternative may be to turn off the AMD dynamic clock rate mechanism, although I have not tested this approach as yet.
My problem is there are too many CPU and cache tuneing features that I am not able to control (mainly due to a lack of expertise)
I am fairly sure that an environment variable such as “OMP_NUM_THREADS = x" would not address this issue, as I do use “call omp_set_num_threads (num_threads)” during execution.

Disabling hyper-threading will ensure each thread will use a different cpu, although some threads will use 2 CPU’s, which makes me wonder what frequent switching between cpu’s does for L2 and L1 cache efficiency. (seen from task manager)
There is also a question as to if environment variable settings such as “OMP_PROC_BIND”, “OMP_PLACES” or “GOMP_CPU_AFFINITY” work as claimed on Windows 10 or if intervening with these controls is a good thing anyway.
I do know that “OMP_STACKSIZE” does not work as documented on Windows 10. As documented it is pretty useless, as for !$OMP PARALLEL DO computation, the thread 0 stack needs to be the largest stack, while OMP_STACKSIZE will only increase the size for threads 1+.

This may be a detailed answer, which is possibly incorrect in some areas, but it does highlight the problem we Fortran programmers have when trying to use multi-threading when we are NOT firmware/hardware developers. All we have to go on is the marketing hype which may skip some important details.

1 Like

This is a good point I hadn’t considered, the HPC clusters I’ve worked on in the past usually had environment and/or MPI-wrapper flags to control distribution of threads/MPI tasks over logical cores for this reason. Not sure how you’d roll your own, though.

1 Like

As far as I know, each thread has its own context on a core, in particular its own register set (much larger than actual number of registers). So register pressure should not be an issue.

1 Like

Sorting out thread affinity is important and usually rather simple in small environments (single/dual processor). There is KMP_AFFINITY for ifort, not sure about how gfortran handles this. Standard behaviour in Linux was always, use physical cores first.

Enabling hyperthreading but only using physical cores for solvers did, even with memory bandwidth bounded solvers, slightly improve performance. My guess is, that system tasks with very different loads can easily be merged with numerical code. On the other, even with abundant memory bandwidth, using (not enable) hyperthreading for numerical codes might be a bad idea as numerical tasks tend to be rather monocultural, stressing only one/few execution unit(s). Then the cost of context-switches is larger than gains the hyper-threading might yield.

1 Like

It is not quaranteed if not set explicitly when execution starts. Threads affinity by default on Windows is set broadly to allow scheduler to move them among available cores.

1 Like

Out of curiosity, in your calculations with Ryzen5900X, are you using the “default” BIOS setting of your motherboard (e.g. X570 or B550), possibly enabling the “Precision Boost Overdrive (PBO)” + “Core Performance Boost (CPB)” features that dynamically boost the clock beyond the base value (3.7 GHz for Ryzen5900X) to higher values (e.g. up to 4.8 GHz)? In that case, for long calculations, have you ever met any machine instability or troubles etc? If the cooling of CPU and motherboard is sufficient, is it usually okay to run such calculations for a long time (unless CPU / MB temperatures remain reasonable)?

Because the chipset heatsink becomes very hot for the motherboard I’m now using (ASUS high-end gamer one with X570 chipset), I am a bit worried about the MB temperature for long calculations…

I’ve tried turning off the Core Performance Boost (CPB), then all the cores ran at the base clock. But naturally, the computational time became longer than otherwise (e.g. 30 %).

Yeah… the AMD BIOS has a lot of boosting features (mainly for gamers?), which is too complicated for me so just relying on the default (or turn it off when necessary) at the moment.

Apart from the BIOS setting, it really seems (empirically) that the data transfer rate (from memory) is very important in many calculations. I guess this is one reason why high-end CPUs are more expensive than consumer ones (by supporting more data channels)…

RE the very high temperature of the fan-less heatsink (of the chipset part), I tried attaching a small CPU or chassis cooler on top of it, then the temperature became very stable. So this problem is now solved… (and now going back to other settings). I am sorry for unrelated noise :sweat: