I am similarly confused about the effectiveness of Hyper-threading, which is Intel’s name for running 2 threads on the one cpu. This is very much a black box for my understanding !
Similar to Intel’s “Hyper-Threading technology”, AMD implemented 2-way simultaneous multithreading.
My computing involves real-64 large arrays (multi gigabyte) where I hope to use AVX2 instructions, however if I limit the number of threads to number of cores, there is insignificant computation loss as measured in GFLOP/s.
I am using 6 core i7-8700K and 12 core Ryzen 5900X, which are both DDR4 dual memory channel architecture.
My bottleneck appears to be the memory to L3 cache bandwidth, as by reducing the memory demand rate (modifing the numerical approach), this appears to improve GFLOP performance.
I have tested different processors with faster DDR4 memory speeds that demonstrate improved performance.
I have tested different array sizes, which appear to demonstrate improved AVX performance if the arrays can be stored in L1 cache.
I have not tested intel XEON, AMD Threadripper or Apple M1 processors with more claimed memory channels to understand if more memory channels can improve bandwidth for selected memory addresses to know if this can improve performance. (ie using 4 gbyte arrays when 64 gbyte is installed, can more memory channels better access the same memory pages?)
I am puzzled if each CPU can support multiple AXV vector registers for hyper-threading. (I can not identify if the performance limit is AVX registers, memory bandwidth or Hyper-Threading.)
I have carried out a number of tests for cores, threads, memory speed, L3 and L1 cache utilisation, but with limited success. The problem in managing these in Fortran code is the controls available are very indirect.
(I have lately turned off AMD 2-way simultaneous multithreading, as with more threads and with dynamic clock rates for each core, some threads perform very slowly in OpenMP computation. Win10 does not appear to choose the best CPU sharing)
My supllementary question for this thread is;
For dual channel memory processors that I have identified, are many cores, Hyper-threading, AVX registers and Cache sizes more marketing than a useful measure of performance ?
Have we been mislead by the marketing, where there are too many cores/threads for the memory bandwith currently provided ?
My computation is structural finite element analysis, using a direct solver of a large “skyline array”. This single shared array is well suited to shared memory OpenMP, rather than a distributed network array of computers (super computer architecture ?).