I have quite a specific problem which I’m not super equipped to debug myself, and was wondering if someone had some insight.
I’m running my fortran program (gfortran 14.2.0 with openMP) on Windows 10, and my workstation has an intel xeon w9-3475x processor (36-core, 72-threads). As I understand it, Windows automatically groups threads into NUMA nodes with a maximum of 64 threads. This leads to an asymmetric grouping on my machine where I have 1 group with 64 threads (Node 0) and another with just 8 (Node 1). When I run my program, there seems to be a 50/50 chance which NUMA node it will run on, and it only uses the resources from the NUMA node it runs on. That means if it runs on Node 1, I get 64 threads and can happily parallelize with openMP. If it runs on Node 1, I only get 8 threads and the result is my code runs ~8-fold slower.
However, if you run Linux within Windows I am not sure whether this removes the limitations of the underlying Windows systems (I, did however, no research concerning this specific problem).
Yes this works. It took me awhile to figure out though, since I normally use a powershell terminal and I wasn’t able to do figure out how to do the same in powershell. However if I’m running my program with fpm (i.e., start /NODE 0 fpm run) it doesn’t work since the child application launched by fpm isn’t bound to node 0. I could still write a script which builds and runs my programs using start /NODE 0.
Regardless, I think the WSL solution is what I’m going to go with.
Unfortunately I did some testing and in my case hyperthreading helps quite a lot: 64 threads performs about 30% faster than 36 nodes without hyperthreading.
There seems to be a limit of 64 CPUs that can “trivially” utilized by a single process in Windows 10. This limitation seems to have been lifted in Windows 11.
You may search for "Windows’ and “Processor Group”. Again, I have no personal experience.
I found that trying to double use (–oversubscribe) the physical cores on my computer just doubles the amount of time a thread takes. Hyperthreading is a marketing way of saying that another thread is ready to start executing when the first thread commits a cache miss ie: branch prediction loaded the wrong memory and must wait for L2/L3/system RAM to continue. If thread 0 of core 0 never fails to read a branch in time, thread 1 of core 0 will never get task time. That doesn’t help get an answer any faster!
I usually program in Open Co-Arrays on Linux for Windows because it’s nice and easy!
@Knarfnarf It’s a little bit more complicated, and no, it is not just marketing. Hyperthreading (actually Simutaneous Multithreading = SMT) is primarily a hardware capability to really execute simultaneously some instructions from 2 different threads in the same core. For instance an integer addition for the thread 0 and a foating point multiplication for the thread 1, as these 2 operations do not use the same physical circuits.
Very roughtly speaking, SMT can help in the case where the two threads execute very different codes, as it increases the chances they don’t need the same resources at the very same time. In contrast, the parallelization of simple loops is generally not more efficient with SMT, as all the threads do the same kind of operations at the very same time.
I went with WSL and it fixed my issues. Some notes for anyone who is curious.
Right away, there was no issue with NUMA nodes—querying numactl --hardware showed 72 cpus on a single node. However, when running my program, task manager still showed that it was only using up to 64 threads (this was still an improvement, because it was always able to use 64 threads, rather than half the time getting 8).
To get 72 threads, I had to run (from a windows command prompt instance, not the linux environment) bcdedit /set hypervisorschedulertype Core
And now I can use up to 72 threads within WSL (running on windows is, however, still limited to 64 threads on node 0).