I have quite a specific problem which I’m not super equipped to debug myself, and was wondering if someone had some insight.
I’m running my fortran program (gfortran 14.2.0 with openMP) on Windows 10, and my workstation has an intel xeon w9-3475x processor (36-core, 72-threads). As I understand it, Windows automatically groups threads into NUMA nodes with a maximum of 64 threads. This leads to an asymmetric grouping on my machine where I have 1 group with 64 threads (Node 0) and another with just 8 (Node 1). When I run my program, there seems to be a 50/50 chance which NUMA node it will run on, and it only uses the resources from the NUMA node it runs on. That means if it runs on Node 1, I get 64 threads and can happily parallelize with openMP. If it runs on Node 1, I only get 8 threads and the result is my code runs ~8-fold slower.
However, if you run Linux within Windows I am not sure whether this removes the limitations of the underlying Windows systems (I, did however, no research concerning this specific problem).
Yes this works. It took me awhile to figure out though, since I normally use a powershell terminal and I wasn’t able to do figure out how to do the same in powershell. However if I’m running my program with fpm (i.e., start /NODE 0 fpm run) it doesn’t work since the child application launched by fpm isn’t bound to node 0. I could still write a script which builds and runs my programs using start /NODE 0.
Regardless, I think the WSL solution is what I’m going to go with.
Unfortunately I did some testing and in my case hyperthreading helps quite a lot: 64 threads performs about 30% faster than 36 nodes without hyperthreading.
There seems to be a limit of 64 CPUs that can “trivially” utilized by a single process in Windows 10. This limitation seems to have been lifted in Windows 11.
You may search for "Windows’ and “Processor Group”. Again, I have no personal experience.
I found that trying to double use (–oversubscribe) the physical cores on my computer just doubles the amount of time a thread takes. Hyperthreading is a marketing way of saying that another thread is ready to start executing when the first thread commits a cache miss ie: branch prediction loaded the wrong memory and must wait for L2/L3/system RAM to continue. If thread 0 of core 0 never fails to read a branch in time, thread 1 of core 0 will never get task time. That doesn’t help get an answer any faster!
I usually program in Open Co-Arrays on Linux for Windows because it’s nice and easy!