Parallelization on Windows

Hi all,

I have quite a specific problem which I’m not super equipped to debug myself, and was wondering if someone had some insight.

I’m running my fortran program (gfortran 14.2.0 with openMP) on Windows 10, and my workstation has an intel xeon w9-3475x processor (36-core, 72-threads). As I understand it, Windows automatically groups threads into NUMA nodes with a maximum of 64 threads. This leads to an asymmetric grouping on my machine where I have 1 group with 64 threads (Node 0) and another with just 8 (Node 1). When I run my program, there seems to be a 50/50 chance which NUMA node it will run on, and it only uses the resources from the NUMA node it runs on. That means if it runs on Node 1, I get 64 threads and can happily parallelize with openMP. If it runs on Node 1, I only get 8 threads and the result is my code runs ~8-fold slower.

I’ve tried setting environment variables like
OMP_NUM_THREADS = 72
OMP_PROC_BIND = spread
OMP_PLACES = cores

But this hasn’t had any effect. Therefore, as I see it, my options are:

  1. Force my application to always run on Node 0 so that I get 64 threads every time.

  2. Disable hyperthreading, so that I have only 1 thread for each core and get a single group of 36 threads.

  3. If I want to use all 72 threads with gfortran, I have to switch to a Linux environment (WSL) so that I can have a single group of 72 threads.

  4. If I want to use all 72 threads with windows, I can switch to the intel compiler and configure it to use both processor groups.

Am I understanding this correctly? I think option 3 makes the most sense because since if I’m doing HPC I should really just stop avoiding linux.

Thanks,
Charles

I don’t know anything about your code, but hyperthreading is most of time useless, and sometimes harmful, for most HPC codes.

2 Likes

Yeah…this is mostly it.

1 Like

However, if you run Linux within Windows I am not sure whether this removes the limitations of the underlying Windows systems (I, did however, no research concerning this specific problem).

1 Like

Have you tried start?

$ start /node <0|1> prog.exe
1 Like

Yes this works. It took me awhile to figure out though, since I normally use a powershell terminal and I wasn’t able to do figure out how to do the same in powershell. However if I’m running my program with fpm (i.e., start /NODE 0 fpm run) it doesn’t work since the child application launched by fpm isn’t bound to node 0. I could still write a script which builds and runs my programs using start /NODE 0.

Regardless, I think the WSL solution is what I’m going to go with.

Interesting point. I will try the WSL solution and find out, I guess!

Unfortunately I did some testing and in my case hyperthreading helps quite a lot: 64 threads performs about 30% faster than 36 nodes without hyperthreading.

There seems to be a limit of 64 CPUs that can “trivially” utilized by a single process in Windows 10. This limitation seems to have been lifted in Windows 11.

You may search for "Windows’ and “Processor Group”. Again, I have no personal experience.

This page might be helpful:

I found that trying to double use (–oversubscribe) the physical cores on my computer just doubles the amount of time a thread takes. Hyperthreading is a marketing way of saying that another thread is ready to start executing when the first thread commits a cache miss ie: branch prediction loaded the wrong memory and must wait for L2/L3/system RAM to continue. If thread 0 of core 0 never fails to read a branch in time, thread 1 of core 0 will never get task time. That doesn’t help get an answer any faster!

I usually program in Open Co-Arrays on Linux for Windows because it’s nice and easy!