Parallelization on Windows

cwetaski · July 4, 2025, 11:01am

Hi all,

I have quite a specific problem which I’m not super equipped to debug myself, and was wondering if someone had some insight.

I’m running my fortran program (gfortran 14.2.0 with openMP) on Windows 10, and my workstation has an intel xeon w9-3475x processor (36-core, 72-threads). As I understand it, Windows automatically groups threads into NUMA nodes with a maximum of 64 threads. This leads to an asymmetric grouping on my machine where I have 1 group with 64 threads (Node 0) and another with just 8 (Node 1). When I run my program, there seems to be a 50/50 chance which NUMA node it will run on, and it only uses the resources from the NUMA node it runs on. That means if it runs on Node 1, I get 64 threads and can happily parallelize with openMP. If it runs on Node 1, I only get 8 threads and the result is my code runs ~8-fold slower.

I’ve tried setting environment variables like
OMP_NUM_THREADS = 72
OMP_PROC_BIND = spread
OMP_PLACES = cores

But this hasn’t had any effect. Therefore, as I see it, my options are:

Force my application to always run on Node 0 so that I get 64 threads every time.
Disable hyperthreading, so that I have only 1 thread for each core and get a single group of 36 threads.
If I want to use all 72 threads with gfortran, I have to switch to a Linux environment (WSL) so that I can have a single group of 72 threads.
If I want to use all 72 threads with windows, I can switch to the intel compiler and configure it to use both processor groups.

Am I understanding this correctly? I think option 3 makes the most sense because since if I’m doing HPC I should really just stop avoiding linux.

Thanks,
Charles

PierU · July 4, 2025, 11:32am

I don’t know anything about your code, but hyperthreading is most of time useless, and sometimes harmful, for most HPC codes.

jorgeg · July 4, 2025, 12:28pm

Yeah…this is mostly it.

Jweber · July 5, 2025, 12:02pm

However, if you run Linux within Windows I am not sure whether this removes the limitations of the underlying Windows systems (I, did however, no research concerning this specific problem).

ivanpribec · July 5, 2025, 12:08pm

Have you tried start?

$ start /node <0|1> prog.exe

cwetaski · July 5, 2025, 12:35pm

Yes this works. It took me awhile to figure out though, since I normally use a powershell terminal and I wasn’t able to do figure out how to do the same in powershell. However if I’m running my program with fpm (i.e., start /NODE 0 fpm run) it doesn’t work since the child application launched by fpm isn’t bound to node 0. I could still write a script which builds and runs my programs using start /NODE 0.

Regardless, I think the WSL solution is what I’m going to go with.

cwetaski · July 5, 2025, 12:39pm

Interesting point. I will try the WSL solution and find out, I guess!

cwetaski · July 5, 2025, 12:43pm

Unfortunately I did some testing and in my case hyperthreading helps quite a lot: 64 threads performs about 30% faster than 36 nodes without hyperthreading.

Jweber · July 5, 2025, 5:08pm

There seems to be a limit of 64 CPUs that can “trivially” utilized by a single process in Windows 10. This limitation seems to have been lifted in Windows 11.

You may search for "Windows’ and “Processor Group”. Again, I have no personal experience.

Jweber · July 5, 2025, 8:19pm

This page might be helpful:

Knarfnarf · July 7, 2025, 1:47am

I found that trying to double use (–oversubscribe) the physical cores on my computer just doubles the amount of time a thread takes. Hyperthreading is a marketing way of saying that another thread is ready to start executing when the first thread commits a cache miss ie: branch prediction loaded the wrong memory and must wait for L2/L3/system RAM to continue. If thread 0 of core 0 never fails to read a branch in time, thread 1 of core 0 will never get task time. That doesn’t help get an answer any faster!

I usually program in Open Co-Arrays on Linux for Windows because it’s nice and easy!

PierU · July 7, 2025, 6:31am

@Knarfnarf It’s a little bit more complicated, and no, it is not just marketing. Hyperthreading (actually Simutaneous Multithreading = SMT) is primarily a hardware capability to really execute simultaneously some instructions from 2 different threads in the same core. For instance an integer addition for the thread 0 and a foating point multiplication for the thread 1, as these 2 operations do not use the same physical circuits.

Very roughtly speaking, SMT can help in the case where the two threads execute very different codes, as it increases the chances they don’t need the same resources at the very same time. In contrast, the parallelization of simple loops is generally not more efficient with SMT, as all the threads do the same kind of operations at the very same time.

cwetaski · July 7, 2025, 3:32pm

I went with WSL and it fixed my issues. Some notes for anyone who is curious.

Right away, there was no issue with NUMA nodes—querying numactl --hardware showed 72 cpus on a single node. However, when running my program, task manager still showed that it was only using up to 64 threads (this was still an improvement, because it was always able to use 64 threads, rather than half the time getting 8).

To get 72 threads, I had to run (from a windows command prompt instance, not the linux environment)
bcdedit /set hypervisorschedulertype Core

And now I can use up to 72 threads within WSL (running on windows is, however, still limited to 64 threads on node 0).

Topic		Replies	Views
Question about hyper-threading	18	1975	July 31, 2022
Omp_proc_bind / omp_places Help	10	738	November 30, 2024
OpenMP and FORTRAN Help	32	1230	December 9, 2024
-fopenmp on gfortran GNU	18	828	March 2, 2024
Hybrid MPI/OpenMP programming and potential pitfalls	11	529	December 16, 2025

Parallelization on Windows

Related topics