I went with WSL and it fixed my issues. Some notes for anyone who is curious.
Right away, there was no issue with NUMA nodes—querying numactl --hardware
showed 72 cpus on a single node. However, when running my program, task manager still showed that it was only using up to 64 threads (this was still an improvement, because it was always able to use 64 threads, rather than half the time getting 8).
To get 72 threads, I had to run (from a windows command prompt instance, not the linux environment)
bcdedit /set hypervisorschedulertype Core
And now I can use up to 72 threads within WSL (running on windows is, however, still limited to 64 threads on node 0).