Omp_proc_bind / omp_places

Hi,

When using OpenMP, if I want to ensure that a given thread is attached to the same physical core for the whole program life, it seems that setting the environment variable OMP_PROC_BIND to true is enough. Is that correct?

Now, I have a machine with 2 CPUs, each one having 16 cores. When running a program with 32 threads, do I have to specify something special (possibly with OMP_PLACES ?) to ensure that the the threads 0-15 are attached to the first CPU and the threads 16-31 to the second CPU, or is it granted by default?

Another case: on this 32 cores machines, I want to use only 16 threads and be sure that they are all on the same CPU. I understand that I should set OMP_PLACES to sockets: is that correct?

And finally, about the allocations: in a NUMA scheme, I assume that the memory allocated within a given thread is as far as possible placed on the physical RAM that is “attached” to the CPU where the thread is running. And that it happens at the “first touch” and not necessarily when “allocate” is executed (at least on the main current OS’s, which do “lazy allocations”). Is that correct ?

1 Like

Does the CPU support hyper-threading? I assume it only has 2 NUMA domains (1 for each socket)?

I think the general answer is no, if you don’t specify, it is implementation-defined. You can check what occurs in practice with OMP_DISPLAY_AFFINITY=1.

For instance, in the libgomp docs they say threads can be moved between CPUs:

If OMP_PLACES and GOMP_CPU_AFFINITY are unset and OMP_PROC_BIND is either unset or false , threads may be moved between CPUs following no placement policy.

Moreover, the libgomp the docs state:

When undefined, OMP_PROC_BIND defaults to TRUE when OMP_PLACES or GOMP_CPU_AFFINITY is set and FALSE otherwise.

So setting OMP_PLACES=sockets will automatically bind/pin the OpenMP threads when using libgomp (but not necessarily in libiomp).


For libiomp, the thread affinity interface is documented here. The OpenMP variables are documented here.

By default, the thread binding is set to FALSE, which is equivalent to KMP_AFFINITY=none which means:

Does not bind OpenMP threads to particular thread contexts; however, if the operating system supports affinity, the compiler still uses the OpenMP thread affinity interface to determine machine topology.

I don’t understand the implications of the second part of the sentence.


Especially on HPC systems you need to be careful, because sometimes the variables are already set by the admins/modules. I’ve encountered an environment where KMP_ variables where set, and took precedence over OMP_ variables. To use the OpenMP interface, I had to explicitly unset the KMP_AFFINITY variable. If not set correctly, it’s easy to slow down everything by mixing up MPI ranks and OpenMP threads in ways that don’t respect the machine topology or clash at the software level.


Probably not relevant to your question, but on MacOS I’m not sure OpenMP thread affinity is supported at all. The OS kernel passes work to threads as it deems best and the affinity variables are ignored. At least that’s what I’ve inferred from the following threads:

A while ago I did some experiments with XCode Instruments (the native Mac profiler), and one can see the thread switching which takes place.

2 Likes

The CPUs are 2x Intel Xeon Gold 6346. AFAIK s single Intel Xeon processor does not have several NUMA domains, so on this machine there are probably 2 NUMA domains (one for each CPU). And yes it has hyperthreading, although I don’t want to use it.

But I was implicitly talking about the case where OMP_PROC_BIND would be set to true. I have tested with OMP_DISPLAY_AFFINITY=1, and at least with the Intel compiler the threads are bound to the cores in order:

% setenv OMP_NUM_THREADS 32 && setenv OMP_PROC_BIND true  && setenv OMP_DISPLAY_AFFINITY 1 && ./a.out
OMP: pid 300862 tid 300875 thread 13 bound to OS proc set {13}
OMP: pid 300862 tid 300862 thread 0 bound to OS proc set {0}
OMP: pid 300862 tid 300868 thread 6 bound to OS proc set {6}
OMP: pid 300862 tid 300863 thread 1 bound to OS proc set {1}
OMP: pid 300862 tid 300864 thread 2 bound to OS proc set {2}
OMP: pid 300862 tid 300877 thread 15 bound to OS proc set {15}
OMP: pid 300862 tid 300871 thread 9 bound to OS proc set {9}
OMP: pid 300862 tid 300865 thread 3 bound to OS proc set {3}
OMP: pid 300862 tid 300869 thread 7 bound to OS proc set {7}
OMP: pid 300862 tid 300876 thread 14 bound to OS proc set {14}
OMP: pid 300862 tid 300870 thread 8 bound to OS proc set {8}
OMP: pid 300862 tid 300867 thread 5 bound to OS proc set {5}
OMP: pid 300862 tid 300874 thread 12 bound to OS proc set {12}
OMP: pid 300862 tid 300873 thread 11 bound to OS proc set {11}
OMP: pid 300862 tid 300872 thread 10 bound to OS proc set {10}
OMP: pid 300862 tid 300866 thread 4 bound to OS proc set {4}
OMP: pid 300862 tid 300882 thread 20 bound to OS proc set {20}
OMP: pid 300862 tid 300886 thread 24 bound to OS proc set {24}
OMP: pid 300862 tid 300878 thread 16 bound to OS proc set {16}
OMP: pid 300862 tid 300890 thread 28 bound to OS proc set {28}
OMP: pid 300862 tid 300888 thread 26 bound to OS proc set {26}
OMP: pid 300862 tid 300889 thread 27 bound to OS proc set {27}
OMP: pid 300862 tid 300881 thread 19 bound to OS proc set {19}
OMP: pid 300862 tid 300884 thread 22 bound to OS proc set {22}
OMP: pid 300862 tid 300883 thread 21 bound to OS proc set {21}
OMP: pid 300862 tid 300893 thread 31 bound to OS proc set {31}
OMP: pid 300862 tid 300891 thread 29 bound to OS proc set {29}
OMP: pid 300862 tid 300887 thread 25 bound to OS proc set {25}
OMP: pid 300862 tid 300880 thread 18 bound to OS proc set {18}
OMP: pid 300862 tid 300892 thread 30 bound to OS proc set {30}
OMP: pid 300862 tid 300885 thread 23 bound to OS proc set {23}
OMP: pid 300862 tid 300879 thread 17 bound to OS proc set {17}

(setting OMP_PROC_BIND to close or spread gives the same pattern)

With only 16 threads, the threads are spread over the 2 CPUs:

% setenv OMP_NUM_THREADS 16 && setenv OMP_PROC_BIND true && setenv OMP_DISPLAY_AFFINITY 1 && ./a.out
OMP: pid 302731 tid 302731 thread 0 bound to OS proc set {0}
OMP: pid 302731 tid 302732 thread 1 bound to OS proc set {2}
OMP: pid 302731 tid 302733 thread 2 bound to OS proc set {4}
OMP: pid 302731 tid 302734 thread 3 bound to OS proc set {6}
OMP: pid 302731 tid 302735 thread 4 bound to OS proc set {8}
OMP: pid 302731 tid 302736 thread 5 bound to OS proc set {10}
OMP: pid 302731 tid 302737 thread 6 bound to OS proc set {12}
OMP: pid 302731 tid 302738 thread 7 bound to OS proc set {14}
OMP: pid 302731 tid 302739 thread 8 bound to OS proc set {16}
OMP: pid 302731 tid 302745 thread 14 bound to OS proc set {28}
OMP: pid 302731 tid 302742 thread 11 bound to OS proc set {22}
OMP: pid 302731 tid 302740 thread 9 bound to OS proc set {18}
OMP: pid 302731 tid 302741 thread 10 bound to OS proc set {20}
OMP: pid 302731 tid 302743 thread 12 bound to OS proc set {24}
OMP: pid 302731 tid 302746 thread 15 bound to OS proc set {30}
OMP: pid 302731 tid 302744 thread 13 bound to OS proc set {26}

OMP_PROC_BIND must be set to close to get them on a single CPU:

% setenv OMP_NUM_THREADS 16 && setenv OMP_PROC_BIND close && setenv OMP_DISPLAY_AFFINITY 1 && ./a.out
OMP: pid 303690 tid 303700 thread 10 bound to OS proc set {10}
OMP: pid 303690 tid 303704 thread 14 bound to OS proc set {14}
OMP: pid 303690 tid 303699 thread 9 bound to OS proc set {9}
OMP: pid 303690 tid 303692 thread 2 bound to OS proc set {2}
OMP: pid 303690 tid 303693 thread 3 bound to OS proc set {3}
OMP: pid 303690 tid 303690 thread 0 bound to OS proc set {0}
OMP: pid 303690 tid 303701 thread 11 bound to OS proc set {11}
OMP: pid 303690 tid 303703 thread 13 bound to OS proc set {13}
OMP: pid 303690 tid 303705 thread 15 bound to OS proc set {15}
OMP: pid 303690 tid 303691 thread 1 bound to OS proc set {1}
OMP: pid 303690 tid 303697 thread 7 bound to OS proc set {7}
OMP: pid 303690 tid 303694 thread 4 bound to OS proc set {4}
OMP: pid 303690 tid 303696 thread 6 bound to OS proc set {6}
OMP: pid 303690 tid 303695 thread 5 bound to OS proc set {5}
OMP: pid 303690 tid 303698 thread 8 bound to OS proc set {8}
OMP: pid 303690 tid 303702 thread 12 bound to OS proc set {12}

I tried playing with OMP_PLACES, but it seems that it’s not needed here.

1 Like

Some older ones do (did). The CoolMUC2 cluster has these dual-socket 28-core “Haswell” servers with two Intel Xeon E5-2697 v3 CPUs each:

ivan@cm2login2:~> lscpu
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
Address sizes:       46 bits physical, 48 bits virtual
CPU(s):              56
On-line CPU(s) list: 0-55
Thread(s) per core:  2
Core(s) per socket:  14
Socket(s):           2
NUMA node(s):        4
...
Model name:          Intel(R) Xeon(R) CPU E5-2697 v3 @ 2.60GHz
...
NUMA node0 CPU(s):   0-6,28-34
NUMA node1 CPU(s):   7-13,35-41
NUMA node2 CPU(s):   14-20,42-48
NUMA node3 CPU(s):   21-27,49-55

In this CPU package the 14 physical cores are divided into this dual-ring layout (image source: Analysis of Intel’s Haswell Microarchitecture Using the ECM Model and Microbenchmarks | SpringerLink) with two independent memory controllers:

The 7-cores per NUMA domain are tied to this “Cluster on Die” mode (section 2.5 in the paper above) where the cores are divided equally among memory controllers. This page contains some further die configurations: The Magic Inside the Uncore - Intel Xeon E5 Version 3: Up to 18 Haswell EP Cores

1 Like

I’ve never tried “thread pinning”, so interested in what kind of situations it is useful. Is it basically for machines with multiple CPUs (like Xeon above with two sockets)? If so, it is not necessary to care about it for a single-CPU machine like Ryzen? (I’ve tried asking it with ChatGPT4o-mini, and the answer is something like it is useful also for Ryzen, so I am confused…)

Output of ChatGPT4o-mini [Q] Is thread pinning important for Ryzen?

[A] Yes, thread pinning can be important for AMD Ryzen CPUs, especially in certain contexts:

  1. Multi-core Performance: Ryzen CPUs often have multiple cores and threads. Pinning threads to specific cores can help improve cache locality and reduce cache misses, leading to better performance in multi-threaded applications.
  2. Game and Application Optimization: Many games and applications benefit from thread pinning, as it can help maintain consistent performance and reduce latency by keeping critical threads on dedicated cores.
  3. NUMA Considerations: Some Ryzen systems, particularly those with higher core counts, may exhibit NUMA-like behavior, where memory access times vary based on core and memory location. Pinning threads can help ensure optimal memory access patterns.
  4. High-Performance Workloads: For workloads that demand high performance, such as video rendering or scientific computations, thread pinning can optimize resource utilization and reduce contention.

While Ryzen CPUs manage threads well, especially with their efficient architecture, thread pinning can still provide benefits in specific scenarios where performance consistency and optimization are crucial.

I have been trying to understand this issue with Ryzen 5900X (12 cores), with Windows 10 and Gfortran Ver 11.

I have a large run that requiresd 10 threads, all of which are using a shared 25 GByte array.
Maintaining each thread to the same cpu and also having all threads running at the same rate could optimise the use of L3 cache, but this has not been achievable. ( poor alignment of threads has extended a 3-hour run to 8 hours, so it is a noticeable problem. This change was observed with OS updates)

“Task Manager” indicates that the threads are moving about between the 24 logical cpu’s.
I have disabled “hyper-threading” as having 2 threads on the same physical CPU appears to provide inferior performance, but the core usage is still varying considerably.

I have not been able to improve performance by combining environment variable OMP_PROC_BIND with Gfortran on Windows.
I have not been able to lock each thread to a single cpu, so I would be interested in what you may be able to achieve.

However, by using !$OMP BARRIER, I have improved performance, assumed to be related to L3 cache efficiency.

I will be interested to see if Win 11 plus 24H2 helps in my case, as thread management appears to be very much managed by the OS.
I have no experience of Linux behaviour to know how the OS compare.

1 Like

OK, I checked this one, and there are 2 NUMA domains:

% lscpu
Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
Address sizes:                   46 bits physical, 57 bits virtual
CPU(s):                          32
On-line CPU(s) list:             0-31
Thread(s) per core:              1
Core(s) per socket:              16
Socket(s):                       2
NUMA node(s):                    2
Vendor ID:                       GenuineIntel
CPU family:                      6
Model:                           106
Model name:                      Intel(R) Xeon(R) Gold 6346 CPU @ 3.10GHz
Stepping:                        6
CPU MHz:                         1067.798
CPU max MHz:                     3600.0000
CPU min MHz:                     800.0000
BogoMIPS:                        6200.00
Virtualization:                  VT-x
L1d cache:                       1.5 MiB
L1i cache:                       1 MiB
L2 cache:                        40 MiB
L3 cache:                        72 MiB
NUMA node0 CPU(s):               0-15
NUMA node1 CPU(s):               16-31
Vulnerability Itlb multihit:     Not affected
Vulnerability L1tf:              Not affected
Vulnerability Mds:               Not affected
Vulnerability Meltdown:          Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:        Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:        Vulnerable, IBPB: disabled, STIBP: disabled
Vulnerability Srbds:             Not affected
Vulnerability Tsx async abort:   Not affected
Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts 
                                 acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art ar
                                 ch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulq
                                 dq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4
                                 _1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm a
                                 bm 3dnowprefetch cpuid_fault epb cat_l3 invpcid_single intel_ppin ssbd mba ibrs ibpb
                                  stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjus
                                 t bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512
                                 ifma clflushopt clwb intel_pt avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xget
                                 bv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local split_lock_detect wbnoi
                                 nvd dtherm ida arat pln pts hwp hwp_act_window hwp_epp hwp_pkg_req avx512vbmi umip p
                                 ku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg tme avx512_vpop
                                 cntdq rdpid fsrm md_clear pconfig flush_l1d arch_capabilities

That is interesting. It looks like hyper-threading (HT) is disabled at the level of BIOS; the Intel product specification for Xeon Gold 6346 says that Intel HT is available. (HT it is rarely useful in BW-limited computational codes anyways.)

I was looking at an Intel Xeon Platinum 8380 (also a 3rd Gen. Xeon SP using Icelake cores), also dual socket but with 40 cores per CPU:

 Model name:            Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz
    CPU family:          6
    Model:               106
    Thread(s) per core:  2
    Core(s) per socket:  40
    Socket(s):           2

Caches (sum of all):     
  L1d:                   3.8 MiB (80 instances)
  L1i:                   2.5 MiB (80 instances)
  L2:                    100 MiB (80 instances)
  L3:                    120 MiB (2 instances)
NUMA:                    
  NUMA node(s):          4
  NUMA node0 CPU(s):     0-19,80-99
  NUMA node1 CPU(s):     20-39,100-119
  NUMA node2 CPU(s):     40-59,120-139
  NUMA node3 CPU(s):     60-79,140-159

I failed to find any architecture images for these specific models, but according to the following articles these SMPs use a mesh architecture:

I believe in my case that Sub-NUMA Clustering (the CPU package is logically split into 2 NUMA domains) is turned on.

It can be beneficial in many different cases, but it seems to me that it is the most useful in case of a NUMA architecture: it can be multiples sockets, but it can also be a single socket with a CPU that has a internal NUMA architecture (as mentioned by @ivanpribec )

1 Like

I just re-read today the thread affinity chapter in Using OpenMP—The Next Step: Affinity, Accelerators, Tasking, and SIMD. There is a lot to process there (the chapter is 67 pages long!), but it provides a much more understandable explanation compared to the OpenMP standard text, including some architectural background on NUMA architectures.

To summarize the main concepts of the chapter:

  • OpenMP Places - A place is a hardware resource that can execute an OpenMP thread.
  • Affinity Policy - While the place list defines all the resources available to the OpenMP runtime scheduler, the affinity policy may be adjusted on each (nested) parallel region, and controls how the threads should be scheduled relative to each other.

The two concepts together specify the thread affinity. In absence of either one of these, or both, implementation-defined default settings are used. The interaction is summarized by the following table (Fig 4.4, page 160),

Here are the tips and tricks from the concluding remarks:

  • The OMP_PLACES and OMP_PROC_BIND environment variables have default settings. We strongly recommend setting these explicitly, or checking the documentation for the defaults.
  • The use of environment variable OMP_DISPLAY_ENV is strongly recommended to verify the affinity settings. [The book was published before OMP_DISPLAY_AFFINITY was introduced in OpenMP 5.0.]
  • The place list is static. It is defined upon program startup and cannot be modified.
  • Threads are not allowed to migrate between places.
  • If a place contains multiple resource numbers, all numbers in the list are equal from a scheduling point of view.
  • The affinity policy [emphasis mine] may be adapted on each region.
  • The abstract names are preferred when specifying the place list.
  • An implementation may support additional abstract names to support specific arhictectural features.
  • When more control over the placement is needed, the interval notation provides for a compact way to define the places. This is not only less error-prone, but also easier to adapt to other platforms.

Is that 16 hardware threads (“places”) or OpenMP threads? You can limit the places to a single socket using OMP_PLACES=sockets(1), but you still can’t control which socket. To control exactly which you’d need to use the interval notation {<lower-bound> : <count> [: <stride>]}. So you could chose:

  • socket 1 (NUMA node0): OMP_PLACES={0:16:1}, which expands to {0,1,...,15}, or
  • socket 2 (NUMA node1): OMP_PLACES={16:16:1}, which expands to {16,17,...31}.

I’ll quote from the book here,

With the First Touch placement policy, the thread (or process) that “touches” the data for the first time, has ownership of the corresponding page. This defines the home node for that page. More specifically, the first thread that accesses a page, memory capacity issues aside, has the data allocated in the memory closest to this thread.

This tends to work okay if each thread only needs to work on blocks of data it initialized and allocated itself. However it potentially creates problems if a master thread initializes the data (or reads it via I/O), and other threads need to access it. (For instance if the master thread was on socket 2, and the worker threads in a parallel section got placed on socket 1.)

1 Like

Just wanted to confirm this on macOS (Intel Inside) using GCC (libgomp) I get the warning:

$ OMP_PLACES=cores OMP_NUM_THREADS=4 ./build/main

libgomp: Affinity not supported on this configuration

If I ask to display affinity information it shows:

level 1 thread 0x7ff84cf22dc0 affinity 0-11
level 1 thread 0x700002f21000 affinity 0-11
level 1 thread 0x700003124000 affinity 0-11
level 1 thread 0x700003327000 affinity 0-11
OMP: pid 4891 tid 259 thread 0 bound to OS proc set {undefined}
OMP: pid 4891 tid 7171 thread 3 bound to OS proc set {undefined}
[... truncated ...]

AFAIK, the ARM-based Macs don’t have SMT, but they do have performance- and efficiency-cores. Would be interesting to find out if libomp (LLVM) supports affinity on the newer Macs.

1 Like