I can imagine multiple sources of variability:
- threads have different workloads
- frequency differences between cores (say, as a result of AVX instructions)
- the CPU’s have different (for example mobile devices have fast and slow cores),
- latency due to the non-uniform memory system (measured here: Core to Core Latency Data on Large Systems – Chips and Cheese)
If your tasks are very heterogeneous, perhaps adding schedule(dynamic)
to the work-sharing constructs could help? I’ve found these four presentations, all from Ruud van der Pas, very informative:
- Shared Memory Parallel Performance To The Extreme
- How To Befriend NUMA
- Mastering OpenMP Performance
- Make OpenMP Go Fast (+ Video)
(Edit: since it’s SC23 week, there is a nice series of OpenMP booth talks newly available here: Supercomputing 2023 - OpenMP)