-fopenmp on gfortran

I was testing AMD cpus with our software. We use the -fopenmpi flag. As seen in this table, the time to complete (min) is gets bad if i choose newer compiler versions:
(128 cores)
gfortran-10 5.8
gfortran-11 8.7
gfortran-12 11.3
gfortran-13 12.5

I tested this again on our workstation with a AMD Ryzen 9 7950X, same story (see attache picture). When i use the flags ‘-march=native -mtune=native’ in addition, i get similar performance with gfortran 12 as in 10. How can this be? Normally the compiler should choose the correct architecture by itself?
openmpi-amd-gfortran

It can detect it, but this doesn’t mean it’s desirable to target only this one. Your CPU may have SSE2 instructions, but you may want the produced executable to run also on machines without SSE2. By default, compilers generally generate executables that can run on a wide range of CPU models, including older ones.

EDIT: you should give the full gfortran command line you are using

We use:

gfortran -O3 -W -Wall -Wno-target-lifetime -cpp -fPIC -Wno-compare-reals -Wno-uninitialized -funroll-all-loops -finline-functions -Wtabs -Wunused-variable -std=legacy -fopenmp -c -o mod_rate_calculation.o mod_rate_calculation.f90

but on gfortran 12 and AMD (in this case a AMD Ryzen 9 7950X, but also newer ones) this produces openmp code that runs slower than using gfortran 10.
If i use this command line for compilation:

gfortran -O3 -W -Wall -Wno-target-lifetime -cpp -fPIC -Wno-compare-reals -Wno-uninitialized -funroll-all-loops -finline-functions -Wtabs -Wunused-variable -std=legacy -fopenmp -march=native -mtune=native -c -o mod_rate_calculation.o mod_rate_calculation.f90

the gfortran 12 on AMD produces code that has the same scaleup.

openmpi-amd-gfortran-native

Another variation of this idea is when you are running on a parallel cluster with a mixture of generations of CPUs. In an SPMD implementation, that single executable program will be run on all of the nodes.

This looks like a problem I had with amd cpus. Our numerical code runs and scales fine on intel cpus both with ifort as well as gfortran. But on amd cpus it was bad. As it turned out the reason was that proper (=spread) thread scheduling on amd does not work out of the box. On intel cpu’s threads are put onto different physical processors with different core id (in /proc/cpuinfo wording). On amd, threads are all over the place. With both compilers!

You can check this by disabling hyperthreading (e.g. echo "0" > /sys/devices/system/cpu/cpuXY/online). For a 16core/32processor AMD CPU, processor i and i+16 have the same core id, so disable i+16 (0<=i<16). Check the processor and core id with /proc/cpuinfo first, I do not have a 7950x.

You can also check this by using htop which shows individual load for all processors. This made it pretty obvious that scheduling did not work.

I tried OMP_PLACES to tell the runtime about cores, as it obviously had no idea about that. But this does not help if you want to run several instances each with a few threads. For example, starting 4 simulations each with 4 threads and OMP_PLACES settings all 4 simulations where scheduled on processors 0…3, with abysmal performance.

My solution was to set OMP_PLACES to {0…31} (on a 32 core threadripper), disallowing processors 32…63 for openmp. System threads where scheduled on these processors, so still more helpful than disabling hyperthreading.

Anyway, the problem really felt strange and unexpected, in particular as all omp scheduling options did help solving this issue, and I tried quite a number of options.

Does anybody had a similar problem and a better solution?

1 Like

AMD cpus generally have a NUMA architecture, which requires special care for performances of OpenMP programs.

1 Like

No, I really am talking about simple scheduling: one thread per core, no two threads onto a single core. For me it was as simple as that, except that runtime does not know about processor layout…

1 Like

By default (if OMP_PLACES and OMP_PROC_BIND are not set) an OpenMP runtime library just spawns threads, and this is the OS that is in charge of scheduling them on the cores. Obviously, the OS knows about the CPU layout.

2 Likes

Unfortunately, it must be more complicated than that. The kernel obviously understands the kernel layout. And the openmp-runtime from the fortran compiler has great influence on where threads are placed. In my case it pretty much looked like that the runtime has no idea about the layout. For example, setting OMP_PLACES=core or other options had no influence, threads were still placed on the same core id. The best I could do was switch off SMT/hyperthread via .../cpu/cpuXY/online or restrict to processors to a set of distinct core id’s via OMP_PLACES={0:32} (on a 3970x).

I just re-ran different settings I had tried a year ago and kept with comments. But now it works as expected. I do not need any OMP_PLACES options anymore :astonished: :face_with_monocle: Something must have been fixed in the last year?! As both compiler required OMP_PLACES a year ago and now both work without any settings, this looks like some kernel issue (or pthread?). Anyway, glad to see that this issue seems to be fixed.

1 Like

Thanks for the hint. However, if the scheduling is the problem, why should using the additional options -march=native -mtune=native fix the problem then?

But i will try to use the OMP_PLACES and see if it fixes the problem with compiling with the ‘native’ settings.

Sorry, it looks like I missed your point and read your first diagram wrongly. The second diagram posted later made that clear. I thought that you were concerned about the scaling with higher core counts (8->16).

gfortran -Q --help=target shows you default options. In particular gfortran -Q --help=target | grep march shows default march option.
Maybe, your gfortran version have been configured with different default targets?

Setting the OMP_PLACES={0:16} and OMP_PROC_BIND=close did not improve the poor gfortran 12 behaviour. It did work, as seen in htop, but the performance did not improve. Only setting the -march and -mtune options helped so far.

This gives: -march= x86-64

Valid options are:
Known valid arguments for -march= option:
i386 i486 i586 pentium lakemont pentium-mmx winchip-c6 winchip2 c3 samuel-2 c3-2 nehemiah c7 esther i686 pentiumpro pentium2 pentium3 pentium3m pentium-m pentium4 pentium4m prescott nocona core2 nehalem corei7 westmere sandybridge corei7-avx ivybridge core-avx-i haswell core-avx2 broadwell skylake skylake-avx512 cannonlake icelake-client rocketlake icelake-server cascadelake tigerlake cooperlake sapphirerapids alderlake bonnell atom silvermont slm goldmont goldmont-plus tremont knl knm intel geode k6 k6-2 k6-3 athlon athlon-tbird athlon-4 athlon-xp athlon-mp x86-64 x86-64-v2 x86-64-v3 x86-64-v4 eden-x2 nano nano-1000 nano-2000 nano-3000 nano-x2 eden-x4 nano-x4 k8 k8-sse3 opteron opteron-sse3 athlon64 athlon64-sse3 athlon-fx amdfam10 barcelona bdver1 bdver2 bdver3 bdver4 znver1 znver2 znver3 znver4 btver1 btver2 generic native

What should it be for a AMD Ryzen 9 7950X 16-Core Processor? To be also able to execute the code on a range of different variants?

That should be -march=znver4 according to the specs, however -march=native will automatically adapt to the processor you are compiling on. For more details I’d refer you to the documentation of -march=...

I think you’d want to use -mtune=znver4 which would tune the code for Zen 4, but still allow it to execute on older processors that may not have the Zen 4 specific instructions (do you really need this?).

Since most commodity x86-64 processors in use already support SSE and AVX instructions, you could also optimize for a particular microarchitecture level, e.g. -march=x86-64-v3 -mtune=znver4 (the target CPU supports all x86-64-v3 features, but tune the code as if the processor were an AMD Zen 4 architecture).

Generally, that’s not the case, because the compiler assumes you’d like your code to work on all possible micro-architectures or that you are cross-compiling (compiling on one system, but running on a system with a different CPU). Hence why your default target is -march= x86-64 (the base instruction set).

Addendum: with -march=x86-64-v3 you’d be missing out on the fancy AVX512 your Zen 4 CPU has to offer. But in general it’s good to profile your application and check if they provide any benefit.

Still there is the question, why the gfortran-10 compiler manages to produce openmp code that is faster than the gfortran-12. There is no difference in the march default for both compilers and also the other default settings are mostly similar (except for additional options in gfortran-12 and a longer list in available architectures)

You are referring to the case without the machine-specific optimization flag. In general GCC 12 introduced a number of changes, including changes to the vectorizer cost model, and to the OpenMP library. Other changes related to safety, correctness, and other things, could also be cause of the observed performance difference.

At the same time the znver4 architecture flag was first introduced in GCC12, so if you want to fully exploit your processors AVX512 capabilities you should stick with v12 (or upgrade to v13).

If you’re truly interested what causes the difference, I’d recommend running your code with a sampling-based profiler like gprof-ng, which does not require instrumenting your build. After identifying the “offending” routines which cause the slow-down you could study those in Compiler Explorer and check how things are scheduled.

There are two performance issues being demonstrated in this thread that I have also struggled with:-

  1. reduced performance of Gfortran Ver 12 in comparison to Ver 10 ( and Ver 8 & 9 ), which may be recovered in Ver 13
  2. Stalling of scale-up after 8 threads (very noticeable on my 5900X) and after 8 and 16 threads on 7950X.

For “1) reduced performance” Gfortran Ver 12, I have not itentified which OS you are using. It may well be that you have to check the new AMD support options for Gfortran and may need different compile options for AMD 7950X. My use of Gfortran and -fopenmp since Ver 8 has shown reduced performance, but has been mitigated by updating the compiler hardware and optimisation options for each version. Does AMD 7950X support AVX512 and does Gfortran Ver 12 handle this well ?

For “2) stalling” I have tried lots of options, as this has been a constant problem in my use of direct linear solvers in structural finite element calculations.

Is it memory bandwidth limitations ?
Is it hyper threading ?
Is it Win OS changes to support intel efficiency threads ?

For my (large vector), direct linear solver) calculations, all these appear possible.

Memory
Both 5900X and 7950X are only dual memory, which does appear to stall after a few threads, depending on the calculation type. Memory bandwidth has not kept pace with core count !
32 threads and dual memory access does not suit my type of calculation. The problem is more cores sell and marketing is a big influence of processor development.

OS support for E-cores
I have been using Windows Ver 8 and Ver 10 (not yet 11) and there are well documented problems with early Ver 11. My experience of Win Ver 10 updates has also shown similar problems for my calculation, as the OS thread to core (re)allocation algorithm has reduced my scale-up performance.
All this tracks back to Intel’s introduction of different core types; P-cores and E-cores, their support by the OS has adversely affected AMD processors.

Hyper-threading
There is a similar differing thread performance issue can occur with Intel’s hyper-threading or AMD’s simultaneous multithreading. Some OpenMP calculations benefit from synchronised threads, especially for cache sharing. To overcome this, I have turned simultaneous multithreading off on AMD. The graph shows that this is easily justified. I don’t get any better performance with 12 threads vs 24 on 5900X and this post for 7950X shows no better performance with 16 threads vs 32.
As yet, I don’t think the OS and Gfortran are doing a good job with hyper-threading on AMD.
My intel 8700K has the same problem, so this is not just an AMD problem.

The other black art that can be affecting all these problems is efficient cache usage, especially L3 between threads.
My problem with this and other possible explainations is mine has been an empirical analysis. I really don’t know what is the main cause of these problems.

I get very similar performance problems as presented in this thread on my AMD 5900X. I need to test a processor with more memory channels, but I doubt if this would be the majic solution. I need a bigger budget !

@solej , You could read the Gfortran Ver 12 and Ver 13 release notes and see if there are more appropriate compile options for Ryzen 9 7950X and please post your results if you find any recommendations.

My present Gfortran compile options include :
set basic=-c -fimplicit-none -fallow-argument-mismatch -march=native
set vector=-O3 -ffast-math -funroll-loops --param max-unroll-times=2
set omp=-fopenmp -fstack-arrays

I really don’t know if -ffast-math does much when memory bandwidth limitations become significant.

If you are solving large sparse matrices, then you probably are limited by memory bandwidth. My experience the last couple years is, that you roughly need 4-6GB/s bandwidth per thread (krylov space methods). High end Desktop processors do not have that much bandwidth in general. For example, the 3970x gets saturated at about 16-20 threads, if not less. (These observations also hold true for Intel processors).