Reproducibility of data with the same OS/compilers/options but with different CPUs?

Hello,

I have been recently using both AMD (Ryzen) and Intel (Xeon) machines, both installed with the same OS (Ubuntu22) and compilers (here gfortran-11.3). All software have been “updated” with apt to keep them as close as possible in the software sense. Under this condition, is it reasonable to expect that the two CPUs give identical results if the program used is completely identical (with no randomness in the input)? Or, is it better to assume that the two results can differ very slightly because of the difference of CPUs? Also, does the answer depend on whether the program is serial (a single thread) or multi-threaded? (FWIW, my code is using OpenMP with OMP_SCHEDULE="static,10". It can also run in an MPI + OpenMP hybrid manner.)

EDIT: My code does not utilize any third-party binary libraries (like BLAS etc), so the dependence on such libraries is minimum (I think). But if exp() etc depends on the system’s math library, it may in turn depend on each CPU type, so resulting in a slight difference of results…?

It would be safer to assume that the results will be different at the round-off level at least, and then (re)think your algorithms to be as agnostic as possible to the order on which local operations are done, and more importantly, to round-off errors. Vectorization and multithreading will exacerbate such differences, so better guard yourself against it.

From the moment that one is aware that even the sum of multiple elements in a vector is not associative with finite arithmetics at the round-off level, then the problem is not libraries dependencies, but dependency to the order on which operations are carried out.

Yes, just for the anecdote that made me aware of this: Several years ago I was working on the implementation of a linear solver for the electric charge conservation on 3D bodies. The simplest PDE ( divergence( elec_conductivity( Temperature(time) ) * Gradient( electric potential ) ) = 0 ) … Solving a rather simple case on 1, 2 and 3 MPI processes gave me the same results, but then, with 4 MPI processes I would get out of the blue a divergence on one node of the FEM mesh, and only with O3 … after a lot of debugging I saw that a vector synchronization was being carried out in a different sequence for each process, this would accumulate a very small difference between each process, but enough such that when the iterative linear solver would hit an error close to the precision, then it would diverge. Changing the preconditioner for the linear solver and tweaking the way the synchronization was being done helped me recover a proper convergence.

4 Likes

Thanks very much for detailed explanation!

Indeed, up to now, I always assumed that the results could be different depending on CPUs, at least at the round-off level (because the code is compiled specifically for each target CPU). And the behavior of my program does not depend on such small differences, so this point is not a problem at the moment. I.e., I am not expecting those machines to give identical results, but rather to give different results. More specifically, I am currently running a certain kind of dynamical simulations. Because the system is strongly chaotic, the time-dependent state of the system becomes different rather quickly (say after 500 steps of time evolution), depending on slight differences in the compiler options etc. Statistical (thermodynamic) properties converge to a well-defined value with a long simulation, though (whose value does not depend on compilers etc).

What I was surprised was that two simulations obtained with Ryzen and Xeon (with the same OS/compilers/options/etc) gave completely the same state of the system even after 4000000 steps of time evolution. Here, Xeon is a pretty old one, while Ryzen is a relatively new one. The compiler options are the same, here with “-O2 + -fcheck=all” and no CPU-specific options. However, this may be exceptional, and I guess I will probably get different dynamics with higher optimizations (I will check this later). (FYI, the job below uses 4 MPI processes each of which uses 4 OpenMP threads.)

[Xeon]
Model name:  Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GHz
compiler version =   GCC version 11.3.0
compiler options =   -cpp -I /usr/lib/x86_64-linux-gnu/openmpi/lib/../../fortran/gfortran-mod-1
5/openmpi -I /usr/lib/x86_64-linux-gnu/openmpi/lib -imultiarch x86_64-linux-gnu -D_REENTRANT -D
 MPI -mtune=generic -march=x86-64 -g -O2 -Wall -Wextra -Wno-maybe-uninitialized -Wno-unused-fun
ction -fimplicit-none -ffree-line-length-none -fmax-errors=1 -fopenmp -fcheck=all -fbacktrace -
finit-real=snan -finit-integer=-999 -finit-derived -ffpe-trap=invalid,zero,overflow -fpre-inclu
de=/usr/include/finclude/math-vector-fortran.h

[Ryzen]
Model name:  AMD Ryzen 9 5950X 16-Core Processor
compiler version =   GCC version 11.3.0
compiler options = -cpp -I /usr/lib/x86_64-linux-gnu/openmpi/lib/../../fortran/gfortran-mod-15/openmpi 
-I /usr/lib/x86_64-linux-gnu/openmpi/lib -imultiarch x86_64-linux-gnu -D_REENTRANT
-D MPI -mtune=generic -march=x86-64 -g -O2 -Wall -Wextra -Wno-maybe-uninitialized
-Wno-unused-function -fimplicit-none -ffree-line-length-none -fmax-errors=1 -fopenmp
-fcheck=all -fbacktrace -finit-real=snan -finit-integer=-999 -finit-derived -ffpe-trap=invalid,zero,overflow
-fpre-include=/usr/include/finclude/math-vector-fortran.h

Here is a SC13 presentation I gave on the topic. http://sc13.supercomputing.org/sites/default/files/WorkshopsArchive/pdfs/wp129s1.pdf

4 Likes

Direct solvers typically show minimal difference in results, but itterative solvers can show more significant difference between processors. This effect can be seen where the test for convergence can result in a different number of itterations, leading to more significant differences in the results.
Use of multi-threading (!$OMP) also results in solutions where differences are significantly greater than round-off differences.
I use a multi-threaded itterative eigen-solver for structural finite element analysis, where the eigenvector results are different in repeated runs on the same processor !
These differences can highlight the accuracy of an itterative solver approach.

2 Likes

The architecture of both CPUs is the same (x86_64), so it’s not totally surprising. Still, a bit-reproducible result is not guaranteed even in this case.

It would be more surprising between two different architectures.

This is expected, as soon as there are some reductions (either the built-in ones or “manual” ones).

1 Like

This could be due to the use of random numbers within the algorithm. Many eigensolver algorithms use random numbers to generate local subspace unit vectors, and those could differ from run to run. Mathematically, those unit vector choices do not affect the final solutions, but they can affect the local results at that step. Also, for degenerate eigenvalues, there is often ambiguity in exactly which vectors are computed within that subspace. BTW, these particular random numbers do not need to be high quality, so they are often generated internally with a simple linear congruential generator (LCG) step (a multiply and an add), rather than by calling an external routine.