The reason in this MRE is actually obvious ā¦
At each iteration of the outer loop on i
:
- In the fast version, the inner loop on
j
is executed onlyN_samples=1000
times, which means that only 1000 elements ofpack_tot
are updated. - In the slow version all the
N_grid_pack=800000
elements ofpack_tot
are updated (and the wholeline_pack
is set to zero, although you really need only 1000 elements).