Hi everyone,
I’m working on optimizing a Fortran code that performs calculations on large double-precision arrays (around 800,000 elements). The code includes loops that calculate different weights that are added to a temporary array. After the loop, I scale and add the temporary array to the permanent array:
pack_tot(:) = pack_tot(:) + (S_pack / tot) * line_pack(:)
and set the temporary array to zero:
line_pack(:) = 0.0d0
These two operations, especially the first one take the most time in my calculation, even though I perform many other operations to calculate the weights that I add to the temporary array. To minimize the time I tried using aggressive optimization and vectorization of the code. I compile it using
gfortran -O3 -ftree-vectorize -funroll-loops -ffast-math -march=native
and checked with “-fopt-info-vec”, if the line actually gets vectorized, and it does. However, even with vectorization enabled, the performance remains very slow (see the time difference below). If I skip the addition of the temporary array to the permanent array and use the permanent array directly in every calculation, the code becomes 25 times faster (from 21s to 0.8s). Additionally, I tried precomputing the scaling factor, but that didn’t help. I also ran the code on a different environment and still encountered the same issue.
Questions:
Do you know what the issue could be?
Could this be related to memory bandwidth limitations or cache inefficiencies due to the large size of the arrays?
I am using an M2 pro chip (ARM architecture) and GNU Fortran (Homebrew GCC 14.2.0_1) 14.2.0.
Many thanks in advance and any suggestions would be very appreciated!