Optimizing vectorized array operations

Thanks for clarification. Reinhold Bader covered this issue in his Fortran programming course:

IMO, one can avoid the pitfalls with some discipline, e.g. having an outer scope which does allocation, and having an inner scope where the actual operations happen and aliasing guarantees apply.

There are so many performance pitfalls for newcomers I really doubt this is the main one. From broken data structures, wrong access patterns, completely blocking vectorization with scalar functions, the list goes on and on. I’ve seen researchers “brag” about their excellent multi-threaded scaling, and it turned out they weren’t compiling the code with any optimization flags at all (I think there is a quote from a famous parallel programmer, “the easiest way to make code scale well is to make it sequentially slow”). The big waste IMO in scientific codes is when you have a Python script using a sequential BLAS library via NumPy (due to misconfiguration), and running on 1 out of 28 cores of a server processor. But Fortran, C, or C++ programmers also aren’t immune to these types of issues.