I’ve noticed a significant performance improvement in Fortran by restructuring an expression. Initially, the expression is:
result = (ev(1)**0.5) * u1 + (ev(2)**0.5) * u2 + (ev(3)**0.5) * u3
Where ev(1), ev(2), and ev(3) are updated in each loop iteration, and u1, u2, and u3 are matrices which also change with each iteration of the loop.
These expressions are in a function which is called in every iteration.
After separating the square root calculations outside the main expression:
a = ev(1)**0.5
b = ev(2)**0.5
c = ev(3)**0.5
result = a * u1 + b * u2 + c * u3
The improvement in calculation time due to this rearrangement is massive.
To give an idea,
My code which took 103 seconds to run with the older version took only 75 seconds to run with the restructured version.
Can anyone kindly help me on what could be the possible reason for this. It would also greatly help if you could provide me with any literature or documentation of the reason as i could not find anything when i browsed the internet.
Not answering your question, but I would write sqrt(x) instead of x**0.5 in case the compiler does not do this internally. Probably sqrt(x) is faster than x**y for general y
Probably more code needs to be seen to find the answer, particularly the variable declarations. For example, is ev() an array whose values change, or a function whose result changes each iteration? Is everything in the expression the same type and kind, or is it a mixture of integer, real, and complex where some implicit type and/or kind conversions are required? As others have pointed out, sqrt(x) is probably better than x**0.5, but I think there are some corner cases that are different, depending on the operand types.
I’m not sure if it helps, but the order of the nested loops is not efficient.
In Fortran, arrays are stored in column-major order, so the most efficient way of computing, say z1, is by computing z1(1,1), then z1(2,1), then z1(3,1), then z1(1,2), then z1(2,2) and so on. That is, the order of the do-constructs should be reversed.
Since the CPU cache tends to work in terms of contiguous storage, right now you’re making it jump positions and potentially discard storage that must be re-cached later.
z1, z2, and z3 (and consequently, Z) are symmetric matrices, so one only needs to calculate the upper triangular part, and then copy those Z elements to the elements below the diagonal (assuming that subseqent calculations are not aware that Z is symmetric).