How can the Fortran Performance improve by Precomputing Scalars used in an expression?

@fortran4ever @Fortran

I’ve noticed a significant performance improvement in Fortran by restructuring an expression. Initially, the expression is:

result = (ev(1)**0.5) * u1 + (ev(2)**0.5) * u2 + (ev(3)**0.5) * u3

Where ev(1), ev(2), and ev(3) are updated in each loop iteration, and u1, u2, and u3 are matrices which also change with each iteration of the loop.

These expressions are in a function which is called in every iteration.

After separating the square root calculations outside the main expression:

a = ev(1)**0.5
b = ev(2)**0.5
c = ev(3)**0.5

result = a * u1 + b * u2 + c * u3

The improvement in calculation time due to this rearrangement is massive.

To give an idea,
My code which took 103 seconds to run with the older version took only 75 seconds to run with the restructured version.

Can anyone kindly help me on what could be the possible reason for this. It would also greatly help if you could provide me with any literature or documentation of the reason as i could not find anything when i browsed the internet.

Thank you very much in advance !!

Welcome…

Could you give the compiler/version you are using, and your compilation options? I would expect any decent compiler to precompute these scalars.

Not answering your question, but I would write sqrt(x) instead of x**0.5 in case the compiler does not do this internally. Probably sqrt(x) is faster than x**y for general y

Hi @PierU, Thanks for the reply !!

I use intel/2019.4

and the compilation options are:
-O2 -safe_cray_ptr -assume byterecl,buffered_io,protect_parens -warn nousage -zero -ftz -fp-model precise -mP2OPT_hpo_dist_factor=21 -diag-disable 10212,10010 -traceback -pad -DLINUX -DIFORT -DNET_SECURITY -DADDR64 -DINTEL -DXEON64 -DFCC80 -DTIMER=cycle_time -D_GLIBCXX_USE_CXX11_ABI=0 -DSSE2 -DOVERRIDE -DSHARELIB=libmppdyna_d_R13.1-138-g8429c8a10f_sse2.so -DUSEMDLU -DMPP -DMPICH -DHPMPI -DAUTODOUBLE -DNEWIO -DMKL18 -DHAVE_BLAS_LAPACK -DHAVE_I8R8_LIBRARY -DMF3_SYM -DRELEASE -DNEW_UNITS -DEXTENDED -DBATTERY -DDUALCESE -DUSING_MDLU -DCOOLPROP -DREFPROP -DUSING_MDLU -DCOOLPROP -DUSING_MDLU -DREFPROP -DUSE_R123 -DLSTCODE -DBIGID -DENABLE_HASH3 -DFFTW -DROTORDYN -DMUMPS -DLSGPART -DMF3_SYM -DMCMS -DUSES_CXX -DTS_CPPSTD_98 -DPFEM -DMUMPSVER=stable -DUSES_CPP -DPTHREADS -i8 -r8 -fimf-arch-consistency=true -qno-opt-dynamic-align -xSSE2 -align array16byte -llapack -lblas -fPIC

I am relatively new to fortran. Excuse me if this is not what you had requested.

kindly do let me know, I can provide if you need any otgher information.

Thanks in Advance !!

Hi @Beliavsky, Thank you very much for the reply.

I did use sqrt() instead of **0.5 which further improved the timing a bit more.

But my main goal is to know the reason for the improvement due the mere restructuring as I had stated before.

Any insights on that would really help me.

Thanks in advance :smile:

OK… It’s hard to tell, seeing the part of the code (with variable declarations…) may help.

I laid the options out like this

and then took out all the defines for the preprocessor

Ones I noticed are

You can find information on exactly what options do by reading Intel’s documentation, in links like https://www.intel.com/content/www/us/en/docs/fortran-compiler/developer-guide-reference/2025-0/fprotect-parens-qprotect-parens.html but make sure it’s for the version that you are using.

1 Like

Probably more code needs to be seen to find the answer, particularly the variable declarations. For example, is ev() an array whose values change, or a function whose result changes each iteration? Is everything in the expression the same type and kind, or is it a mixture of integer, real, and complex where some implicit type and/or kind conversions are required? As others have pointed out, sqrt(x) is probably better than x**0.5, but I think there are some corner cases that are different, depending on the operand types.

Hi @RonShepard, thanks for the reply.

I have changed the names of the variables in the code as it is critical and I won’t be able to share more than this. I hope this should be sufficient.

If it helps, further “x” and “y” are the outputs obtained from a different function.

Thanks in advance !!!

Hi @themos , Thank you very much for the insights.

I will look into it and try to see if I can get something out of it.

Thank you again !!

@PierU

I have attached a snap of the variable declarations in the code in the comments below. Kindly have a look at it.

Thanks !!

I’m not sure if it helps, but the order of the nested loops is not efficient.

In Fortran, arrays are stored in column-major order, so the most efficient way of computing, say z1, is by computing z1(1,1), then z1(2,1), then z1(3,1), then z1(1,2), then z1(2,2) and so on. That is, the order of the do-constructs should be reversed.

Since the CPU cache tends to work in terms of contiguous storage, right now you’re making it jump positions and potentially discard storage that must be re-cached later.

z1, z2, and z3 (and consequently, Z) are symmetric matrices, so one only needs to calculate the upper triangular part, and then copy those Z elements to the elements below the diagonal (assuming that subseqent calculations are not aware that Z is symmetric).