How can the Fortran Performance improve by Precomputing Scalars used in an expression?

ravi_0602 · December 16, 2024, 2:40pm

I’ve noticed a significant performance improvement in Fortran by restructuring an expression. Initially, the expression is:

result = (ev(1)**0.5) * u1 + (ev(2)**0.5) * u2 + (ev(3)**0.5) * u3

Where ev(1), ev(2), and ev(3) are updated in each loop iteration, and u1, u2, and u3 are matrices which also change with each iteration of the loop.

These expressions are in a function which is called in every iteration.

After separating the square root calculations outside the main expression:

a = ev(1)**0.5
b = ev(2)**0.5
c = ev(3)**0.5

result = a * u1 + b * u2 + c * u3

The improvement in calculation time due to this rearrangement is massive.

To give an idea,
My code which took 103 seconds to run with the older version took only 75 seconds to run with the restructured version.

Can anyone kindly help me on what could be the possible reason for this. It would also greatly help if you could provide me with any literature or documentation of the reason as i could not find anything when i browsed the internet.

Thank you very much in advance !!

PierU · December 16, 2024, 2:52pm

Welcome…

Could you give the compiler/version you are using, and your compilation options? I would expect any decent compiler to precompute these scalars.

Beliavsky · December 16, 2024, 3:02pm

Not answering your question, but I would write sqrt(x) instead of x**0.5 in case the compiler does not do this internally. Probably sqrt(x) is faster than x**y for general y

ravi_0602 · December 16, 2024, 5:23pm

Hi @PierU, Thanks for the reply !!

I use intel/2019.4

and the compilation options are:
-O2 -safe_cray_ptr -assume byterecl,buffered_io,protect_parens -warn nousage -zero -ftz -fp-model precise -mP2OPT_hpo_dist_factor=21 -diag-disable 10212,10010 -traceback -pad -DLINUX -DIFORT -DNET_SECURITY -DADDR64 -DINTEL -DXEON64 -DFCC80 -DTIMER=cycle_time -D_GLIBCXX_USE_CXX11_ABI=0 -DSSE2 -DOVERRIDE -DSHARELIB=libmppdyna_d_R13.1-138-g8429c8a10f_sse2.so -DUSEMDLU -DMPP -DMPICH -DHPMPI -DAUTODOUBLE -DNEWIO -DMKL18 -DHAVE_BLAS_LAPACK -DHAVE_I8R8_LIBRARY -DMF3_SYM -DRELEASE -DNEW_UNITS -DEXTENDED -DBATTERY -DDUALCESE -DUSING_MDLU -DCOOLPROP -DREFPROP -DUSING_MDLU -DCOOLPROP -DUSING_MDLU -DREFPROP -DUSE_R123 -DLSTCODE -DBIGID -DENABLE_HASH3 -DFFTW -DROTORDYN -DMUMPS -DLSGPART -DMF3_SYM -DMCMS -DUSES_CXX -DTS_CPPSTD_98 -DPFEM -DMUMPSVER=stable -DUSES_CPP -DPTHREADS -i8 -r8 -fimf-arch-consistency=true -qno-opt-dynamic-align -xSSE2 -align array16byte -llapack -lblas -fPIC

I am relatively new to fortran. Excuse me if this is not what you had requested.

kindly do let me know, I can provide if you need any otgher information.

Thanks in Advance !!

ravi_0602 · December 16, 2024, 5:25pm

Hi @Beliavsky, Thank you very much for the reply.

I did use sqrt() instead of **0.5 which further improved the timing a bit more.

But my main goal is to know the reason for the improvement due the mere restructuring as I had stated before.

Any insights on that would really help me.

Thanks in advance

PierU · December 16, 2024, 5:47pm

OK… It’s hard to tell, seeing the part of the code (with variable declarations…) may help.

themos · December 16, 2024, 6:42pm

I laid the options out like this

ravi_0602:

-O2
-safe_cray_ptr
-assume byterecl,buffered_io,protect_parens
-warn nousage
-zero
-ftz
-fp-model precise
-mP2OPT_hpo_dist_factor=21
-diag-disable 10212,10010
-traceback
-pad
-DLINUX
-DIFORT
-DNET_SECURITY
-DADDR64
-DINTEL
-DXEON64
-DFCC80
-DTIMER=cycle_time
-D_GLIBCXX_USE_CXX11_ABI=0
-DSSE2
-DOVERRIDE
-DSHARELIB=libmppdyna_d_R13.1-138-g8429c8a10f_sse2.so
-DUSEMDLU
-DMPP
-DMPICH
-DHPMPI
-DAUTODOUBLE
-DNEWIO
-DMKL18
-DHAVE_BLAS_LAPACK
-DHAVE_I8R8_LIBRARY
-DMF3_SYM
-DRELEASE
-DNEW_UNITS
-DEXTENDED
-DBATTERY
-DDUALCESE
-DUSING_MDLU
-DCOOLPROP
-DREFPROP
-DUSING_MDLU
-DCOOLPROP
-DUSING_MDLU
-DREFPROP
-DUSE_R123
-DLSTCODE
-DBIGID
-DENABLE_HASH3
-DFFTW
-DROTORDYN
-DMUMPS
-DLSGPART
-DMF3_SYM
-DMCMS
-DUSES_CXX
-DTS_CPPSTD_98
-DPFEM
-DMUMPSVER=stable
-DUSES_CPP
-DPTHREADS
-i8
-r8
-fimf-arch-consistency=true
-qno-opt-dynamic-align
-xSSE2
-align array16byte
-llapack
-lblas
-fPIC

and then took out all the defines for the preprocessor

Ones I noticed are

You can find information on exactly what options do by reading Intel’s documentation, in links like https://www.intel.com/content/www/us/en/docs/fortran-compiler/developer-guide-reference/2025-0/fprotect-parens-qprotect-parens.html but make sure it’s for the version that you are using.

RonShepard · December 16, 2024, 6:57pm

Probably more code needs to be seen to find the answer, particularly the variable declarations. For example, is ev() an array whose values change, or a function whose result changes each iteration? Is everything in the expression the same type and kind, or is it a mixture of integer, real, and complex where some implicit type and/or kind conversions are required? As others have pointed out, sqrt(x) is probably better than x**0.5, but I think there are some corner cases that are different, depending on the operand types.

ravi_0602 · December 16, 2024, 8:24pm

Hi @RonShepard, thanks for the reply.

I have changed the names of the variables in the code as it is critical and I won’t be able to share more than this. I hope this should be sufficient.

If it helps, further “x” and “y” are the outputs obtained from a different function.

Thanks in advance !!!

ravi_0602 · December 16, 2024, 8:27pm

Hi @themos , Thank you very much for the insights.

I will look into it and try to see if I can get something out of it.

Thank you again !!

ravi_0602 · December 16, 2024, 8:28pm

@PierU

I have attached a snap of the variable declarations in the code in the comments below. Kindly have a look at it.

Thanks !!

jwmwalrus · December 16, 2024, 8:57pm

I’m not sure if it helps, but the order of the nested loops is not efficient.

In Fortran, arrays are stored in column-major order, so the most efficient way of computing, say z1, is by computing z1(1,1), then z1(2,1), then z1(3,1), then z1(1,2), then z1(2,2) and so on. That is, the order of the do-constructs should be reversed.

Since the CPU cache tends to work in terms of contiguous storage, right now you’re making it jump positions and potentially discard storage that must be re-cached later.

themos · December 16, 2024, 9:17pm

z1, z2, and z3 (and consequently, Z) are symmetric matrices, so one only needs to calculate the upper triangular part, and then copy those Z elements to the elements below the diagonal (assuming that subseqent calculations are not aware that Z is symmetric).

Topic		Replies	Views
Fortran benchmark Meta	18	1827	October 13, 2021
A (mis)adventure in optimization	2	1377	January 6, 2021
Optimizing vectorized array operations Help	78	1312	April 2, 2025
Simple summation 8x slower than in Julia	89	14962	April 2, 2022
Compiler option matmul speedup	10	987	September 16, 2023

How can the Fortran Performance improve by Precomputing Scalars used in an expression?

Related topics