Speed of array intrinsics

RonShepard · April 13, 2025, 5:28pm

Yes, my reply about the intrinsic was in the context of comparing the intrinsic function to the compiled fortran. In the previous posts, the compiled fortran was reported to be much faster than the intrinsic, presumably because of the overhead associated with processing the optional arguments of the intrinsic that do not exist for the fortran code.

Also, it surprised me to see that the intrinsic timings depended on the optimization level. Any idea why that might occur? Is that also due to compile time optimizations related to the optional arguments?

aledinola · April 13, 2025, 11:33pm

Interesting! From your results it seems that the intrinsic and the notemp are the fastest. I wonder why the temporary variable slows down the code so much, though

urbanjost · April 14, 2025, 1:24am

Making a BIG assumption that the instructions written in the code are basically the same as end up in the machine code the cost of copying all the values sequentially to the temp variable could exceed the cost of accessing the location where the maximum value is stored.

To really answer that, particularly if not turning off all optimizations you would need to look at the machine code.

But you asked for variants that might be more efficient and that was a possible improvement that was worth trying at a coarse high-level empirical level like this method is using.

The answer very much depends on the compiler and compiler switches and hardware.
If you have not done so already, and wallclock time is an issue running realistic problems using a profiling tool (eg. gprof(1) is a great place to start. Once you identify where your code is spending the bulk of its time then getting down in the weeds and trying many variants and looking at the machine code is worth it, otherwise you can easily get caught up in using too much time tuning code with little return, which sort of defeats a lot of the reasons for using high-level languages, I think.

Amusingly, running on a small Linux platform and using -O3 on three compilers I got the intrinsic fastest in one, the intrinsic slowest in one, temp fastest, and notemp fastest. The best time with ifx where temp and notemp generated the same machine code and were fastest and the intrinsic was slowest. loc always did either last or second-to-last.

The difference between slowest to fastest was nearly a factor of four. The difference between fastest and slowest intrinsic was nearly 2x. I found that somewhat surprising.

Topic		Replies	Views
Array intrinsics performances/accuracy	20	874	May 20, 2023
Testing the performance of `matmul` under default compiler settings Help	37	2752	August 11, 2022
Why a function returns an array is much slower than a subroutine returns an array? (real MWE included)	14	1062	September 20, 2021
LFortran now supports all intrinsic functions Announcements	44	1438	January 5, 2025
Some Intrinsic SUMS	82	3255	May 31, 2023

Speed of array intrinsics

Related topics