Speed of array intrinsics

Yes, my reply about the intrinsic was in the context of comparing the intrinsic function to the compiled fortran. In the previous posts, the compiled fortran was reported to be much faster than the intrinsic, presumably because of the overhead associated with processing the optional arguments of the intrinsic that do not exist for the fortran code.

Also, it surprised me to see that the intrinsic timings depended on the optimization level. Any idea why that might occur? Is that also due to compile time optimizations related to the optional arguments?

Interesting! From your results it seems that the intrinsic and the notemp are the fastest. I wonder why the temporary variable slows down the code so much, though

Making a BIG assumption that the instructions written in the code are basically the same as end up in the machine code the cost of copying all the values sequentially to the temp variable could exceed the cost of accessing the location where the maximum value is stored.

To really answer that, particularly if not turning off all optimizations you would need to look at the machine code.

But you asked for variants that might be more efficient and that was a possible improvement that was worth trying at a coarse high-level empirical level like this method is using.

The answer very much depends on the compiler and compiler switches and hardware.
If you have not done so already, and wallclock time is an issue running realistic problems using a profiling tool (eg. gprof(1) is a great place to start. Once you identify where your code is spending the bulk of its time then getting down in the weeds and trying many variants and looking at the machine code is worth it, otherwise you can easily get caught up in using too much time tuning code with little return, which sort of defeats a lot of the reasons for using high-level languages, I think.

Amusingly, running on a small Linux platform and using -O3 on three compilers I got the intrinsic fastest in one, the intrinsic slowest in one, temp fastest, and notemp fastest. The best time with ifx where temp and notemp generated the same machine code and were fastest and the intrinsic was slowest. loc always did either last or second-to-last.

The difference between slowest to fastest was nearly a factor of four. The difference between fastest and slowest intrinsic was nearly 2x. I found that somewhat surprising.

1 Like