Struggling with a strange performance issue

MatthewMacGregor · May 8, 2022, 4:18pm

It’s a rainy Sunday morning here and I enjoy a good perf investigation, so I pulled together a minimal reproducible example of nothing but the get_pointer and get_pointer_dummy subs. The internals of both are identical, and the loops that exercise them are identical. This allowed me to isolate and exercise the assertion that execution times are substantially longer with the four param signature (view) than the three param signature.

My work is here: GitHub - matthew-macgregor/fort-lang-disc-3372: Testing a hunch related to a thread in the Fortran Discourse group.

The example is set up to build and run with fpm to make it especially easy to clone and run if anyone is curious to try it out.

Compiler: gfortran 11.2.0 (mingw, build from source from the nuwen distribution)
OS: Windows 10 Pro

I found a measurable difference at 100M iterations between the get_pointer_dummy and get_pointer subroutines when run in isolation with an unoptimized build, but nothing like what @Arjen noted. Optimized with -O3 I found no difference between the two variants. This seems to track with what @mecej4 posted.

I also explored the generated assembly for any clues. In the unoptimized case, the compiler generated 7 extra instructions for the loop and get_pointer than it did for the loop and get_pointer_dummy. As expected, get_pointer_dummy was ~ 12% faster than get_pointer, despite the fact that the subs themselves are doing the same thing (essentially nothing). I believe the additional instructions account for this difference for the slower case.

Meanwhile, the optimized code (-O3) had identical performance at 100M iterations and get_pointer generated only two additional loc to set up the extra argument to the subroutine call. You can see all of this in the asm/ folder in the repo I posted if you’re curious, and there are instructions in the README.md if you want to try to reproduce it. With enough iterations, I’m sure that the difference would become measurable (those two instructions to pass the extra param will add up), but certainly seems to rule out anything unexpected related to the subroutine call itself.

# Unoptimized
 :: testing get_pointer
Wall clock (s): 0.656000
CPU time (s):   0.656250

 :: testing get_pointer_dummy
Wall clock (s): 0.578000
CPU time (s):   0.578125

# Optimized with -O3
 :: testing get_pointer
Wall clock (s): 0.281000
CPU time (s):   0.281250

 :: testing get_pointer_dummy
Wall clock (s): 0.281000
CPU time (s):   0.281250

Caveats

I’m not a Fortran programmer.
This is my first post on Discourse, so I’m still in the zero trust tier, which limits my ability to link/attach.

Topic		Replies	Views
Why a function returns an array is much slower than a subroutine returns an array? (real MWE included)	14	1073	September 20, 2021
Optimizing vectorized array operations Help	79	1497	May 5, 2025
Testing the performance of `matmul` under default compiler settings Help	37	2778	August 11, 2022
Performance impact of how a large array is accessed Help	49	1907	June 4, 2023
Trying to make a program using OpenMP give same output as non parallel counterpart Help	2	483	May 9, 2022

Struggling with a strange performance issue

Caveats

Related topics