Struggling with a strange performance issue

I have encountered a problem with the performance of a program that I do not understand. (Apologies for the length mail). The thing is this:

  • I have a bunch of arrays that act as a single array (they are catenated - see an older thread about the Mathematics of Arrays). Via a pointer function I get the pointer to an element of one of the arrays. So far so good.
  • The performance of this implementation is quite bad in comparison to a direct accessing of elements of a single array. That is understandable, the program has to do more work than in the case of a plain array.
  • But analysing the cause of the bad performance has shown that this code::
FUNCTION get_elem_ndim(view, i) result(elem)
    CLASS(moa_view_type), INTENT(INOUT) :: view
    INTEGER, DIMENSION(:), INTENT(IN)   :: i
    INTEGER, POINTER                    :: elem

    logical                             :: found

    CALL get_pointer( view, i, elem, found )

END FUNCTION

takes up to four times as much computation time as:

    ...
    CALL get_pointer( i, elem, found )
    ...

whereas in both cases the routine get_pointer does not do more than:

    found = .TRUE.
    elem => dummy
    return

To give a better idea:

  • Skipping the call to get_pointer all together: 10-11 seconds
  • Using the version without the class(moa_view_type) argument: 19-21 seconds
  • Using the version with the “view” argument: 44-46 seconds

I have uploaded the complete source code for convenience. The input file I use is:

# Example of using the moa_measure program
#
# First define the view, then sample the result
#
report-file view1chk.out

use-view                   # just to be sure
allocate-view 10 10000     # total size of the view
number-repetitions 100000  # repeat any procedure a thousand times

sequential-get 1 10000     # simple sequential access
random-get 10000           # simple random access

sequential-get 10 10000    # simple sequential access with a step of 10
sequential-get 100 10000    # simple sequential access with a step of 10
sequential-get 179  10000    # simple sequential access with a step of 10
sequential-get 357  10000    # simple sequential access with a step of 10
sequential-get 1000 10000    # simple sequential access with a step of 10

The timings I cited were achieved with slight variations of the code in “moa_view_ndim_flat_v6.f90”.

I think I know how I can improve the performance, but like I said, I simply don’t understand why passing the “view” argument should cost so much time, as it is passed through two functions before it arrives at "get_pointer already.

cmdparse.f90 (9.7 KB)
moa_measure.f90 (8.7 KB)
moa_view_ndim_flat_v6.f90 (10.5 KB)
view_general_v6.f90 (3.2 KB)

1 Like

I forgot to add that most of the experiments were done on Linux using the gfortran compiler version 10.1.0. However, I also used Intel Fortran oneAPI on Windows with the Inspector tool, but that was before I arrived at the current problem.

1 Like

Sorry I may not help directly. But I can say such strange performance issue occasionally also occur to me. Perhaps for some reason the compiler did not put some operations or data is in the cache anymore so it can be several times slower.
If Intel Fortran also have such strange performance issue, may also post this issue on Intel’s Forum, their guru may give you some solutions.

@Arjen:

I think that you expect a bit too much from readers here. In order to check the behavior that you reported, one has to modify unfamiliar code, and one is not sure if the modifications follow your directions. I tried, and found that in all three cases (assuming that I followed your directions correctly) the runs took about 19 seconds.

I suggest that you modify your code, with a module variable, say, VERSION, and use

use global, only : VERSION
SELECT CASE(VERSION)
case(1)...
case(2)...
case(3)...
END SELECT

to choose the alternative code paths in your code. This way, one can run the program with VERSION=1, 2, 3, without deviating from your expectations as to what the code should do.

1 Like

@mecej4 Yes, that is fair enough :slight_smile: . In the meantime I did try some radical alternatives and nothing helped. Preparing a version like you suggest will take me some time though. Which compiler did you use by the way?

1 Like

I had used Gfortran 12.0 from Equation-com on Windows, with -O2

Using Cygwin Gfortran 11.2 instead, I noted 22-23 seconds.

Using Intel 2022 with /O3, I find that the three versions take 33-35 seconds.

1 Like

I have been using gfortran 10.1.0 on Linux and 10.2.0 on Windows. The results on Windows are, well, erratic to say the least. Huge differences in the performance from one run to the next.

Meanwhile, I have had a surprise with Intel Fortran oneAPI wrt plain arrays: the random access test usually takes longer than the sequential tests, because of cache misses and that sort of things. But in this case the random access test took ten times less time! When I used the “view” version, it was indeed slower.

Well, it is clear to me that there is a lot more to this than I had expected.

1 Like

Well, either I made a mistake or something different was happening. I prepared three versions of the main module to illustrate the problem, but now I get consistently the following results:

  • Not calling the routine get_pointer at all: 10 seconds.
  • Calling the routine get_pointer with or without the type(moa_view_type) argument that seemed to cause the increase: both 20 seconds. I tried several times, the same result.

I will just put it down to the phase of the tide.

1 Like

I am surprised that calling a routine that does practically nothing should cause the run time to double. Have you considered generating a profile of a run? Would you really see as much as 50 percent of the run time reported as consumed by get_pointer()?

1 Like

Well, I have used Intel Inspector on WIndows - @jbdv pointed out, early in this project, that one specific statement was using a lot of CPU. I will see about profiling the current code.

1 Like

There are lots of situations where calling a routine that does “practically nothing” must expend a lot of effort. An intent(out) argument, for example, might require a long sequence of recursive deallocations. An assumed size array argument might require copy-in/copy-out semantics on the actual argument, even if the routine doesn’t do anything to the argument.

2 Likes

@RonShepard, good points, but in the code under discussion, there are no assumed size arguments, and the only intent(out) arguments are scalar variables of intrinsic types (logical, integer, real). Furthermore, using his code I have not been able to reproduce @Arjen’s findings that (i) calling get_pointer() cause the run time to double, and (ii) his random access test took ten times less time.

I prepared an abbreviated version of his code, with more compact output, and reduced the number of repetitions by a factor of 10 (file inp1.txt.f90, please rename to inp1.txt). Here are the results (Windows, Gfortran11.2 Cygwin -O2).

A. With the statement

    elem => get_elem_ndim( view, inew )

in function get_elem_single commented out.

Compiler version:     GCC version 11.2.0  Compiler options:           -mtune=generic -march=x86-64 -O2
Note: using view (catenation)
Allocate view: n_chunks: 10   chunk size: 10000    0.000 (s)      0.000 (s)
Number of repetitions: 10000

     ACTION        Step size  N steps   Wall_clk          CPU
Sequential get             1   10000    0.273 (s)      0.265 (s)
Random     get             1   10000    0.266 (s)      0.266 (s)
Sequential get            10   10000    0.236 (s)      0.250 (s)
Sequential get           100   10000    0.237 (s)      0.234 (s)
Sequential get           179   10000    0.237 (s)      0.235 (s)
Sequential get           357   10000    0.236 (s)      0.234 (s)
Sequential get          1000   10000    0.237 (s)      0.234 (s)
Done

B. With the same statement reinstated:

Compiler version:     GCC version 11.2.0  Compiler options:           -mtune=generic -march=x86-64 -O2
Note: using view (catenation)
Allocate view: n_chunks: 10   chunk size: 10000    0.000 (s)      0.000 (s)
Number of repetitions: 10000

     ACTION        Step size  N steps   Wall_clk          CPU
Sequential get             1   10000    0.250 (s)      0.250 (s)
Random     get             1   10000    0.259 (s)      0.265 (s)
Sequential get            10   10000    0.236 (s)      0.235 (s)
Sequential get           100   10000    0.237 (s)      0.234 (s)
Sequential get           179   10000    0.236 (s)      0.235 (s)
Sequential get           357   10000    0.237 (s)      0.250 (s)
Sequential get          1000   10000    0.263 (s)      0.250 (s)
Done

cmdparse.f90 (9.7 KB)
flat1.f90 (8.3 KB)
inp1.txt.f90 (775 Bytes)
measure.f90 (8.4 KB)
view_general_v6.f90 (3.2 KB)

P.S. Quirk of upload facility: the file that I selected to upload is named vg.f90 on my PC. After uploading, its name got changed to view_general_v6.f90, which is the name of the file that @Arjen provided!

1 Like

It’s a rainy Sunday morning here and I enjoy a good perf investigation, so I pulled together a minimal reproducible example of nothing but the get_pointer and get_pointer_dummy subs. The internals of both are identical, and the loops that exercise them are identical. This allowed me to isolate and exercise the assertion that execution times are substantially longer with the four param signature (view) than the three param signature.

My work is here: GitHub - matthew-macgregor/fort-lang-disc-3372: Testing a hunch related to a thread in the Fortran Discourse group.

The example is set up to build and run with fpm to make it especially easy to clone and run if anyone is curious to try it out.

  • Compiler: gfortran 11.2.0 (mingw, build from source from the nuwen distribution)
  • OS: Windows 10 Pro

I found a measurable difference at 100M iterations between the get_pointer_dummy and get_pointer subroutines when run in isolation with an unoptimized build, but nothing like what @Arjen noted. Optimized with -O3 I found no difference between the two variants. This seems to track with what @mecej4 posted.

I also explored the generated assembly for any clues. In the unoptimized case, the compiler generated 7 extra instructions for the loop and get_pointer than it did for the loop and get_pointer_dummy. As expected, get_pointer_dummy was ~ 12% faster than get_pointer, despite the fact that the subs themselves are doing the same thing (essentially nothing). I believe the additional instructions account for this difference for the slower case.

Meanwhile, the optimized code (-O3) had identical performance at 100M iterations and get_pointer generated only two additional loc to set up the extra argument to the subroutine call. You can see all of this in the asm/ folder in the repo I posted if you’re curious, and there are instructions in the README.md if you want to try to reproduce it. With enough iterations, I’m sure that the difference would become measurable (those two instructions to pass the extra param will add up), but certainly seems to rule out anything unexpected related to the subroutine call itself.

# Unoptimized
 :: testing get_pointer
Wall clock (s): 0.656000
CPU time (s):   0.656250

 :: testing get_pointer_dummy
Wall clock (s): 0.578000
CPU time (s):   0.578125
# Optimized with -O3
 :: testing get_pointer
Wall clock (s): 0.281000
CPU time (s):   0.281250

 :: testing get_pointer_dummy
Wall clock (s): 0.281000
CPU time (s):   0.281250

Caveats

  • I’m not a Fortran programmer.
  • This is my first post on Discourse, so I’m still in the zero trust tier, which limits my ability to link/attach.
5 Likes

@MatthewMacGregor ,

Welcome to this Discourse! That’s very impressive investigation indeed, all the more remarkable since you don’t consider yourself a Fortran programmer. Hope you will find it interesting and useful to continue to look at Fortran and contribute on this forum, Fortran and its practitioners surely will benefit tremendously from perspective such as yours.

2 Likes

Thanks everyone, here are the results I mentioned last week (limited to two test cases):

A: no call to get_pointer      B: get_pointer points to dummy    C: get_pointer has no argument
                                                                    "view"

Sequential get:                Sequential get:                   Sequential get:
    Step size:       1             Step size:       1                Step size:       1
    Number of steps: 10000         Number of steps: 10000            Number of steps: 10000
Wall clock (s):  10.6770       Wall clock (s):  20.0920          Wall clock (s):  19.2840
CPU time (s):    10.6753       CPU time (s):    20.0913          CPU time (s):    19.2785

Random get:                    Random get:                       Random get:
    Number of steps: 10000         Number of steps: 10000            Number of steps: 10000
Wall clock (s):  12.6090       Wall clock (s):  24.1400          Wall clock (s):  23.5040
CPU time (s):    12.6087       CPU time (s):    24.1384          CPU time (s):    23.5031

I have attached the code for reference.
cmdparse.f90 (9.7 KB)
moa_measure.f90 (8.7 KB)
moa_view_ndim_flat_a.f90 (7.7 KB)
moa_view_ndim_flat_b.f90 (7.9 KB)
moa_view_ndim_flat_c.f90 (7.9 KB)

To make sure I didn’t make mistakes building the versions and saving the output, I used a small shell script:

gfortran -c cmdparse.f90 -march=native -mtune=native
gfortran -c view_general_v6.f90 -march=native -mtune=native
gfortran -c moa_view_ndim_flat_a.f90 -march=native -mtune=native
gfortran -c moa_view_ndim_flat_b.f90 -march=native -mtune=native
gfortran -c moa_view_ndim_flat_c.f90 -march=native -mtune=native
gfortran -c moa_view_ndim_flat_v7.f90 -march=native -mtune=native

gfortran -o test_moa_measure_a moa_measure.f90 cmdparse.o view_general_v6.o moa_view_ndim_flat_a.o -march=native -mtune=native
gfortran -o test_moa_measure_b moa_measure.f90 cmdparse.o view_general_v6.o moa_view_ndim_flat_b.o -march=native -mtune=native
gfortran -o test_moa_measure_c moa_measure.f90 cmdparse.o view_general_v6.o moa_view_ndim_flat_c.o -march=native -mtune=native
gfortran -o test_moa_measure_v7 moa_measure.f90 cmdparse.o view_general_v6.o moa_view_ndim_flat_v7.o -march=native -mtune=native

./test_moa_measure_a view1chk.inp
mv view1chk.out view1a.out

./test_moa_measure_b view1chk.inp
mv view1chk.out view1b.out

./test_moa_measure_c view1chk.inp
mv view1chk.out view1c.out

./test_moa_measure_v7 view1chk.inp
mv view1chk.out view1v7.out 

(Version v7 is a variation that is not important here.)

I have not yet tried to profile them.

I used your latest files and the latest script. I removed the lines from the script with “v7” in them, since you did not provide the source file for v7. I added -O2 to the compilation commands. On Windows 11, using Cygwin and Gfortran 11.2, the results (run times):

A: 17.470 s
B: 17.596 s
C: 17.603 s

I repeated the run another time, just to note what variations to expect.

A: 17.512 s
B: 17.521 s
C: 17.779 s

The differences between A, B, and C appear to be about as much as the variations between repeated runs. As I noted before, the effects on the run time of (i) not calling, (ii) calling with a pointer to a dummy, and (iii) calling with fewer arguments are insignificant.

Using Ifort 2021.5.0 /O2 /Qxhost, I obtained these timings:

A: 30.231 s
B: 30.487 s
C: 30.493 s

What is unexpected is that some aspect of your code causes the Ifort-compiled program to run at 2/3 the the speed of the Gfortran-compiled program!

Yes, they are insignificant. Then why do I see such large differences in my experiments? Is it merely a different compiler version or different flags?

To get more info, how about comparing the output of -fdump-tree-optimized option in gfortran? Something like

gfortran-10 -fdump-tree-optimized test.f90

vs

gfortran-10 -fdump-tree-optimized -O2 test.f90

(both of which create a report like test.f90.234t.optimized that contains an intermediate code created by the compiler). I’ve just tried a few simple codes, which suggests that with -O2, the creation of a temporary descriptor (= address of original data + “vtable”) is avoided and also inlining occurs (for the case of my simple code).
Similar things may happen when -O2 is attached in your codes.

2 Likes

Sorry for a naive question.
If you run the code on Linux, do you observe the issue you mentioned?
gfortran on Windows sometimes does not provide the same performance/behavior as on Linux.

I have run all these experiments on Linux - the experiments on Windows give very erratic results which are basically useless (even if the machine is apparently only doing these calculations)

1 Like