Performance impact of how a large array is accessed

I have a code containing a large allocatable 3D array defined in a module,
real(wp), allocatable, dimension( :, : ,: ) :: hist

The array is allocated immediately in main,
allocate(hist(6,0 : nhistpoints,npart))

The array contains the history of every particle at every timestep of a simulation.

There is one subroutine that takes nearly all the execution time in my code, and it needs this array. Originally I accessed it via a USE statement:

subroutine blah(…)
use probcons, only : hist

But the routine needs just a 2D subarray, so I decided to see what would happen if I passed that as a subroutine argument,

call blah ( … , hist(1:6,0:nhistpoint,i) …)

and the subroutine defines a declaration for an assumed-shape 2D array,
subroutine blah(… , histarg,…)
real(wp), dimension(:,0:) :: histarg

I’ve run the code both ways (accessing the 3D array via USE, accessing by passing a 2D subarray). The version with the 2D subarray is typically 4% slower.

I’m not too bothered by 4% performance loss, but I would like to understand it. Can someone shed light on why a code would run slower when accessing a subarray through an argument list vs accessing the whole array via a USE statement?

The assumed shape dummy has to construct a new “descriptor” to keep track of where in memory its elements are (as opposed to where where the original array was) and what its bounds are. You’re passing a contiguous portion so that’s all it has to do, but since you don’t declare the dummy as contiguous, the subroutine can’t be sure and has to do stride calculations to determine where in memory each element is. Adding contiguous might get you almost all the performance back. Assumed size (or explicit shape) dummy arguments imply contiguous (and don’t create a new descriptor to pass), so you could go that route, but I tend to avoid it as it has its own pitfalls (no more shape and bounds checking).

1 Like

I changed the declaration in the subroutine to:

real(wp), contiguous, dimension( : , 0: ) :: hist

but the code is now running extremely slowly. Is there something else that I need to do?

My mistake. The subroutine calls a function, and I forgot to declare contiguous in the function. Rerunning now…

Ah, I see. Try removing the explicit ranges from the call blah. I.e. call blah ( … , hist(:,:,i) …). That way the compiler knows for sure that slice is contiguous.

Nothing stands out as incorrect. The contiguous attribute might be triggering copy-in/copy-out association. What is the intent of the argument? An explicit intent(in|out|inout) might be useful in avoiding an unnecessary copy-in/copy-out or perhaps to avoid some runtime checks.

In this case, I would stick with your “USE” approach, rather than as an argument.

You could try a non-standard “F77” approach, to avoid any possibility the compiler is copying array sections.

call blah ( … , hist(1,0,i) …)

subroutine blah(… , histarg,…)
real(wp), dimension(6,0:*) :: histarg

Also, as “hist” is a large array, to minimise memory access delays, it would be better to access histarg in a contiguous way, such as

do j = 1,n
  do i = 1,6
    hist(i,j) = computation
  end do
end do

rather than do i = 1,6 ; do j = 1,n

Unfortunately the “computation” does not always allow this.

Another possible change that might provide a small help could be to change from
real(wp), dimension(:,0:) :: histarg
to
real(wp), dimension(6,0:) :: histarg

Certainly in subroutine BLAH, if you can reference histarg as histarg(:,j), to encourage simd instructions, this may also help.
However, as compilers are reported to becoming smarter, these changes would already be identified by the compiler !

To check array temporaries, -fcheck=all or -fcheck=array-temp may be useful (for gfortran). I remember ifort also has a similar option, but forgot the name…

program main
    implicit none
    integer :: a(2,2)
    a = reshape( [1,2,3,4], [2,2] )

    call show( a( :,   1 ) )   !! no array-temp
    call show( a( 1:2, 1 ) )   !! no array-temp
    call show( a( 2, 1:2 ) )   !! array-temp (Line 8)

contains
    subroutine show( arr )
        !! integer :: arr(:)
        !! integer, contiguous :: arr(:)   !! array-temp
        !! integer, intent(in) :: arr(:)
        integer, intent(in), contiguous :: arr(:)   !! array-temp
        print *, arr(:)
    end
end program

$ gfortran -fcheck=array-temp test.f90  && ./a.out

At line 8 of file test.f90
Fortran runtime warning: An array temporary was created
           1           2
           1           2
           2           4

Thanks for all these suggestions. Regarding the question from RonShepard, I have not specified the intent of the argument, but I can do that, it’s intent(in).

So far I have only tried implementing the suggestion of everythingfunctional who mentioned removing explicit ranges in the call statement. FYI all these runs are on the CPU nodes of perlmutter at NERSC. I’ve tried 3 different compilers, using -O2, and here’s what I’ve found:

gnu compiler:
original code (with USE) : 402 sec
2D array code with explicit ranges: did not finish (>200x slowdown)
2D array code without explicit ranges: 412 sec

cray compiler:
original code (with USE): 189 sec
2D array code with explicit ranges: 212 sec
2D array code without explicit ranges: 203 sec

Nvidia compiler:
original code (with USE): 448 sec
2D array code with explicit ranges: 452 sec
2D array code without explicit ranges: 454 sec

In summary the timing for Nvidia is insensitive to the implementation. The cray result differs from the original by 12% with explicit ranges, 7% without explicit ranges. The gnu result is pretty insensitive without explicit ranges, but it’s incredibly slow with explicit ranges. I don’t know what to make of the gnu result, except to say that I will not use explicit ranges in that implementation.

I will give this a try.

This >200x slowdown is odd and I would say unexpected. In fact, any slowdown at all is odd behavior, even 5%. If you use the gfortran compiler flag -fcheck=array-temp as suggested above to warn about temporary array allocation, does it give any insight?

I don’t know about 200x, but this explainer from Intel says that passing non-contiguous arrays to functions with explicit bounds can cause temporary copies to be made.

Given how large OP said their array is, I wouldn’t be surprised if this had significant penalties.

I’ll be away from my computer today but I’ll try the gfortran compiler flags when I return.

Edit: Ignore this. I see that this is a runtime check. I will run the code now.

Old post:
Perlmutter was down all day yesterday. But this morning I logged in and recompiled using the gnu compiler. I tried -fcheck=all and I tried -fcheck=array-temp . Neither showed anything new compared with compiling without those compiler flags. (The only warnings I get are related to related to a rank mismatch in the MPI calls, so I compile with -fallow-argument-mismatch.) The was nothing new at all with -fcheck flags.

I compiled with -fcheck=array-temp and then I ran the code. The version of the code with 2D array arguments with explicit ranges (the extremely slow version) reports that “An array temporary was created.” In the 2D version without explicit arguments there are no reports of array temporaries.

I implemented the suggestion of John Campbell (assumed-size) and compiled with the gnu and Cray compilers. Collecting along with the previous results, they are:

gnu compiler:
assumed size (* as last dimension) 2D array argument: 393 sec
USE 3D array from module : 402 sec
assumed-shape 2D array argument without explicit ranges: 409 sec
assumed-shape 2D array argument with explicit ranges: >200x slowdown (runtime report of array temporaries)

cray compiler:
assumed size (* as last dimension) 2D array argument: 184 sec
USE 3D array from module: 189 sec
assumed-shape 2D array argument without explicit ranges: 203 sec
assumed-shape 2D array argument with explicit ranges: 212 sec

I am interested in this difference (between gnu and cray compilers); is the code completely the same for these two runs? Also, are the call statement and the declaration of subroutine (dummy) argument something like this?

!! "hist" is an allocatable array defined in a module?
!! (if so, guaranteed to be contiguous)

call blah (..., hist( 1:6, 0:nhistpoint, i ), ...)

subroutine blah(..., histarg, ...)
real(wp) :: histarg( :, 0: )                  !! (*)

If the above line (marked with *) is written like the following (marked with **), some compiler might make a copy depending on situations, but I guess it won’t happen if the actual argument is an allocatable array in a module.

real(wp) :: histarg( 1:6, 0:nhistpoint )    !! (**)

So I am wondering if there is some intermediate stage where a compiler is confused / cannot decide whether a variable is contiguous. (Possibly, is the subroutine not in a module? (like written in a legacy-style library file?))

Not a serious problem but just out of interest…

Could you give some more code detail of:

  • array declaration in module,
  • call blah
  • blah array declaration

It would be interesting to identify what has gone wrong with Gfortran; most likely being temporary arrays for array section ?

I would have expected that the following would have implied contiguous memory use :
subroutine blah(… , histarg,…)
real(wp) :: histarg(6,0:*)

The compiler can assert at compile time that hist is contiguous if:

  • hist is declared in foo()
  • or hist is declared in a module used byfoo()
  • or hist is an argument of the calling routine, with an assumed size or explicit shape
  • or hist is an argument of the calling routine, with an assumed shape and the allocatable attribute
  • or hist is an argument of the calling routine, with an assumed shape and the contiguous attribute

But knowing that hist is contiguous is not enough: the compiler has now to determine if the array section hist( 1:6, 0:nhistpoint, i ) is contiguous or not. This is possible at compile time if:

  • hist is statically declared in foo() and nhistpoint is a parameter
  • or hist is statically declared in a module used by foo() and nhistpoint is a parameter`
  • or hist is an argument of foo() with the assumed size or explicit shape, and the first two dimensions set with parameter values, and nhistpoint is a parameter`

Maybe I’ve forgotten some cases, but I think that in all other cases the compiler cannot know at compile time if the section is contiguous or not. So either it unconditionnaly generates a copy-in (it seems that this is what gfortran does in the example in this discussion), or it generates runtime tests about the contiguity to decide at runtime about copy-in or not copy-in.

@PierU

Regarding “call blah (…, hist( 1:6, 0:nhistpoint, i ), …)”

Would "call blah (…, hist( :, :, i ), …)` provide the compiler sufficient information to know hist is contiguous ?

Otherwise, there needs to be more understanding of how to avoid temporary arrays when array sections are used as arguments.

I have always tried to avoid array sections as routine arguments.