Performance impact of how a large array is accessed

I would say that is provides more information, as the compiler here just has to check if hist is contiguous. If it is, then hist( :, :, i )is necessarily contiguous as well.

Are there any compiler options that would cause gfortran to say why a copy is being made of a particular actual argument? I would presume that the compiler is going through some elimination process to arrive at its conclusion, so something looks to be wrong with that process in this case.

@septc , @JohnCampbell – Regarding your questions – The codes are the same, I simply compiled one with PrgEnv-gnu and the other with PrgEnv-cray on Perlmutter. In the gnu run with 200x slowdown, the CALL has explicit ranges:
CALL BLAH (…, hist(1:6, 0:nhistpoints, ipart) ,…)

The call without explicit ranges is:
CALL BLAH (… hist(:, :, ipart) ,…)

The dummy argument declaration within BLAH is the same in both:
real(wp), contiguous, dimension(:,0:) :: hist

Based on compiling/running with -fcheck=array-temp, the gnu compiler creates a temporary array when it is called with an explicit range. But without the explicit range, there’s no temporary created.

I should mention, the array is defined in a module as:
real(wp), allocatable, dimension( :,:,: ) :: hist

In the main program it is allocated as:
allocate( hist(6, 0:nhistpoints, npart_lcl ) )

Just out of curiosity, if you remove the contiguous attribute in the dummy array declaration, does this eliminate the array copy from occurring?

If everything is the way you describe, then this looks like an error of some kind in the compiler logic. In all cases, the actual argument should, to my eye, be the same, and no copy should be required for any of these actual argument specifications.

@RonShepard I removed the CONTIGUOUS attribute as you suggested and compiled with PrgEnv-gnu. The code ran normally! Instead of >200x slower, it ran in 419 sec.

Any thoughts on why removing CONTIGUOUS would do this?

I think it is an error in the compiler. It thinks the actual argument is not contiguous, so it is making an unnecessary copy. Including the contiguous attribute for the dummy argument can potentially make the subroutine more efficient, but in this case it is also triggering the compiler error, so it ends up doing more harm than good.

I don’t know what “contiguous” means in this declaration.
Is this standard Fortran or a Fortran extension ?
Does it mean the supplied array is contiguous, or that the compiler should make a copy to ensure it is contiguous ?

My preference for the declaration would be the following to hopefully imply contiguous:
real(wp), dimension(6,0:*) :: hist

Array sections are a performance nightmare, which I have always avoided.

It is fully standard Fortran, and instructs the compiler to make a copy in the case the actual argument is not contiguous (or in practice in the case the compiler cannot determine the contiguity).

This has the same effect. With the additional drawbacks of the assumed size.

That’s obvious: because without contiguous the compiler can avoid the copy-in.

This all depends on the sophistocation of the compiler, as the decision by the compiler to provide instructions for a temporary copy is made at compile time, not run time. (similar to the problem switching “large” arrays between stack and heap)
That “hist” is an allocatable array makes a lot of difference !

If the interface of the called routine requires that the received array must be contiguous, then there is a greater possibility of making a temporary copy.

In the routine calling; if the array is allocatable, then it could be harder to confirm a contiguous section of “hist(1:6, 0:nhistpoints, ipart)” than for “hist(:, :, ipart)”.
The compiler may still not be certain “hist(:, :, ipart)” is contiguous, depending if it can see the array allocation or declaration.

There is a requirement in the calling routine that for “CALL BLAH (…, hist(1:6, 0:nhistpoints, ipart) ,…)”, there is sufficient information to guarantee it would be contiguous. How hard should the compiler try ?

Again I am not sure if the declaration in the recieving routine “real(wp), contiguous, dimension(:,0:) :: hist” confirms or requires that hist is contiguous.

What should the compiler do ?

My strategy is to avoid the issue by avoiding array sections as arguments.
I also incorrectly assume that all array arguments are contiguous ( this assumption is a requirement of a F77 wrapper approach )

In many cases the decision can be made only at runtime.

It ensures that the dummy argument is contiguous, even if the actual argument is not.

Just to be clear about this, it ensures the argument is contiguous by making a copy (and also copying back the result if necessary). In this sense, it does the same thing as association with an explcit shape or an assumed size dummy argument, while keeping the other features of assumed shape arguments, so these three choices are not all exactly the same.

Maybe there is something missing in the explanations given about the code so far. It might be nice to see a small full-code example of this that shows the 200x slowdown. We could then compile it, look at the intermediate code or the assembler code, and maybe determine what is happening.

The computed results are presumably correct in any case, right? So if this is a compiler error, it is just in the optimization analysis steps.

I think that ignores a lot of capability of the language. In principle, there should be no performance penalty just for using array sections, either in expressions or as arguments. Of course, that does not necessarily mean there is no penalty in practice, so the programmer must sometimes make copies of data, e.g. contiguous copies, to help the compiler achieve optimal performance.

No, what I do is manage my data structures in a better way so that I minimise the requirement for making temporary copies of array sections. When array sections are required, I try to set up a section as a temporary array to minimise the frequency of creating the temporary sections.

This approach has been based on using F95+ compilers that are not “smart” with array sections.
Placing array sections in inner loops of an analysis is asking for inefficiency.

I also try to avoid arrays that can produce a non-contiguous array structure, which ifort (and apparently Gfortran) have introduced to improve the performance. Having learnt Fortran with pre-F90, non-contiguous arrays look to be bad approach for efficient memory use.

@RonShepard I’m preparing a stripped down version of the code that illustrates the issue.

But in the meantime I have a question about an earlier statement:

“it [CONTIGUOUS] ensures the argument is contiguous by making a copy (and also copying back the result if necessary).”

Just to clarify: The check I’m doing is a runtime check, and it reports when an array temporary was created. The thing I don’t understand is, if it’s clear during execution that a certain call will pass a contiguous array section, then why would the code generate an array temporary?

Once again, with an assumed shape dummy argument + contiguous, the compiler will make a copy if:

  • the actual argument is not contiguous
  • or the compiler cannot determine if the actual argument is contiguous or not

It might be clear to you (the programmer), but it might not be clear to the compiler, either at compile time or a run time. You might know more about your data structures than the compiler can infer, or there is also the possibility that the logic the compiler does to test for contiguity is incorrect. That is why having a small test case that shows the 200x slowdown would be useful.

Yes, but noncontiguous arrays occur in various algorithms, even in pre-F90 computer codes. If you look at the level-1 blas, for example, which were developed in the 1970s (pre-f77 even), they all have incx and incy arguments specifically to handle the noncontiguous cases. The array slice notation introduced in f90 is simply a continuation of that tradition of treating noncontiguous arrays. I’m not suggesting that anyone should go out of their way to use noncontiguous arrays, but rather that the array slice notation provides a clear and convenient way to use them when necessary or appropriate.

This a problem, as the compiler has to provide instructions at compile time.

Typically in this case, this would lead to providing a temporary copy, as I am not aware of compilers providing both options, then choosing at run time.
( I have asked for a similar approach with stack overflow, where at run time, if the array does not fit on the stack, it is sent to the heap. Imaging how much easier this would be for the user ! After all it is just a new memory address + a bit of clean up)

I expect if the need for a temporary array is not clear, then a temporary array would be provided (unless a non-contiguous array can be handled by the called routine)

Ron, Did you run programs in 1970’s on mini computers with disk based paging ?
If you did you would not remember fondly the use of " incx and incy arguments"
That was a performance penalty !!