Optimizing vectorized array operations

Imagine a loop like

do i = 1, n
   a(i) = b(i) * c(i)
   d(i) = fred(a(i)**2 + 2.0)
   e(i) = d(i) / b(i) + a(i)
end do

with the function fred defined in a different translation unit (a separate file/module)

! implicit typing used!
function fred(x)
data c0, c1, c2, c3, c4, c5, c6, c7, c8, c9 &
   / .1, .2, .3, .4, .5, .6, .7, .8, .9, 1. /

fred = c0 + x*(c1 + x*(c2 + x*(c3 + &
            x*(c4 + x*(c5 + x*(c6 + &
            x*(c7 + x*(c8 + x*(c9 + x)))))))))
end function

At the point the loop is compiled, the function body remains unknown and the compiler will most likely produce a scalar loop (scalar meaning it doesn’t use SIMD instructions).

Levesque discusses five approaches to fix this issue:

  1. Splitting the loop to isolate the external reference into a loop of its own
  2. Replacing an external function definition with a statement function
  3. Pulling the code of the external routine into the referencing loop
  4. Pushing the loop into the sub-program
  5. Restructuring a scalar function into a vector subroutine

Number 2 would be replaced by an internal procedure today. Which fix to choose depends from case to case. Today there is also array expressions and elemental functions as option 6 and OpenMP SIMD as option 7.

1 Like

Thanks for your answer and for the reference to Levesque book! It’s very readable, at a first glance.

The Levesque book that you referenced dates back to 1988. While it is a great source, it is based on pre-Fortran90 unfortunately. Would you recommend similar but more recent books/material? Even just Fortran 90 would be nice. Thanks

Consider the following:

program assign
   integer :: b(-1:1)=[-1,0,1], c(-2:0) = [-2,-1,0]
   integer, allocatable :: a(:)
   a=b; write(*,*) lbound(a), lbound(b)
   a=c; write(*,*) lbound(a), lbound(c)
end program assign

$ gfortran assign.f90 && a.out
          -1          -1
          -1          -2

I understand why that happens, but I can see why it might be surprising to a new programmer.

Yes, that is exactly the problem sentence. As I stated above, it does not specify what happens if those conditions are not satisfied.

The best argument that the lhs should not be reallocated is that f90 and f95 required the array extents on the lhs to match those of the rhs and these standards did not allow reallocation of the lhs. F2003 and more recent standards allow the array extents to differ, and in that case the lhs is reallocated. One might assume that the behavior of f2003 and later should match the behavior of the earlier standards, but this is not stated, either directly or indirectly. My personal preference would be for the later standards to specify what happens in that case, rather than to refer back to previous standard behavior. That is, I think each standard should be complete and self-contained.

In this particular case, the only change that would be needed is 1) to change “if” to “only if” in the above quoted sentence, or 2) add the statement that says the lhs is not reallocated if the shapes agree.

One new feature that would be useful in the language would be a way to change the bounds of an existing allocatable array. The only way to achieve that now is to make a temporary copy of the array contents, deallocate the original, allocate the original with the new bounds, and copy the contents back.

Yes… Bounds and (re)allocation on assignment has sometimes unexpected behaviors. And a = b(:) in the first place is even a different case.

I understand your concern, but this is a more general questioning: is a processor authorized to relocate in memory any variable (allocatable or not) during the lifetime of this variable?

Everybody would say “no”. And yet, I don’t think the standard clearly and explicitely state that.

The best I can think of is

“Introduction to High Performance Computing for Scientists and Engineers” by Georg Hager and Gerhard Wellein

Book webpage here: Georg Hager's Blog | HPC Book (a new edition is in preparation).

Levesque also published other books later in his career, but I didn’t have the chance to read them yet. There is this quote from Levesque,

Ask not what your compiler can do for you, ask what you can do for your compiler.

My observation is that sometimes different compilers “expect” you to program in different ways if you want to get the auto-vectorization to succeed. Anecdotally, Levesque isn’t the biggest fan of array operations (maybe he says it during this tutorial, but I might be mistaken). Edit: the inefficiency of array operations appears to be a section of his 2018 book,

I think array operations can be very expressive and helpful. If profiling shows they introduce too much over-head they can be refactored. One can include this refactoring cost under the “Plan to throw one away” rule (also called the “Design it twice” principle).

2 Likes

I think the old (1998) HPC book by Dowd and Severence has a section on loop optimization and code tuning
https://www.amazon.com/s?k=High%20Performance%20Computing&i=stripbooks&rh=p_27%3ADowd&s=relevancerank&Adv-Srch-Books-Submit.x=0&Adv-Srch-Books-Submit.y=0&unfiltered=1

This old NASA report also has some good info on reordering loops for cache performance. Its a bit dated though and focuses mostly on RISC processors from the 1990s

1 Like

The standard also doesn’t specify what happens if the Sun goes nova. What it does specify is the series of steps taken during intrinsic assignment when the LHS is an allocatable variable. The sentence I quoted is part of a longer paragraph that spells out more:

If the conditions are not satisfied, such that the shape, length, dynamic type, etc. all match, then no deallocation or reallocation is done.

Keep in mind that the Fortran standard specifies the behavior of a standard-conforming program.

As I said, the programmer might hope that reallocation does not occur, but that sentence, and that paragraph, do not specify that behavior. The words “if” and “only if” have distinct meanings, one cannot just declare that they are equivalent in this case.

edit: I just did a quick search of the f2023 draft document, and I found “only if” occurs on 69 pages of the document. So the standard authors certainly do know the difference in the meaning of “if” and “only if” and they know how to use the two terms appropriately.

This raises the further question of what might happen if reallocation does occur in a standard conforming program? Would there be observable consequences? Consider that the lhs array has the target attribute, and that pointers have been assigned to either the whole array or to strides within that array. If stealth reallocation occurs, then those pointers would lose their association, and that would then have observable consequences. If those pointers are local, then the compiler might be able to detect the possible problem and avoid the stealth reallocation, but if there are nonlocal pointers associated with the original array, that would be more difficult or even impossible for the compiler to detect.

Another observable consequence of stealth reallocation would be the possible change of bounds of the lhs array. This is the third bullet point in the quoted paragraph above, along with the assumption that stealth reallocation would have the same semantics as the defined reallocation situation.

I think this question also gets to this same issue. Certainly the compiler is free to make local copies of variables, including arrays. Register copies for scalar variables and copy-in/copy-out argument association are two examples of this. But in these cases, the local variables are eventually copied back to the original memory locations so that, for example, pointers remain associated. But if stealth reallocation occurs, those pointers would no longer remain associated, and that would have observable consequences.

The scope of local variables and dummy arguments and the consequences of the target attribute seem to be accounted for in the language specification. For example an intent(in) dummy argument with target attribute cannot be reallocated; its allocation status must remain unchanged, and it is the programmers responsibility to ensure that reallocation does not occur. This means that nonlocal pointers to that array will remain associated, as expected by the programmer. But an intent(inout) dummy argument with the target attribute can be reallocated, in which case any previously associated nonlocal pointers lose their association.

Another example in the standard where this is mentioned is in the move_alloc() intrinsic. In this case, the language specifies that any previous pointer associations with the from= argument retain their associations and are transferred to the to= argument.

So while the language is fairly specific, and I think completely covers all possible situations, in other parts of the standard, it seems odd that this one particular situation involving the lhs assignment is not fully specified. I do not think this neglect was intentional, giving the compiler the freedom to reallocate, I think it was an oversight. Consistent with my previous comments about f90/f95 compatibility in this situation, I think “only if” was intended all along.

Let’s see it differently: it’s not because the standard describes a case where deallocation/reallocation does occur, that the compiler can do whatever he wants in the other cases.

Before F2003, so without this paragraph in the standard, were you questionning the following?

real, allocatable :: a(:), b(:)
...
allocate( a(-2:7), b(10) )   ! both size=10
...
a = b
deallocate( b )
...

The compiler could optimize the last 2 lines by modifying the descriptor of a such that its hidden data pointer point to the memory area of b, and free the memory originally pointed by a. The rest of the descriptor of a, including the bounds, would be unchanged. And since a is not a target you wouldn’t be able to notice the trick. At the end the program would run as expected.

That’s actually a general rule: as long as a standard-conforming program runs as expected, the compiler can do whatever he wants (including some stealth reallocations). If the program doesn’t run as expected, then it’s a compiler bug.

Before f2003, reallocation was not a possibility in the assignment, so I would say the semantics of that assignment+deallocation would have required the bounds of a to remain unchanged. With no pointers associated with a, no storage sequence requirements, etc., then I do think it would have been allowed for a compiler to do the shallow copy optimization to achieve the same ends. That is, the “as if” rule applies to that assignment the same as any other assignment. At least I can’t think of anything else in the language at that time what would have precluded that optimization. But if there are pointers and targets involved, or if storage sequence issues were involved, then such an optimization would not have been allowed because it would no longer satisfy the “as if” rule. As I said before, I don’t know of any compiler that actually did that kind of optimization, but at least as an academic question, I think it would have been allowed in some situations (where the compiler could verify that “as if” applied).

But after f2003, reallocation on assignment became possible, so the language needed to describe fully when c_f_pointer() results change, when the targets of pointers can change, when dummy argument aliases might occur, when bounds are transfered and when they aren’t, and so on. Those are all things that don’t just occur behinds the curtains, they are now things that have observable consequences within the fortran language. I think the language since f2003 does describe all of those situations fully. I mentioned the intent(in) and intent(inout) situations regarding pointer association as just one example of how careful the standard was in describing the standard behavior. However, this one common situation with assignment is not fully specified by the language. Since so many other similar situations involving allocatables, targets, and pointers are described fully in the language, one can assume that assignment case is just an oversight, that “only if” was intended, and that there are no special rules involving pointer associations or dummy argument aliases or bounds transfer that need to be considered. I think all of that should be clearly specified in the standard, not just assumed.

Here’s a simplified version of something that tripped us up in our code:

program test
implicit none
integer, parameter :: n=10000
integer i,j
integer a(n),b(n),c(n),d(n)
real(8) cpu0,cpu1

do i=1,n
  a(i)=i; b(i)=i; c(i)=i
end do

call cpu_time(cpu0)
do j=1,1000000
  do i=1,n
    d(i)=a(b(c(i)))
  end do
  c(1)=1
end do
call cpu_time(cpu1)
print *,'time loop',cpu1-cpu0
print *

call cpu_time(cpu0)
do j=1,1000000
  d(1:n)=a(b(c(1:n)))
  c(1)=1
end do
call cpu_time(cpu1)
print *,'time array op',cpu1-cpu0
print *

print *,d(1)

end program

There are no allocatable arrays and all ranges are explicit. Nevertheless compiling with gfortran 14.2.0 using

gfortran -O3 -Warray-temporaries test.f90

yields

   26 |   d(1:n)=a(b(c(1:n)))
      |           1
Warning: Creating array temporary at (1) [-Warray-temporaries]

It appears that having a triple nested index causes gfortran to create a temporary array. This reduces performance of the above code from 3.6s to 5.8s. Double nested index arrays like

d(1:n)=a(c(1:n))

don’t have this problem.

Intel Fortran doesn’t create a temporary array and the array operation is slightly faster than the explicit loop.

2 Likes

As a Fortran programmer who has embraced F90’s addition of automatic and allocate, I have found F03’s inclusion of auto (re-) allocate to be a tricky inclusion for (what is only F90) legacy codes.
One way I have tried to avoid the problem is by using “F90 wrapper” routines where the allocatable attribute is not included in the wrapper routine.
Use of array sections may help, but the main development Fortran compiler I am using has inefficiencies associated with array sections ( due to creating unnecessary temporary copies)

The problem is not just that you need to be careful with array syntax in new code, but the uncertainty of array syntax used in previous F90 codes.

1 Like

The following change helps, but an “optimising compiler” should be better than us !

integer a(n),b(n),c(n),d(n),k(n)

call cpu_time(cpu0)

  k = b(c(1:n))

do j=1,loops
  d(1:n)=a(k(1:n))
  c(1)=1
end do
call cpu_time(cpu1)
print *,'time new array op',cpu1-cpu0
print *

The “as if” rule still applies. IMO one should not assume that deallocation/reallocation occur, except in the cases explicitly described by the standard. And if it occurs besides these cases, it should be “as if” it doesn’t occur.

Nonetheless, if you think that the standard needs to be edited on this point (with an “if and only if” instead of “if”, or with an additional note to make things fully clear), the best way is to raise the topic on the J3 mailing list, and if people there agree with you, to submit a proposal describing the edits.

Do you use explicit intents? I read recently in the gfortran developer mailing list,

this patch addresses a long-standing difference between gfortran and
other brands: when an array actual argument was passed to a procedure,
and the dummy argument had no intent specified, we would often create
packing and unpacking code. Only the case of the dummy argument
having intent(in) did avoid the unpacking.

[…]

(*) There is a missed-optimization in that we do not simply create
suitable array descriptors when passing to assumed-shape dummies,
which may avoid the packing.

I agree that one should not assume that that the dealloction+allocation sequence occurs, but the ambiguity I see is rather that the programmer must assume that it does not occur because that case is not fully described by the standard.

To use the sun analogy above, suppose the standard goes to great detail describing what happens when the sun does not go nova, but ignores the case where it does go nova. What should the programmer assume about his code in the nova case?

This is the other part of the ambiguity. If it does occur, then should the programmer assume “as if” f90 semantics apply, e.g. the lhs array bounds remain unchanged, or should the programmer assume “as if” f2003 semantics apply according to 10.2.1.3 (posted above in this thread), which states that the bounds are copied from the rhs expression.

BTW, I have noticed one other observable difference (besides the bounds, pointer association, and argument aliasing situations), and that is when the deallocation triggers a finalization operation. Without deallocation of the lhs, no finalization should occur, but with deallocation the programmer might expect it to occur and it doesn’t, or he might expect it to not occur and it does.

I’ll consider that. I guess I’m surprised that this has not already been corrected over the past 20+ years or so. I thought I might be overlooking some relevant section of the standard, or not understanding some convention in the standard (e.g. that “if” always implies “only if”). So far, that does not appear to be the case. No one so far has pointed out any other relevant parts of the standard, and the standard does indeed use correctly “only if” when that is the intended meaning.

I can see how this avoids the unnecessary compiler-generated temp array, but it does so because the programmer instead creates the unnecessary temp array. :slight_smile:

2 Likes

@RonShepard
“the programmer instead creates the unnecessary temp array” only once !
( and possibly would update it a minimal amount )

1 Like

I have frequently had this problem at compile time.

However, it doesn’t necessarily mean that the program will do this.

A better check is to do -fcheck=array-temps which will print-out at run-time whether a temporary array was created. This is way more accurate.

However, as you say your test program does indicate the temporary, so this is merely a comment for other readers. :slight_smile:

1 Like