Optimizing vectorized array operations

JohnCampbell · March 23, 2025, 4:46am

I just do not understand this statement. Is this a new revelation or rubbish ?
How is a C-style loop different from a Fortran DO loop ?

Multi-dimensional Fortran array syntax is quite powerful as it can provide the compiler more flexible optimisation approaches for memory management.

You mention “optimizing for L1/L2 cache utilization” but isn’t the problem more optimizing for L1/L3 cache utilization. You previously mentioned “manual memory management”. I am not sure what this is but if this applies to interfering with cache management, this could be a disastrous approach.
I have spent years trying to understand how to “optimize for L1/L3 cache utilization” and my impression is the best influence can be via designing array subscript order, while manual cache controls should be left to the operating system.
It is a bad sign when you test a memory strategy and get inconclusive test results.

FORALL and DO CONCURRENT are examples of Fortran language applying rules that may help the compiler when optimising, but the benefits are not easy to identify.
It is disappointing that OpenMP can not derive more efficiency from Fortran array syntax. The inefficency of OpenMP thread management has hinderd the benefits of FORALL and DO CONCURRENT.

Gfortran’s MATMUL implementation is an interesting example of L1/L3 cache optimisation, although tailored to the L1 cache sizes at that time.

The other problem with some memory strategies is they can easily become outdated. A good example is memory alignment issues with AVX registers, which appear to have since been addressed in the recent hardware.

PierU · March 23, 2025, 8:06am

I think that you missed the point: these 2 operations in your MRE do update the whole 800000 elements of the arrays, while only 1000 of them are really modified (that is about 0.1%). Your code is performing useless tasks here. If you want a temporary array and update it with array syntax, fine, but do it only where it matters (i1 and i2 are just readability):

i1 = rand_num + 1
i2 = rand_num + N_samples
line_pack(i1:i2) = 0.0d0 
pack_tot(i1:i2) = pack_tot(i1:i2) + line_tot(i1:i2)

PierU · March 23, 2025, 8:26am

As @JohnCampbell wrote, this is a harsh statement.

Typically even a HPC code is made of >95% of non performance-critical sections. Using array syntax there is most of time fine.

What about the few % that are critical? Depends…

I tend to write array syntax in the first place as long as the array syntax is at least as readable as the equivalent loop, because it also corresponds to cases where the compilers have got quite good at optimizing them (we are no longer in the 90’s where compilers were struggling very hard with the array syntax).

As soon as the full array syntax gets difficult to read, I write it as a mix of loops and simpler array syntax. Again with readability/simplicity in mind.

At the end, when profiling the code, if a section with some array syntax tend to take a lot of time, I investigate and possibly try rewriting it with only loops. But in my experience, the full “do loop” approach is rarely really more efficient than a mixed “do loop”+“array syntax” (although it sometimes happens, but again with some experience I can often spot this cases early).

zerothi · March 27, 2025, 1:18pm

There are lots of really good answers here! So I won’t add to those.

Let me just add a final note which makes array-syntax sub-optimal, and in this suggestion, it can kill performance.

When you do

real, allocatable :: a(:,:,:)
...

a = b

what the compiler will do (because of the standard) is to check if a has the correct shape, and if not, deallocate it, then re-allocate it, then copy. If it has the correct shape, it will just copy. However, the checks inserts lots of unnecessary code…

While this may be intended by the programmer, it can hurt performance when one is doing that call frequently, and you intentionally don’t want to check shape. So I would encourage users to adopt this standard:

a(:,:,:) = b

whenever you really don’t want to deallocate. The slices tells the compiler that you want to overwrite values in the array, and thus the compiler will not put in checks for shape, and re-alloc statements.
The point is that you’ll be removing lots of branches, and gain free performance.
Once gotten use to it will also be more obvious when you want what.

Beliavsky · March 27, 2025, 1:55pm

@sblionel wrote a post Doctor, it hurts when I do this! in 2008 giving different advice, although his comment at the bottom in 2021 in response to a question is consistent with what you wrote.

As an alternative to writing x(:,:,:) on the LHS, which looks a bit cluttered, once arrays have been allocated they can be passed to a procedure as fixed-size arrays.

It would be nice to have a benchmark where the speeds of using x vs. x(:,:,:) vs. passing the allocatable arrays as fixed-size arguments to a procedure are compared.

zerothi · March 27, 2025, 2:02pm

Sure, it is only applicable when the variable is declared as an allocatable. Argument, or local variable aside.

RonShepard · March 27, 2025, 3:47pm

This would be good information to have, but of course the results would depend on the compiler, the compiler version, and also on the compiler options that have been invoked at compile time and possibly even the run time environment at the moment that the statement is executed.

I also wanted to point out that there is an ambiguity in the standard regarding the array assignent a=b when the allocatable array has been allocated and when it has consistent sizes in each dimension. The programmer might expect that no reallocation would occur in this case, but this situation is not actually fully specified by the standard. It is possible also for the compiler to reallocate even in that case. An example of when this might be desirable is the code sequence

a = b
deallocate(b)

A clever compiler might realize that

call move_alloc( from=b, to=a )

achieves the same thing but with an inexpensive shallow copy instead of an expensive deep copy and is also consistent with the text of the standard. Another situation where reallocation might be desirable is to enforce some kind of memory or cache locality on the array a.

I don’t know of any compilers that do this optimization, so this is just an academic observation at this time.

themos · March 27, 2025, 4:40pm

An alternative suggested by the editor of the Fortran Standard, but in a slightly different context, is to use POINTER.
“There is no auto-reallocation for pointers.”

https://mailman.j3-fortran.org/pipermail/j3/2015-April/008047.html

certik · March 27, 2025, 4:50pm

zerothi:

When you do
real, allocatable :: a(:,:,:)
...

a = b
what the compiler will do (because of the standard) is to check if a has the correct shape, and if not, deallocate it, then re-allocate it, then copy. If it has the correct shape, it will just copy. However, the checks inserts lots of unnecessary code…

Yes, I am not a fan on this either. I think Fortran should never be doing unnecessary checks or operations. If the user requests it via a compiler option (in Debug mode) then the compiler should check that all shapes agree. But in Release mode it should not be checking these things. For most compilers one can get this (non-standard) behavior via compiler options (for some it’s even the default).

ivanpribec · March 27, 2025, 5:00pm

In this particular case, re-allocation of an allocatable argument, the checks are necessary IMO. Not deallocating a (if allocated) would be a memory a leak. Checking the shape can save the effort of a redundant malloc. I don’t see anything unnecessary.

It isn’t that different in other languages say Python + NumPy, where you need to write a += alpha * b instead of a = a + alpha * b to prevent creating a temporary array on the RHS.

RonShepard · March 27, 2025, 5:57pm

I think they are necessary, and of course standard conforming, too. I think @certik’s comments rather apply to the statement a(:,:,:)=b. In that case, the compiler is not required by the standard to test array extents, but many programmers would like the ability to turn on such checks in debug mode and also sometimes even in production mode. However, such extent tests are standard conforming in any case, whether the programmer wants them or not.

zerothi · March 27, 2025, 8:43pm

I agree, this should have never been part of the standard IMHO!

And having compilers handling this differently just makes things worse to ensure compliance…

certik · March 27, 2025, 9:53pm

The two modes are automatic reallocation of LHS, and no reallocation. In the no reallocation mode there is no leak and no checking necessary. In the reallocation mode, you have to indeed check even in Release mode and free the old array and allocate it to the proper size.

PierU · March 28, 2025, 10:27am

zerothi:

real, allocatable :: a(:,:,:)
...
a = b
what the compiler will do (because of the standard) is to check if a has the correct shape, and if not, deallocate it, then re-allocate it, then copy. If it has the correct shape, it will just copy. However, the checks inserts lots of unnecessary code…

While this may be intended by the programmer, it can hurt performance when one is doing that call frequently, and you intentionally don’t want to check shape. So I would encourage users to adopt this standard:
a(:,:,:) = b
whenever you really don’t want to deallocate. The slices tells the compiler that you want to overwrite values in the array, and thus the compiler will not put in checks for shape, and re-alloc statements. The point is that you’ll be removing lots of branches, and gain free performance.
Once gotten use to it will also be more obvious when you want what.

I agree, and I tend to always use the slice notation on the LHS, if I know that the shapes are supposed to be identical. However, it should be noted that the performance gain by doing so is noticeable only with small arrays. If the arrays have 10**3 elements the cost of the additional test is negligible…

But even with large arrays there is another (and better IMO) reason to not write a = b when we know that the shapes are supposed to be identical: if there’s a bug somewhere in the code and b doesn’t have the expected shape, a = b does not allow detecting the bug: the compiler simply reallocates a and all is good. In contrast, a(:,:,:) = b is an illegal statement, which can be catched at runtime by compiling with bound checking enabled.

RonShepard:

I also wanted to point out that there is an ambiguity in the standard regarding the array assignent a=b when the allocatable array has been allocated and when it has consistent sizes in each dimension. The programmer might expect that no reallocation would occur in this case, but this situation is not actually fully specified by the standard. It is possible also for the compiler to reallocate even in that case. An example of when this might be desirable is the code sequence
a = b
deallocate(b)
A clever compiler might realize that
call move_alloc( from=b, to=a )
achieves the same thing but with an inexpensive shallow copy instead of an expensive deep copy and is also consistent with the text of the standard. Another situation where reallocation might be desirable is to enforce some kind of memory or cache locality on the array a.

I don’t know of any compilers that do this optimization, so this is just an academic observation at this time.

I can’t find the discussion, but I’m almost certain that @sblionel answered to that in the past, and certified that the standard does require the processor to not reallocate a when the shapes conform.

I disagree. The reallocation on assignment is also a useful feature in some cases. And it’s simple to avoid it by using the slice notation on the LHS. And if a compiler does not handle properly, then it’s a compiler bug.

The only real downside of the reallocation on assignment is the behavior of the bounds, which can can be surprising (although logical):

real, allocatable :: a(:)
real :: b(16), c(20)
allocate( a(-5:10) )    ! size(a) == 16, lbound(a) == -5
a(:) = b   ! lbound(a) == -5
a    = b   ! lbound(a) == -5
a    = c   ! lbound(a) == 1

PierU · March 28, 2025, 10:35am

I think that such an option is dangerous (and I hope that no compiler has it by default nowadays).

zerothi · March 28, 2025, 12:18pm

The problem is that this is a sematic pitfall. Knowledgeable people will know about it, and try to avoid it. But new-comers to the language won’t notice this, and might end up in perforrmance problems due to a single (:), which a code shouldn’t.

Cases where it can be useful are easy to find, but its how common users interact with the language that is important. And here, the small sematic difference is not something a common user would recognize.

The same thing could be said about the scalar definitions of parameters

pi = 3.1415
pi = 3.1415_8

are not the same.

Even for well versed fortran programmers the above mistake is happening. And the LHS-reallocation poses the same problems. It’s easy to forget…

We don’t have to agree. I just think this part of the standard was wasteful as it could be bypassed with simple routines quite easily…

PierU · March 28, 2025, 12:52pm

My usual answer to that: all languages have a learning curve, and no language is 100% intuitive. I’ve been bitten more than once by some C/C++ features because they behave differently from Fortran, and because I am primarily a Fortran programmer with little experience in C/C++

And one can actually argue that the reallocation on assignment is far more natural for a complete newcomer: a = b meaning that the object a is identical to the object b after completion looks logical… Also, the only other popular language that extensively uses array syntax is Matlab/Octave, where reallocation on assignment is everywhere.

Reallocation on assignment is actually the most disturbing for Fortran programmer who learned the array syntax starting with Fortran 90 and got used to the behavior without reallocation on assignment.

zerothi:

The same thing could be said about the scalar definitions of parameters
pi = 3.1415
pi = 3.1415_8
are not the same.

Even for well versed fortran programmers the above mistake is happening. And the LHS-reallocation poses the same problems. It’s easy to forget…

No, it’s not easy to forget. It’s like bicycle: Once you know, you know…

I can agree on the fact that it was not a must-have addition to the language. I use it now that it’s here, but I could easily do without it.

sblionel · March 28, 2025, 3:01pm

First, the standard really does say that if the shapes of the LHS and RHS match, no reallocation is done.

10.2.1.3 Interpretation of intrinsic assignments

Paragraph 3 spells out what happens if the LHS is an allocated allocatable variable. The first step is “it is deallocated if expr is an array of different shape, any corresponding length type parameter values of the variable and expr differ, or the variable is polymorphic and the dynamic type or any corresponding kind type parameter values of the variable and expr differ.” In essence, if the end result is going to be the same shape (and same other attributes), it doesn’t get deallocated and reallocated. Yes, this does take some time for the checks. But note how extraordinarily useful this is for deferred-length allocatable character variables so that if the lengths differ, the variable now has the length of the expr.

As for use of (:,:,:), I agree that if you know that the shapes will be the same that this can save you a bit of time with the checks, though as noted, if the arrays are large the difference will be noise. I still don’t care for the superfluous bounds, but if you have a good reason for them, I don’t object.

ivanpribec · March 28, 2025, 5:09pm

Thanks for clarification. Reinhold Bader covered this issue in his Fortran programming course:

IMO, one can avoid the pitfalls with some discipline, e.g. having an outer scope which does allocation, and having an inner scope where the actual operations happen and aliasing guarantees apply.

There are so many performance pitfalls for newcomers I really doubt this is the main one. From broken data structures, wrong access patterns, completely blocking vectorization with scalar functions, the list goes on and on. I’ve seen researchers “brag” about their excellent multi-threaded scaling, and it turned out they weren’t compiling the code with any optimization flags at all (I think there is a quote from a famous parallel programmer, “the easiest way to make code scale well is to make it sequentially slow”). The big waste IMO in scientific codes is when you have a Python script using a sequential BLAS library via NumPy (due to misconfiguration), and running on 1 out of 28 cores of a server processor. But Fortran, C, or C++ programmers also aren’t immune to these types of issues.

aledinola · March 28, 2025, 8:37pm

Can you give an example of “blocking vectorization with scalar functions”?
Thannks!

Topic		Replies	Views
Fortran: Array Language (video) Advocacy	20	1072	February 3, 2024
Performance impact of how a large array is accessed Help	49	1973	June 4, 2023
Will using Vectorization speed up the program? Help	22	1928	November 26, 2023
Implied-do array constructor, type specs, and differences between GFortran, Intel, and LFortran Help	49	1355	May 26, 2024
Fortran is dead – Long live Fortran! Tutorials	9	886	June 27, 2025

Optimizing vectorized array operations

Related topics