How to vectorize a loop of a constant stride access pattern

MetMan · July 16, 2025, 12:44pm

Hi all, I have a code that I want to vectorize, but according to intel ifort vectorization report it doesn’t vectorize effectively

do nb=1,nblks
   blen = blksz(nb)
   do ix=1,blen
      i=index(nb)%ii(ix)
      j=index(nb)%jj(ix)
      n=index(nb)%ff(ix)
     do k=1,npz
         IPD_Data(nb)%Statein%tgrs(ix,k) = t_dyn(k,i,j,n)
         IPD_Data(nb)%Statein%ugrs(ix,k) = u_dyn(k,i,j,n)
    enddo
  enddo
enddo

According to the vectorization report there are several problems:

t_dyn/u_dyn variables are non-aligned access, but I can ensure that the k-dimension size is a power of 2(make sure 64bytes aligned). How can I use intel ifort compiler’s directives to specify certain variable data alignment.
For ipd_data(:)%Statein%tgrs and ugrs variables, the compiler prompts the following problem

irregularly indexed store was emulated for the variable <... >, part of index is nonlinearly computed

or

non-unit strided was emulated for the variable <... >, stride is unknown to compiler

But I can make sure that the ix dimension size is a power of 2 number, like 32, which convert from i/j two-dimensional mesh. How do I specify constant-stride memory access pattern

Thank you.

ivanpribec · July 16, 2025, 12:49pm

Which loop do you want to vectorize? The k loop or the ix loop?

Can you share what are the typical values for nblks, blen and npz?

MetMan · July 16, 2025, 12:54pm

Currently I wish to vectorize the k-dimension. The blen can be controlled by a runtime parameter with a typical value of 32 for cache friendly. nzp is 64, and typical values of nblks are the number of horizontal grid points based on the problem size divided by blen.

RonShepard · July 16, 2025, 5:23pm

The most obvious way to do this would be to transpose the tgrs(:,:) and ugrs(:,:) arrays. Of course, that might cause problems elsewhere in the code, so sometimes easier said than done.

Something else to try might be to make the first dimension of these arrays (the one indexed by ix) not a power of two but rather an odd value, and just ignore the last element of the array. I remember seeing this trick long ago with the linpack supercomputer benchmark – the leading dimension was sometimes changed from 300 to 301 to improve performance.

rwmsu · July 16, 2025, 6:46pm

I remember reading something about this in some very old Cray documentation a long time ago. Basically, you always got better performance by not using the full vector register length (probably 64) so 63 or 65 could give much better performance. Why that was the case has been wiped from my memory banks by Father Time.

RonShepard · July 16, 2025, 8:55pm

I think the issue was the stride across a row of a matrix, much like the original post in this thread. Here is the way I understood it. On a Cray, the memory was interleaved with even addresses in one bank and odd addresses in the other bank. These were 64-bit word addresses, of course, not byte addresses. You could not initiate a memory reference to the same bank on two consecutive clock cycles, you had to wait at least one cycle in between. In the first row of a 300x300 array, the address sequence would be 1, 301, 601, 901… and so on, all odd. In the second row the addresses would be 2, 302, 602, 902, and so on, all even. The same for all the other rows. So it was not possible to copy the elements of a row into the vector registers one element per clock cycle, it took twice as long as expected. The same issue also applied to copying results back from the vector registers to memory.

But if the array were dimensioned 301x300, and the last row is simply ignored in the algorithm, then the addresses in the first row would be 1, 302, 603, 904…and so on, alternating even and odd, and memory could be referenced one element per clock cycle. Same for all the other rows. Of course, referencing the elements sequentially within a column always resulted in alternating even-odd addresses, so that was not a problem in the first place.

I think later models had 4-way or 8-way memory interleaving, and then it became even more difficult for the programmer to tune algorithms, but the same kind of memory access restrictions applied in these cases too.

rwmsu · July 17, 2025, 12:49am

Found the following circa 1991 (Cray Y-MP) vectorization manual on the web. I think some of the examples and code kernels are still valid.

One of the examples shows increasing the row dimension to an odd number as one of the vectorization techniques.

MetMan · July 17, 2025, 3:38am

@RonShepard @rwmsu
If I adjust the length of the innermost dimension, I can’t use the more efficient data alignment load/store vectorization instructions which can improve performance in other loops.Is this modification worth it?

PierU · July 17, 2025, 9:41am

No! This trick was valid for a very specific architecture (vector Cray computers) and has no reason to be useful on current computers

JohnCampbell · July 17, 2025, 10:23am

I would use array section notation.
You could replace
do k=1,npz
IPD_Data(nb)%Statein%tgrs(ix,k) = t_dyn(k,i,j,n)
IPD_Data(nb)%Statein%ugrs(ix,k) = u_dyn(k,i,j,n)
enddo

with
IPD_Data(nb)%Statein%tgrs(ix,1:npz) = t_dyn(1:npz,i,j,n)
IPD_Data(nb)%Statein%ugrs(ix,1:npz) = u_dyn(1:npz,i,j,n)

and see what the vectorisation report returns.

If this is such a time critical loop, I would consider restructuring “IPD_Data(nb)%Statein%ugrs”

MetMan · July 17, 2025, 12:09pm

If I use array section notation for k loop, report gave this message,

remark #15344: loop was not vectorized: vector depenence prevents vectorization
remark #15346: vector dependence: assume FLOW dependence between ipd_data(nb,ix,:)(226,11) and ipd_data(nb) (226:11)
remark #15346: vector dependence: assume ANTI dependence between ipd_data(nb,ix,:)(226,11) and ipd_data(nb) (226:11)

FYI, the defintion of IPD_Data like:

type IPD_Data_type
   real,pointer :: tgrs(:,:) => null()
   ...
end type

I also add contiguous attribute for component of DDT, it’s no useful.

Arjen · July 17, 2025, 12:25pm

Do you have a small program that demonstrates the problem? It would be easier for others to experiment with it.

RonShepard · July 17, 2025, 2:49pm

Ah, it is a potential alias situation. The answer is to avoid using pointers, or to use them within a scope where the pointer attribute is hidden from the compiler.

If the data in the statement really are aliased, then of course the statement cannot be vectorized in principle and must be executed in a scalar way, one element at a time. It is not allowed to use a do concurrent loop in this case.

Topic		Replies	Views
Fortran is dead – Long live Fortran! Tutorials	9	916	June 27, 2025
Will using Vectorization speed up the program? Help	22	2020	November 26, 2023
Aligned allocation Help	2	1370	November 20, 2023
Multi-D pointer with constant stride and Rank R-1 into array of Rank R	7	363	October 9, 2023
Function inline in F90 code Help	11	706	June 24, 2024

How to vectorize a loop of a constant stride access pattern

Related topics