Hi all, I have a code that I want to vectorize, but according to intel ifort vectorization report it doesn’t vectorize effectively
do nb=1,nblks
blen = blksz(nb)
do ix=1,blen
i=index(nb)%ii(ix)
j=index(nb)%jj(ix)
n=index(nb)%ff(ix)
do k=1,npz
IPD_Data(nb)%Statein%tgrs(ix,k) = t_dyn(k,i,j,n)
IPD_Data(nb)%Statein%ugrs(ix,k) = u_dyn(k,i,j,n)
enddo
enddo
enddo
According to the vectorization report there are several problems:
t_dyn/u_dyn variables are non-aligned access, but I can ensure that the k-dimension size is a power of 2(make sure 64bytes aligned). How can I use intel ifort compiler’s directives to specify certain variable data alignment.
For ipd_data(:)%Statein%tgrs and ugrs variables, the compiler prompts the following problem
irregularly indexed store was emulated for the variable <... >, part of index is nonlinearly computed
or
non-unit strided was emulated for the variable <... >, stride is unknown to compiler
But I can make sure that the ix dimension size is a power of 2 number, like 32, which convert from i/j two-dimensional mesh. How do I specify constant-stride memory access pattern
Currently I wish to vectorize the k-dimension. The blen can be controlled by a runtime parameter with a typical value of 32 for cache friendly. nzp is 64, and typical values of nblks are the number of horizontal grid points based on the problem size divided by blen.
The most obvious way to do this would be to transpose the tgrs(:,:) and ugrs(:,:) arrays. Of course, that might cause problems elsewhere in the code, so sometimes easier said than done.
Something else to try might be to make the first dimension of these arrays (the one indexed by ix) not a power of two but rather an odd value, and just ignore the last element of the array. I remember seeing this trick long ago with the linpack supercomputer benchmark – the leading dimension was sometimes changed from 300 to 301 to improve performance.
I remember reading something about this in some very old Cray documentation a long time ago. Basically, you always got better performance by not using the full vector register length (probably 64) so 63 or 65 could give much better performance. Why that was the case has been wiped from my memory banks by Father Time.
I think the issue was the stride across a row of a matrix, much like the original post in this thread. Here is the way I understood it. On a Cray, the memory was interleaved with even addresses in one bank and odd addresses in the other bank. These were 64-bit word addresses, of course, not byte addresses. You could not initiate a memory reference to the same bank on two consecutive clock cycles, you had to wait at least one cycle in between. In the first row of a 300x300 array, the address sequence would be 1, 301, 601, 901… and so on, all odd. In the second row the addresses would be 2, 302, 602, 902, and so on, all even. The same for all the other rows. So it was not possible to copy the elements of a row into the vector registers one element per clock cycle, it took twice as long as expected. The same issue also applied to copying results back from the vector registers to memory.
But if the array were dimensioned 301x300, and the last row is simply ignored in the algorithm, then the addresses in the first row would be 1, 302, 603, 904…and so on, alternating even and odd, and memory could be referenced one element per clock cycle. Same for all the other rows. Of course, referencing the elements sequentially within a column always resulted in alternating even-odd addresses, so that was not a problem in the first place.
I think later models had 4-way or 8-way memory interleaving, and then it became even more difficult for the programmer to tune algorithms, but the same kind of memory access restrictions applied in these cases too.
@RonShepard@rwmsu
If I adjust the length of the innermost dimension, I can’t use the more efficient data alignment load/store vectorization instructions which can improve performance in other loops.Is this modification worth it?
I would use array section notation.
You could replace
do k=1,npz
IPD_Data(nb)%Statein%tgrs(ix,k) = t_dyn(k,i,j,n)
IPD_Data(nb)%Statein%ugrs(ix,k) = u_dyn(k,i,j,n)
enddo
with
IPD_Data(nb)%Statein%tgrs(ix,1:npz) = t_dyn(1:npz,i,j,n)
IPD_Data(nb)%Statein%ugrs(ix,1:npz) = u_dyn(1:npz,i,j,n)
and see what the vectorisation report returns.
If this is such a time critical loop, I would consider restructuring “IPD_Data(nb)%Statein%ugrs”
If I use array section notation for k loop, report gave this message,
remark #15344: loop was not vectorized: vector depenence prevents vectorization
remark #15346: vector dependence: assume FLOW dependence between ipd_data(nb,ix,:)(226,11) and ipd_data(nb) (226:11)
remark #15346: vector dependence: assume ANTI dependence between ipd_data(nb,ix,:)(226,11) and ipd_data(nb) (226:11)
FYI, the defintion of IPD_Data like:
type IPD_Data_type
real,pointer :: tgrs(:,:) => null()
...
end type
I also add contiguous attribute for component of DDT, it’s no useful.
Ah, it is a potential alias situation. The answer is to avoid using pointers, or to use them within a scope where the pointer attribute is hidden from the compiler.
If the data in the statement really are aliased, then of course the statement cannot be vectorized in principle and must be executed in a scalar way, one element at a time. It is not allowed to use a do concurrent loop in this case.