What really is vectorization and how to implement it?

Sorry for this perhaps stupid question.
I always feel not exactly clear what is vectorization, never really and fully understand it. I know there are many materials I can find on internet, google. But still not very clear what is vectorization.

Could anyone give some small examples? Thanks in advance!

Like, all the below code are using -O3 -xHost (or -Ofast -march=native in gfortran) optimization, and we are using a AVX2 CPU (256 bit register or something whatever). Actually all the below examples are using Intel Fortran, I am not sure if gfortran behave simialrly but I guess so.

Anyway,
I know in the following code, sorry for the sloppy code, see example1

example1

integer, parameter :: r8=selected_real_kind(15,9)
subroutine f(a,b,c)
real(kind=r8) :: a(1000), b(1000), c(1000)
c(:)=a(:)+b(:)
return
end

The above code should be vectorized,right? The real is real 8 type, it should took 64 bit in the ram. For AVX2 it is with 256 bit register or something, The above code is like doing c(1:4) = a(1:4) + b(1:4) in one clock cycle or something, and 4 is because 256/64=4.

But why the hell many times I found that actually doing the loop is faster? see example 2,

example2

integer, parameter :: r8=selected_real_kind(15,9)
subroutine f(a,b,c)
real(kind=r8) :: a(1000), b(1000), c(1000)
integer :: i
do i=1,1000
    c(i)=a(i)+b(i)
endddo
return
end

I notice that the compiler in example1 may complain that the array are not aligned or something. Therefore actually example2 is faster than example1.
But I mean in principle both example1 and 2 should be the same speed right? In example2, the compiler should be smart enough to vectorize the code correspondingly.

Finally, in the below code, example3, is there any vectorization?

example3

integer, parameter :: r8=selected_real_kind(15,9)
subroutine f(a,b,c)
real(kind=r8) :: a(1000), b(1000), c(1000)
c(:) = exp(a(:)) + log(b(:))
return
end

or slightly complicated things like, example4
example4

integer, parameter :: r8=selected_real_kind(15,9)
subroutine f(a,b,c)
real(kind=r8) :: a(1000), b(1000), c(1000)
c(:) = exp(a(:)) + log(b(:)) + exp(a(:)) * log(b(:))
return
end

I mean is vectorization is only for those very basic operatons like, op = + - * /

a(:) = b(:) op c(:)

However, for something more complicated, like

 a(:) = log(b(:) + c(:)*exp(a(:))*ln(b(:)))

Then vectorization may not work?

It seems in many case using a do loop is faster than writing things like a(:)=b(:)+c(:). It seem the compiler are doing very good job or highly optimized in just doing the do loops.

I also noticed an author wrote a guide for optimization Fortran code, is what he said in the manual still true?

1 Like

It is possible to write vectorized versions of elementary functions like exp and log that are faster than the scaler versions. That said calling them from a language without metaprogramming will be ugly and annoying.

subroutine f(a,b,c)
    real(kind=r8),intent(in) :: a(:), b(:)
    real(kind=r8),intent(in) :: c(:)
    c = a + b
end

Assuming you have explicit interfaces (e.g. subroutine is in a module) I would have made the example as above. I have seen examples of compilers getting confused on optimisation so make things as clear as possible. If it is a whole array operation do not specify a slice for example and add intent. That said you simple examples should vectorise. The do loop example is simple if there were other operations in the loop the compiler might have to work hard or be quite clever to work out if it can be vectorised. Maybe some of the values change in more than one place in the loop etc. Give the compiler the best chance possible to optimise is my rule. Also look at the optimisation reports (if any) to see what was actually done…

2 Likes

You can experiment with the code compilers produce, here: Compiler Explorer

2 Likes

On first reading, there are some good points and some misunderstandings in there. Perhaps I will have time to comment further.

1 Like

That subroutine has a bug: c must be declared with intent(out) not in.

@CRquantum One other thing that is probably contributing to the confusion here, is that there are two completely different things that are sometimes referred to as “vectorization”.

  1. Vectorization can mean the use of SIMD instructions like AVX, which allow the CPU to perform multiple operations in the time it would otherwise take to perform one, as you know.

  2. Vectorization can mean using array statements instead of loops. For example, some people refer to c(:) = a(:) + b(:) as “vectorized”, where an equivalent do loop (like in your example) would be “non-vectorized”. This difference doesn’t matter that much in Fortran, where do loops are still fast, but it matters a lot in interpreted languages like matlab and python, where a for loop might be tens or hundreds of times slower than using a numpy array statement.

But the key thing that I think is causing some of your confusion is that those two different uses of the word “vectorization” are completely different. Using an array statement doesn’t necessarily mean that the compiler is more likely to emit SIMD instructions than if you use a do loop.

As for why you sometimes find that do loops are faster than array statements, I don’t know for sure but I can speculate. At the end of the day, array statements are just syntactic shorthand for loops - in the simple example you posted, I think a compiler could very well emit the exact same assembly code for the two versions. But in a more complicated example, there could be differences. Suppose, for example, you have a series of array statements, and you’re comparing that with a do-loop that does a series of operations in the body of the loop. If the compiler emits a separate loop for each array statement, then the array statement version might well be slower than that do-loop version, because it has to walk over the data multiple times. See for example section 2.3.8 in the optimization guide that you linked (I haven’t read the whole guide yet so I won’t comment on its overall accuracy).

5 Likes

When you’re talking “vectorization” in terms of the compiler emitting vector SIMD instructions (e.g. AVX2), AFAIK it is only turned on at higher levels of compiler optimization (-O2 or above for nvfortran; -O3 or above for ifort) or has to be enabled manually (as in specifying -march and -mtune for gfortran).

1 Like

These are my comments on the PDF document

Page 7, last para:
In fortran, data in arrays are stored in such an order that the most left-hand side digit is growing rst.

Not specified in modern Fortran but the practice is universal among compilers. “Array element order” is what is specified in the Standard and it makes sense to make array layout in linearly addressed memory as close to array element order as possible.

Page 8, penultimate para:
Unfortunately that implicit shaped declarations also instruct the compiler that these dummy arrays are not necessarily contiguous in memory.

Correct because the subroutine can now be called with actual arguments being sections of arrays (which may include vector subscripts).

Page 8
Implicit shaped declarations should not be used

Disagree (and the term is assumed-shape)! Instead you can use intrinsic IS_CONTIGUOUS to determine at runtime if the actual argument is contiguous, and dispatch to more specific code:

subroutine add(a,b,c)
  Real, Intent(Out), Target:: a(:)
  Real, Intent(In), Target:: b(:),c(:)
  If (IS_CONTIGUOUS(a).AND.IS_CONTIGUOUS(b).AND. IS_CONTIGUOUS(c)) Then
    Block 
      Real, Pointer, Contiguous :: p(:),q(:),r(:)
      p=>a;q=>b;r=>c
      p(:)=q(:)+r(:)
    End Block
  Else
    a(:)=b(:)+c(:)
  End If
End Subroutine

Advanced compilers might insert such code speculatively. You can easily check the generated code from some compilers in Compiler Explorer website. Intel compiler produced AVX2 SIMD code for the above loop.

Page 9
Do not use array syntax in computational loops

Disagree! Advanced compilers could perform loop fusion. Do not underestimate their power!

3 Likes

GCC 12 Enables Auto-Vectorization For -O2 Optimization Level

https://www.phoronix.com/scan.php?page=news_item&px=GCC-12-Auto-Vec-O2

4 Likes