Sorry for this perhaps stupid question.
I always feel not exactly clear what is vectorization, never really and fully understand it. I know there are many materials I can find on internet, google. But still not very clear what is vectorization.
Could anyone give some small examples? Thanks in advance!
Like, all the below code are using -O3 -xHost (or -Ofast -march=native in gfortran) optimization, and we are using a AVX2 CPU (256 bit register or something whatever). Actually all the below examples are using Intel Fortran, I am not sure if gfortran behave simialrly but I guess so.
Anyway,
I know in the following code, sorry for the sloppy code, see example1
example1
integer, parameter :: r8=selected_real_kind(15,9)
subroutine f(a,b,c)
real(kind=r8) :: a(1000), b(1000), c(1000)
c(:)=a(:)+b(:)
return
end
The above code should be vectorized,right? The real is real 8 type, it should took 64 bit in the ram. For AVX2 it is with 256 bit register or something, The above code is like doing c(1:4) = a(1:4) + b(1:4) in one clock cycle or something, and 4 is because 256/64=4.
But why the hell many times I found that actually doing the loop is faster? see example 2,
example2
integer, parameter :: r8=selected_real_kind(15,9)
subroutine f(a,b,c)
real(kind=r8) :: a(1000), b(1000), c(1000)
integer :: i
do i=1,1000
c(i)=a(i)+b(i)
endddo
return
end
I notice that the compiler in example1 may complain that the array are not aligned or something. Therefore actually example2 is faster than example1.
But I mean in principle both example1 and 2 should be the same speed right? In example2, the compiler should be smart enough to vectorize the code correspondingly.
Finally, in the below code, example3, is there any vectorization?
example3
integer, parameter :: r8=selected_real_kind(15,9)
subroutine f(a,b,c)
real(kind=r8) :: a(1000), b(1000), c(1000)
c(:) = exp(a(:)) + log(b(:))
return
end
or slightly complicated things like, example4
example4
integer, parameter :: r8=selected_real_kind(15,9)
subroutine f(a,b,c)
real(kind=r8) :: a(1000), b(1000), c(1000)
c(:) = exp(a(:)) + log(b(:)) + exp(a(:)) * log(b(:))
return
end
I mean is vectorization is only for those very basic operatons like, op = + - * /
a(:) = b(:) op c(:)
However, for something more complicated, like
a(:) = log(b(:) + c(:)*exp(a(:))*ln(b(:)))
Then vectorization may not work?
It seems in many case using a do loop is faster than writing things like a(:)=b(:)+c(:). It seem the compiler are doing very good job or highly optimized in just doing the do loops.
I also noticed an author wrote a guide for optimization Fortran code, is what he said in the manual still true?