In my Fortran code, there are three loops in x, y, z directions. I learned that vectorization would help to accelerate code, so I modified the code, vectorizing the loops. However, I found the CPU times of these two version are almost the same. And I only used -O3 flag in both compile commands.
So I’m curious if what I’m doing is correct, and if so why the speed isn’t significantly improved. When does vectorization apply?
old:
do iz = 1,lz
do iy = 1,ly
do ix = 0,lx
factor = xc(ix+1)-xc(ix)
factor = hdt/factor
do ip = 0,npop-1
gx=(fb(ip,ix+1,iy,iz)-fb(ip,ix,iy,iz))*factor
gy=(fix(ip,ix,iy+1,iz)-fix(ip,ix,iy-1,iz))*facy
gz=(fix(ip,ix,iy,iz+1)-fix(ip,ix,iy,iz-1))*facz
grad=cix(ip)*gx+ciy(ip)*gy+ciz(ip)*gz
fi(ip,ix,iy,iz)= fix(ip,ix,iy,iz)-grad
enddo
enddo
enddo
enddo
new
do ix = 0, lx
factorx_flux_x(:,ix,:,:)=hdt/(xc(ix+1)-xc(ix))
enddo
facy = hdt/dy/2.
facz = hdt/dz/2.
gx_flux_x(:,:,:,:)=(fb(:,1:lx+1,1:ly,1:lz)-fb(:,0:lx,1:ly,1:lz))*factorx_flux_x(:,:,:,:)
gy_flux_x(:,:,:,:)=(fix(:,0:lx,2:ly+1,1:lz)-fix(:,0:lx,0:ly-1,1:lz))*facy
gz_flux_x(:,:,:,:)=(fix(:,0:lx,1:ly,2:lz+1)-fix(:,0:lx,1:ly,0:lz-1))*facz
do ip=0, npop-1
gradx(ip,:,:,:)=cix(ip)*gx_flux_x(ip,:,:,:) &
+ciy(ip)*gy_flux_x(ip,:,:,:) &
+ciz(ip)*gz_flux_x(ip,:,:,:)
enddo
fi(:,0:lx,1:ly,1:lz) = fix(:,0:lx,1:ly,1:lz)-gradx(:,0:lx,1:ly,1:lz)