Will using Vectorization speed up the program?

In my Fortran code, there are three loops in x, y, z directions. I learned that vectorization would help to accelerate code, so I modified the code, vectorizing the loops. However, I found the CPU times of these two version are almost the same. And I only used -O3 flag in both compile commands.

So I’m curious if what I’m doing is correct, and if so why the speed isn’t significantly improved. When does vectorization apply?

old:

     do iz = 1,lz
        do iy = 1,ly
          do ix = 0,lx
            factor = xc(ix+1)-xc(ix)
            factor = hdt/factor
            do ip = 0,npop-1
              gx=(fb(ip,ix+1,iy,iz)-fb(ip,ix,iy,iz))*factor
              gy=(fix(ip,ix,iy+1,iz)-fix(ip,ix,iy-1,iz))*facy
              gz=(fix(ip,ix,iy,iz+1)-fix(ip,ix,iy,iz-1))*facz
              grad=cix(ip)*gx+ciy(ip)*gy+ciz(ip)*gz
              fi(ip,ix,iy,iz)= fix(ip,ix,iy,iz)-grad
            enddo
          enddo
        enddo
      enddo

new

    do ix = 0, lx
    factorx_flux_x(:,ix,:,:)=hdt/(xc(ix+1)-xc(ix))
    enddo
facy = hdt/dy/2.
facz = hdt/dz/2.
gx_flux_x(:,:,:,:)=(fb(:,1:lx+1,1:ly,1:lz)-fb(:,0:lx,1:ly,1:lz))*factorx_flux_x(:,:,:,:)
gy_flux_x(:,:,:,:)=(fix(:,0:lx,2:ly+1,1:lz)-fix(:,0:lx,0:ly-1,1:lz))*facy
gz_flux_x(:,:,:,:)=(fix(:,0:lx,1:ly,2:lz+1)-fix(:,0:lx,1:ly,0:lz-1))*facz
do ip=0, npop-1
gradx(ip,:,:,:)=cix(ip)*gx_flux_x(ip,:,:,:) &
                           +ciy(ip)*gy_flux_x(ip,:,:,:) &
                            +ciz(ip)*gz_flux_x(ip,:,:,:)
enddo
fi(:,0:lx,1:ly,1:lz) = fix(:,0:lx,1:ly,1:lz)-gradx(:,0:lx,1:ly,1:lz)

A terminology remark: what you are doing here is named “array syntax” in the Fortran context. I know that this is named “vectorization” in the context of some other langages, but this is rather confusing, as “vectorization” also means something totally different in computer science, which is the ability of a CPU core to perform (fully or partially) simultaneously the same operation on multiples elements of an array.

Array syntax is crucial in interpreted langages like Python, because explicit loops are dramatically slow to execute. This is quite different in compiled langages like C or Fortran, because 1) there is no specific performance penalty with explicit loops, and 2) the compilers are usually quite good at translating the loops into binary code that uses the vectorization capabilities of the CPUs

That said, array syntax can help in some cases where the compilers are not always able to automatically recognize instructions that can be safely vectorized. But if the compiler managed to vectorize the loop on the CPU, you don’t gain anything with the array syntax. Performance-wise I mean. And to make things more complex, there are even cases where the array syntax actually hurts the performances.

Performances apart, array syntax is also nice to get a more compact and more readable code. But again, too much array syntax can also lead to less readable code…

Note that most generally this is the inner loop that drives the ability to get hardware vectorization of not. And that it’s completely possible to mix array syntax and explicit loops. For instance you code be turned to:

     allocate( gx(0:npop-1), gy(0:npop-1), gz(0:npop-1), grad(0:npop-1) )
     do iz = 1,lz
        do iy = 1,ly
          do ix = 0,lx
            factor = xc(ix+1)-xc(ix)
            factor = hdt/factor
              gx(:) = ( fb(:,ix+1,iy  ,iz  )- fb(:,ix,iy  ,iz  ))*factor
              gy(:) = (fix(:,ix  ,iy+1,iz  )-fix(:,ix,iy-1,iz  ))*facy
              gz(:) = (fix(:,ix  ,iy  ,iz+1)-fix(:,ix,iy  ,iz-1))*facz
              grad = cix(:)*gx(:) + ciy(:)*gy(:) + ciz(:)*gz(:)
              fi(:,ix,iy,iz) = fix(:,ix,iy,iz) - grad(:)
          enddo
        enddo
      enddo

Thank you for your reply. Yes, I do feel a little confused about vectorization. According to what you said, if I use a compiled language like Fortran, generally speaking, there is no need to deliberately vectorize the code, because the Fortran compiler usually performs optimizations by itself. But appropriate operations can be performed to improve code readability, such as in my example, operating on the inner loop. Is my understanding correct?

Vectorization can be done automatically by the compiler, semi-automatically by writing code in a way that is easy for compilers to vectorize, or manually with SIMD vector instructions (in C/C++, but not Fortran).

Your definition of vectorization is wrong.

It’s not to write the equations in form x(:) = y(:) + z(:)

Yes, the compiler can vectorize it, but simple do loops would be equally vectorizable in most cases.

My suggestion would be to not care about vectorization initially.

It’s significantly easier to optimize your code by changing the algorithm, and reducing O time complexity. This is the best option for most.

Low level optimizations like vectorization should be mainly done by the compilers, not by developers.

1 Like

As has been said, most compilers will vectorize your do loops IF it deems it safe, otherwise it will not. So using array sintax or explicit do loops will not make much difference if your data is well arranged in memory (or badly)

The important thing is to avoid jumps in memory on your inner most loop (index to the left) and minimize recursive dependencies.

Their definition is not completely wrong: this is the terminology that is used in the Python (and probably Julia) world, and this a rather poor terminology as it collides with the hardware vectorization. I have been confused for a long time as well, as I was not aware that “vectorization” had a fully different meaning for them.

At least the terminology “array syntax” is clear and non-ambiguous.

1 Like

Mostly yes. Just, one can help the compilers exploiting the vector capabilities of the code by writing appropriate code. And array syntax can be one way.

Yes. The appropriate amount of array syntax can be determined for the following rule of thumb (that I like a lot): “If you have to think more than 10 seconds about how to express your calculations with array syntax, then you’d probably better write a loop instead”.

I think your rule will be helpful. It’s a great way to decide when to use array syntax versus loops. It helps keep the code clear and efficient. Thanks for sharing!

Compilers tend to be quite conservative in the instructions set they will target. While the vast majority of x86 processors support Streaming SIMD Extensions (SSE) nowadays, the high-end processors have larger vector registers used by the AVX, AVX2 and AVX-512 extensions.

You must use the flag -march=native with gfortran or -xHost with Intel Fortran to target your specific processor. You can also choose to target a particular architecture - my processor for example is skylake. Visit -march=cpu-type or -x for more information.

You can also ask the compiler for a report using,

  • -fopt-info with gfortran (and -fopt-info-missed for missed opportunities), or
  • -qopt-report=3 -qopt-report-phase=vec -qopt-report-file=stdout with ifort,

to check that vectorization is occuring. For example for a mock up of your loop I see the following:

LOOP BEGIN at /app/example.f90(28,1)
   remark #15542: loop was not vectorized: inner loop was already vectorized

   LOOP BEGIN at /app/example.f90(29,5)
      remark #15542: loop was not vectorized: inner loop was already vectorized

      LOOP BEGIN at /app/example.f90(30,9)
         remark #15542: loop was not vectorized: inner loop was already vectorized

         LOOP BEGIN at /app/example.f90(32,13)
            remark #15300: LOOP WAS VECTORIZED
            remark #15448: unmasked aligned unit stride loads: 3 
            remark #15450: unmasked unaligned unit stride loads: 7 
            remark #15451: unmasked unaligned unit stride stores: 1 
            remark #15475: --- begin vector cost summary ---
            remark #15476: scalar cost: 33 
            remark #15477: vector cost: 9.000 
            remark #15478: estimated potential speedup: 2.320 
            remark #15488: --- end vector cost summary ---
         LOOP END

         LOOP BEGIN at /app/example.f90(32,13)
         <Remainder loop for vectorization>
         LOOP END
      LOOP END
   LOOP END
LOOP END

I am not as pessimistic as you appear, as I have previously found with ifort and now find with gfortran, that vectorisation is most often implemented.
I choose to use -march=native and at least -O2 when compiling; so not difficult to achieve.

I also prefer to replace the inner loop with array syntax where I can, such as
gx(:) = ( fb(:,ix+1,iy ,iz )- fb(:,ix,iy ,iz ))*factor” which implies a contiguous memory usage suitable for AVXnnn. Array syntax appears to document AVX / vectorisation.

In terms of AVX performance, my impression is that the vectors (such as in the above example) fb(:,ix+1,iy ,iz ) and fb(:,ix,iy ,iz ) must be in cache for effective performance improvement, ie if large stride addressing then memory transfer bandwidth can stall AVX performance improvements.

Aren’t vectorisation and simd instructions the same thing ? where Fortran array syntax can be a way to imply the same thing.
I would expect that any optimising compiler would replace “x(:) = y(:) + z(:)” with simd instructions.

As for the question: Will using Vectorization speed up the program?
The appropriate answer is mostly YES.

At the hardware level there are many different features that can fit the generic term “vectorization”: superscalar, pipelining, simd registers & instructions,…

I would think SIMD is a generic description also, while SSE and AVX are specific instruction groups

In some sense, yes. But “vectorization” is to me even “more generic”.

I’m not pessimistic either. @xsong mentioned using -O3, but nothing about using machine-specific flags.

The Intel Fortran documentation states this clearly:

If option -x or -march is not specified (Linux), or if option /Qx or /arch is not specified (Windows), the default target architecture supports Intel® SSE2 instructions. [emphasis added]

With gfortran, if you only specify -O3, you will get -mtune=generic -march=x86-64 by default. The latter is for “a generic CPU with 64-bit extensions”.

With both compilers you can use -### to see the full output being passed to the compiler driver:

For example part of what I get with -O3:

COLLECT_GCC_OPTIONS='-Wall' '-O3' '-shared-libgcc' '-mtune=generic' 
'-march=x86-64'
 /usr/lib/gcc/x86_64-linux-gnu/9/f951 test_char.f90 -quiet -dumpbase 
test_char.f90 "-mtune=generic" "-march=x86-64" -auxbase test_char -O3 -Wall 
-fintrinsic-modules-path /usr/lib/gcc/x86_64-linux-gnu/9/finclude 
"-fpre-include=/usr/include/finclude/math-vector-fortran.h" -o /tmp/cc04xV4I.s

and with -O3 -march=native:

COLLECT_GCC_OPTIONS='-Wall' '-O3' '-march=native' '-shared-libgcc'
 /usr/lib/gcc/x86_64-linux-gnu/9/f951 test_char.f90 "-march=icelake-client" 
-mmmx -mno-3dnow -msse -msse2 -msse3 -mssse3 -mno-sse4a -mcx16 -msahf -mmovbe 
-maes -msha -mpclmul -mpopcnt -mabm -mno-lwp -mfma -mno-fma4 -mno-xop -mbmi 
-mno-sgx -mbmi2 -mno-pconfig -mno-wbnoinvd -mno-tbm -mavx -mavx2 -msse4.2 
-msse4.1 -mlzcnt -mno-rtm -mno-hle -mrdrnd -mf16c -mfsgsbase -mrdseed -mprfchw 
-madx -mfxsr -mxsave -mxsaveopt -mavx512f -mno-avx512er -mavx512cd 
-mno-avx512pf -mno-prefetchwt1 -mclflushopt -mxsavec -mxsaves -mavx512dq 
-mavx512bw -mavx512vl -mavx512ifma -mavx512vbmi -mno-avx5124fmaps 
-mno-avx5124vnniw -mno-clwb -mno-mwaitx -mno-clzero -mpku -mrdpid -mgfni 
-mno-shstk -mavx512vbmi2 -mavx512vnni -mvaes -mvpclmulqdq -mavx512bitalg 
-mavx512vpopcntdq -mno-movdiri -mno-movdir64b -mno-waitpkg -mno-cldemote 
-mno-ptwrite --param "l1-cache-size=48" --param "l1-cache-line-size=64" --param 
"l2-cache-size=16384" "-mtune=generic" -quiet -dumpbase test_char.f90 -auxbase 
test_char -O3 -Wall -fintrinsic-modules-path 
/usr/lib/gcc/x86_64-linux-gnu/9/finclude 
"-fpre-include=/usr/include/finclude/math-vector-fortran.h" -o /tmp/ccPu89Qm.s

Just like the Intel Compiler, gfortran does appear to generate some SSE instructions even without an explicit -msse option when using the -O2 or -O3 flags, or at least in does so in recent releases (12 and 13). (Here is a linked news item from Intel: Build Innovation and Performance with GCC* 12).

As an example, let’s look at this code:

subroutine arraysum(a,b)
    integer, parameter :: dp = kind(1.0d0)
    real(dp) :: a(16), b(16)
    a = a + b
end subroutine

With GCC 11.4 we get:

arraysum_:
        xor     eax, eax
.L2:
        movsd   xmm0, QWORD PTR [rdi+rax]
        addsd   xmm0, QWORD PTR [rsi+rax]
        movsd   QWORD PTR [rdi+rax], xmm0
        add     rax, 8
        cmp     rax, 128
        jne     .L2
        ret

But with GCC 12.1 you get:

arraysum_:
        xor     eax, eax
.L2:
        movupd  xmm0, XMMWORD PTR [rdi+rax]
        movupd  xmm1, XMMWORD PTR [rsi+rax]
        addpd   xmm0, xmm1
        movups  XMMWORD PTR [rdi+rax], xmm0
        add     rax, 16
        cmp     rax, 128
        jne     .L2
        ret

Do you spot the difference? The first one is using addsd (add scalar double) and incrementing the loop counter rax in steps of 8 bytes or 1 element. The newer version is using addpd (add packed double) which is part of the SSE2 family, and incrementing the loop counter in steps of 16 bytes (two elements).

If I turn on the flag -march=icelake-client, it will use vaddpd adding 4 doubles in parallel using the YMM registers. If I turn on -mprefer-vector-width=512 it becomes.

arraysum_:
        vmovupd zmm1, ZMMWORD PTR [rdi]
        vaddpd  zmm0, zmm1, ZMMWORD PTR [rsi]
        vmovupd ZMMWORD PTR [rdi], zmm0
        vmovupd zmm0, ZMMWORD PTR [rsi+64]
        vaddpd  zmm0, zmm0, ZMMWORD PTR [rdi+64]
        vmovupd ZMMWORD PTR [rdi+64], zmm0
        vzeroupper
        ret

summing 16 doubles element-wise using two vaddpd instructions using the AVX512F instructions and the ZMM registers.

This is also a reason why you should update your compiler. For example the -march=icelake-client flag appeared with the GCC 8.1 release, for Sapphire Rapids released this year you’ll need GCC 11 at least. So to take advantage of the latest hardware features you can’t just stick to old tools.

The reason both compilers target the SSE extensions by default is that they’ve become ubiquitous. For example if we look at the Steam Hardware and Software Survey (keeping in mind this only captures people who use Steam). Practically every workstation has SSE2-SSE4 nowadays.

@ivanpribec ,

Thanks for your post, as you highlight that my “default” compiler options are not “the” default settings.
It is also interesting that you have identified changes in Gfortran’s use of avx instructions from V11 to V12. Is this due to recognising newer hardware with -march=native (which I default to) ?

The background problem I have with AVX and with the initial question that relates vectorisation to program speed is “Why don’t AVX instructions speed up my program?”

I especially find this when using my default Gfortran settings of :
-fimplicit-none -fallow-argument-mismatch -O3 -march=native -ffast-math -fopenmp -fstack-arrays
Where the problem of “memory transfer bandwidth” limits the ability to fill the AVX registers, especially when using multiple threads.
I am yet to check the performance with faster ddr5 memory, but am reluctant to continue chaseing the latest slightly faster hardware. I am convinced that recent Intel and AMD dual channel memory processors have provided more threads/cores than can be effectively supported by their memory architecture.

Ignoring all the other discussion about the failings of -ffast-math, the combination of -ffast-math and -fopenmp can be questioned where the computation bottleneck is the memory bandwidth.

I still am firmly convinced that Vectorization will speed up the program, but there are fairly common conditions where the delays in filling the AVX registers reduce it’s effectiveness.
My programming strategy to avoid this problem is to provide the conditions where the vectors are most likely in the CPU cache. We know how easy this is to prescribe, using Fortran.

That’s right, and this illustrates that in the HPC world, the developer has definitely to be aware of what really happens at low level when writing high level code. This is not an option.

1 Like

We usually compile our codes with:

ifort -O3 -xHost -ipo -qopenmp -qmkl=parallel

and

gfortran -Ofast -march=native -mtune=native -fomit-frame-pointer -fopenmp

for Intel Fortran and GFortran, respectively.

Most of the time, the compiler vectorizes the code and it runs very fast. There are some instances where prompts with !$OMP SIMD directives can improve the speed. However, most of the time these do nothing or even slow the code down, indicating that the compiler usually knows best.

Try to avoid dependencies between loops (like cumulative variables) and conditional statements. In other words, make your loops as simple as possible so that the compiler can spot any potential optimizations.

Here are two !$OMP SIMD examples from our code which do actually speed it up:

!$OMP SIMD PRIVATE(a1,b1,a2,b2,t1,t2) SIMDLEN(8)
do i=1,n
  a1=dble(wf1(i)); b1=aimag(wf1(i))
  a2=dble(wf2(i)); b2=aimag(wf2(i))
  t1=a1**2+b1**2; t2=a2**2+b2**2
  mag1(i)=mag1(i)+wo2*(a1*a2+b1*b2)
  mag2(i)=mag2(i)+wo2*(a1*b2-b1*a2)
  mag3(i)=mag3(i)+wo*(t1-t2)
  rho(i)=rho(i)+wo*(t1+t2)
end do

and

!$OMP SIMD LASTPRIVATE(f1,f2,f3,f4) SIMDLEN(8)
do i=3,n-2
  f1=f(i-1); f2=f(i); f3=f(i+1); f4=f(i+2)
  df(i)=wc(1,i)*f1+wc(2,i)*f2+wc(3,i)*f3+wc(4,i)*f4
end do

Here is an example of a loop which doesn’t vectorize well:

do i=2,n-2
  j=i*4+1
  zsm=zsm+wp(j)*f(i-1)+wp(j+1)*f(i)+wp(j+2)*f(i+1)+wp(j+3)*f(i+2)
  g(i+1)=zsm
end do

Here we’ve helped a bit by using j=i*4+1 instead of j=j+4 so there is no interloop dependency on j. If we just wanted the total zsm then we could use a reduction clause: REDUCTION(+:zsm) and each SIMD lane would add up its own zsm and sum the results at the end of the loop. However, we also have to store each partial sum as g(i+1)=zsm, which makes vectorization difficult. (Anybody have an idea how to vectorize this with !$OMP SIMD?)

However, it is important not to assume that !$OMP SIMD will speed up your code: you absolutely have to test each case. A corollary to this is to keep testing as the years go by; as @ivanpribec showed above, compilers evolve with time and your present optimizations may become future anti-optimizations.

1 Like

For some problems you may very well be memory bandwidth limited, and improvements to the cores’ integer/floating point math performance will just let you realize that even faster.

In my own practice, I have found that if you stick to IEEE arithmetic you get very poorly optimized code. I also noticed very little benefit or ever speed loss b y using -ffast-math over -funsafe-math-optimizations with gfortran in my use case (typically image convolutions and matrix operations). My established set of flags to compile my codes with -O3 -funsafe-math-optimizations -march=native.

Does anyone have memory bandwidth numbers handy for RAM, level-2 cache, and level-1 cache for some of the current intel, AMD, and ARM cpus?

Typically, if there is only one floating point operation for each memory fetch, then the RAM bandwidth determines the performance. However, for operations like matrix-matrix products, each memory fetch is reused N times, where N is the matrix dimension, so the cache bandwidth determines the performance.