Performance of vectorized code in ifort and ifx

Here are some changes that made it faster with ifx:

  • use of explicit-shape arrays
  • use of a derived type instead of global variables
  • the maxloc intrinsic from ifort is used, as it turns out to be slightly faster (better compiler vectorization); ifx versions prior to 2024.0 use a scalar loop
  • use of !$omp simd directives to encourage vectorization of the inner loop
module mod_vfi

    implicit none
    private

    public :: bellman_op, params
    
    type :: params
        integer :: n_a, n_z
        real(8) :: beta, gamma, R
        real(8), allocatable :: a_grid(:)
        real(8), allocatable :: ap_grid(:)
        real(8), allocatable :: z_grid(:)
        real(8), allocatable :: z_tran(:,:)
    end type

contains
    
    subroutine bellman_op(v_new, pol_ap, v, p)
        implicit none
        ! Inputs:
        type(params), intent(in) :: p
        real(8), intent(in) :: v(p%n_a,p%n_z)
        ! Outputs:
        real(8), intent(out) :: v_new(p%n_a,p%n_z), pol_ap(p%n_a,p%n_z)

        ! Locals:
        real(8) :: ev(p%n_a,p%n_z)
        real(8) :: cons(p%n_a), v_temp(p%n_a)

        real(8) :: a_val, z_val
        integer :: a_c, z_c, ap_ind

        ! Compute expected value function EV(a',z) = sum over z' of v(a',z')*z_tran(z,z')
        EV = matmul(v,transpose(p%z_tran))

        ! Step through the state space
        do z_c=1,p%n_z
            do a_c=1,p%n_a

                ! Current states
                a_val = p%a_grid(a_c)
                z_val = p%z_grid(z_c)

                ! Array of constraints
                cons = p%R*a_val + z_val - p%ap_grid

                call max_ap_loc(p%n_a, cons, ev(:,z_c), p%beta, p%gamma, &
                    v_temp,  ap_ind)

                v_new(a_c,z_c)  = v_temp(ap_ind)
                pol_ap(a_c,z_c) = p%ap_grid(ap_ind)
            enddo
        enddo

    contains
            
    ! Choose a' optimally by stepping through all possible values      
    subroutine max_ap_loc(n, cons, ev, beta, gamma, vt, id)
        implicit none
        integer, intent(in) :: n
        real(8), intent(in) :: cons(n), ev(n), beta, gamma
        real(8), intent(out) :: vt(n)
        integer, intent(out) :: id

        interface
            ! ifx versions < 2024.0 generate scalar loops.
            ! The ifort version on the other hand, seems to be 
            ! vectorized well when compiled with -O3 -xHOST
            pure function ifort_maxloc(n,v) result(idx)
               integer, intent(in) :: n
               real(8), intent(in) :: v(n)
               integer :: idx
            end function
        end interface

        real(8), parameter :: dbl_min = -huge(1.0d0)
        integer :: i

        !$omp simd
        do i = 1, n
            if (cons(i) > 0.0d0) then
                vt(i) = ff_util(cons(i), gamma) + beta*ev(i)
            else
                vt(i) = dbl_min
            end if
        end do

#if defined(__INTEL_LLVM_COMPILER) || defined(__GFORTRAN__) 
        id = ifort_maxloc(n,vt)
#else
        id = maxloc(vt,dim=1)
#endif
    end subroutine

    pure function ff_util(c, gamma) result(util)
    !$omp declare simd uniform(gamma) inbranch

        ! Purpose: utility function
        ! Assumption: c is positive, this is enforced
        ! elsewhere in the code
        real(8), intent(in) :: c, gamma
        real(8) :: util
        
        if (abs(gamma - 1.0d0) < 1.0d-6) then
            util = log(c)
        else
            util = (c**(1.0d0 - gamma))/(1.0d0 - gamma)
        endif
    end function

    end subroutine 

end module mod_vfi

(Edit: Added a missing end subroutine statement, that got cut while copying. The code now compiles.)

I get the same error as the original:

~/bellman$ make main FC=ifx FFLAGS="-O3 -xHOST -qopenmp-simd -mprefer-vector-width=512 -fopenmp -qmkl"
ifort -c -O3 -xHOST -diag-disable=10448 ifort_maxloc.f90
ifx -o main -O3 -xHOST -qopenmp-simd -mprefer-vector-width=512 -fopenmp -qmkl -fopenmp -L/opt/intel/oneapi/compiler/latest/lib  main.F90 ifort_maxloc.o 
~/bellman$ ./main 0 1000
 Starting program...
 Starting VFI...
 Method =            0
 n_a =         1000
 Iteration:          276
 Error:  9.800499998213752E-007
~/bellman$ make original FFLAGS="-O3 -march=native -fopenmp"
gfortran-13 -o original -O3 -march=native -fopenmp original.F90 ifort_maxloc.o
~/bellman$ ./original 1 1000
 Starting program...
 Starting VFI...
 Method =            1
 Iteration:          276
 Error:   9.8004999982137520E-007

When I plot the new one in the same chart as earlier, it looks like this:

perf_opt

If I look again at the APS performance snapshot (the numbers in bracket are the initial values from ifort), the various metrics have also improved:

  Elapsed Time:                               0.35 s             (0.63 s)
  SP GFLOPS:                                  0.00               (0.65)
  DP GFLOPS:                                 33.87               (12.05)
  Average CPU Frequency:                      4.91 GHz           (4.76 GHz)
  IPC Rate:                                   1.93               (1.46)
  Vectorization:                             99.50%
     Instruction Mix:
       SP FLOPs:                              0.00% of uOps
       DP FLOPs:                             19.40% of uOps      (7.20% of uOps)
          Packed:                            99.50% from DP FP   (94.60% from DP FP)
             128-bit:                         0.00%
             256-bit:                         0.10%
             512-bit:                        99.40%              (94.50%)
          Scalar:                             0.50% from DP FP
       Non-FP:                               80.60% of uOps      (92.30% of uOps

So ifx can perform as fast or even faster as ifort. Unfortunately, auto-vectorization is not a panacea, and is prone to failure. Matt Pharr put it like this:

The problem with an auto-vectorizer is that as long as vectorization can fail (and it will), then if you’re a programmer who actually cares about what code the compiler generates for your program, you must come to deeply understand the auto-vectorizer. Then, when it fails to vectorize code you want to be vectorized, you can either poke it in the right ways or change your program in the right ways so that it works for you again. This is a horrible way to program; it’s all alchemy and guesswork and you need to become deeply specialized about the nuances of a single compiler’s implementation—something you wouldn’t otherwise need to care about one bit.


Edit: here is how the multi-threaded speed-up looks like, I increased n_z = 100, and set the tolerance to 1.0d-7 so there would be a bit more work. For 1-4 threads it’s roughly linear, but then it starts to veer downward for some reason.

scaling_plot

The APS snapshot looks like this:

5 Likes