Performance of vectorized code in ifort and ifx

aledinola · April 7, 2024, 12:55pm

I created a small Fortran example of discrete dynamic programming that occurs quite often in macroeconomics. It is an optimal savings problem for a consumer who received idiosyncratic shocks to their income.

I was interested in comparing loop-based Fortran code vs vectorized code, both in ifort and in ifx. Just to avoid misunderstandings, the definition of “vectorizing” that I am using is the following: I vectorize if I replace this code

do i=1,n
yvec(i) = exp(xvec(i))
enddo

with the following code:

yvec = exp(xvec)

With this in mind, I coded the dynamic programming example in three different subroutines:

bellman_op, which is the benchmark based entirely on loops (three do loops)
bellman_op_vec, which is partially vectorized (I eliminated the innermost loop over a’)
bellman_op_vec2, which is even more vectorized (I eliminated the two innermost loops)

bellman_op is here:

subroutine bellman_op(v_new,pol_ap, v)
    ! Purpose: Bellman operator, it does one step of the VFI
    ! This version is loop-based and can be parallelized with OpenMP
    ! For vectorized versions, see subroutine bellman_op_vec, bellman_op_vec2
    implicit none
    ! Inputs:
    real(8), intent(in) :: v(:,:)
    ! Outputs:
    real(8), intent(out) :: v_new(:,:), pol_ap(:,:)
    ! Locals:
    real(8), allocatable :: EV(:,:)
    integer :: a_c, z_c, ap_c, ap_ind
    real(8) :: a_val, z_val, aprime_val, cons, v_max, v_temp

    ! Compute expected value function EV(a',z) = sum over z' of v(a',z')*z_tran(z,z')
    EV = matmul(v,transpose(z_tran))

    ! Step through the state space
    !$omp parallel if (par_fortran==1) default(shared) & 
    !$ private(z_c,a_c,a_val,z_val,v_max,ap_ind,ap_c,aprime_val,cons,v_temp)
    !$omp do collapse(2)
    do z_c=1,n_z
        do a_c=1,n_a
            ! Current states
            a_val = a_grid(a_c)
            z_val = z_grid(z_c)
            ! Initialize v_new
            v_max = large_negative
            ap_ind = 0
            ! Choose a' optimally by stepping through all possible values
            do ap_c=1,n_a
                aprime_val = ap_grid(ap_c)
                cons = R*a_val + z_val - aprime_val
                if (cons>0.0d0) then
                    v_temp = f_util(cons) + beta*EV(ap_c,z_c)
                    !v_temp = f_util(cons) + beta*sum(v(ap_c,:)*z_tran(z_c,:))
                    if (v_temp>v_max) then
                        v_max = v_temp
                        ap_ind = ap_c
                    end if
                endif
            enddo !end a'
            v_new(a_c,z_c)  = v_max
            pol_ap(a_c,z_c) = ap_grid(ap_ind)
        enddo
    enddo
    !$omp enddo
    !$omp end parallel
    
    end subroutine bellman_op

bellman_op_vec:

subroutine bellman_op_vec(v_new,pol_ap, v)
     ! Purpose: Bellman operator, it does one step of the VFI
     ! This version is partially vectorized (eliminated loop over a')
     ! For a loop-based version, see subroutine bellman_op
     ! For a fully vectorized version, see subroutine bellman_op_vec2
     implicit none
     ! Inputs:
     real(8), intent(in) :: v(:,:)
     ! Outputs:
     real(8), intent(out) :: v_new(:,:), pol_ap(:,:)
     ! Locals:
     real(8), allocatable :: EV(:,:), cons(:), v_temp(:)
     integer :: a_c,z_c,ap_ind,ap_c
     real(8) :: a_val,z_val
 
     allocate(cons(n_a),v_temp(n_a))
 
   ! Compute expected value function EV(a',z) = sum over z' of v(a',z')*z_tran(z,z')
     EV = matmul(v,transpose(z_tran))
 
     ! Step through the state space
     !$omp parallel if (par_fortran==1) default(shared) &
     !$ private(z_c,a_c,a_val,z_val,ap_ind,ap_c,cons,v_temp)
     !$omp do collapse(2)
     do z_c=1,n_z
         do a_c=1,n_a
             ! Current state
             a_val = a_grid(a_c)
             z_val = z_grid(z_c)
             ! Compute consumption
             cons = R*a_val + z_val - ap_grid ! (n_ap,1)
             ! NOTE: where and merge are slower than forall
             ! NOTE: forall and do concurrent are equivalent with ifort
             ! but do concurrent is very slow with ifx!
             !where (cons>0.0d0)
             !    util = f_util(cons)  ! (n_ap,1)
             !elsewhere
             !    util = large_negative
             !end where
             !util = merge(f_util(cons),large_negative,cons>0.0d0)
             ! v_temp = large_negative
             ! do concurrent (ap_c=1:n_a, cons(ap_c)>0.0d0)
             !    v_temp(ap_c) = f_util(cons(ap_c))+beta*EV(ap_c,z_c)
             ! enddo
             v_temp = large_negative
             forall (ap_c=1:n_a, cons(ap_c)>0.0d0)
                v_temp(ap_c) = f_util(cons(ap_c))+beta*EV(ap_c,z_c)
             end forall
             ap_ind = maxloc(v_temp,dim=1)
             v_new(a_c,z_c) = v_temp(ap_ind)
             pol_ap(a_c,z_c) = a_grid(ap_ind)
         enddo
     enddo
     !$omp enddo
     !$omp end parallel
 
     end subroutine bellman_op_vec

and finally bellman_op_vec2:

subroutine bellman_op_vec2(v_new,pol_ap, v)
     ! Purpose: Bellman operator, it does one step of the VFI
     ! This version is partially vectorized (eliminated loops over a and a')
     ! For a loop-based version, see subroutine bellman_op
     implicit none
     ! Inputs:
     real(8), intent(in) :: v(:,:)
     ! Outputs:
     real(8), intent(out) :: v_new(:,:), pol_ap(:,:)
     ! Locals:
     real(8), allocatable :: EV(:,:), cons(:,:), v_temp(:,:)
     integer, allocatable :: ap_ind(:)
     integer :: z_c,ap_c,a_c
     real(8) :: z_val
 
     allocate(cons(n_ap,n_a),v_temp(n_ap,n_a))
 
     ! Compute expected value function EV(a',z) = sum over z' of v(a',z')*z_tran(z,z')
     EV = matmul(v,transpose(z_tran))
 
     ! Step through the state space
     do z_c=1,n_z
         ! Current state
         z_val = z_grid(z_c) 
         ! Compute consumption
         cons = R*a_grid2 + z_val - ap_grid2 ! (n_ap,n_a)
         v_temp = large_negative
         forall (a_c=1:n_a,ap_c=1:n_ap, cons(ap_c,a_c)>0.0d0)
             v_temp(ap_c,a_c) = f_util(cons(ap_c,a_c))+beta*EV(ap_c,z_c)
         end forall
         ! v_temp is a 2-dim array(a',a), we need to find the max along a' 
         ap_ind = maxloc(v_temp,dim=1)
         v_new(:,z_c) = maxval(v_temp,dim=1)
         pol_ap(:,z_c) = a_grid(ap_ind)
     enddo
     
     end subroutine bellman_op_vec2

What I have found, to my partial surprise, is that the vectorized versions are either the same or worse than the loops-based version and that ifx is much slower than ifort (these might be two totally separate issues, but I didn’t want to write two posts). Here are the timings (I report the median time across 10 runs for each case)

Column 1	Column 2	Column 3
Version	Time in secs (IFORT)	Time in secs (IFX)
All loops (bellman_op)	0.78	1.63
Vectorized loop over a’ (bellman_op_vec)	0.75	1.8
Vectorized loops over a’ and a (bellman_op_vec2)	1.3	2.18

Other observations:

When working with arrays in the vectorized code, I found that the constructs where and merge are slower than forall or do concurrent. In ifort, forall and do concurrent give similar performance, whereas in ifx do concurrent is super slow. All the timings reported above are based on forall for the vectorized versions.
As compilation flags I use the following
/O3 /fast /Qmkl /Qopenmp /Qparallel
but ifx returns the warning that /Qparallel is not supported. This might be a problem because can the compiler optimize the forall and do concurrent without the parallel option?

If someone is interested in replicating my code, I attach here below the full version. The above code is also available on github here

! Title: Solving the income fluctuation problem with vectorization/parallelization
! Author: Alessandro Di Nola
! References: 
!  (1) https://julia.quantecon.org/dynamic_programming/ifp.html
!  (2) Dynamic Programming on the GPU via JAX
!   QE note https://notes.quantecon.org/submission/622ed4daf57192000f918c61
!===============================================================================!
module mod_numerical
    implicit none
    
    private
    public :: linspace
    
    contains
    
    function linspace(my_start, my_stop, n)
        ! Purpose: replicate Matlab function <linspace>
        implicit none
        ! Inputs:
        integer :: n
        real(8) :: my_start, my_stop
        ! Function result:
        real(8) :: linspace(n)
        ! Locals:
        integer :: i
        real(8) :: step, grid(n)

        if (n == 1) then
            grid(n) = (my_start+my_stop)/2.d0
        elseif (n>=2) then
            grid(1) = my_start
            if (n>2) then
                step = (my_stop-my_start)/(real(n-1,8))
                do i = 2, n-1
                    grid(i) = grid(i-1) + step
                end do
            end if
            grid(n) = my_stop
        endif
        linspace = grid
    
    end function linspace

end module mod_numerical
!===============================================================================!

module mod_globals
implicit none

real(8), parameter :: large_negative = -1.0d10
character(len=*), parameter :: savedir = "output\"
real(8) :: R, beta, gamma, a_min, a_max, rho, sig_z
integer :: n_a, n_z, n_ap, par_fortran

real(8), allocatable :: a_grid(:), ap_grid(:), z_grid(:), z_tran(:,:)
real(8), allocatable :: a_grid2(:,:), ap_grid2(:,:) !needed for vectorized code

contains
    
    elemental function f_util(c) result(util)
        ! Purpose: utility function
        ! Assumption: c is positive, this is enforced
        ! elsewhere in the code
        real(8), intent(in) :: c
        real(8) :: util
        
        if (abs(gamma-1.0d0)<1.0d-6) then
            util = log(c)
        else
            util = c**(1.0d0-gamma)/(1.0d0-gamma)
        endif
    
    end function f_util
    
end module mod_globals
!===============================================================================!

module mod_vfi 
    use mod_globals
    use omp_lib
    implicit none
    
    private
    public :: bellman_op, bellman_op_vec, bellman_op_vec2
    
    contains
    
    subroutine bellman_op(v_new,pol_ap, v)
    ! Purpose: Bellman operator, it does one step of the VFI
    ! This version is loop-based and can be parallelized with OpenMP
    ! For vectorized versions, see subroutine bellman_op_vec, bellman_op_vec2
    implicit none
    ! Inputs:
    real(8), intent(in) :: v(:,:)
    ! Outputs:
    real(8), intent(out) :: v_new(:,:), pol_ap(:,:)
    ! Locals:
    real(8), allocatable :: EV(:,:)
    integer :: a_c, z_c, ap_c, ap_ind
    real(8) :: a_val, z_val, aprime_val, cons, v_max, v_temp

    ! Compute expected value function EV(a',z) = sum over z' of v(a',z')*z_tran(z,z')
    EV = matmul(v,transpose(z_tran))

    ! Step through the state space
    !$omp parallel if (par_fortran==1) default(shared) & 
    !$ private(z_c,a_c,a_val,z_val,v_max,ap_ind,ap_c,aprime_val,cons,v_temp)
    !$omp do collapse(2)
    do z_c=1,n_z
        do a_c=1,n_a
            ! Current states
            a_val = a_grid(a_c)
            z_val = z_grid(z_c)
            ! Initialize v_new
            v_max = large_negative
            ap_ind = 0
            ! Choose a' optimally by stepping through all possible values
            do ap_c=1,n_a
                aprime_val = ap_grid(ap_c)
                cons = R*a_val + z_val - aprime_val
                if (cons>0.0d0) then
                    v_temp = f_util(cons) + beta*EV(ap_c,z_c)
                    !v_temp = f_util(cons) + beta*sum(v(ap_c,:)*z_tran(z_c,:))
                    if (v_temp>v_max) then
                        v_max = v_temp
                        ap_ind = ap_c
                    end if
                endif
            enddo !end a'
            v_new(a_c,z_c)  = v_max
            pol_ap(a_c,z_c) = ap_grid(ap_ind)
        enddo
    enddo
    !$omp enddo
    !$omp end parallel
    
    end subroutine bellman_op

    subroutine bellman_op_vec(v_new,pol_ap, v)
    ! Purpose: Bellman operator, it does one step of the VFI
    ! This version is partially vectorized (eliminated loop over a')
    ! For a loop-based version, see subroutine bellman_op
    ! For a fully vectorized version, see subroutine bellman_op_vec2
    implicit none
    ! Inputs:
    real(8), intent(in) :: v(:,:)
    ! Outputs:
    real(8), intent(out) :: v_new(:,:), pol_ap(:,:)
    ! Locals:
    real(8), allocatable :: EV(:,:), cons(:), v_temp(:)
    integer :: a_c,z_c,ap_ind,ap_c
    real(8) :: a_val,z_val

    allocate(cons(n_a),v_temp(n_a))

    ! Compute expected value function EV(a',z) = sum over z' of v(a',z')*z_tran(z,z')
    EV = matmul(v,transpose(z_tran))

    ! Step through the state space
    !$omp parallel if (par_fortran==1) default(shared) & 
    !$ private(z_c,a_c,a_val,z_val,ap_ind,ap_c,cons,v_temp)
    !$omp do collapse(2)
    do z_c=1,n_z
        do a_c=1,n_a
            ! Current state
            a_val = a_grid(a_c)
            z_val = z_grid(z_c)
            ! Compute consumption
            cons = R*a_val + z_val - ap_grid ! (n_ap,1)
            ! NOTE: where and merge are slower than forall
            ! NOTE: forall and do concurrent are equivalent with ifort
            ! but do concurrent is very slow with ifx!
            !where (cons>0.0d0)
            !    util = f_util(cons)  ! (n_ap,1)
            !elsewhere
            !    util = large_negative
            !end where
            !util = merge(f_util(cons),large_negative,cons>0.0d0)
            ! v_temp = large_negative
            ! do concurrent (ap_c=1:n_a, cons(ap_c)>0.0d0)
            !    v_temp(ap_c) = f_util(cons(ap_c))+beta*EV(ap_c,z_c)
            ! enddo
            v_temp = large_negative
            forall (ap_c=1:n_a, cons(ap_c)>0.0d0)
               v_temp(ap_c) = f_util(cons(ap_c))+beta*EV(ap_c,z_c)
            end forall
            ap_ind = maxloc(v_temp,dim=1)
            v_new(a_c,z_c) = v_temp(ap_ind)
            pol_ap(a_c,z_c) = a_grid(ap_ind)
        enddo
    enddo
    !$omp enddo
    !$omp end parallel

    end subroutine bellman_op_vec

subroutine bellman_op_vec2(v_new,pol_ap, v)
    ! Purpose: Bellman operator, it does one step of the VFI
    ! This version is partially vectorized (eliminated loops over a and a')
    ! For a loop-based version, see subroutine bellman_op
    implicit none
    ! Inputs:
    real(8), intent(in) :: v(:,:)
    ! Outputs:
    real(8), intent(out) :: v_new(:,:), pol_ap(:,:)
    ! Locals:
    real(8), allocatable :: EV(:,:), cons(:,:), v_temp(:,:)
    integer, allocatable :: ap_ind(:)
    integer :: z_c,ap_c,a_c
    real(8) :: z_val

    allocate(cons(n_ap,n_a),v_temp(n_ap,n_a))

    ! Compute expected value function EV(a',z) = sum over z' of v(a',z')*z_tran(z,z')
    EV = matmul(v,transpose(z_tran))

    ! Step through the state space
    do z_c=1,n_z
        ! Current state
        z_val = z_grid(z_c) 
        ! Compute consumption
        cons = R*a_grid2 + z_val - ap_grid2 ! (n_ap,n_a)
        v_temp = large_negative
        forall (a_c=1:n_a,ap_c=1:n_ap, cons(ap_c,a_c)>0.0d0)
            v_temp(ap_c,a_c) = f_util(cons(ap_c,a_c))+beta*EV(ap_c,z_c)
        end forall
        ! v_temp is a 2-dim array(a',a), we need to find the max along a' 
        ap_ind = maxloc(v_temp,dim=1)
        v_new(:,z_c) = maxval(v_temp,dim=1)
        pol_ap(:,z_c) = a_grid(ap_ind)
    enddo
    
    end subroutine bellman_op_vec2

end module mod_vfi
!===============================================================================!

program main
    use mod_globals
    use mod_vfi
    use mod_utilities, only: disp,writescalar,writedim
    use mod_numerical, only: linspace
    use omp_lib 

    implicit none
    
    ! Declarations
    integer :: istat,it,verbose,maxit,z_c,a_c,method 
    real(8) :: t1,t2,err,tol
    !integer, allocatable :: pol_ap_ind(:,:)
    real(8), allocatable :: V(:,:),V_new(:,:),pol_ap(:,:),pol_c(:,:)
    
    write(*,*) "Starting program..."
    
    ! ---------------------- Set numerical parameters ----------------------!
    method      = 1       ! 0: loop-based, 1: vectorized, 2: vectorized2
    verbose     = 1       ! If 1, print iteration info
    par_fortran = 0       ! If 1, parallelize with OpenMP
    tol         = 1.0d-6  ! Tolerance for VFI
    maxit       = 1000    ! Maximum number of iterations

    ! ---------------------- Set economic parameter values ----------------------!
    R     = 1.03d0 ! Gross interest rate, R = 1+r
    beta  = 0.96d0 ! Discount factor
    gamma = 1.0d0  ! Coeff. rel. risk aversion (if 1, log utility)
    
    ! ---------------------- Set grids and probabilities ------------------------!
    a_min = 0.0d0 ! Minimum value of asset grid
    a_max = 4.0d0 ! Maximum value of asset grid
    n_a   = 1000  ! Number of grid points for a (state variable)
    n_ap  = n_a   ! Number of grid points for a' (choice variable)
    a_grid  = linspace(a_min,a_max,n_a)
    ap_grid = a_grid
    !call disp("a_grid = ",a_grid(1:100))
    ! Expand a_grid to 2D
    allocate(a_grid2(n_a,n_a),ap_grid2(n_a,n_a),stat=istat)
    if (istat/=0) error stop "Allocation of a_grid2 and ap_grid2 failed!"
    
    do a_c=1,n_a
        a_grid2(a_c,:)  = a_grid !varies along rows
        ap_grid2(:,a_c) = ap_grid !varies along columns
    enddo

    !call disp("a_grid2  = ",a_grid2)
    !call disp("ap_grid2 = ",ap_grid2)
    !pause 
    
    n_z   = 2 ! Number of grid points for z (shock)
    allocate(z_grid(n_z),z_tran(n_z,n_z),stat=istat)
    if (istat/=0) error stop "Allocation of z_grid and z_tran failed!"
    
    z_grid = [0.5d0,1.0d0]
    ! Transition matrix for z (rows sum to 1)
    z_tran(1,:) = [0.6d0,0.4d0]
    z_tran(2,:) = [0.05d0,0.95d0]
    
    !call disp("z_grid = ",z_grid)
    !call disp("z_tran = ",z_tran)

    ! ---------------------- Initialize value function ----------------------!
    allocate(V(n_a,n_z),V_new(n_a,n_z),pol_ap(n_a,n_z),pol_c(n_a,n_z),stat=istat)
    if (istat/=0) error stop "Allocation of V and pol_ap failed!"

    V = 0.0d0 ! Initial guess for value function
    it = 1
    err = tol+1.0d0
    
    write(*,*) "Starting VFI..."
    write(*,*) "Method = ",method
    t1 = omp_get_wtime()

    do while (err>tol .and. it<maxit)

        if (method==0) then
            call bellman_op(V_new,pol_ap, V)
        elseif (method==1) then
            call bellman_op_vec(V_new,pol_ap, V)
        elseif (method==2) then
            call bellman_op_vec2(V_new,pol_ap, V)
        else
            error stop "Invalid method!"
        endif
        err = maxval(abs(V_new-V))

        if (verbose==1) then
            write(*,*) "Iteration ",it," error = ",err
        endif

        ! Update
        V = V_new
        it = it + 1
    enddo
    
    t2 = omp_get_wtime()
    
    write(*,*) "============================================================="
    write(*,*) 'VFI runs for',real(t2-t1),'seconds.'
    write(*,*) "============================================================="
    !pause 
    
    ! Compute policy function for consumption
    do z_c=1,n_z
        pol_c(:,z_c) = R*a_grid + z_grid(z_c) - pol_ap(:,z_c)
    enddo
    
    write(*,*) "Writing results to file..."
    call sub_write()
    
    
    contains
    
    subroutine sub_write()
    ! NOTE: The subroutines writescalar and writedim are defined in mod_utilities
    ! If you don't have mod_utilities, just comment out the calls to these subroutines
        call writescalar(n_a,trim(savedir)//'n_a.txt')
        call writescalar(n_z,trim(savedir)//'n_z.txt')
        
        ! Write grids and probs to files
        call writedim(z_grid,trim(savedir)//'z_grid.txt')
        call writedim(a_grid,trim(savedir)//'a_grid.txt')
        call writedim(V,trim(savedir)//'V.txt')
        call writedim(pol_ap,trim(savedir)//'pol_ap.txt')
        call writedim(pol_c,trim(savedir)//'pol_c.txt')
        !call writedim(mu,trim(savedir)//'mu.txt')
    
    end subroutine sub_write
    
end program main

Note that in the code there are some openMP directives but they are not used at the moment (since the variable par_fortran is set to zero).

Any comment or suggestion on how to improve the code is welcome! (Including on how to get rid of the smile face in some array definitions
Overall, I find it quite odd that ifx gives a code that is significantly slower compared to ifort and that now Intel says that ifort is deprecated.

urbanjost · April 7, 2024, 1:07pm

    ```text
    put text here like the table to get fixed-font text
    ``
    ```fortran
    put fortran code here
    ```

start the line with three grave and an optional type like “text” or “bash” or “fortran” and then follow that line with your text or code and then close the text block with a line that is just three grave characters

Markdown Guide:

https://markdown-it.github.io/

You can edit what you already entered and add the block delimiters to make the post text appear properly

aledinola · April 7, 2024, 1:33pm

Thanks for your suggestions about typesetting. I hope now the code is easier to read. I did not manage to improve much the layout of the table, though.

ivanpribec · April 7, 2024, 1:47pm

Two comments:

mod_utilities are missing, so the internal subroutine sub_write can be removed
The OpenMP directives are not accepted by gfortran, due to the way the lines are broken. gfortran expects full clauses on a single line, e.g.

    !$omp parallel if (par_fortran==1) default(shared) &
    !$omp private(z_c,a_c,a_val,z_val, ap_ind,ap_c,cons,v_temp)
    !$omp do collapse(2)

urbanjost · April 7, 2024, 2:27pm

For the table it was an actual markdown table not fixed text, so do not use the grave delimiters, just start the lines with a vertical line character and no blank lines between rows

Version	Time in secs (IFORT)	Time in secs (IFX)
All loops (bellman_op)	0.78	1.63
Vectorized loop over a’ (bellman_op_vec)	0.75	1.8
Vectorized loops over a’ and a (bellman_op_vec2)	1.3	2.18

aledinola · April 7, 2024, 2:44pm

Thanks for your comments

The module mod_utilities is available on this github link. If someone wants to replicate my example, the best way is to download this project of mine from github.
I have updated the openMP directives so that they are ok also for gfortran, thanks!

Chuckyvt · April 7, 2024, 2:44pm

Similar questions have been posted in the Intel forum. They first thing they are going to recommend is to compile and link with Qipo to see the compiler can inline the functions. Outside of that, I believe IFX performance wise is still a work in progress relative to iFort.

Edit: see now /fast, which should have applied /Qipo.

aledinola · April 7, 2024, 2:47pm

Thanks, but I have the option /fast which should include /Qipo, if I am not mistaken.

Edit: saw your edit. Just to be on the safe side, I added /Qipo on top of /fast and the timing is the same

On ifort vs ifx: I agree with you, but if ifx is still a work in progress, why label ifort as deprecated? This seems to suggest that programmers should stop using ifort

ivanpribec · April 7, 2024, 3:27pm

To study the performance, the more relevant metric is iterations per second (or the inverse, time per iteration), which is independent of the tolerance.

Since the vectorization applies along the dimension of size n_a it would be nice to define a measure that includes the problem size. Something like “updates per grid point per second”: n_a * it / elapsed_time. Does the total amount of work scale quadratically with n_a?

perf_O2

perf_O3

(The same version of gfortran was in both plots.)

aledinola · April 7, 2024, 3:35pm

That’s quite interesting! I’ll try to do the same in ifort/ifx. The message from your graph is however that the code with loops is always faster than the vectorized code, right? And the more you vectorize, the worse the performance is.

ivanpribec · April 7, 2024, 3:52pm

It took me a while to comprehend what you mean here, and I’m not sure this can be inferred from the graph. Yes, the performance of the version using array expressions is worse, but I don’t think this disparity in runtime grows bigger as the problem size increases.

But a metric which included n_a would help determine if the CPU is being used effectively.

Using Intel’s Application Performance Snapshot, gives the following numbers (method = 0):

gfortran -O2 -march=native -fopenmp:

  Elapsed Time:                               1.68 s
  SP GFLOPS:                                  0.00
  DP GFLOPS:                                  6.43
  Average CPU Frequency:                      4.96 GHz
  IPC Rate:                                   3.36
  Physical Core Utilization:                 13.50%
  Average Physical Core Utilization:          1.08 out of 8 Physical Cores
  Memory Stalls:                              3.70% of Pipeline Slots
    Cache Stalls:                             2.60% of Cycles
    DRAM Stalls:                              0.90% of Cycles
    DRAM Bandwidth
       Peak:                                  5.23 GB/s
       Average:                               0.09 GB/s
       Bound:                                 0.00%
  Vectorization:                              0.00%
     Instruction Mix:
       SP FLOPs:                              0.00% of uOps
       DP FLOPs:                             31.80% of uOps
          Packed:                             0.00% from DP FP
             128-bit:                         0.00%
             256-bit:                         0.00%
             512-bit:                         0.00%
          Scalar:                           100.00% from DP FP
       Non-FP:                               68.20% of uOps
     FP Arith/Mem Rd Instr. Ratio:            1.21
     FP Arith/Mem Wr Instr. Ratio:           13.53
 Memory Footprint:
 Resident:         20.00 MB
 Virtual:          26.00 MB

Notably, vectorization (in the sense of using vector registers) is at 0 %.

aledinola · April 7, 2024, 4:05pm

Thanks, I used the term vectorization (as I explained in the intro to my post) to refer to the process of replacing do loops with array expressions. In interpreted languages like Matlab this brings a big increase in performance. Fortran is of course different, but one of the advantages of Fortran with respect to C is that it is array-oriented. However, it seems from my example and from other sources that in the end if you want good performance you have to program with loops, thus “losing” this nice feature of Fortran.

The second point that I wanted to make with my post is that ifx performs quite bad relative to ifort

ivanpribec · April 7, 2024, 5:46pm

In the f_util function, isn’t the power function superfluous?

        if (abs(gamma-1.0d0)<1.0d-6) then
            util = log(c)
        else
            util = c**(1.0d0-gamma)/(1.0d0-gamma)
        endif

aledinola · April 7, 2024, 6:29pm

It’s the general way of coding an isoelastic utility function (see here). Of course, given than gamma=1, I could just write log(c), but I’d like to keep the code general enough.

I could write

if (gamma==1.0d0) then
  util = log(c)
else
  util = c**(1.0d0-gamma)/(1.0d0-gamma)
endif

but I think it would not be robust to round-off errors since gamma is a real, not an integer.

aledinola · April 7, 2024, 6:56pm

Since in the code I use the matmul intrinsic function to multiply two matrices, I was looking into ways of optimizing this (without invoking the subroutine dgemm, since I would like to keep the code at a basic level). So I came across this excellent post by @zaikunzhang. One suggestion given in the comments to that post by @ivanpribec was to compile with

ifort -O3 -xHost -qopenmp -qmkl=parallel -heap-arrays 40 -qopt-matmul test_matmul.f90

In my example I am compiling with these flags:

SWITCH = /O3 /fast /Qipo /Qmkl=parallel /Qopenmp /Qparallel -qopt-matmul

The problem is that ifort does not recognize -qopt-matmul. In particular, it gives this warning:

ifort: command line warning #10006: ignoring unknown option ‘/qopt-matmul’

I am a bit puzzled by this…

ivanpribec · April 7, 2024, 7:52pm

I’ve also observed this with some programs. The two compilers have reached feature parity, but Intel Fortran compiler team has said on a previous occasion that their goal for 2024 is to reach performance parity.

When I plot the performance in terms of updates per second (only for method=0), I see the following:

The difference is due to the success of vectorization on the inner most loop (the a’ one).

From the performance snapshot for ifx:

Elapsed Time:                               2.06 s
  SP GFLOPS:                                  0.18
  DP GFLOPS:                                  5.54
  IPC Rate:                                   2.16
     DP FLOPs:                             21.60% of uOps
          Packed:                             0.00% from DP FP
             128-bit:                         0.00%
             256-bit:                         0.00%
             512-bit:                         0.00%
          Scalar:                           100.00% from DP FP
       Non-FP:                               77.70% of uOps

Whereas with ifort:

Elapsed Time:                               0.63 s
  SP GFLOPS:                                  0.65
  DP GFLOPS:                                 12.05
  Average CPU Frequency:                      4.76 GHz
  IPC Rate:                                   1.46
       DP FLOPs:                              7.20% of uOps
          Packed:                            94.60% from DP FP
             128-bit:                         0.10%
             256-bit:                         0.10%
             512-bit:                        94.50%
          Scalar:                             5.40% from DP FP
       Non-FP:                               92.30% of uOps

If you look at the results with Intel Advisor, it shows that the majority of time is spent calculating the logarithm function (the one inlined from f_util):

The remaining question now is what can be done to nudge gfortran and ifx to emit the desired vector instructions.

urbanjost · April 7, 2024, 9:01pm

To reduce any duplication of effort in making the github project gfortran-compatible, adding command-line parameters for the method and reorganizing the application for use with fpm see

wget “ftp://ftp.urbanjost.altervista.org/REMOVE/timing1.tgz”

If you have trouble with the wget(1) command this should work

https://urbanjost.altervista.org/REMOVE/timing1.tgz

Looks like the best bet is a temporary github repository

which is set up to just running “run.sh” runs a variety of compiler options and program options via the “fpm.rsp” file; but this looks to have already progressed to the point that is mainly of use in saving an fpm(1) user from setting up to run with multiple compilers.

I will remove the tar file in a few weeks.

WIthout even significantly changing the code logic and just using various compilers and options I was seeing about a 10x speedup from slowest case to fastest case that I tried; which is always interesting.

aledinola · April 7, 2024, 10:16pm

Thanks @urbanjost! The link that you have provided does not seem to work. When I click on it, nothing happens
I have copied and pasted it on my browser but I get a strange notification: Open Pick an app

urbanjost · April 7, 2024, 10:32pm

Odd. Seems ftp and https are not being accepted. Have not had trouble before. Added an https link and getting an unexpected warning. I can probably add it to a github site instead while I check what is up.

Not that there is anything earth-shaking there; but the github link should work. I will probably take it down in a few weeks like the REMOVE/ directory one. Just a few changes but if you want to create ClI numeric arguments to pass in and/or read from a namelist file or curious about fpm and fpm response files it might be interesting. Also shows how to log the compiler and compiler options from the program if you have not run across that before; useful in particular for testing or creating a program pedigree.

ivanpribec · April 8, 2024, 1:37am

Here are some changes that made it faster with ifx:

use of explicit-shape arrays
use of a derived type instead of global variables
the maxloc intrinsic from ifort is used, as it turns out to be slightly faster (better compiler vectorization); ifx versions prior to 2024.0 use a scalar loop
use of !$omp simd directives to encourage vectorization of the inner loop

module mod_vfi

    implicit none
    private

    public :: bellman_op, params
    
    type :: params
        integer :: n_a, n_z
        real(8) :: beta, gamma, R
        real(8), allocatable :: a_grid(:)
        real(8), allocatable :: ap_grid(:)
        real(8), allocatable :: z_grid(:)
        real(8), allocatable :: z_tran(:,:)
    end type

contains
    
    subroutine bellman_op(v_new, pol_ap, v, p)
        implicit none
        ! Inputs:
        type(params), intent(in) :: p
        real(8), intent(in) :: v(p%n_a,p%n_z)
        ! Outputs:
        real(8), intent(out) :: v_new(p%n_a,p%n_z), pol_ap(p%n_a,p%n_z)

        ! Locals:
        real(8) :: ev(p%n_a,p%n_z)
        real(8) :: cons(p%n_a), v_temp(p%n_a)

        real(8) :: a_val, z_val
        integer :: a_c, z_c, ap_ind

        ! Compute expected value function EV(a',z) = sum over z' of v(a',z')*z_tran(z,z')
        EV = matmul(v,transpose(p%z_tran))

        ! Step through the state space
        do z_c=1,p%n_z
            do a_c=1,p%n_a

                ! Current states
                a_val = p%a_grid(a_c)
                z_val = p%z_grid(z_c)

                ! Array of constraints
                cons = p%R*a_val + z_val - p%ap_grid

                call max_ap_loc(p%n_a, cons, ev(:,z_c), p%beta, p%gamma, &
                    v_temp,  ap_ind)

                v_new(a_c,z_c)  = v_temp(ap_ind)
                pol_ap(a_c,z_c) = p%ap_grid(ap_ind)
            enddo
        enddo

    contains
            
    ! Choose a' optimally by stepping through all possible values      
    subroutine max_ap_loc(n, cons, ev, beta, gamma, vt, id)
        implicit none
        integer, intent(in) :: n
        real(8), intent(in) :: cons(n), ev(n), beta, gamma
        real(8), intent(out) :: vt(n)
        integer, intent(out) :: id

        interface
            ! ifx versions < 2024.0 generate scalar loops.
            ! The ifort version on the other hand, seems to be 
            ! vectorized well when compiled with -O3 -xHOST
            pure function ifort_maxloc(n,v) result(idx)
               integer, intent(in) :: n
               real(8), intent(in) :: v(n)
               integer :: idx
            end function
        end interface

        real(8), parameter :: dbl_min = -huge(1.0d0)
        integer :: i

        !$omp simd
        do i = 1, n
            if (cons(i) > 0.0d0) then
                vt(i) = ff_util(cons(i), gamma) + beta*ev(i)
            else
                vt(i) = dbl_min
            end if
        end do

#if defined(__INTEL_LLVM_COMPILER) || defined(__GFORTRAN__) 
        id = ifort_maxloc(n,vt)
#else
        id = maxloc(vt,dim=1)
#endif
    end subroutine

    pure function ff_util(c, gamma) result(util)
    !$omp declare simd uniform(gamma) inbranch

        ! Purpose: utility function
        ! Assumption: c is positive, this is enforced
        ! elsewhere in the code
        real(8), intent(in) :: c, gamma
        real(8) :: util
        
        if (abs(gamma - 1.0d0) < 1.0d-6) then
            util = log(c)
        else
            util = (c**(1.0d0 - gamma))/(1.0d0 - gamma)
        endif
    end function

    end subroutine 

end module mod_vfi

(Edit: Added a missing end subroutine statement, that got cut while copying. The code now compiles.)

I get the same error as the original:

~/bellman$ make main FC=ifx FFLAGS="-O3 -xHOST -qopenmp-simd -mprefer-vector-width=512 -fopenmp -qmkl"
ifort -c -O3 -xHOST -diag-disable=10448 ifort_maxloc.f90
ifx -o main -O3 -xHOST -qopenmp-simd -mprefer-vector-width=512 -fopenmp -qmkl -fopenmp -L/opt/intel/oneapi/compiler/latest/lib  main.F90 ifort_maxloc.o 
~/bellman$ ./main 0 1000
 Starting program...
 Starting VFI...
 Method =            0
 n_a =         1000
 Iteration:          276
 Error:  9.800499998213752E-007

~/bellman$ make original FFLAGS="-O3 -march=native -fopenmp"
gfortran-13 -o original -O3 -march=native -fopenmp original.F90 ifort_maxloc.o
~/bellman$ ./original 1 1000
 Starting program...
 Starting VFI...
 Method =            1
 Iteration:          276
 Error:   9.8004999982137520E-007

When I plot the new one in the same chart as earlier, it looks like this:

perf_opt

If I look again at the APS performance snapshot (the numbers in bracket are the initial values from ifort), the various metrics have also improved:

  Elapsed Time:                               0.35 s             (0.63 s)
  SP GFLOPS:                                  0.00               (0.65)
  DP GFLOPS:                                 33.87               (12.05)
  Average CPU Frequency:                      4.91 GHz           (4.76 GHz)
  IPC Rate:                                   1.93               (1.46)
  Vectorization:                             99.50%
     Instruction Mix:
       SP FLOPs:                              0.00% of uOps
       DP FLOPs:                             19.40% of uOps      (7.20% of uOps)
          Packed:                            99.50% from DP FP   (94.60% from DP FP)
             128-bit:                         0.00%
             256-bit:                         0.10%
             512-bit:                        99.40%              (94.50%)
          Scalar:                             0.50% from DP FP
       Non-FP:                               80.60% of uOps      (92.30% of uOps

So ifx can perform as fast or even faster as ifort. Unfortunately, auto-vectorization is not a panacea, and is prone to failure. Matt Pharr put it like this:

The problem with an auto-vectorizer is that as long as vectorization can fail (and it will), then if you’re a programmer who actually cares about what code the compiler generates for your program, you must come to deeply understand the auto-vectorizer. Then, when it fails to vectorize code you want to be vectorized, you can either poke it in the right ways or change your program in the right ways so that it works for you again. This is a horrible way to program; it’s all alchemy and guesswork and you need to become deeply specialized about the nuances of a single compiler’s implementation—something you wouldn’t otherwise need to care about one bit.

Edit: here is how the multi-threaded speed-up looks like, I increased n_z = 100, and set the tolerance to 1.0d-7 so there would be a bit more work. For 1-4 threads it’s roughly linear, but then it starts to veer downward for some reason.

scaling_plot

The APS snapshot looks like this:

Topic		Replies	Views
Will using Vectorization speed up the program? Help	22	1887	November 26, 2023
Fortran is dead – Long live Fortran! Tutorials	9	863	June 27, 2025
Simple summation 8x slower than in Julia	89	15023	April 2, 2022
How to use IFX and offload openMP to GPU?	0	1625	April 2, 2022
Parallelization on GPU with Intel compiler Intel	55	2653	September 20, 2024

Performance of vectorized code in ifort and ifx

Markdown Guide:

Related topics