Here are some changes that made it faster with ifx
:
- use of explicit-shape arrays
- use of a derived type instead of global variables
- the
maxloc
intrinsic fromifort
is used, as it turns out to be slightly faster (better compiler vectorization); ifx versions prior to 2024.0 use a scalar loop - use of
!$omp simd
directives to encourage vectorization of the inner loop
module mod_vfi
implicit none
private
public :: bellman_op, params
type :: params
integer :: n_a, n_z
real(8) :: beta, gamma, R
real(8), allocatable :: a_grid(:)
real(8), allocatable :: ap_grid(:)
real(8), allocatable :: z_grid(:)
real(8), allocatable :: z_tran(:,:)
end type
contains
subroutine bellman_op(v_new, pol_ap, v, p)
implicit none
! Inputs:
type(params), intent(in) :: p
real(8), intent(in) :: v(p%n_a,p%n_z)
! Outputs:
real(8), intent(out) :: v_new(p%n_a,p%n_z), pol_ap(p%n_a,p%n_z)
! Locals:
real(8) :: ev(p%n_a,p%n_z)
real(8) :: cons(p%n_a), v_temp(p%n_a)
real(8) :: a_val, z_val
integer :: a_c, z_c, ap_ind
! Compute expected value function EV(a',z) = sum over z' of v(a',z')*z_tran(z,z')
EV = matmul(v,transpose(p%z_tran))
! Step through the state space
do z_c=1,p%n_z
do a_c=1,p%n_a
! Current states
a_val = p%a_grid(a_c)
z_val = p%z_grid(z_c)
! Array of constraints
cons = p%R*a_val + z_val - p%ap_grid
call max_ap_loc(p%n_a, cons, ev(:,z_c), p%beta, p%gamma, &
v_temp, ap_ind)
v_new(a_c,z_c) = v_temp(ap_ind)
pol_ap(a_c,z_c) = p%ap_grid(ap_ind)
enddo
enddo
contains
! Choose a' optimally by stepping through all possible values
subroutine max_ap_loc(n, cons, ev, beta, gamma, vt, id)
implicit none
integer, intent(in) :: n
real(8), intent(in) :: cons(n), ev(n), beta, gamma
real(8), intent(out) :: vt(n)
integer, intent(out) :: id
interface
! ifx versions < 2024.0 generate scalar loops.
! The ifort version on the other hand, seems to be
! vectorized well when compiled with -O3 -xHOST
pure function ifort_maxloc(n,v) result(idx)
integer, intent(in) :: n
real(8), intent(in) :: v(n)
integer :: idx
end function
end interface
real(8), parameter :: dbl_min = -huge(1.0d0)
integer :: i
!$omp simd
do i = 1, n
if (cons(i) > 0.0d0) then
vt(i) = ff_util(cons(i), gamma) + beta*ev(i)
else
vt(i) = dbl_min
end if
end do
#if defined(__INTEL_LLVM_COMPILER) || defined(__GFORTRAN__)
id = ifort_maxloc(n,vt)
#else
id = maxloc(vt,dim=1)
#endif
end subroutine
pure function ff_util(c, gamma) result(util)
!$omp declare simd uniform(gamma) inbranch
! Purpose: utility function
! Assumption: c is positive, this is enforced
! elsewhere in the code
real(8), intent(in) :: c, gamma
real(8) :: util
if (abs(gamma - 1.0d0) < 1.0d-6) then
util = log(c)
else
util = (c**(1.0d0 - gamma))/(1.0d0 - gamma)
endif
end function
end subroutine
end module mod_vfi
(Edit: Added a missing end subroutine
statement, that got cut while copying. The code now compiles.)
I get the same error as the original:
~/bellman$ make main FC=ifx FFLAGS="-O3 -xHOST -qopenmp-simd -mprefer-vector-width=512 -fopenmp -qmkl"
ifort -c -O3 -xHOST -diag-disable=10448 ifort_maxloc.f90
ifx -o main -O3 -xHOST -qopenmp-simd -mprefer-vector-width=512 -fopenmp -qmkl -fopenmp -L/opt/intel/oneapi/compiler/latest/lib main.F90 ifort_maxloc.o
~/bellman$ ./main 0 1000
Starting program...
Starting VFI...
Method = 0
n_a = 1000
Iteration: 276
Error: 9.800499998213752E-007
~/bellman$ make original FFLAGS="-O3 -march=native -fopenmp"
gfortran-13 -o original -O3 -march=native -fopenmp original.F90 ifort_maxloc.o
~/bellman$ ./original 1 1000
Starting program...
Starting VFI...
Method = 1
Iteration: 276
Error: 9.8004999982137520E-007
When I plot the new one in the same chart as earlier, it looks like this:
If I look again at the APS performance snapshot (the numbers in bracket are the initial values from ifort), the various metrics have also improved:
Elapsed Time: 0.35 s (0.63 s)
SP GFLOPS: 0.00 (0.65)
DP GFLOPS: 33.87 (12.05)
Average CPU Frequency: 4.91 GHz (4.76 GHz)
IPC Rate: 1.93 (1.46)
Vectorization: 99.50%
Instruction Mix:
SP FLOPs: 0.00% of uOps
DP FLOPs: 19.40% of uOps (7.20% of uOps)
Packed: 99.50% from DP FP (94.60% from DP FP)
128-bit: 0.00%
256-bit: 0.10%
512-bit: 99.40% (94.50%)
Scalar: 0.50% from DP FP
Non-FP: 80.60% of uOps (92.30% of uOps
So ifx
can perform as fast or even faster as ifort
. Unfortunately, auto-vectorization is not a panacea, and is prone to failure. Matt Pharr put it like this:
The problem with an auto-vectorizer is that as long as vectorization can fail (and it will), then if you’re a programmer who actually cares about what code the compiler generates for your program, you must come to deeply understand the auto-vectorizer. Then, when it fails to vectorize code you want to be vectorized, you can either poke it in the right ways or change your program in the right ways so that it works for you again. This is a horrible way to program; it’s all alchemy and guesswork and you need to become deeply specialized about the nuances of a single compiler’s implementation—something you wouldn’t otherwise need to care about one bit.
Edit: here is how the multi-threaded speed-up looks like, I increased n_z = 100, and set the tolerance to 1.0d-7 so there would be a bit more work. For 1-4 threads it’s roughly linear, but then it starts to veer downward for some reason.
The APS snapshot looks like this: