Your code has forall construct which is not recommended. FORALL (intel.com) says
The FORALL construct is an obsolescent language feature in Fortran 2018.
Your code has forall construct which is not recommended. FORALL (intel.com) says
The FORALL construct is an obsolescent language feature in Fortran 2018.
I know that, but in practice the forall
construct performs better than the alternatives that I have tested (see bellman_op_vec
, and the commented lines therein). In particular,
forall
is faster than using where
or merge
forall
performs on par with do concurrent
in ifort but do concurrent
is really much slower when using ifx.To give some background on why I am using these constructs:
I have to compute the utility of consumption, defined in the function f_util
in mod_globals
, for each element of the vector cons
(cons
is a vector with n_a
elements). However, I donāt want to compute utility if cons
is negative, so I have to loop over all elements in aā and take the log of cons
only if cons
is positive. Taking the log is time-consumping, so I donāt want to do unnecessary operations
Thanks a lot for sharing this on github!
Very interesting, thanks! A couple of comments:
/heap-arrays0
but this degrades performance, sometimes significantly.f_util
from being a module procedure to an internal function of bellman_op
. Could you please explain the logic of this change?Allocating a temporary array should be fine here, I mainly did it to make the function shorter. This kernel isnāt affected by the one-time overhead.
Does the derived type help? My reasoning was it would simplify the compiler analysis, allowing the auto-vectorized to do a better job. Switching to contiguous arrays and removing references to global data led to a two- to three-fold increase, but ultimately, this wasnāt the bottleneck. I tracked my iterations in this chart:
Typically, compilers will try to use vector instructions on the inner-most loop, as thatās where the most work lies. I.e. The inner-most loop over aā will be repeated n_a*n_z
times, so thatās the optimization target. I split the body into a function to focus exclusively on that function as it accounts for 95 % of the runtime. But perhaps thatās not most the natural way to structure this calculation.
Today, I had another look at this, and it turns out the biggest boost comes from using Intelās Short Vector Math Library (SVML). That is to say the time difference comes down to whether you are using the AVX512 versions of the log and pow functions or not (and to a minor extent the same for the maxloc
intrinsic).
To demonstrate this point, I wrote the following C vector version of the utility function:
// v_util.c
#include <stddef.h>
#include <float.h>
#include <math.h>
#define VLEN 8
#define HUGE_NEG (-DBL_MAX)
static inline double f_util(double c, double gamma)
{
const double p = 1 - gamma;
return (fabs(gamma - 1.0) < 1.0e-6) ? log(c) : pow(c,p)/p;
}
/* Vectorized utility function */
void v_util(int n,
const double* __restrict c,
const double* __restrict ev,
double* __restrict v,
double beta,
double gamma)
{
#pragma omp simd simdlen(VLEN)
for (size_t i = 0; i < n; i++)
{
if (c[i] > 0.0) {
v[i] = f_util(c[i],gamma) + beta*ev[i];
} else {
v[i] = HUGE_NEG;
}
}
}
This file will be compiled with the icx
compiler by the following Make rule:
v_util.o: v_util.c
icx -O3 -xHOST -qopenmp-simd -mprefer-vector-width=512 -g -c $<
ifort_maxloc.o: ifort_maxloc.f90
ifort -c -O3 -xHOST -diag-disable=10448 -g $<
To make sure the flags lead to the desired effect, Iāve inspected the object file to make sure the vector math functions are called:
~/bellman$ nm v_util.o
0000000000000000 r .LCPI0_0
0000000000000008 r .LCPI0_1
0000000000000010 r .LCPI0_2
0000000000000018 r .LCPI0_3
0000000000000020 r .LCPI0_4
U __svml_log8_mask_z0
U __svml_pow8_mask_z0
0000000000000000 T v_util
gcc
(paired with the right set of flags) also supports vectorized math functions using the libmvec library, i.e.
...
U log
U pow
0000000000000000 T v_util
U _ZGVdN4v_log
U _ZGVdN4vv_pow
U _ZGVeN8v_log
U _ZGVeN8vv_pow
but in all my tests, the object produced by the Intel compiler is faster (perhaps at the expense of strict floating point accuracy?).
The Bellman operator now looks like this:
subroutine bellman_op(v_new,pol_ap, v)
! Purpose: Bellman operator, it does one step of the VFI
! This version is loop-based and can be parallelized with OpenMP
! For vectorized versions, see subroutine bellman_op_vec, bellman_op_vec2
implicit none
! Inputs:
real(8), intent(in) :: v(:,:)
! Outputs:
real(8), intent(out) :: v_new(:,:), pol_ap(:,:)
! Locals:
real(8), allocatable :: EV(:,:)
integer :: a_c, z_c, ap_c, ap_ind
real(8) :: a_val, z_val, aprime_val
interface
! For ifx versions < 2024.0, a serial loop is used.
! The ifort version on the other hand, seems to be
! vectorized well when compiled with -O3 -xHOST
pure function ifort_maxloc(n,v) result(idx)
integer, intent(in) :: n
real(8), intent(in) :: v(n)
integer :: idx
end function
! Vectorized utility function implemented in C
! and compiled with icx for use of the SVML
subroutine v_util(n,c,ev,v,beta,gamma) bind(c,name="v_util")
use, intrinsic :: iso_c_binding
integer(c_int), value :: n
real(c_double), intent(in) :: c(n), ev(n)
real(c_double), intent(out) :: v(n)
real(c_double), value :: beta, gamma
end subroutine
end interface
real(8), allocatable :: cons(:), v_temp(:)
allocate(cons,mold=ap_grid)
allocate(v_temp,mold=ap_grid)
! Compute expected value function EV(a',z) = sum over z' of v(a',z')*z_tran(z,z')
EV = matmul(v,transpose(z_tran))
! Step through the state space
!$omp parallel if (par_fortran==1) default(shared) &
!$omp private(z_c,a_c,a_val,z_val,ap_ind,ap_c,aprime_val,cons,v_temp)
!$omp do collapse(2)
do z_c=1,n_z
do a_c=1,n_a
! Current states
a_val = a_grid(a_c)
z_val = z_grid(z_c)
! ----- modified section -----------
cons = R*a_val + z_val - ap_grid
call v_util(n_a,cons,ev(:,z_c),v_temp,beta,gamma) ! icx
ap_ind = ifort_maxloc(n_a,v_temp) ! ifort
! -----------------------------------
v_new(a_c,z_c) = v_temp(ap_ind)
pol_ap(a_c,z_c) = ap_grid(ap_ind)
enddo
enddo
!$omp end do
!$omp end parallel
end subroutine bellman_op
The performance plot looks like this:
The ifx version I provided yesterday is still 15-20% faster (keep reading as Iāll come to that), but we see that replacing the body of the loop with the C function has increased the performance of the hybrid executable drastically.
Now the reason I placed f_util
in the body of the function was because I was hoping the compiler would perform an optimization known as loop unswitching. Since gamma
is constant for the duration of the loop, it would be beneficial if this check that \gamma \approx 1 could be moved out of the loop. However itās hard for a compiler to tell that gamma wonāt change across iterations, when the same global symbol is referenced from multiple places.
We can also perform this optimization manually. Hereās how it can be done in C (in this case without use of the preprocessor, although that may be necessary if the compiler doesnāt do inlining very well):
#include <stddef.h>
#include <float.h>
#include <math.h>
#include <stdbool.h>
#define VLEN 8
#define HUGE_NEG (-DBL_MAX)
static inline double f_util(
bool gamma_is_one,
double c,
double gamma)
{
double p = 1.0 - gamma;
return gamma_is_one ? log(c) : pow(c, p) / p;
}
static inline void apply_utility(
bool gamma_is_one,
int n,
const double* __restrict c,
const double* __restrict ev,
double* __restrict v,
double beta,
double gamma)
{
#pragma omp simd simdlen(VLEN)
for (int i = 0; i < n; i++) {
if (c[i] > 0.0) {
v[i] = f_util(gamma_is_one, c[i], gamma) + beta * ev[i];
} else {
v[i] = HUGE_NEG;
}
}
}
void v_util(int n,
const double* __restrict c,
const double* __restrict ev,
double* __restrict v,
double beta,
double gamma)
{
const bool gamma_is_one = fabs(gamma - 1.0) < 1.0e-6;
if (gamma_is_one) {
apply_utility(true, n, c, ev, v, beta, gamma);
} else {
apply_utility(false, n, c, ev, v, beta, gamma);
}
}
With this optimization (and with a considerable loss of simplity) the performance matches the ifx
line (there is some measurement noise coming from warm-up effects, system jitter, etc.),
To summarize what I learned on this example:
This is great thanks! Do you have a repo where I can see a complete version of the ifx/ ifort implementation? (I donāt know C, so I donāt plan to use hybrid C/Fortran approaches).
Can you visit Time difference using if and where statement - Help - Fortran Discourse (fortran-lang.discourse.group)
I had a sort of similar test two years ago.
perhaps you need to take it outside the loop as suggested by others in my previous link.
Thanks for your comment. Unfortunately, due to the structure of my code, I cannot do that, since the value of cons depends on the nested loop over z and a. In your example, the matrix c was independent of the loops i and j, so the where construct could be taken out.