Performance of vectorized code in ifort and ifx

Shahid · April 8, 2024, 6:35am

Your code has forall construct which is not recommended. FORALL (intel.com) says

The FORALL construct is an obsolescent language feature in Fortran 2018.

aledinola · April 8, 2024, 10:43am

I know that, but in practice the forall construct performs better than the alternatives that I have tested (see bellman_op_vec, and the commented lines therein). In particular,

The forall is faster than using where or merge
The forall performs on par with do concurrent in ifort but do concurrent is really much slower when using ifx.

To give some background on why I am using these constructs:

I have to compute the utility of consumption, defined in the function f_util in mod_globals, for each element of the vector cons (cons is a vector with n_a elements). However, I don’t want to compute utility if cons is negative, so I have to loop over all elements in a’ and take the log of cons only if cons is positive. Taking the log is time-consumping, so I don’t want to do unnecessary operations

aledinola · April 8, 2024, 11:08am

Thanks a lot for sharing this on github!

aledinola · April 8, 2024, 1:03pm

Very interesting, thanks! A couple of comments:

Automatic arrays. Usually I refrain from using automatic arrays because in ifort on Windows they often give stack overflow errors. Then I have to set /heap-arrays0 but this degrades performance, sometimes significantly.
Use of a derived type instead of global variables. Do you think in your experience that this would help also in ifort? By the way, I know that using global variables is not a very good programming practice but I did not know that it would have adverse impact on performance.
I noticed that you outsourced the innermost loop over a’ to an internal subroutine and you also moved the function f_util from being a module procedure to an internal function of bellman_op. Could you please explain the logic of this change?

ivanpribec · April 8, 2024, 10:09pm

Allocating a temporary array should be fine here, I mainly did it to make the function shorter. This kernel isn’t affected by the one-time overhead.
Does the derived type help? My reasoning was it would simplify the compiler analysis, allowing the auto-vectorized to do a better job. Switching to contiguous arrays and removing references to global data led to a two- to three-fold increase, but ultimately, this wasn’t the bottleneck. I tracked my iterations in this chart:

image1088×795 106 KB
Typically, compilers will try to use vector instructions on the inner-most loop, as that’s where the most work lies. I.e. The inner-most loop over a’ will be repeated n_a*n_z times, so that’s the optimization target. I split the body into a function to focus exclusively on that function as it accounts for 95 % of the runtime. But perhaps that’s not most the natural way to structure this calculation.

Today, I had another look at this, and it turns out the biggest boost comes from using Intel’s Short Vector Math Library (SVML). That is to say the time difference comes down to whether you are using the AVX512 versions of the log and pow functions or not (and to a minor extent the same for the maxloc intrinsic).

To demonstrate this point, I wrote the following C vector version of the utility function:

// v_util.c

#include <stddef.h>
#include <float.h>
#include <math.h>

#define VLEN 8
#define HUGE_NEG (-DBL_MAX)

static inline double f_util(double c, double gamma)
{
    const double p = 1 - gamma;
    return (fabs(gamma - 1.0) < 1.0e-6) ? log(c) : pow(c,p)/p;
}

/* Vectorized utility function */
void v_util(int n, 
            const double* __restrict c, 
            const double* __restrict ev, 
            double* __restrict v,
            double beta,
            double gamma)
{
    #pragma omp simd simdlen(VLEN)
    for (size_t i = 0; i < n; i++)
    {
        if (c[i] > 0.0) {
            v[i] = f_util(c[i],gamma) + beta*ev[i];
        } else {
            v[i] = HUGE_NEG;
        }
    }
}

This file will be compiled with the icx compiler by the following Make rule:

v_util.o: v_util.c
	icx -O3 -xHOST -qopenmp-simd -mprefer-vector-width=512 -g -c $<
ifort_maxloc.o: ifort_maxloc.f90
	ifort -c -O3 -xHOST -diag-disable=10448 -g $<

To make sure the flags lead to the desired effect, I’ve inspected the object file to make sure the vector math functions are called:

~/bellman$ nm v_util.o
0000000000000000 r .LCPI0_0
0000000000000008 r .LCPI0_1
0000000000000010 r .LCPI0_2
0000000000000018 r .LCPI0_3
0000000000000020 r .LCPI0_4
                 U __svml_log8_mask_z0
                 U __svml_pow8_mask_z0
0000000000000000 T v_util

gcc (paired with the right set of flags) also supports vectorized math functions using the libmvec library, i.e.

...
                 U log
                 U pow
0000000000000000 T v_util
                 U _ZGVdN4v_log
                 U _ZGVdN4vv_pow
                 U _ZGVeN8v_log
                 U _ZGVeN8vv_pow

but in all my tests, the object produced by the Intel compiler is faster (perhaps at the expense of strict floating point accuracy?).

The Bellman operator now looks like this:

    subroutine bellman_op(v_new,pol_ap, v)
    ! Purpose: Bellman operator, it does one step of the VFI
    ! This version is loop-based and can be parallelized with OpenMP
    ! For vectorized versions, see subroutine bellman_op_vec, bellman_op_vec2
    implicit none
    ! Inputs:
    real(8), intent(in) :: v(:,:)
    ! Outputs:
    real(8), intent(out) :: v_new(:,:), pol_ap(:,:)
    ! Locals:
    real(8), allocatable :: EV(:,:)
    integer :: a_c, z_c, ap_c, ap_ind
    real(8) :: a_val, z_val, aprime_val

    interface
        ! For ifx versions < 2024.0, a serial loop is used.
        ! The ifort version on the other hand, seems to be 
        ! vectorized well when compiled with -O3 -xHOST
        pure function ifort_maxloc(n,v) result(idx)
           integer, intent(in) :: n
           real(8), intent(in) :: v(n)
           integer :: idx
        end function
        ! Vectorized utility function implemented in C
        ! and compiled with icx for use of the SVML
        subroutine v_util(n,c,ev,v,beta,gamma) bind(c,name="v_util")
            use, intrinsic :: iso_c_binding
            integer(c_int), value :: n
            real(c_double), intent(in) :: c(n), ev(n)
            real(c_double), intent(out) :: v(n)
            real(c_double), value :: beta, gamma
        end subroutine
    end interface

    real(8), allocatable :: cons(:), v_temp(:)

    allocate(cons,mold=ap_grid)
    allocate(v_temp,mold=ap_grid)

    ! Compute expected value function EV(a',z) = sum over z' of v(a',z')*z_tran(z,z')
    EV = matmul(v,transpose(z_tran))

    ! Step through the state space
    !$omp parallel if (par_fortran==1) default(shared) & 
    !$omp private(z_c,a_c,a_val,z_val,ap_ind,ap_c,aprime_val,cons,v_temp)
    !$omp do collapse(2)
    do z_c=1,n_z
        do a_c=1,n_a
            ! Current states
            a_val = a_grid(a_c)
            z_val = z_grid(z_c)

    ! ----- modified section -----------
            cons = R*a_val + z_val - ap_grid
            call v_util(n_a,cons,ev(:,z_c),v_temp,beta,gamma)  ! icx            
            ap_ind = ifort_maxloc(n_a,v_temp)   ! ifort
    ! -----------------------------------
            v_new(a_c,z_c)  = v_temp(ap_ind)
            pol_ap(a_c,z_c) = ap_grid(ap_ind)
        enddo
    enddo
    !$omp end do
    !$omp end parallel

    end subroutine bellman_op

The performance plot looks like this:

perf_with_svml

The ifx version I provided yesterday is still 15-20% faster (keep reading as I’ll come to that), but we see that replacing the body of the loop with the C function has increased the performance of the hybrid executable drastically.

Now the reason I placed f_util in the body of the function was because I was hoping the compiler would perform an optimization known as loop unswitching. Since gamma is constant for the duration of the loop, it would be beneficial if this check that \gamma \approx 1 could be moved out of the loop. However it’s hard for a compiler to tell that gamma won’t change across iterations, when the same global symbol is referenced from multiple places.

We can also perform this optimization manually. Here’s how it can be done in C (in this case without use of the preprocessor, although that may be necessary if the compiler doesn’t do inlining very well):

#include <stddef.h>
#include <float.h>
#include <math.h>
#include <stdbool.h>

#define VLEN 8
#define HUGE_NEG (-DBL_MAX)

static inline double f_util(
        bool gamma_is_one, 
        double c, 
        double gamma) 
{
    double p = 1.0 - gamma;
    return gamma_is_one ? log(c) : pow(c, p) / p;
}

static inline void apply_utility(
        bool gamma_is_one, 
        int n, 
        const double* __restrict c, 
        const double* __restrict ev, 
        double* __restrict v, 
        double beta, 
        double gamma) 
{
    #pragma omp simd simdlen(VLEN)
    for (int i = 0; i < n; i++) {
        if (c[i] > 0.0) {
            v[i] = f_util(gamma_is_one, c[i], gamma) + beta * ev[i];
        } else {
            v[i] = HUGE_NEG;
        }
    }
}

void v_util(int n, 
        const double* __restrict c, 
        const double* __restrict ev, 
        double* __restrict v, 
        double beta, 
        double gamma) 
{
    const bool gamma_is_one = fabs(gamma - 1.0) < 1.0e-6;
    if (gamma_is_one) {
        apply_utility(true, n, c, ev, v, beta, gamma);
    } else {
        apply_utility(false, n, c, ev, v, beta, gamma);
    }
}

With this optimization (and with a considerable loss of simplity) the performance matches the ifx line (there is some measurement noise coming from warm-up effects, system jitter, etc.),

perf_with_svml

To summarize what I learned on this example:

vector math libraries make a big difference, and it’s worth checking if they are used
compilers struggle to optimize across function/module boundaries, unless given some help
we can use multiple compilers together for better performance (the tradeoff being the extra build complexity)

aledinola · April 8, 2024, 10:50pm

This is great thanks! Do you have a repo where I can see a complete version of the ifx/ ifort implementation? (I don’t know C, so I don’t plan to use hybrid C/Fortran approaches).

Shahid · April 9, 2024, 1:34am

Can you visit Time difference using if and where statement - Help - Fortran Discourse (fortran-lang.discourse.group)

I had a sort of similar test two years ago.

Shahid · April 9, 2024, 4:24am

perhaps you need to take it outside the loop as suggested by others in my previous link.

aledinola · April 9, 2024, 2:27pm

Thanks for your comment. Unfortunately, due to the structure of my code, I cannot do that, since the value of cons depends on the nested loop over z and a. In your example, the matrix c was independent of the loops i and j, so the where construct could be taken out.

Topic		Replies	Views
Will using Vectorization speed up the program? Help	22	1888	November 26, 2023
Fortran is dead – Long live Fortran! Tutorials	9	863	June 27, 2025
Simple summation 8x slower than in Julia	89	15024	April 2, 2022
How to use IFX and offload openMP to GPU?	0	1625	April 2, 2022
Parallelization on GPU with Intel compiler Intel	55	2653	September 20, 2024

Performance of vectorized code in ifort and ifx

Related topics