Practice with elemental function

I’m trying to see if using elemental functions can improve performance of my code, but my example does not work as intended.

I would like to compute a matrix defined as

F(a’,a) = \log(cash - a’)
where cash = (1+r)a + wz. If cash - a’ \leq 0, then F is equal to a large negative number.

I calculate the matrix F in three different ways

  • Calling the function f_return within two loops
  • Calling the function f_return_vec using array syntax: I pass a vector and the function returns a vector with the same shape. Inside the function I use the intrinsic merge (but I could use where, forall or other constructs, I don’t think it is relevant).
  • Calling the function f_return_el using elemental function.

All three should give the same result of course but the third method, based on elemental syntax, fails. Any idea why?

Thanks!

! Compute F(a',a) = log(cash - a') where cash = (1+r)a + wz
! If cash - a' <= 0, return a large negative number
module mymod
    implicit none
    
    contains
    
    pure function f_return(cash,aprime) result(F)
    ! Scalar-syntax version with scalar input and output
        implicit none
        real(8), intent(in) :: cash, aprime
        real(8) :: F
        real(8) :: consumption
        
        consumption = cash - aprime
        
        if (consumption > 0.0d0) then
            F = log(consumption)
        else
            F = -1.0d10
        endif
    
    end function f_return
    
    pure function f_return_vec(cash,aprime) result(F)
    ! Array-syntax version with vector input and output
        implicit none
        real(8), intent(in) :: cash
        real(8), intent(in) :: aprime(:)
        real(8) :: F(size(aprime))
        real(8) :: consumption(size(aprime))
        
        consumption = cash - aprime
        
        F = merge(log(consumption), -1.0d10, consumption > 0.0d0)
    
    end function f_return_vec
    
    elemental pure function f_return_el(cash,aprime) result(F)
    ! Elemental version with all inputs/output as scalars
        implicit none
        real(8), intent(in) :: cash
        real(8), intent(in) :: aprime
        real(8) :: F
        real(8) :: consumption
        
        consumption = cash - aprime
        F = merge(log(consumption), -1.0d10, consumption > 0.0d0)
        !if (consumption > 0.0d0) then
        !    F = log(consumption)
        !else
        !    F = -1.0d10
        !endif
    
    end function f_return_el
    
end module mymod
    
program main
    use mymod, only: f_return, f_return_vec, f_return_el
    implicit none
    integer, parameter :: n_a = 4000
    real(8), parameter :: a_min = 0.0d0, a_max = 20.0d0
    real(8), parameter :: w=1.0d0, z=1.0d0
    integer :: i, a_c, ap_c
    real(8) :: cash, step_size, diff, diff3
    real(8) :: t_start1, t_end1, t_start2, t_end2, t_start3, t_end3
    real(8) :: a_grid(n_a)
    real(8), allocatable :: ret_matrix1(:,:), ret_matrix2(:,:), ret_matrix3(:,:)
    
    ! Create equally spaced grid for assets
    step_size = (a_max-a_min)/(real(n_a-1,8))
    do i = 1,n_a
        a_grid(i) = a_min + step_size*(i - 1)
    enddo
    
    allocate(ret_matrix1(n_a,n_a), ret_matrix2(n_a,n_a), ret_matrix3(n_a,n_a))
    
    ! Measure time for three different ways to evaluate the
    call cpu_time(t_start1)
    do a_c = 1,n_a
        do ap_c = 1,n_a
            cash = (1.0d0+a_grid(a_c)) + w*z
            ! ret_matrix(a',a) gives the return for asset today a and asset tomorrow a'
            ret_matrix1(ap_c,a_c) = f_return(cash,a_grid(ap_c))
        enddo
    enddo
    call cpu_time(t_end1)
    
    call cpu_time(t_start2) 
    do a_c = 1,n_a
        cash = (1.0d0+a_grid(a_c)) + w*z
        ! ret_matrix(a',a) gives the return for asset today a and asset tomorrow a'
        ret_matrix2(:,a_c) = f_return_vec(cash,a_grid)
    enddo
    call cpu_time(t_start2)
    
    call cpu_time(t_start3)
    do a_c = 1,n_a
        cash = (1.0d0+a_grid(a_c)) + w*z
        ! ret_matrix(a',a) gives the return for asset today a and asset tomorrow a'
        ret_matrix3(:,a_c) = f_return_el(cash,a_grid)
    enddo
    call cpu_time(t_start3)
    
    diff = maxval(abs(ret_matrix1 - ret_matrix2))
    diff3 = maxval(abs(ret_matrix1 - ret_matrix3))
    
    write(*,*) 'Maximum difference between scalar and vectorized evaluations: ', diff
    write(*,*) 'Maximum difference between scalar and elemental evaluations: ', diff3
    write(*,*) 'Time taken for scalar evaluations: ', t_end1 - t_start1
    write(*,*) 'Time taken for vectorized evaluations: ', t_start2 - t_end2
    write(*,*) 'Time taken for elemental evaluations: ', t_start3 - t_end3
    
end program main

The output of this program is:

Maximum difference between scalar and vectorized evaluations:
  0.000000000000000E+000
 Maximum difference between scalar and elemental evaluations:
   10000000000.0000
Press any key to continue . . .

Update

I updated this MWE fixing a bug, replaced the if condition with merge in the second function (thanks @zaikunzhang) and added the timing.

There is just a typo in your third loop, if you change ret_matrix2 for ret_matrix3 you get your expected result. Also, you could use the same expression

consumption = cash - aprime
F = merge(log(consumption), -1.0d10, consumption > 0.0d0)

For the 3 variants, but that’s just extra

1 Like

For performance, you should to check if your compiler is using a SIMD math library like Intel SVML or libmvec to evaluate log. Typically this will require the -ffast-math option (or similar).

1 Like

Thanks for spotting the typo!! Re the merge suggestion, I’ve noticed in the past that the if condition was faster, so I tend not to use it if I can avoid it. This goes back to array-syntax not being very performant in Fortran. I had an example here some time ago showing a simple value function iteration loop-based vs array-syntax and the loop-based version was much faster.

Indeed, even in this case, the “all loops-based” code is the clear winner. The result might depend on the compiler and on my computer of course: I would be interested is someone else can run the MWE shown in my first post.

I think you might be packaging together different things. Array syntax might or might not render the best performance possible compared to a plain do depending on the compiler and also what is the exact operation being carried out.

merge vs if is a different discussion. My experience so far is that merge get’s inlined and vectorized quite nicely at least with ifort and gfortran. But to be sure, one would need to check the ASM for the specific case at hand. In some scenarios they might be translated to the same ASM, sometimes not …

1 Like

To maximize performance I suggest not using merge when tsource or fsource is an expression, since the compiler may evaluate both.

1 Like

Besides the if condition, is there an alternative safe way of doing this? By safe I mean not evaluate the log of a negative number.

Would where behave differently from merge?

Quoting The new features of Fortran 2023 by Reid,

Conditional expressions, expressions whose value is one of several alternatives, are added. A simple example is

value = ( a>0.0 ? a : 0.0)

which is a short way of writing

if (a>0.0) then
   value = a 
else 
   value = 0.0 
end if

I don’t know which compilers have added this feature.

1 Like

Nothing I know of guarantees short-circuiting other than that new syntax that I know of; but you can make an elemental function that does the if/else/endif and call that, or (as mentioned) use WHERE.
WHERE is tricky though (or at least I got surprised) in that you can end up operating on all the elements if not careful.

    where (A > 0.0)
       A = LOG (A)           ! LOG is invoked only for positive elements.
       A = A / SUM (LOG (A)) ! LOG is invoked for all elements
                             ! because SUM is transformational.
    elsewhere 
    endwhere
1 Like

All three expressions in merge are required to be evaluated before choosing which results to use.

F2023 conditional expressions are in the latest gfortran development trunk. They actually work quite well!

As far as using and timing elemental procedures, be sure your compiler supports and has inlining turned on.

1 Like

Didn’t double-check but I think it is up to the compiler and that it does not have to evaluate the expressions for all values regardless of the mask, it is just that it is also allowed to (?). I was using MERGE() early on with optional values as in VALUE=MERGE(A,B,PRESENT(A))
and got away with it on several compilers, then it blew up on one and pretty sure I found the standard said it was up to the compiler when I looked it up.

That is the way I now understand merge() to work too.

MERGE works like any other function and must evaluate all its arguments. I suppose if the third argument could be evaluated at compile time, and neither of the first two expressions contained function calls with side-effects, an optimizing compiler could simplify things. But generally speaking this is the need for conditional expressions.

I like using conditional expressions in other languages. They’ve been around since ALGOL-60, LISP, and APL. In fact John McCarthy wrote that he developed conditional expressions based on his experience with a (MERGE-like) function he wrote in early Fortran called XIF. He claimed XIF made his code a lot more readable than using arithmetic IFs. But apparently couldn’t convince the IBMers at the time to only evaluate one expression or the other based on the conditional. So he put it in LISP, and got the ALGOL-60 committee to also include them. It is well past time that Fortran should support them too.

Please be careful @aledinola of your results as you have typos in time monitoring: you queried cpu_time(start_2) instead of cpu_time(end_2) for both loops 2 and 3 in your example. With these fixed, the elemental version is much faster on my machine. Unfortunately Fortran’s N-D array syntax is not good enough for these symmetric matrix cases (spread would lead to huge slowdowns). Here is what I get witht he time fix on a Mac M1 with my version, and -O3:

 Maximum difference between scalar and vectorized evaluations:    0.0000000000000000     
 Maximum difference between scalar and elemental evaluations:    0.0000000000000000     
 Time taken for scalar evaluations:    5.9456000000000002E-002
 Time taken for vectorized evaluations:    4.2580000000000007E-002
 Time taken for elemental evaluations:    3.9827000000000001E-002

My combustion engineering practice would suggest:

  • keep as much physics as possible inside the elemental function (everything is passed purely by reference)
  • If you’re going to optimize this, try to make it at least C0 (no hard step near 0), perhaps you can save more from the better convergence of the optimizer rather than by optimizing the function evaluation.
    elemental pure function f2d(a,aprime,w,z) result(F)
    ! Elemental version with all physics embedded
        implicit none
        real(8), intent(in) :: a,aprime,w,z
        real(8) :: F
        real(8) :: consumption, cash
        real(8), parameter :: MIN_CONSUMPTION = 10*tiny(0.0_8)
        
        cash = (1.0d0+a) + w*z        
        consumption = max(cash - aprime, MIN_CONSUMPTION)
        F = log(consumption)

    end function f2d
1 Like