Practice with elemental function

I’m trying to see if using elemental functions can improve performance of my code, but my example does not work as intended.

I would like to compute a matrix defined as

F(a’,a) = \log(cash - a’)
where cash = (1+r)a + wz. If cash - a’ \leq 0, then F is equal to a large negative number.

I calculate the matrix F in three different ways

  • Calling the function f_return within two loops
  • Calling the function f_return_vec using array syntax: I pass a vector and the function returns a vector with the same shape. Inside the function I use the intrinsic merge (but I could use where, forall or other constructs, I don’t think it is relevant).
  • Calling the function f_return_el using elemental function.

All three should give the same result of course but the third method, based on elemental syntax, fails. Any idea why?

Thanks!

! Compute F(a',a) = log(cash - a') where cash = (1+r)a + wz
! If cash - a' <= 0, return a large negative number
module mymod
    implicit none
    
    contains
    
    pure function f_return(cash,aprime) result(F)
    ! Scalar-syntax version with scalar input and output
        implicit none
        real(8), intent(in) :: cash, aprime
        real(8) :: F
        real(8) :: consumption
        
        consumption = cash - aprime
        
        if (consumption > 0.0d0) then
            F = log(consumption)
        else
            F = -1.0d10
        endif
    
    end function f_return
    
    pure function f_return_vec(cash,aprime) result(F)
    ! Array-syntax version with vector input and output
        implicit none
        real(8), intent(in) :: cash
        real(8), intent(in) :: aprime(:)
        real(8) :: F(size(aprime))
        real(8) :: consumption(size(aprime))
        
        consumption = cash - aprime
        
        where (consumption > 0.0d0)
            F = log(consumption)
        elsewhere
            F = -1.0d10
        end where
    
    end function f_return_vec
    
    elemental pure function f_return_el(cash,aprime) result(F)
    ! Elemental version with all inputs/output as scalars
        implicit none
        real(8), intent(in) :: cash
        real(8), intent(in) :: aprime
        real(8) :: F
        real(8) :: consumption
        
        consumption = cash - aprime
        if (consumption > 0.0d0) then
            F = log(consumption)
        else
            F = -1.0d10
        endif
    
    end function f_return_el
    
end module mymod
    
program main
    use mymod, only: f_return, f_return_vec, f_return_el
    implicit none
    integer, parameter :: n_a = 5000
    real(8), parameter :: a_min = 0.0d0, a_max = 20.0d0
    real(8), parameter :: w=1.0d0, z=1.0d0, r=0.04d0
    integer :: i, a_c, ap_c
    real(8) :: cash, step_size, diff, diff3
    real(8) :: t_start1, t_end1, t_start2, t_end2, t_start3, t_end3
    real(8) :: a_grid(n_a)
    real(8), allocatable :: ret_matrix1(:,:), ret_matrix2(:,:), ret_matrix3(:,:)
    
    ! Create equally spaced grid for assets
    step_size = (a_max-a_min)/(real(n_a-1,8))
    do i = 1,n_a
        a_grid(i) = a_min + step_size*(i - 1)
    enddo
    
    allocate(ret_matrix1(n_a,n_a), ret_matrix2(n_a,n_a), ret_matrix3(n_a,n_a))
    
    ! Measure time for three different ways to evaluate the
    call cpu_time(t_start1)
    do a_c = 1,n_a
        cash = (1.0d0 + r)*a_grid(a_c) + w*z
        do ap_c = 1,n_a
            ! ret_matrix(a',a) gives the return for asset today a and asset tomorrow a'
            ret_matrix1(ap_c,a_c) = f_return(cash,a_grid(ap_c))
        enddo
    enddo
    call cpu_time(t_end1)
    
    call cpu_time(t_start2) 
    do a_c = 1,n_a
        cash = (1.0d0 + r)*a_grid(a_c) + w*z
        ! ret_matrix(a',a) gives the return for asset today a and asset tomorrow a'
        ret_matrix2(:,a_c) = f_return_vec(cash,a_grid)
    enddo
    call cpu_time(t_end2)
    
    call cpu_time(t_start3)
    do a_c = 1,n_a
        cash = (1.0d0 + r)*a_grid(a_c) + w*z
        ! ret_matrix(a',a) gives the return for asset today a and asset tomorrow a'
        ret_matrix3(:,a_c) = f_return_el(cash,a_grid)
    enddo
    call cpu_time(t_end3)
    
    diff = maxval(abs(ret_matrix1 - ret_matrix2))
    diff3 = maxval(abs(ret_matrix1 - ret_matrix3))
    
    write(*,*) 'Maximum difference between scalar and vectorized evaluations: ', diff
    write(*,*) 'Maximum difference between scalar and elemental evaluations: ', diff3
    write(*,*) 'Time taken for scalar evaluations:     ', t_end1 - t_start1
    write(*,*) 'Time taken for vectorized evaluations: ', t_end2 - t_start2
    write(*,*) 'Time taken for elemental evaluations:  ', t_end3 - t_start3
    
end program main

The output of this program is:

Maximum difference between scalar and vectorized evaluations:
  0.000000000000000E+000
 Maximum difference between scalar and elemental evaluations:
   10000000000.0000
Press any key to continue . . .

Update

I updated this MWE fixing a bug, replaced the if condition with merge in the second function (thanks @zaikunzhang) and added the timing.

Update 2

More typos fixed. Also, I eliminated the use of merge since it is not safe: it may evaluate the log of a negative number (see discussion below).

1 Like

There is just a typo in your third loop, if you change ret_matrix2 for ret_matrix3 you get your expected result. Also, you could use the same expression

consumption = cash - aprime
F = merge(log(consumption), -1.0d10, consumption > 0.0d0)

For the 3 variants, but that’s just extra

2 Likes

For performance, you should to check if your compiler is using a SIMD math library like Intel SVML or libmvec to evaluate log. Typically this will require the -ffast-math option (or similar).

2 Likes

Thanks for spotting the typo!! Re the merge suggestion, I’ve noticed in the past that the if condition was faster, so I tend not to use it if I can avoid it. This goes back to array-syntax not being very performant in Fortran. I had an example here some time ago showing a simple value function iteration loop-based vs array-syntax and the loop-based version was much faster.

1 Like

Indeed, even in this case, the “all loops-based” code is the clear winner. The result might depend on the compiler and on my computer of course: I would be interested is someone else can run the MWE shown in my first post.

1 Like

I think you might be packaging together different things. Array syntax might or might not render the best performance possible compared to a plain do depending on the compiler and also what is the exact operation being carried out.

merge vs if is a different discussion. My experience so far is that merge get’s inlined and vectorized quite nicely at least with ifort and gfortran. But to be sure, one would need to check the ASM for the specific case at hand. In some scenarios they might be translated to the same ASM, sometimes not …

2 Likes

To maximize performance I suggest not using merge when tsource or fsource is an expression, since the compiler may evaluate both.

2 Likes

Besides the if condition, is there an alternative safe way of doing this? By safe I mean not evaluate the log of a negative number.

Would where behave differently from merge?

1 Like

Quoting The new features of Fortran 2023 by Reid,

Conditional expressions, expressions whose value is one of several alternatives, are added. A simple example is

value = ( a>0.0 ? a : 0.0)

which is a short way of writing

if (a>0.0) then
   value = a 
else 
   value = 0.0 
end if

I don’t know which compilers have added this feature.

2 Likes

Nothing I know of guarantees short-circuiting other than that new syntax that I know of; but you can make an elemental function that does the if/else/endif and call that, or (as mentioned) use WHERE.
WHERE is tricky though (or at least I got surprised) in that you can end up operating on all the elements if not careful.

    where (A > 0.0)
       A = LOG (A)           ! LOG is invoked only for positive elements.
       A = A / SUM (LOG (A)) ! LOG is invoked for all elements
                             ! because SUM is transformational.
    elsewhere 
    endwhere
2 Likes

All three expressions in merge are required to be evaluated before choosing which results to use.

F2023 conditional expressions are in the latest gfortran development trunk. They actually work quite well!

As far as using and timing elemental procedures, be sure your compiler supports and has inlining turned on.

3 Likes

Didn’t double-check but I think it is up to the compiler and that it does not have to evaluate the expressions for all values regardless of the mask, it is just that it is also allowed to (?). I was using MERGE() early on with optional values as in VALUE=MERGE(A,B,PRESENT(A))
and got away with it on several compilers, then it blew up on one and pretty sure I found the standard said it was up to the compiler when I looked it up.

1 Like

That is the way I now understand merge() to work too.

2 Likes

MERGE works like any other function and must evaluate all its arguments. I suppose if the third argument could be evaluated at compile time, and neither of the first two expressions contained function calls with side-effects, an optimizing compiler could simplify things. But generally speaking this is the need for conditional expressions.

I like using conditional expressions in other languages. They’ve been around since ALGOL-60, LISP, and APL. In fact John McCarthy wrote that he developed conditional expressions based on his experience with a (MERGE-like) function he wrote in early Fortran called XIF. He claimed XIF made his code a lot more readable than using arithmetic IFs. But apparently couldn’t convince the IBMers at the time to only evaluate one expression or the other based on the conditional. So he put it in LISP, and got the ALGOL-60 committee to also include them. It is well past time that Fortran should support them too.

2 Likes

Please be careful @aledinola of your results as you have typos in time monitoring: you queried cpu_time(start_2) instead of cpu_time(end_2) for both loops 2 and 3 in your example. With these fixed, the elemental version is much faster on my machine. Unfortunately Fortran’s N-D array syntax is not good enough for these symmetric matrix cases (spread would lead to huge slowdowns). Here is what I get witht he time fix on a Mac M1 with my version, and -O3:

 Maximum difference between scalar and vectorized evaluations:    0.0000000000000000     
 Maximum difference between scalar and elemental evaluations:    0.0000000000000000     
 Time taken for scalar evaluations:    5.9456000000000002E-002
 Time taken for vectorized evaluations:    4.2580000000000007E-002
 Time taken for elemental evaluations:    3.9827000000000001E-002

My combustion engineering practice would suggest:

  • keep as much physics as possible inside the elemental function (everything is passed purely by reference)
  • If you’re going to optimize this, try to make it at least C0 (no hard step near 0), perhaps you can save more from the better convergence of the optimizer rather than by optimizing the function evaluation.
    elemental pure function f2d(a,aprime,w,z) result(F)
    ! Elemental version with all physics embedded
        implicit none
        real(8), intent(in) :: a,aprime,w,z
        real(8) :: F
        real(8) :: consumption, cash
        real(8), parameter :: MIN_CONSUMPTION = 10*tiny(0.0_8)
        
        cash = (1.0d0+a) + w*z        
        consumption = max(cash - aprime, MIN_CONSUMPTION)
        F = log(consumption)

    end function f2d
3 Likes

Thanks, I fixed the remaining typos in my OP.

The timing results of the elemental version have improved, but on my machine the only-loops-based version is still the fastest.

1 Like

Related to `merge`: I fed my MWE to ChatGPT 5.2 to make sure there are no further bugs. It actually flagged my using of `merge` as a bug, giving this message:

1) Big correctness bug: merge(log(consumption), ...) still evaluates log(consumption)

In Fortran, both “true” and “false” arguments to merge(tsource, fsource, mask) are evaluated before selection. So:

F = merge(log(consumption), -1.0d10, consumption > 0.0d0)

will still attempt log(consumption) for entries where consumption <= 0, which can generate NaNs, floating invalid exceptions, or inconsistent results depending on flags (especially with -ffast-math-like settings).

This affects both:

  • f_return_vec

  • f_return_el

:white_check_mark: Fix: use where / masked assignment (safe), or explicitly guard the log

Then ChatGPT suggested keeping the if condition in the elemental function and using where in the version that uses array syntax.

module mymod
    use, intrinsic :: iso_fortran_env, only: real64
    implicit none
contains

    pure function f_return(cash, aprime) result(F)
        real(real64), intent(in) :: cash, aprime
        real(real64) :: F
        real(real64) :: consumption

        consumption = cash - aprime
        if (consumption > 0.0_real64) then
            F = log(consumption)
        else
            F = -1.0e10_real64
        end if
    end function f_return

    pure function f_return_vec(cash, aprime) result(F)
        real(real64), intent(in) :: cash
        real(real64), intent(in) :: aprime(:)
        real(real64) :: F(size(aprime))
        real(real64) :: consumption(size(aprime))

        consumption = cash - aprime
        F = -1.0e10_real64
        where (consumption > 0.0_real64)
            F = log(consumption)
        end where
    end function f_return_vec

    elemental pure function f_return_el(cash, aprime) result(F)
        real(real64), intent(in) :: cash, aprime
        real(real64) :: F
        real(real64) :: consumption

        consumption = cash - aprime
        if (consumption > 0.0_real64) then
            F = log(consumption)
        else
            F = -1.0e10_real64
        end if
    end function f_return_el

end module mymod

1 Like

Right. If you feed a bad argument into log, it is a problem. You can use a second merge, inside the expression passed to log, to detect and prevent the bad argument. Same situation occurs when trying to use merge to avoid a divide-by-zero.

2 Likes

Thanks for your comments.

I prefer to compute the variable cash outside of the function and pass it as input. The reason is simple: cash depends only on a and not on a’. So in the nested loops version I do the following:

do a_c = 1,n_a
    cash = (1.0d0 + r)*a_grid(a_c) + w*z
    do ap_c = 1,n_a
        ret_matrix1(ap_c,a_c) = f_return(cash,a_grid(ap_c))
    enddo
enddo

In your version, the function f2dhas both a and a’ as inputs. So if I call it with loops I have to write

do a_c = 1,n_a
    do ap_c = 1,n_a
        ret_matrix1(ap_c,a_c) = f2d(a_grid(a_c),a_grid(ap_c),w,z)
        ! f2d calculates cash=(1+r)*a+w*z, cons=cash-aprime and log(cons)
    enddo
enddo

which is inefficient because it computes cash n_a*n_a times instead of n_a.

2 Likes

@FedericoPerini I tried also a version using forall but is slower (as slow as where, to be precise). Here is the function using forall:

pure function f_return_vec(cash,aprime) result(F)
! Array-syntax version with vector input and output
    implicit none
    real(8), intent(in) :: cash
    real(8), intent(in) :: aprime(:)
    real(8) :: F(size(aprime))
    real(8) :: consumption(size(aprime))
    integer :: i
    
    consumption = cash - aprime
    F = -1.0d10
    forall (i=1:size(aprime), consumption(i)>0.0d0)
        F(i) = log(consumption(i))
    end forall

end function f_return_vec

and here is how it’s called:

do a_c = 1,n_a
    cash = (1.0d0 + r)*a_grid(a_c) + w*z
    ret_matrix3(:,a_c) = f_return_el(cash,a_grid)
enddo
2 Likes