Testing LogSumExp (or RealSoftMax)

I wrote a simple function to compute the log sum of exponentials in Fortran. My first version uses array syntax. As a curiosity, I implemented the same function using loops. The loop-based version turned out to be faster.

To provide some context, I am using this function in a larger project where it is called many times within several nested loops, so performance is important.

I like more the version with array syntax (fun_logsum) because it is closer to a similar function I have in Matlab.

module mymodule
    implicit none
    private
    public :: fun_logsum, fun_logsum_loops
    
    contains
    
    ! fun_logsum, fun_logsum_loops
    ! DESCRIPTION
	! Calculates the log-sum. The computation avoids
	! overflows.
	! https://gregorygundersen.com/blog/2020/02/09/log-sum-exp/
	! https://nhigham.com/2021/01/05/what-is-the-log-sum-exp-function/
	! INPUTS
	!   V:     Vector with values of the different choices
	!   sigma: Standard deviation of the taste shock
    !   n:     Size of the vector V
	! OUTPUTS
	!   LogSum: Log sum of exponentials
	! AUTHOR
	! Alessandro Di Nola, March 2022.
    
    function fun_logsum(V,sigma,n) result(LogSum)
    ! Uses array syntax
    implicit none
    ! Declare inputs and function result:
    integer, intent(in) :: n      ! Size of the vector V
    real(8), intent(in) :: V(n)   ! Input vector
    real(8), intent(in) :: sigma  ! St. dev. of the preference shock
	real(8) :: LogSum             ! Result is log(sum(exp(v)))
	! Declare locals:
	real(8) :: max_val, sum_exp

     ! Step 1: Find the maximum value to ensure numerical stability
    max_val = maxval(V)
	
	! Step 2: Compute sum of exp( (v_i - maxval)/sigma )
	sum_exp = sum( exp((V-max_val)/sigma) )
    
    ! Step 3: Return max_val + sigma*log(sum_exp)
	LogSum = max_val + sigma*log(sum_exp)	
    
    end function fun_logsum
    
function fun_logsum_loops(V, sigma, n) result(LogSum)
    ! Uses loops
    implicit none
    ! Declare inputs and function result:
    integer, intent(in) :: n      ! Size of the vector V
    real(8), intent(in) :: V(n)   ! Input vector
    real(8), intent(in) :: sigma  ! St. dev. of the preference shock
	real(8) :: LogSum             ! Result is log(sum(exp(v)))
	! Declare locals:
	real(8) :: max_val, sum_exp
    integer :: i

    ! Step 1: Find the maximum value to ensure numerical stability
    max_val = V(1)
    do i = 2, n
        if (V(i) > max_val) then
            max_val = V(i)
        endif
    enddo

    ! Step 2: Compute sum of exp( (v_i - maxval)/sigma )
    sum_exp = 0.0d0
    do i = 1, n
        sum_exp = sum_exp + exp((V(i) - max_val)/sigma)
    enddo

    ! Step 3: Return max_val + sigma*log(sum_exp)
    LogSum = max_val + sigma*log(sum_exp)
    
end function fun_logsum_loops

    
end module mymodule
!===============================================================================!
    
program main
    use mymodule, only: fun_logsum, fun_logsum_loops
    implicit none
    ! Declare variables:
    integer, parameter :: n = 5 ! Number of alternatives
    integer, parameter :: Nsim = 100000000 ! Number of iterations
    real(8), parameter :: sigma = 0.02d0 ! Standard deviation of the preference shock
    real(8) :: v_vec5(n), tic, toc, tic2, toc2
    real(8) :: Vbar, Vbar2, Prob(n)
    integer :: ii
    
    ! Execution starts here:
    
    ! CASE 1: simple
    v_vec5 = [0.1d0, 0.2d0, 0.3d0, 0.4d0, 0.5d0] ! Example vector of utilities
    Vbar = fun_logsum(v_vec5, sigma, n)
    Vbar2 = fun_logsum_loops(v_vec5, sigma, n)
    
    write(*,'(A,5E15.6)') "v_vec5 is: ", v_vec5
    write(*,*) "fun_logsum: ", Vbar
    write(*,*) "fun_logsum_loops: ", Vbar2
    
    ! CASE 2: big numbers
    v_vec5 = [1.0d10, 2.0d10, 3.0d10, 4.0d10, 5.0d10] ! Example vector of utilities
    Vbar = fun_logsum(v_vec5, sigma, n)
    Vbar2 = fun_logsum_loops(v_vec5, sigma, n)
    
    write(*,'(A,5E15.6)') "v_vec5 is: ", v_vec5
    write(*,*) "fun_logsum: ", Vbar
    write(*,*) "fun_logsum_loops: ", Vbar2
    
    ! CASE 3: all zeros
    v_vec5 = [0.0d0, 0.0d0, 0.0d0, 0.0d0, 0.0d0] ! Example vector of utilities
    Vbar = fun_logsum(v_vec5, sigma, n)
    Vbar2 = fun_logsum_loops(v_vec5, sigma, n)
    
    write(*,'(A,5E15.6)') "v_vec5 is: ", v_vec5
    write(*,*) "fun_logsum: ", Vbar
    write(*,*) "fun_logsum_loops: ", Vbar2
    
    pause "Press Enter to continue..."
    
    write(*,*) "Now testing speed..."
    
    call cpu_time(tic) ! Start timer
    do ii = 1, Nsim
        Vbar = fun_logsum(v_vec5, sigma, n)
    enddo
    call cpu_time(toc) ! Stop timer
    
    call cpu_time(tic2) ! Start timer
    do ii = 1, Nsim
        Vbar = fun_logsum_loops(v_vec5, sigma, n)
    enddo
    call cpu_time(toc2) ! Stop timer
       
    write(*,*) "Time taken for fun_logsum:       ", toc-tic,   " seconds"
    write(*,*) "Time taken for fun_logsum_loops: ", toc2-tic2, " seconds"
    
end program main
2 Likes

I’ve tried the code on my Mac and the behavior seems interesting… :slightly_smiling_face:

First, it seems that the loops over Nsim in the original code may have been optimized away (no or little calculations done because the results are not used), because the timing remains almost the same for my machine even for n=100 etc (roughly a few micro seconds). Also, the pause statement caused a run-time error (IEEE_UNDERFLOW_FLAG) and the program stopped before time measurement, so it may be better to delete that line (for copy-paste testing…)

Then, I’ve tried a modified version like below, and it seems that the speed comparison depends considerably on the vector size n (though your main interest may be in a short vector like n=5…). What I found interesting is that the speed changes suddenly from n=17 (fast) to n=18 (slow). I’ve also tried different memory allocations (like “auto”, “save”, “alloc”), which also affected the comparison.

So anyway, it may be interesting to modify the original code such that it uses the result of each loop and measure timing on your machine. For comparison, my results are something like the following (on MacM1 + gfortran-11 -O3). So overall, the loop-based routine may be better for short vectors…? (though the results may depend on machines/compilers/options/etc) Another approach may be to do some “batch” calculations for a set of short vectors (by combining them into a larger array).

!! (the module has not been modified)

program main
    call test()
end program
    
subroutine test()
    use mymodule, only: fun_logsum_arr => fun_logsum, fun_logsum_loops
    implicit none

    !>>>>
    !![ vector size ]
    integer, parameter :: n = 5
    !! integer, parameter :: n = 17   !! "fast" for n <= 17 (also depends on "Mem:xxx")
    !! integer, parameter :: n = 18   !! "slow" for n >= 18
    !! integer, parameter :: n = 30
    !<<<<

    integer, parameter :: Nsim = 100000000 ! Number of iterations
    real(8), parameter :: sigma = 0.02d0 ! Standard deviation of the preference shock

    real(8) :: Vbar_arr, Vbar_loop, tmp
    integer :: ii

    real(8) :: t_arr, t_loop
    integer(8) :: t1, t2, t_rate

    !>>>> allocation of vector
    real(8) :: v_vec5(n)      !! (Mem:auto)
    !! real(8), save :: v_vec5(n)   !! (Mem:save)
    !! real(8), allocatable :: v_vec5(:); allocate( v_vec5(n) )  !! (Mem:alloc)
    !<<<<

    print *, "n = ", n
    v_vec5 = [( 1.0d0 / ii, ii = 1, n)]

    !------------------------------
    !! Use fun_logsum_arr().
    
    call system_clock(t1)

    tmp = 0
    do ii = 1, Nsim
        tmp = tmp + fun_logsum_arr(v_vec5, sigma, n)
    enddo
    Vbar_arr = tmp / Nsim

    call system_clock(t2, t_rate)
    t_arr = (t2 - t1) / dble(t_rate)

    !------------------------------
    !! Use fun_logsum_loops().

    call system_clock(t1)

    tmp = 0
    do ii = 1, Nsim
        tmp = tmp + fun_logsum_loops(v_vec5, sigma, n)
    enddo
    Vbar_loop = tmp / Nsim

    call system_clock(t2, t_rate)
    t_loop = (t2 - t1) / dble(t_rate)

    !------------------------------

    print *, "Vbar_arr  = ", Vbar_arr
    print *, "Vbar_loop = ", Vbar_loop
    print *
    print "(a,f17.8,a)", "time (fun_logsum_arr)   : ", t_arr,   " seconds"
    print "(a,f17.8,a)", "time (fun_logsum_loops) : ", t_loop,  " seconds"
    
end subroutine

Result (macM1 + gfortran-11 -O3)

n = 5
(Mem:auto)
time (fun_logsum_arr)   :        0.13470300 seconds
time (fun_logsum_loops) :        0.09383300 seconds
(Mem:save)
time (fun_logsum_arr)   :        2.04203100 seconds
time (fun_logsum_loops) :        0.09372100 seconds
(Mem:alloc)
time (fun_logsum_arr)   :        0.13349300 seconds
time (fun_logsum_loops) :        0.09369500 seconds

n = 17
(Mem:auto)
time (fun_logsum_arr)   :        0.13826400 seconds
time (fun_logsum_loops) :        0.09373500 seconds
(Mem:save)
time (fun_logsum_arr)   :        6.34116300 seconds
time (fun_logsum_loops) :        0.09369600 seconds
(Mem:alloc)
time (fun_logsum_arr)   :        5.52861200 seconds
time (fun_logsum_loops) :        0.09375800 seconds

n = 18
(Mem:auto)
time (fun_logsum_arr)   :        6.09161000 seconds
time (fun_logsum_loops) :        5.68330500 seconds
(Mem:save)
time (fun_logsum_arr)   :        6.27594900 seconds
time (fun_logsum_loops) :        5.81525500 seconds
(Mem:alloc)
time (fun_logsum_arr)   :        6.08113200 seconds
time (fun_logsum_loops) :        5.68257800 seconds

n = 30
(Mem:auto)
time (fun_logsum_arr)   :        9.38762100 seconds
time (fun_logsum_loops) :        9.37687700 seconds
(Mem:save)
time (fun_logsum_arr)   :        9.61152000 seconds
time (fun_logsum_loops) :        9.40967200 seconds
(Mem:alloc)
time (fun_logsum_arr)   :        9.58302400 seconds
time (fun_logsum_loops) :        9.40681800 seconds
1 Like

Thanks! I will test your code on my computer.

Trivial question: how do I do the Fortran syntax highlighting when I copy and paste a piece of code? I used to enclose code in ``fortran CODE HERE``` but it does not work as before.

For timing different implementation of the same code, you may be interested in (self advertisement here :face_with_peeking_eye:) benchmark.f. It will also collect hardware and compiler info to ease the comparison.

2 Likes

I experimented with this last version a little, and here is what I get on MacOS with M2 arm64 hardware for n=30 and with the (Mem:auto) declaration line for the array.

$ gfortran -O3 logsum.f90 && a.out
 n =           30
 Vbar_arr  =    1.0000000000000002     
 Vbar_loop =    1.0000000000000002     

time (fun_logsum_arr)   :       10.81233600 seconds
time (fun_logsum_loops) :       10.77006100 seconds

$ flang -O3 logsum.f90 && a.out
 n =  30
 Vbar_arr  =  1.0000000000000002
 Vbar_loop =  1.0000000000000002

time (fun_logsum_arr)   :        0.10813100 seconds
time (fun_logsum_loops) :        0.08587600 seconds

I see a huge difference in the timing with the two compilers. I don’t know why. It is not some anomaly with the timer, I really do see those differences when I run the code. Also, if I change Nsim, then the timings change accordingly, so it is not an issue with one compiler optimizing away the Nsim loop. Basically, when n is changed, the flang timings are almost constant, whereas the gfortran timings change as described previously. This is with gfortran 15.1.0 and flang 20.1.8.

There are a few style comments I might make about using REAL(8) instead of REAL(wp), and using dble() and the 0.0d0 constant within the code. Also, there is the sigma argument that is inconsistent with the documentation comments within the two functions. But let’s skip over that and focus on just the timing question. Anyone know what might be going on here?

edit: Some further experimentation shows this:

$ flang -O3 logsum.f90 && a.out
 n =  33
 Vbar_arr  =  1.0000000000000002
 Vbar_loop =  1.0000000000000002

time (fun_logsum_arr)   :        0.10870100 seconds
time (fun_logsum_loops) :        0.08581800 seconds

$ flang -O3 logsum.f90 && a.out
 n =  34
 Vbar_arr  =  1.0000000000000002
 Vbar_loop =  1.0000000000000002

time (fun_logsum_arr)   :        8.98841500 seconds
time (fun_logsum_loops) :        8.96371500 seconds

So like the gfortran timings, there is an abrupt change in the flang timings with the array dimension, it just occurs for a different value of n.

2 Likes

In my case, I just enclose a code with triple backticks followed by the language name “fortran”

```fortran
program main
end
```
which gives

program main
end

Possibly the behavior also depends on browsers etc…?

1 Like

After reading Ron’s post, I’ve just noticed that my modified code above may also be affected by compiler optimizations that eliminate most of the calculations, because the loop body consists of the evaluation of fun_logsum_xxx() that is independent of the loop index (so essentially “constant”…). So, to avoid such a potential optimization, I’ve tried another modified code that changes the first element of v_vec5 in each iteration, like the following:

program main
    call test_ver2()
end program

subroutine test_ver2()
    use mymodule, only: fun_logsum_arr => fun_logsum, fun_logsum_loops
    implicit none

    integer, parameter :: Nsim = 100000000 ! Number of iterations
    real(8), parameter :: sigma = 0.02d0 ! Standard deviation of the preference shock

    real(8) :: Vbar_arr, Vbar_loop, tmp
    integer :: ii

    real(8) :: t_arr, t_loop
    integer(8) :: t1, t2, t_rate

    !>>>>
    !![ vector size ]
    !! integer, parameter :: n = 2
    integer, parameter :: n = 5
    !! integer, parameter :: n = 17
    !! integer, parameter :: n = 18
    !! integer, parameter :: n = 30
    !! integer, parameter :: n = 34
    !<<<<

    !>>>>
    !![ memory allocation ]
    real(8) :: v_vec5(n)         !! (Mem:auto)
    !! real(8), save :: v_vec5(n)   !! (Mem:save)
    !! real(8), allocatable :: v_vec5(:); allocate( v_vec5(n) )  !! (Mem:alloc)
    !<<<<

    print *, "n = ", n
    v_vec5 = [( 1.0d0 / ii, ii = 1, n)]

    !------------------------------
    !! Use fun_logsum_arr().
    
    call system_clock(t1)

    tmp = 0
    do ii = 1, Nsim
        v_vec5(1) = 1.0d0 / ii
        tmp = tmp + fun_logsum_arr(v_vec5, sigma, n)
    enddo
    Vbar_arr = tmp / Nsim

    call system_clock(t2, t_rate)
    t_arr = (t2 - t1) / dble(t_rate)

    !------------------------------
    !! Use fun_logsum_loops().

    call system_clock(t1)

    tmp = 0
    do ii = 1, Nsim
        v_vec5(1) = 1.0d0 / ii
        tmp = tmp + fun_logsum_loops(v_vec5, sigma, n)
    enddo
    Vbar_loop = tmp / Nsim

    call system_clock(t2, t_rate)
    t_loop = (t2 - t1) / dble(t_rate)

    !------------------------------

    print *, "Vbar_arr  = ", Vbar_arr
    print *, "Vbar_loop = ", Vbar_loop
    print *
    print "(a,f17.8,a)", "time (fun_logsum_arr)   : ", t_arr,   " seconds"
    print "(a,f17.8,a)", "time (fun_logsum_loops) : ", t_loop,  " seconds"
    
end subroutine

Then, it seems that the timing becomes more “regular” than the previous results (please see blow). But, it is still not very clear to me how such a potential optimization (e.g. for constant expressions) is affected by n and memory allocation in the previous results…

Anyway, the updated timing of the newer code (test_ver2() above) are like these. (I’ve measured the timing only once, so there may be some fluctuations. From these data, it seems that flang-20 is faster for very short vectors, while gfortran-11 is faster for longer vectors, though the speed may vary with different options…)

[ macM1 + gfortran-11 -O3 ]

------------------------------
n = 2
(Mem:auto)
time (fun_logsum_arr)   :        1.10061400 seconds
time (fun_logsum_loops) :        0.79768200 seconds
(Mem:save)
time (fun_logsum_arr)   :        1.08158700 seconds
time (fun_logsum_loops) :        1.06369200 seconds
(Mem:alloc)
time (fun_logsum_arr)   :        1.24887400 seconds
time (fun_logsum_loops) :        0.79780700 seconds
------------------------------
n = 5
(Mem:auto)
time (fun_logsum_arr)   :        2.41570200 seconds
time (fun_logsum_loops) :        1.93685900 seconds
(Mem:save)
time (fun_logsum_arr)   :        2.31261500 seconds
time (fun_logsum_loops) :        2.15986400 seconds
(Mem:alloc)
time (fun_logsum_arr)   :        2.11550800 seconds
time (fun_logsum_loops) :        1.93653800 seconds
------------------------------
n = 17
(Mem:auto)
time (fun_logsum_arr)   :        7.12022300 seconds
time (fun_logsum_loops) :        6.00388600 seconds
(Mem:save)
time (fun_logsum_arr)   :        6.72878300 seconds
time (fun_logsum_loops) :        7.54748400 seconds
(Mem:alloc)
time (fun_logsum_arr)   :        5.58244600 seconds
time (fun_logsum_loops) :        5.40773300 seconds
------------------------------
n = 18
(Mem:auto)
time (fun_logsum_arr)   :        6.16025200 seconds
time (fun_logsum_loops) :        5.85428700 seconds
(Mem:save)
time (fun_logsum_arr)   :        6.25142200 seconds
time (fun_logsum_loops) :        5.85311600 seconds
(Mem:alloc)
time (fun_logsum_arr)   :        6.07515800 seconds
time (fun_logsum_loops) :        5.85482800 seconds
------------------------------
n = 30
(Mem:auto)
time (fun_logsum_arr)   :        9.74580900 seconds
time (fun_logsum_loops) :        9.42603800 seconds
(Mem:save)
time (fun_logsum_arr)   :        9.74411400 seconds
time (fun_logsum_loops) :        9.44398600 seconds
(Mem:alloc)
time (fun_logsum_arr)   :        9.74785300 seconds
time (fun_logsum_loops) :        9.43061000 seconds
------------------------------
n = 34
(Mem:auto)
time (fun_logsum_arr)   :       10.85358000 seconds
time (fun_logsum_loops) :       10.58217700 seconds
(Mem:save)
time (fun_logsum_arr)   :       10.85584600 seconds
time (fun_logsum_loops) :       10.55483500 seconds
(Mem:alloc)
time (fun_logsum_arr)   :       10.85865900 seconds
time (fun_logsum_loops) :       10.58505700 seconds
------------------------------

[ macM1 + flang-20.1 -O3 ]

------------------------------
n = 2
(Mem:auto)
time (fun_logsum_arr)   :        0.94063100 seconds
time (fun_logsum_loops) :        0.88547500 seconds
(Mem:save)
time (fun_logsum_arr)   :        0.93390800 seconds
time (fun_logsum_loops) :        0.88412700 seconds
(Mem:alloc)
time (fun_logsum_arr)   :        1.15461800 seconds
time (fun_logsum_loops) :        1.09323100 seconds
------------------------------
n = 5
(Mem:auto)
time (fun_logsum_arr)   :        1.73628800 seconds
time (fun_logsum_loops) :        1.68453700 seconds
(Mem:save)
time (fun_logsum_arr)   :        1.72946000 seconds
time (fun_logsum_loops) :        1.69121300 seconds
(Mem:alloc)
time (fun_logsum_arr)   :        2.23714500 seconds
time (fun_logsum_loops) :        2.11967700 seconds
------------------------------
n = 17
(Mem:auto)
time (fun_logsum_arr)   :        5.01231200 seconds
time (fun_logsum_loops) :        4.92869400 seconds
(Mem:save)
time (fun_logsum_arr)   :        4.81544900 seconds
time (fun_logsum_loops) :        4.77432100 seconds
(Mem:alloc)
time (fun_logsum_arr)   :        9.76461800 seconds
time (fun_logsum_loops) :        9.65898600 seconds
------------------------------
n = 18
(Mem:auto)
time (fun_logsum_arr)   :        5.27591300 seconds
time (fun_logsum_loops) :        5.19503600 seconds
(Mem:save)
time (fun_logsum_arr)   :        6.95131800 seconds
time (fun_logsum_loops) :        6.83111100 seconds
(Mem:alloc)
time (fun_logsum_arr)   :       10.11385300 seconds
time (fun_logsum_loops) :       10.00436800 seconds
------------------------------
------------------------------
n = 30
(Mem:auto)
time (fun_logsum_arr)   :        8.50785500 seconds
time (fun_logsum_loops) :        8.46470500 seconds
(Mem:save)
time (fun_logsum_arr)   :       11.26487300 seconds
time (fun_logsum_loops) :       11.16486700 seconds
(Mem:alloc)
time (fun_logsum_arr)   :       14.47837800 seconds
time (fun_logsum_loops) :       14.36105300 seconds
------------------------------
n = 34
(Mem:auto)
time (fun_logsum_arr)   :       13.62200400 seconds
time (fun_logsum_loops) :       13.51473400 seconds
(Mem:save)
time (fun_logsum_arr)   :       13.62338600 seconds
time (fun_logsum_loops) :       13.51262800 seconds
(Mem:alloc)
time (fun_logsum_arr)   :       13.73558600 seconds
time (fun_logsum_loops) :       13.66850400 seconds
------------------------------
1 Like

Regarding style, what is the recommended way of declaring a double precision variable? Something like

integer, parameter :: dp = kind(0.0)

And then declare double precision variables as

real(dp) :: Vbar

I have read about this, but is more verbose than real(8). Do you think it matters in this example?

Also, how would you write 0.0d0?

Thanks for any feedback/comments!

If your original code had been written in this portable way, then I could have easily used the nagfor compiler too. (Actually nagfor has options to change its default behavior, but I never use them, so it would have taken me some effort to either fix the code or to look up the compiler option.) I think there are other compilers for which the kind value of 8 is either not supported or not the desired value.

Here are some ways to specify the desired KIND value. The best option depends on various factors within your code.

integer, parameter :: wp = kind(0.0D0)
integer, parameter :: wp = selected_real_kind(14)
use, intrinsic :: iso_fortran_env, only: wp=>real64

0.0_wp

1 Like

Is that why there is a jump in the timings? If the compiler were doing that optimization, then why would changing the array length from 17 to 18 change that optimization?

I originally thought that maybe it was a cache effect, but then the same question, why would changing the array length from 17 to 18 on a machine with 128kB of level-1 data cache make a difference?

2 Likes

I have modified the test program to:

  • test alternative values for n
  • modified the v_vec5 values, as the supplied code had v_vec5 = 0 for timing tests.
  • overcome the -O3 Gfortran efficiency which removed the function call code
  • included a delta_second elapsed time measure, based on System_Clock 64-bit integers for better Gfortran/win64 precision clock. Note cpu_time is an imprecise timer on Win32/64 and some other OS.
  • use more general integer and real kind parameters for 64-bit variables

This modified code now shows a significant increase in run time than the initial posted code when using Gfortran -O3.

However there is a consistent increase in runtime for the array syntax method vs the loop method for most reported tests in this thread, although only small.

This may be due to the “Step 2” where “exp((V-max_val)/sigma)” may be replaced by a temporary array before performing the sum.

! Step 2: Compute sum of exp( (v_i - maxval)/sigma )
sum_exp = sum( exp((V-max_val)/sigma) )

While array syntax is helpful in coding, the compiler still has to implement some effective loop calculation. Both the array and loop syntax approaches would be modified by the compiler for AVX instructions.

The small difference in run times suggests to me that AVX instructions are being utilised for both coding methods and the use of either coding approach depends on user preference for documentation of the calculation.

My altered code is:

module precision
    integer, parameter :: r8 = kind (1.d00)
    integer, parameter :: i8 = selected_int_kind ( 12 )
    contains
     function delta_second ()
       real(r8)    :: delta_second
       integer(i8) :: tick, rate, last_tick = 0
       call system_clock ( tick, rate )
       delta_second = dble (tick-last_tick) / dble (rate)
       last_tick    = tick
     end function delta_second
end module precision

module mymodule
    use precision
    implicit none
    private
    public :: fun_logsum, fun_logsum_loops
    contains
    
    ! fun_logsum, fun_logsum_loops
    ! DESCRIPTION
	! Calculates the log-sum. The computation avoids
	! overflows.
	! https://gregorygundersen.com/blog/2020/02/09/log-sum-exp/
	! https://nhigham.com/2021/01/05/what-is-the-log-sum-exp-function/
	! INPUTS
	!   V:     Vector with values of the different choices
	!   sigma: Standard deviation of the taste shock
        !   n:     Size of the vector V
	! OUTPUTS
	!   LogSum: Log sum of exponentials
	! AUTHOR
	! Alessandro Di Nola, March 2022.
    
  function fun_logsum(V,sigma,n) result(LogSum)
    ! Uses array syntax
    implicit none
    ! Declare inputs and function result:
    integer, intent(in) :: n      ! Size of the vector V
    real(r8), intent(in) :: V(n)   ! Input vector
    real(r8), intent(in) :: sigma  ! St. dev. of the preference shock
	real(r8) :: LogSum             ! Result is log(sum(exp(v)))
	! Declare locals:
	real(r8) :: max_val, sum_exp

     ! Step 1: Find the maximum value to ensure numerical stability
       max_val = maxval(V)
	
     ! Step 2: Compute sum of exp( (v_i - maxval)/sigma )
       sum_exp = sum( exp((V-max_val)/sigma) )

     ! Step 3: Return max_val + sigma*log(sum_exp)
       LogSum  = max_val + sigma*log(sum_exp)	
    
  end function fun_logsum
    
  function fun_logsum_loops(V, sigma, n) result(LogSum)
    ! Uses loops
    implicit none
    ! Declare inputs and function result:
    integer, intent(in) :: n      ! Size of the vector V
    real(r8), intent(in) :: V(n)   ! Input vector
    real(r8), intent(in) :: sigma  ! St. dev. of the preference shock
	real(r8) :: LogSum             ! Result is log(sum(exp(v)))
	! Declare locals:
	real(r8) :: max_val, sum_exp
    integer :: i

    ! Step 1: Find the maximum value to ensure numerical stability
      max_val = V(1)
      do i = 2, n
        if (V(i) > max_val) then
            max_val = V(i)
        endif
      enddo

    ! Step 2: Compute sum of exp( (v_i - maxval)/sigma )
      sum_exp = 0.0d0
      do i = 1, n
        sum_exp = sum_exp + exp((V(i) - max_val)/sigma)
      enddo

    ! Step 3: Return max_val + sigma*log(sum_exp)
      LogSum = max_val + sigma*log(sum_exp)
    
  end function fun_logsum_loops

    
end module mymodule
!===============================================================================!
    
subroutine n_test (n)
    use precision
    use mymodule, only: fun_logsum, fun_logsum_loops
    implicit none
    integer :: n ! Number of alternatives

    ! Declare variables:
    real(r8), parameter :: sigma = 0.02d0 ! Standard deviation of the preference shock
    real(r8) :: v_vec5(n)
    real(r8) :: Vbar1,Vbar2,Vbar3    ! , Prob(n)

    integer, parameter :: Nsim = 100000000 ! Number of iterations
    real(r8) :: tic, toc, elap1, tic2, toc2, elap2, tic3, toc3, elap3
    integer  :: ii
    
    write (*,'(//,A,i5)') "### Test for n: ", n

    ! Execution starts here:
    
    ! CASE 1: simple
    v_vec5 = [(ii*0.1d0, ii=1,n)] ! Example vector of utilities
    Vbar1 = fun_logsum(v_vec5, sigma, n)
    Vbar2 = fun_logsum_loops(v_vec5, sigma, n)
    write(*,11) "v_vec5 small is: ", v_vec5
    write(*,*) "fun_logsum:       ", Vbar1
    write(*,*) "fun_logsum_loops: ", Vbar2, vbar2-Vbar1
    
    ! CASE 2: big numbers
    v_vec5 = v_vec5 * 1.d10   ! Example vector of utilities
    Vbar1 = fun_logsum(v_vec5, sigma, n)
    Vbar2 = fun_logsum_loops(v_vec5, sigma, n)
    
    write(*,11) "v_vec5 large is: ", v_vec5
    write(*,*) "fun_logsum:       ", Vbar1
    write(*,*) "fun_logsum_loops: ", Vbar2, vbar2-Vbar1
    
    ! CASE 3: all zeros
    v_vec5 = 0.d0            ! Example vector of utilities
    Vbar1 = fun_logsum(v_vec5, sigma, n)
    Vbar2 = fun_logsum_loops(v_vec5, sigma, n)
    
    write(*,11) "v_vec5 zero is: ", v_vec5
    write(*,*) "fun_logsum:       ", Vbar1
    write(*,*) "fun_logsum_loops: ", Vbar2, vbar2-Vbar1

 11 format ( A/ (5E15.6/) )   

    if ( n==5 ) then    
!z    pause "Press Enter to continue..."
    end if
    
    write(*,*) "Now testing speed...", Nsim
    v_vec5 = [(ii*0.1d0, ii=1,n)] ! Example vector of utilities

! test loops first
    call cpu_time(tic2) ! Start timer
    elap2 = delta_second ()
    Vbar2 = 0
    do ii = 1, Nsim
        v_vec5(1) = ii*1.d-10
!        v_vec5(1) = v_vec5(1) + Vbar2*1.d-10
        Vbar2 = Vbar2 + fun_logsum_loops (v_vec5, sigma, n)
    enddo
    call cpu_time(toc2) ! Stop timer
    elap2 = delta_second ()

! test array function       
    call cpu_time(tic) ! Start timer
    elap1 = delta_second ()
    Vbar1 = 0
    do ii = 1, Nsim
        v_vec5(1) = ii*1.d-10
        Vbar1 = Vbar1 + fun_logsum (v_vec5, sigma, n)
    enddo
    call cpu_time(toc) ! Stop timer
    elap1 = delta_second ()
    
! test no function       
    call cpu_time(tic3) ! Start timer
    elap3 = delta_second ()
    Vbar3 = 0
    do ii = 1, Nsim
        v_vec5(1) = ii*1.d-10
        Vbar3 = Vbar3 + v_vec5(1)
    enddo
    call cpu_time(toc3) ! Stop timer
    elap3 = delta_second ()

! times
    write(*,*) "Time taken for fun_logsum:       ", toc-tic,   elap1, " seconds : sum", Vbar1
    write(*,*) "Time taken for fun_logsum_loops: ", toc2-tic2, elap2, " seconds : sum", Vbar2
    write(*,*) "Time taken for no fun:           ", toc3-tic3, elap3, " seconds : sum", Vbar3

end subroutine n_test

program main
  use precision

  write (*,*) i8,r8
  call n_test (5) 
  call n_test (15) 
  call n_test (25) 
end program main

ps : This new default rich text editor is not an improvement for me !!
Needs a warning ?

1 Like

If you want to avoid verbosity then write 0d0 instead of 0.0d0. But that’s only if you want double precision instead of 15 significant digits: they are the same with many compilers but not all.

1 Like

I have updated my test following some helpful suggestions by @RonShepard and @septc

module mymodule
    implicit none
    
    integer, parameter :: dp = kind(0.0d0)
    
    private
    public :: dp, fun_logsum_arr, fun_logsum_loops
    
    contains
    
    function fun_logsum_arr(V,sigma,n) result(LogSum)
    ! Uses array syntax
    implicit none
    ! Declare inputs and function result:
    real(dp), intent(in) :: V(n)   ! Input vector
    real(dp), intent(in) :: sigma  ! St. dev. of the preference shock
    integer, intent(in) :: n      ! Size of the vector V
	real(dp) :: LogSum             ! Result is log(sum(exp(v)))
	! Declare locals:
	real(dp) :: max_val, sum_exp

     ! Step 1: Find the maximum value to ensure numerical stability
    max_val = maxval(V)
	
	! Step 2: Compute sum of exp( (v_i - maxval)/sigma )
	sum_exp = sum( exp((V-max_val)/sigma) )
    
    ! Step 3: Return max_val + sigma*log(sum_exp)
	LogSum = max_val + sigma*log(sum_exp)	
    
    end function fun_logsum_arr
    
function fun_logsum_loops(V, sigma, n) result(LogSum)
    ! Uses loops
    implicit none
    ! Declare inputs and function result:
    real(dp), intent(in) :: V(n)   ! Input vector
    real(dp), intent(in) :: sigma  ! St. dev. of the preference shock
    integer, intent(in) :: n      ! Size of the vector V
	real(dp) :: LogSum             ! Result is log(sum(exp(v)))
	! Declare locals:
	real(dp) :: max_val, sum_exp
    integer :: i

    ! Step 1: Find the maximum value to ensure numerical stability
    max_val = V(1)
    do i = 2, n
        if (V(i) > max_val) then
            max_val = V(i)
        endif
    enddo

    ! Step 2: Compute sum of exp( (v_i - maxval)/sigma )
    sum_exp = 0.0d0
    do i = 1, n
        sum_exp = sum_exp + exp((V(i) - max_val)/sigma)
    enddo

    ! Step 3: Return max_val + sigma*log(sum_exp)
    LogSum = max_val + sigma*log(sum_exp)
    
end function fun_logsum_loops

    
end module mymodule
!===============================================================================!
    
program main
    use mymodule, only: dp, fun_logsum_arr, fun_logsum_loops
    implicit none
    ! Declare variables:
    integer, parameter :: n = 5 ! Number of alternatives
    integer, parameter :: Nsim = 1000000000 ! Number of iterations
    real(dp), parameter :: sigma = 0.1_dp ! Standard deviation of the preference shock
    real(dp) :: v_vec5(n), Vbar, Vbar2, Prob(n), Vbar_arr, Vbar_loop, tmp
    integer :: ii
    integer(8) :: tic, toc, tic2, toc2, t_rate
    real(8) :: t_arr, t_loop
    
    ! Execution starts here:
    
    ! CASE 1: simple
    v_vec5 = [0.1d0, 0.2_dp, 0.3_dp, 0.4_dp, 0.5_dp] ! Example vector of utilities
    Vbar = fun_logsum_arr(v_vec5, sigma, n)
    Vbar2 = fun_logsum_loops(v_vec5, sigma, n)
    
    write(*,'(A,5E15.6)') "v_vec5 is: ", v_vec5
    write(*,*) "fun_logsum_arr: ", Vbar
    write(*,*) "fun_logsum_loops: ", Vbar2
    
    write(*,*) "Now testing speed..."
    
    call system_clock(tic) ! Start timer
    tmp = 0.0_dp
    do ii = 1, Nsim
        tmp = tmp + fun_logsum_arr(v_vec5, sigma, n)
    enddo
    Vbar_arr = tmp / Nsim ! Average over Nsim iterations
    call system_clock(toc,t_rate) ! Stop timer
    t_arr = (toc - tic)/real(t_rate,dp) ! Calculate elapsed time for array version
    
    call system_clock(tic2) ! Start timer
    tmp = 0.0_dp
    do ii = 1, Nsim
        tmp = tmp + fun_logsum_loops(v_vec5, sigma, n)
    enddo
    Vbar_loop = tmp / Nsim ! Average over Nsim iterations
    call system_clock(toc2,t_rate) ! Stop timer
    t_loop = (toc2 - tic2)/real(t_rate,dp) ! Calculate elapsed time for loop version
       
    print *, "Vbar_arr  = ", Vbar_arr
    print *, "Vbar_loop = ", Vbar_loop
    print *
    write(*,"(a,f17.8,a)") "time (fun_logsum_arr)   : ", t_arr,   " seconds"
    write(*,"(a,f17.8,a)") "time (fun_logsum_loops) : ", t_loop,  " seconds"
    
end program main

Strangely, I get different results now.

v_vec5 is:    0.100000E+00   0.200000E+00   0.300000E+00   0.400000E+00   0.500000E+00
fun_logsum_arr:   0.545191439593759
fun_logsum_loops:   0.545191439593759

Now testing speed…
Vbar_arr  =   0.545191437747621
Vbar_loop =   0.545191437747621

time (fun_logsum_arr)   :        0.21900000 seconds
time (fun_logsum_loops) :        0.20900000 seconds
Press any key to continue . . .

Now it seems that the two versions take basically the same time. Given this, I prefer the array version since it is more concise.

P.S. In my code, `n` is typically a small number.

P.P.S. @RonShepard Following your suggestion, I have declared the parameter dp in the numerical module mymodule and I use it in the subprograms of mymodule and in the main program. However, in a more complex project, all files should see dp so every file should use mymodule. An alternative is to define dp in each module and in the main program, but it becomes verbose. Is there a recommended way of doing this?

1 Like

There is a further possible optimization in your array code, that is, to:

  • only perform one division
  • extract the constant exponential from the loop and apply it only once (will reduce accuracy for large numbers):
pure real(dp) function fun_logsum_optim(V, sigma, n) result(LogSum)
    integer, intent(in) :: n
    real(dp), intent(in) :: V(n)
    real(dp), intent(in) :: sigma
    real(dp) :: max_val, rsigma, sum_exp
 
    max_val = maxval(V)
    rsigma  = 1.0_dp / sigma
    sum_exp = sum(exp(V*rsigma))*exp(-max_val*rsigma)
    LogSum  = max_val + sigma * log(sum_exp)
end function fun_logsum_optim

Of course n=5 is not great for vectorization, but for n=64 on my Mac, I get:

...
 Vbar_arr  =    6.4458675156397689     
 Vbar_loop =    6.4458675156397689     
 Vbar_loop =    6.4458675156397689     
time (fun_logsum_arr)   :        2.51893000 seconds
time (fun_logsum_loops) :        2.49575800 seconds
time (fun_logsum_optim) :        0.00932900 seconds
Here is the complete implementation
module mymodule
    implicit none
    
    integer, parameter :: dp = kind(0.0d0)
    
    private
    public :: dp, fun_logsum_arr, fun_logsum_loops, fun_logsum_optim
    
    contains
    
    function fun_logsum_arr(V,sigma,n) result(LogSum)
    ! Uses array syntax
    implicit none
    ! Declare inputs and function result:
    real(dp), intent(in) :: V(n)   ! Input vector
    real(dp), intent(in) :: sigma  ! St. dev. of the preference shock
    integer, intent(in) :: n      ! Size of the vector V
	real(dp) :: LogSum             ! Result is log(sum(exp(v)))
	! Declare locals:
	real(dp) :: max_val, sum_exp

     ! Step 1: Find the maximum value to ensure numerical stability
    max_val = maxval(V)
	
	! Step 2: Compute sum of exp( (v_i - maxval)/sigma )
	sum_exp = sum( exp((V-max_val)/sigma) )
    
    ! Step 3: Return max_val + sigma*log(sum_exp)
	LogSum = max_val + sigma*log(sum_exp)	
    
    end function fun_logsum_arr
    
function fun_logsum_loops(V, sigma, n) result(LogSum)
    ! Uses loops
    implicit none
    ! Declare inputs and function result:
    real(dp), intent(in) :: V(n)   ! Input vector
    real(dp), intent(in) :: sigma  ! St. dev. of the preference shock
    integer, intent(in) :: n      ! Size of the vector V
	real(dp) :: LogSum             ! Result is log(sum(exp(v)))
	! Declare locals:
	real(dp) :: max_val, sum_exp
    integer :: i

    ! Step 1: Find the maximum value to ensure numerical stability
    max_val = V(1)
    do i = 2, n
        if (V(i) > max_val) then
            max_val = V(i)
        endif
    enddo

    ! Step 2: Compute sum of exp( (v_i - maxval)/sigma )
    sum_exp = 0.0d0
    do i = 1, n
        sum_exp = sum_exp + exp((V(i) - max_val)/sigma)
    enddo

    ! Step 3: Return max_val + sigma*log(sum_exp)
    LogSum = max_val + sigma*log(sum_exp)
    
end function fun_logsum_loops

pure real(dp) function fun_logsum_optim(V, sigma, n) result(LogSum)
    integer, intent(in) :: n
    real(dp), intent(in) :: V(n)
    real(dp), intent(in) :: sigma
    real(dp) :: max_val, rsigma, sum_exp
 
    max_val = maxval(V)
    rsigma  = 1.0_dp / sigma
    sum_exp = sum(exp(V*rsigma))*exp(-max_val*rsigma)
    LogSum  = max_val + sigma * log(sum_exp)
end function fun_logsum_optim

end module mymodule
!===============================================================================!
    
program main
    use mymodule, only: dp, fun_logsum_arr, fun_logsum_loops, fun_logsum_optim
    implicit none
    ! Declare variables:
    integer, parameter :: n = 64 ! Number of alternatives
    integer, parameter :: Nsim = 10000000 ! Number of iterations
    real(dp), parameter :: sigma = 0.1_dp ! Standard deviation of the preference shock
    real(dp) :: v_vec5(n), Vbar, Vbar2, Vbar3, Prob(n), Vbar_arr, Vbar_loop, Vbar_optim, tmp
    integer :: ii
    integer(8) :: tic, toc, tic2, toc2, tic3, toc3, t_rate
    real(8) :: t_arr, t_loop, t_optim
    
    ! Execution starts here:
    
    ! CASE 1: simple
    v_vec5 = [(0.1_dp*ii, ii=1,n)] ! Example vector of utilities
    Vbar = fun_logsum_arr(v_vec5, sigma, n)
    Vbar2 = fun_logsum_loops(v_vec5, sigma, n)
    Vbar3 = fun_logsum_optim(v_vec5, sigma, n)
    
    write(*,'(A,5E15.6)') "v_vec5 is: ", v_vec5(:5)
    write(*,*) "fun_logsum_arr  : ", Vbar
    write(*,*) "fun_logsum_loops: ", Vbar2
    write(*,*) "fun_logsum_optim: ", Vbar3
    
    write(*,*) "Now testing speed..."
    
    call system_clock(tic) ! Start timer
    tmp = 0.0_dp
    do ii = 1, Nsim
        tmp = tmp + fun_logsum_arr(v_vec5, sigma, n)
    enddo
    Vbar_arr = tmp / Nsim ! Average over Nsim iterations
    call system_clock(toc,t_rate) ! Stop timer
    t_arr = (toc - tic)/real(t_rate,dp) ! Calculate elapsed time for array version
    
    call system_clock(tic2) ! Start timer
    tmp = 0.0_dp
    do ii = 1, Nsim
        tmp = tmp + fun_logsum_loops(v_vec5, sigma, n)
    enddo
    Vbar_loop = tmp / Nsim ! Average over Nsim iterations
    call system_clock(toc2,t_rate) ! Stop timer
    t_loop = (toc2 - tic2)/real(t_rate,dp) ! Calculate elapsed time for loop version
       
    call system_clock(tic3) ! Start timer
    tmp = 0.0_dp
    do ii = 1, Nsim
        tmp = tmp + fun_logsum_optim(v_vec5, sigma, n)
    enddo
    Vbar_optim = tmp / Nsim ! Average over Nsim iterations
    call system_clock(toc3,t_rate) ! Stop timer
    t_optim = (toc3 - tic3)/real(t_rate,dp) ! Calculate elapsed time for loop version       
       
    print *, "Vbar_arr  = ", Vbar_arr
    print *, "Vbar_loop = ", Vbar_loop
    print *, "Vbar_loop = ", Vbar_optim
    print *
    write(*,"(a,f17.8,a)") "time (fun_logsum_arr)   : ", t_arr,   " seconds"
    write(*,"(a,f17.8,a)") "time (fun_logsum_loops) : ", t_loop,  " seconds"
    write(*,"(a,f17.8,a)") "time (fun_logsum_optim) : ", t_optim,  " seconds"
    
end program main

maybe something is getting optimized away? but then, why only the optimized function and none of the previous, that have identical signature?

1 Like

Thanks! just to understand, with the first trick you replaced n divisisions with n multiplications? I am not sure if I can do the second trick since accuracy is very important in my project and it is the main reason why I am computing the logsum in this way.

I implemented only your first suggestion as follows:

function fun_logsum_optim(V,sigma,n) result(LogSum)
    ! Uses array syntax
    implicit none
    ! Declare inputs and function result:
    real(dp), intent(in) :: V(n)   ! Input vector
    real(dp), intent(in) :: sigma  ! St. dev. of the preference shock
    integer, intent(in) :: n      ! Size of the vector V
	real(dp) :: LogSum             ! Result is log(sum(exp(v)))
	! Declare locals:
	real(dp) :: max_val, sum_exp, rsigma

     ! Step 1: Find the maximum value to ensure numerical stability
    max_val = maxval(V)
	rsigma  = 1.0_dp / sigma
    
	! Step 2: Compute sum of exp( (v_i - maxval)/sigma )
	sum_exp = sum( exp((V-max_val)*rsigma) )
    
    ! Step 3: Return max_val + sigma*log(sum_exp)
	LogSum = max_val + sigma*log(sum_exp)	
    
end function fun_logsum_optim

With these parameters,

integer, parameter :: n = 64 ! Number of alternatives
integer, parameter :: Nsim = 10000000 ! Number of iterations
real(dp), parameter :: sigma = 0.1_dp ! Standard deviation of the preference shock

I got these runtimes results:

 Now testing speed...
 Vbar_arr   =    6.44586751471583
 Vbar_loop  =    6.44586751471583
 Vbar_optim =    6.44586751471583

time (fun_logsum_arr)   :        1.36000000 seconds
time (fun_logsum_loops) :        1.01200000 seconds
time (fun_logsum_optim) :        1.38700000 seconds
Press any key to continue . . .

So the loop-based version is still the fastest. The oprimized version is actually slightly slower.

One should be careful with timing such loops. A smart compiler might chose to do the following (here compiled with gfortran -O3 -march=skylake -ffast-math -fno-inline):

        call    _gfortran_system_clock_8 
        mov     edx, OFFSET FLAT:.LC6
        mov     esi, OFFSET FLAT:.LC7
        mov     rdi, r13
        call    __mymodule_MOD_fun_logsum_optim.constprop.0
        mov     r14, QWORD PTR [rsp+144]
        xor     eax, eax
        vbroadcastsd    ymm3, xmm0
        vmulsd  xmm0, xmm0, QWORD PTR .LC18[rip]
        vbroadcastsd    ymm2, xmm0
        vmulpd  ymm0, ymm3, YMMWORD PTR .LC19[rip]
.L11:
        inc     eax
        vmovapd ymm1, ymm0
        vaddpd  ymm0, ymm0, ymm2
        cmp     eax, 2500000
        jne     .L11
        vaddpd  ymm0, ymm1, ymm3
        xor     edx, edx
        lea     rsi, [rsp+160]
        lea     rdi, [rsp+152]
        vextractf128    xmm0, ymm0, 0x1
        vunpckhpd       xmm0, xmm0, xmm0
        vmulsd  xmm0, xmm0, QWORD PTR .LC20[rip]
        vmovsd  QWORD PTR [rsp+88], xmm0
        vzeroupper
        call    _gfortran_system_clock_8

Notice the problem? The function is pure, so the result should be the same if the arguments don’t change. Why evaluate it within the loop (.L11), if you can just evaluate the result once? :wink:

Also related to the difference in timings observed, studying the output of -fopt-info shows that for small n, the compiler may choose to just inline and unroll the whole function. Even if you add -fno-inline, the compiler might do constant propagation (notice the .constprop.0), and just compile the variant you need.

To avoid some of these timing artefacts, it might be useful to place the kernels into a separate translation unit (read separate file), compile that first, and then link it with the main program.

Without the -ffast-math it’s perhaps a bit more obvious:

        call    _gfortran_system_clock_8
        mov     edx, OFFSET FLAT:.LC42
        mov     esi, OFFSET FLAT:.LC43
        mov     rdi, r14
        call    __mymodule_MOD_fun_logsum_optim.constprop.0
        mov     r13, QWORD PTR [rsp+144]
        mov     eax, 10000000                ; loop_ii = Nsim
        vmovapd xmm1, xmm0                   ; xmm1 = fun_logsum_optim(v_vec5,sigma,n)
        vxorpd  xmm0, xmm0, xmm0             ; tmp = 0
.L138:                                       ; loop unrolled by 2
        vaddsd  xmm0, xmm0, xmm1             ;   tmp += xmm1
        vaddsd  xmm0, xmm0, xmm1             ;   tmp += xmm1
        sub     eax, 2                       ;   loop_ii -= 2
        jne     .L138                        ;  if (loop_ii /= 0) goto .L138
        xor     edx, edx                          ; set count_max argument of system_clock to null pointer
        lea     rsi, [rsp+160]                    ; load address of t_rate for system_clock call
        lea     rdi, [rsp+152]                    ; load address of toc3 for system_clock call
        vdivsd  xmm0, xmm0, QWORD PTR .LC54[rip]  ; tmp = tmp / Nsim
        vmovsd  QWORD PTR [rsp+88], xmm0          ; Vbar_optim = tmp
        call    _gfortran_system_clock_8
3 Likes

Many projects have a single precision_defs.f90 or a my_kinds.f90 file that has a single module within it that defines the various real, integer, logical, etc. kind values. Then each subroutine or module can use precision_defs, only: wp as appropriate. This ensures that the whole project is compiled consistently with the same kind values, and it also makes it trivial to change the kind values for the whole project if you ever want to do that.

3 Likes

No, you definitely do not want to do that, even though they are mathematically the same. The reason you compute exp(V-max_val) in the first place is to eliminate floating point overflows and underflows.

Eliminating the divisions with the reciprocal is a good idea, but the compiler is likely to do that anyway, so it may or may not actually result in any practical change in the run time.

2 Likes

I haven’t double checked it, but I’m guessing that -ffast-math (gfortran, lang) or -fp-model fast (ifx) might be needed for this optimization to take place automatically.

1 Like

I was wondering what was going on with the timings, so this explains part of it. One question though, why does this occur for small values of n but not for larger ones (e.g. 18 in gfortran, 34 in flang)?

1 Like