The advantage of the loop based version is that no temp arrays need to be allocated, and that is probably what is happening in the array expression. If I were concerned about performance over a variety of compilers, I would use the max_val() intrinsic in step 1, and the loop version for step 2 (this would eliminate any temp array allocations), and I would use the rsigma version of the expression (just in case a compiler might not do that automatically).
The other issue with the timings looks like the constant expression problem, which is something that is common for benchmark codes. Getting useful benchmark results often requires tricking the compiler somehow so that it does not apply what it thinks is an obvious optimization. Sometimes whole loops are eliminated, sometimes (as in this case), the loop is not eliminated but the function call within it is and is replaced with a single evaluation and multiple references to the stored constant.
This version does everything in one loop instead of two, and it is quite faster on my Mac (removed pure thanks to @ivanpribecās suggestion):
Now testing speed...
Vbar_arr = 6.4458675156397689
Vbar_loop = 6.4458675156397689
Vbar_loop = 6.4458675156397689
time (fun_logsum_arr) : 2.52017300 seconds
time (fun_logsum_loops) : 2.49555300 seconds
time (fun_logsum_optim) : 2.06673700 seconds
function fun_logsum_optim2(V, sigma, n) result(LogSum)
! One-pass, numerically-stable log-sum-exp
implicit none
integer, intent(in) :: n
real(dp), intent(in) :: V(n)
real(dp), intent(in) :: sigma
real(dp) :: LogSum
real(dp) :: rsigma, max_val, sum_exp, factor, Vrel
integer :: i
rsigma = 1.0_dp/sigma
max_val = -huge(1.0_dp)
sum_exp = 0.0_dp
do i = 1, n
Vrel = V(i) - max_val
if (Vrel>0) then
factor = exp(-Vrel*rsigma) ! between old max_val and new max_val==V(i)
sum_exp = sum_exp * factor + 1 ! rescale to new max_val; then, new term is exp(0)
max_val = V(i)
else
sum_exp = sum_exp + exp(Vrel*rsigma)
end if
end do
LogSum = max_val + sigma*log(sum_exp)
end function fun_logsum_optim2
maybe worth checking on other platforms of course. I guess the speed depends on the maxloc of V(i), that in the example, is at n (V being sorted), so, the loop basically always experiences Vrel>0
Iām having all kinds of problems today trying to copy code from my browser window into my text editor window (e.g. missing blocks of code, extra characters). Is this because of some change here in this discussion group?