# Code optimization

I have two functions that need further optimization. Any advice?

``````module opt
use ieee_arithmetic, only: ieee_value, ieee_quiet_nan
integer(8), parameter :: k = 270
real(8), parameter :: pi = acos(-1d0)

contains

function roll_bv(n, x) result(res)
integer(8) :: n, i1, i2
real(8) :: x(n), res(n)
!\$omp parallel do
do i1 = 1, k - 1
res(i1) = ieee_value(0d0, ieee_quiet_nan)
end do
!\$omp parallel do
do i1 = k, n
res(i1) = 0d0
do i2 = i1 - k + 1, i1 - 1
res(i1) = res(i1) + pi/2d0*k/(k - 1d0)*abs(x(i2)*x(i2 + 1))
end do
end do
end function

function lm_jump(n, x, x_roll_bv) result(res)
integer(8) :: n, i
real(8) :: x(n), x_roll_bv(n), res, max_z, c, s, z
real(8), allocatable :: z_vector(:)
allocate(z_vector(n))
!\$omp parallel do
do i = 1, n
z_vector(i) = abs(x(i))/sqrt(x_roll_bv(i)/(k - 2d0))
end do
max_z = maxval(z_vector)
deallocate(z_vector)
c = (2d0*log(n*1d0))**(1d0/2d0)/8d-1 - &
(log(pi) + log(log(n*1d0)))/(1.6d0*(2d0*log(pi))**(1d0/2d0))
s = 1d0/(6d-1*(2d0*log(pi))**(1d0/2d0))
z = (max_z - c)/s
res = 1d0 - exp(-exp(-z))
end function
end module
``````

Have you run them through a profiler? Do you know what the hotspots are? If not I suggest you use `gprof` or Intel’s tool (I don’t remember the name) to gain a better understanding of what is slowing you down.

Before you optimize the code, I’ll humbly suggest that you fix the bug(s).

The code runs fine and generates correct answers.

Well, we don’t know if the code runs correctly, because you did not provide sufficient information (like an entire functioning program). I suspect you either got lucky or `n < k`.

I suspect that there is an error here: s was set equal to the reciprocal of an expression rather than to the expression itself, probably to supplant the division by s in the next line by a faster multiplication by the reciprocal of s. However, the programmer forgot to replace ‘/’ by ‘*’ in that second line.

An inefficiency: ** (1d0/2d0) can be replaced by sqrt(). Some processors may have a floating point instruction to evaluate the square root in hardware or microcode. The X87 instruction set included SQRT as long as 40 years ago.

As @kargl correctly points out, if n < k the potentially expensive double DO loop in roll_bv() is not executed even once, so those source lines – in fact, the whole function – can be dispensed with.

In lm_jump(), there i an expensive evaluation for c. If the function is called with a single value of n throughout the calculation, c could be calculated during the first call and the resulting value saved for use in future calls.

I hope that you realize that without knowing the usage patterns (which arguments never change in a run, how often a function gets called, which portions of the code consume the most processor time, etc.), we are going to run into Knuth’s aphorism: Premature optimization is the root of all evil.

It’s more than just ‘n < k’ allows the code to be removed, a more important issue exists for ‘n > k’. The inner expression is `res(i1) = res(i1) + ...` This is invalid Fortran.

Agreed. The variable `res` is the result variable of the function, so it is undefined at function entry. The first DO loop of the code assigns values to res(1:k-1), so when n >= k, res(k:n) will be undefined.

However, I realize that such errors are not uncommon when some lines of code are lifted out of a longer section of program source and used for the sake of discussion in a forum post.

I would presume that if someone is asking for help in optimizing a piece that the code has not been edited for brevity. It’s all there. Looking at `lm_jump`, both `log` and `exp` can be expensive. Try writing the algorithm to minimize calls to `log`.

Thanks for pointing this out. I have added
`res(i1) = 0d0`, now the code should make sense.

In my case, n is much much larger than k. That’s why I use 64-bit integers. You are right, I probably should have accounted for the possible situations when n is smaller than k.

I have added `res(i1) = 0d0` before `res(i1) = res(i1) + ...`.

It is correctly translated from a formula on a paper. I agree that the formula can be rearranged for better clarity.

Now that you have conforming code, look at the loop and what can be hoisted?

``````block
real(8) tmp
tmp = (pi / 2) * k / (k - 1)
!\$omp parallel do
do i1 = k, n
res(i1) = 0d0
do i2 = i1 - k + 1, i1 - 1
res(i1) = res(i1) + abs(x(i2) * x(i2 + 1))
end do
end do
res(k:n) = tmp * res(k:n)
end block
``````
1 Like

thanks for the advice. for this nested loop, is it possible to use OpenMP to parallelize it?

If so, the two lines

``````    s = 1d0/(6d-1*(2d0*log(pi))**(1d0/2d0))
z = (max_z - c)/s
``````

may be replaced by

``````    z = (max_z - c)*sqrt(2d0*log(pi))*6d-1
``````
2 Likes

Can you replace `abs(x(i2) * x(i2 + 1))` with `x_seq(i2)`?
`x_seq` is calculated as follows (the range of the loop is likely incorrect)

``````   real(8) tmp, tmp_sum
real(8), allocatable :: x_seq(:)

allocate(x_seq(n-1))
do i1=1, n-1
x_seq(i1) = abs(x(i1)*x(i1+1))
end do
``````

This will create another array. Not sure why it will be faster.

It trades memory allocation outside of the nested loops for a floating point multiplication within the inner loop. It might or might not be faster. However, after that change, the innermost loop can be replaced with sum(), and maybe other simplifications will become apparent. I’d say it is at least worth exploring.

1 Like

Here is a go at it for the roll_bv() function. The integer parameters wp and ip need to be defined in the module.

``````   function roll_bv(n, x) result(res)
real(wp)                :: res(n)
integer(ip), intent(in) :: n
real(wp), intent(in)    :: x(:)   ! (1:n)
integer(ip) :: i1
real(wp)    :: tmp, t
tmp = ((pi / 2) * k) / (k - 1)             ! overall scale factor.
res(1:k-1) = ieee_value(0.0_wp, ieee_quiet_nan)
t = sum( abs( x(1:k-2) * x(2:k-1) ) )      ! precompute the first terms.
do i1 = k, n
t = t + abs( x(i1-1) * x(i1) )          ! add in the last term.
res(i1) = tmp * t
t = t - abs( x(i1-k+1) * x(i1-k+2) )    ! subtract off the first term.
enddo
end function roll_bv
``````

This precomputes the overall scale factor as others have suggested. It eliminates the inner do loop, which is the main “optimization” here. It does this by recognizing that a rolling sum was being computed inside that loop. Each successive pass involved the same rolling sum terms except for the first and last elements. So this code separates off those elements and adds and subtracts them as necessary.

I wrote the initial t computation as an array expression here. However, I would turn on compiler warnings about array temporaries, and if any compiler generates an array temporary to evaluate that expression, then I would replace that expression with a do loop, which obviously would not require any temporary array allocations.

For large n values, this algorithm might be less accurate than the original code because errors can accumulate here where they were kept local before. In the actual application, if this is important, then the first and last terms can be added in with the Kahan summation trick to reduce the cumulative errors. If you don’t know what that is, it is worth the time to google it to see how it works. Another possibility might be to declare and compute t in an extended precision, and then convert just when storing into res(:).

Edit: As discussed in the next post, I changed the dummy array declaration to assumed shape.

Here is my attempt at the lm_jump() function.

``````function lm_jump(n, x, x_roll_bv) result(res)
integer(ip), intent(in) :: n
real(wp), intent(in)    :: x(:), x_roll_bv(:)  ! (1:n)
integer(ip) :: i
real(wp)    :: res, max_z, c, z, log_n
real(wp), parameter :: log_pi = log(pi), s2log_pi = sqrt(2*log_pi)
log_n = log(real(n,kind=wp))
max_z = -huge(max_z)
do i = 1, n
max_z = max( max_z, (abs(x(i)) / sqrt(x_roll_bv(i) / (k - 2))) )
end do
c = sqrt(2*log_n) / 0.8_wp - (log_pi + log(log_n)) / (1.6_wp * s2log_pi)
z = (max_z - c) * 0.6_wp * s2log_pi
res = 1.0_wp - exp(-exp(-z))
end function lm_jump
``````

The main thing here is to eliminate the allocate() for the temporary array whose elements are only referenced once. I left in the do loop, but of course that could be replaced with an array expression. The compiler is probably smart enough to produce the same code in both cases. I probably prefer the do loop in this case to the array expression.

The other things are replacing all of the **(1.0/2.0) with SQRT() and factoring out some constant terms. An optimizing compiler would probably have done that last part anyway, but it also clarifies the code a little to see what is constant and what are variables. I also eliminated the double reciprocal in the expression, as others had already recognized.

I also changed the array declarations to assumed shape. I probably should have done that for the other function too. It isn’t a big deal, but if you were to ever call either of those original functions with an array slice of some kind, it would invoke copy-in/copy-out and would cost some extra overhead. The assumed shape declarations avoid that possibility.

The original codes had an annoying mishmash of real(8) declarations and d exponents. There is no need for that, just declare the integer parameter wp and use it consistently throughout. Then it is easy to change if you want to change kinds, and also there is no chance of mixed kinds in the expressions.