Code optimization

I have two functions that need further optimization. Any advice?

module opt
  use ieee_arithmetic, only: ieee_value, ieee_quiet_nan
  integer(8), parameter :: k = 270
  real(8), parameter :: pi = acos(-1d0)


  function roll_bv(n, x) result(res)
    integer(8) :: n, i1, i2
    real(8) :: x(n), res(n)
    !$omp parallel do
    do i1 = 1, k - 1
      res(i1) = ieee_value(0d0, ieee_quiet_nan)  
    end do
    !$omp parallel do
    do i1 = k, n
      res(i1) = 0d0 
      do i2 = i1 - k + 1, i1 - 1
        res(i1) = res(i1) + pi/2d0*k/(k - 1d0)*abs(x(i2)*x(i2 + 1))  
      end do
    end do
  end function

  function lm_jump(n, x, x_roll_bv) result(res)
    integer(8) :: n, i
    real(8) :: x(n), x_roll_bv(n), res, max_z, c, s, z 
    real(8), allocatable :: z_vector(:)
    !$omp parallel do
    do i = 1, n
      z_vector(i) = abs(x(i))/sqrt(x_roll_bv(i)/(k - 2d0)) 
    end do
    max_z = maxval(z_vector)
    c = (2d0*log(n*1d0))**(1d0/2d0)/8d-1 - &
      (log(pi) + log(log(n*1d0)))/(1.6d0*(2d0*log(pi))**(1d0/2d0))
    s = 1d0/(6d-1*(2d0*log(pi))**(1d0/2d0))
    z = (max_z - c)/s
    res = 1d0 - exp(-exp(-z))
  end function
end module

Have you run them through a profiler? Do you know what the hotspots are? If not I suggest you use gprof or Intel’s tool (I don’t remember the name) to gain a better understanding of what is slowing you down.

Before you optimize the code, I’ll humbly suggest that you fix the bug(s).

The code runs fine and generates correct answers.

Well, we don’t know if the code runs correctly, because you did not provide sufficient information (like an entire functioning program). I suspect you either got lucky or n < k.

I suspect that there is an error here: s was set equal to the reciprocal of an expression rather than to the expression itself, probably to supplant the division by s in the next line by a faster multiplication by the reciprocal of s. However, the programmer forgot to replace ‘/’ by ‘*’ in that second line.

An inefficiency: ** (1d0/2d0) can be replaced by sqrt(). Some processors may have a floating point instruction to evaluate the square root in hardware or microcode. The X87 instruction set included SQRT as long as 40 years ago.

As @kargl correctly points out, if n < k the potentially expensive double DO loop in roll_bv() is not executed even once, so those source lines – in fact, the whole function – can be dispensed with.

In lm_jump(), there i an expensive evaluation for c. If the function is called with a single value of n throughout the calculation, c could be calculated during the first call and the resulting value saved for use in future calls.

I hope that you realize that without knowing the usage patterns (which arguments never change in a run, how often a function gets called, which portions of the code consume the most processor time, etc.), we are going to run into Knuth’s aphorism: Premature optimization is the root of all evil.

It’s more than just ‘n < k’ allows the code to be removed, a more important issue exists for ‘n > k’. The inner expression is res(i1) = res(i1) + ... This is invalid Fortran.

Agreed. The variable res is the result variable of the function, so it is undefined at function entry. The first DO loop of the code assigns values to res(1:k-1), so when n >= k, res(k:n) will be undefined.

However, I realize that such errors are not uncommon when some lines of code are lifted out of a longer section of program source and used for the sake of discussion in a forum post.

I would presume that if someone is asking for help in optimizing a piece that the code has not been edited for brevity. It’s all there. Looking at lm_jump, both log and exp can be expensive. Try writing the algorithm to minimize calls to log.

Thanks for pointing this out. I have added
res(i1) = 0d0, now the code should make sense.

In my case, n is much much larger than k. That’s why I use 64-bit integers. You are right, I probably should have accounted for the possible situations when n is smaller than k.

I have added res(i1) = 0d0 before res(i1) = res(i1) + ....

It is correctly translated from a formula on a paper. I agree that the formula can be rearranged for better clarity.

Now that you have conforming code, look at the loop and what can be hoisted?

   real(8) tmp
   tmp = (pi / 2) * k / (k - 1)
   !$omp parallel do
   do i1 = k, n
      res(i1) = 0d0 
      do i2 = i1 - k + 1, i1 - 1
         res(i1) = res(i1) + abs(x(i2) * x(i2 + 1))  
      end do
   end do
   res(k:n) = tmp * res(k:n)
end block
1 Like

thanks for the advice. for this nested loop, is it possible to use OpenMP to parallelize it?

If so, the two lines

    s = 1d0/(6d-1*(2d0*log(pi))**(1d0/2d0))
    z = (max_z - c)/s

may be replaced by

    z = (max_z - c)*sqrt(2d0*log(pi))*6d-1

Can you replace abs(x(i2) * x(i2 + 1)) with x_seq(i2)?
x_seq is calculated as follows (the range of the loop is likely incorrect)

   real(8) tmp, tmp_sum
   real(8), allocatable :: x_seq(:)
   do i1=1, n-1
        x_seq(i1) = abs(x(i1)*x(i1+1))
   end do

This will create another array. Not sure why it will be faster.

It trades memory allocation outside of the nested loops for a floating point multiplication within the inner loop. It might or might not be faster. However, after that change, the innermost loop can be replaced with sum(), and maybe other simplifications will become apparent. I’d say it is at least worth exploring.

1 Like

Here is a go at it for the roll_bv() function. The integer parameters wp and ip need to be defined in the module.

   function roll_bv(n, x) result(res)
      real(wp)                :: res(n)
      integer(ip), intent(in) :: n
      real(wp), intent(in)    :: x(:)   ! (1:n)
      integer(ip) :: i1
      real(wp)    :: tmp, t
      tmp = ((pi / 2) * k) / (k - 1)             ! overall scale factor.                                                            
      res(1:k-1) = ieee_value(0.0_wp, ieee_quiet_nan)                                                                            
      t = sum( abs( x(1:k-2) * x(2:k-1) ) )      ! precompute the first terms.                                                      
      do i1 = k, n
         t = t + abs( x(i1-1) * x(i1) )          ! add in the last term.                                                            
         res(i1) = tmp * t
         t = t - abs( x(i1-k+1) * x(i1-k+2) )    ! subtract off the first term.                                                     
   end function roll_bv

This precomputes the overall scale factor as others have suggested. It eliminates the inner do loop, which is the main “optimization” here. It does this by recognizing that a rolling sum was being computed inside that loop. Each successive pass involved the same rolling sum terms except for the first and last elements. So this code separates off those elements and adds and subtracts them as necessary.

I wrote the initial t computation as an array expression here. However, I would turn on compiler warnings about array temporaries, and if any compiler generates an array temporary to evaluate that expression, then I would replace that expression with a do loop, which obviously would not require any temporary array allocations.

For large n values, this algorithm might be less accurate than the original code because errors can accumulate here where they were kept local before. In the actual application, if this is important, then the first and last terms can be added in with the Kahan summation trick to reduce the cumulative errors. If you don’t know what that is, it is worth the time to google it to see how it works. Another possibility might be to declare and compute t in an extended precision, and then convert just when storing into res(:).

Edit: As discussed in the next post, I changed the dummy array declaration to assumed shape.

Here is my attempt at the lm_jump() function.

function lm_jump(n, x, x_roll_bv) result(res)
   integer(ip), intent(in) :: n
   real(wp), intent(in)    :: x(:), x_roll_bv(:)  ! (1:n)                                                                        
   integer(ip) :: i
   real(wp)    :: res, max_z, c, z, log_n
   real(wp), parameter :: log_pi = log(pi), s2log_pi = sqrt(2*log_pi)
   log_n = log(real(n,kind=wp))
   max_z = -huge(max_z)
   do i = 1, n
      max_z = max( max_z, (abs(x(i)) / sqrt(x_roll_bv(i) / (k - 2))) )
   end do
   c = sqrt(2*log_n) / 0.8_wp - (log_pi + log(log_n)) / (1.6_wp * s2log_pi)
   z = (max_z - c) * 0.6_wp * s2log_pi
   res = 1.0_wp - exp(-exp(-z))
end function lm_jump

The main thing here is to eliminate the allocate() for the temporary array whose elements are only referenced once. I left in the do loop, but of course that could be replaced with an array expression. The compiler is probably smart enough to produce the same code in both cases. I probably prefer the do loop in this case to the array expression.

The other things are replacing all of the **(1.0/2.0) with SQRT() and factoring out some constant terms. An optimizing compiler would probably have done that last part anyway, but it also clarifies the code a little to see what is constant and what are variables. I also eliminated the double reciprocal in the expression, as others had already recognized.

I also changed the array declarations to assumed shape. I probably should have done that for the other function too. It isn’t a big deal, but if you were to ever call either of those original functions with an array slice of some kind, it would invoke copy-in/copy-out and would cost some extra overhead. The assumed shape declarations avoid that possibility.

The original codes had an annoying mishmash of real(8) declarations and d exponents. There is no need for that, just declare the integer parameter wp and use it consistently throughout. Then it is easy to change if you want to change kinds, and also there is no chance of mixed kinds in the expressions.