Code optimization

fortran4r · June 9, 2022, 11:16pm

I have two functions that need further optimization. Any advice?

module opt
  use ieee_arithmetic, only: ieee_value, ieee_quiet_nan
  integer(8), parameter :: k = 270
  real(8), parameter :: pi = acos(-1d0)

contains

  function roll_bv(n, x) result(res)
    integer(8) :: n, i1, i2
    real(8) :: x(n), res(n)
    !$omp parallel do
    do i1 = 1, k - 1
      res(i1) = ieee_value(0d0, ieee_quiet_nan)  
    end do
    !$omp parallel do
    do i1 = k, n
      res(i1) = 0d0 
      do i2 = i1 - k + 1, i1 - 1
        res(i1) = res(i1) + pi/2d0*k/(k - 1d0)*abs(x(i2)*x(i2 + 1))  
      end do
    end do
  end function

  function lm_jump(n, x, x_roll_bv) result(res)
    integer(8) :: n, i
    real(8) :: x(n), x_roll_bv(n), res, max_z, c, s, z 
    real(8), allocatable :: z_vector(:)
    allocate(z_vector(n))
    !$omp parallel do
    do i = 1, n
      z_vector(i) = abs(x(i))/sqrt(x_roll_bv(i)/(k - 2d0)) 
    end do
    max_z = maxval(z_vector)
    deallocate(z_vector)
    c = (2d0*log(n*1d0))**(1d0/2d0)/8d-1 - &
      (log(pi) + log(log(n*1d0)))/(1.6d0*(2d0*log(pi))**(1d0/2d0))
    s = 1d0/(6d-1*(2d0*log(pi))**(1d0/2d0))
    z = (max_z - c)/s
    res = 1d0 - exp(-exp(-z))
  end function
end module

gnikit · June 9, 2022, 11:23pm

Have you run them through a profiler? Do you know what the hotspots are? If not I suggest you use gprof or Intel’s tool (I don’t remember the name) to gain a better understanding of what is slowing you down.

fortran4r · June 10, 2022, 12:27am

The code runs fine and generates correct answers.

mecej4 · June 10, 2022, 12:50am

I suspect that there is an error here: s was set equal to the reciprocal of an expression rather than to the expression itself, probably to supplant the division by s in the next line by a faster multiplication by the reciprocal of s. However, the programmer forgot to replace ‘/’ by ‘*’ in that second line.

An inefficiency: ** (1d0/2d0) can be replaced by sqrt(). Some processors may have a floating point instruction to evaluate the square root in hardware or microcode. The X87 instruction set included SQRT as long as 40 years ago.

As @kargl correctly points out, if n < k the potentially expensive double DO loop in roll_bv() is not executed even once, so those source lines – in fact, the whole function – can be dispensed with.

In lm_jump(), there i an expensive evaluation for c. If the function is called with a single value of n throughout the calculation, c could be calculated during the first call and the resulting value saved for use in future calls.

I hope that you realize that without knowing the usage patterns (which arguments never change in a run, how often a function gets called, which portions of the code consume the most processor time, etc.), we are going to run into Knuth’s aphorism: Premature optimization is the root of all evil.

mecej4 · June 10, 2022, 1:29am

Agreed. The variable res is the result variable of the function, so it is undefined at function entry. The first DO loop of the code assigns values to res(1:k-1), so when n >= k, res(k:n) will be undefined.

However, I realize that such errors are not uncommon when some lines of code are lifted out of a longer section of program source and used for the sake of discussion in a forum post.

fortran4r · June 10, 2022, 2:36am

Thanks for pointing this out. I have added
res(i1) = 0d0, now the code should make sense.

fortran4r · June 10, 2022, 2:38am

In my case, n is much much larger than k. That’s why I use 64-bit integers. You are right, I probably should have accounted for the possible situations when n is smaller than k.

I have added res(i1) = 0d0 before res(i1) = res(i1) + ....

fortran4r · June 10, 2022, 2:46am

It is correctly translated from a formula on a paper. I agree that the formula can be rearranged for better clarity.

fortran4r · June 10, 2022, 3:14am

thanks for the advice. for this nested loop, is it possible to use OpenMP to parallelize it?

mecej4 · June 10, 2022, 3:22am

If so, the two lines

    s = 1d0/(6d-1*(2d0*log(pi))**(1d0/2d0))
    z = (max_z - c)/s

may be replaced by

    z = (max_z - c)*sqrt(2d0*log(pi))*6d-1

NaoMatch · June 13, 2022, 1:15pm

Can you replace abs(x(i2) * x(i2 + 1)) with x_seq(i2)?
x_seq is calculated as follows (the range of the loop is likely incorrect)

   real(8) tmp, tmp_sum
   real(8), allocatable :: x_seq(:)
   
   allocate(x_seq(n-1))
   do i1=1, n-1
        x_seq(i1) = abs(x(i1)*x(i1+1))
   end do

fortran4r · June 13, 2022, 8:34pm

This will create another array. Not sure why it will be faster.

RonShepard · June 13, 2022, 9:14pm

It trades memory allocation outside of the nested loops for a floating point multiplication within the inner loop. It might or might not be faster. However, after that change, the innermost loop can be replaced with sum(), and maybe other simplifications will become apparent. I’d say it is at least worth exploring.

RonShepard · June 13, 2022, 10:54pm

Here is a go at it for the roll_bv() function. The integer parameters wp and ip need to be defined in the module.

   function roll_bv(n, x) result(res)
      real(wp)                :: res(n)
      integer(ip), intent(in) :: n
      real(wp), intent(in)    :: x(:)   ! (1:n)
      integer(ip) :: i1
      real(wp)    :: tmp, t
      tmp = ((pi / 2) * k) / (k - 1)             ! overall scale factor.                                                            
      res(1:k-1) = ieee_value(0.0_wp, ieee_quiet_nan)                                                                            
      t = sum( abs( x(1:k-2) * x(2:k-1) ) )      ! precompute the first terms.                                                      
      do i1 = k, n
         t = t + abs( x(i1-1) * x(i1) )          ! add in the last term.                                                            
         res(i1) = tmp * t
         t = t - abs( x(i1-k+1) * x(i1-k+2) )    ! subtract off the first term.                                                     
      enddo
   end function roll_bv

This precomputes the overall scale factor as others have suggested. It eliminates the inner do loop, which is the main “optimization” here. It does this by recognizing that a rolling sum was being computed inside that loop. Each successive pass involved the same rolling sum terms except for the first and last elements. So this code separates off those elements and adds and subtracts them as necessary.

I wrote the initial t computation as an array expression here. However, I would turn on compiler warnings about array temporaries, and if any compiler generates an array temporary to evaluate that expression, then I would replace that expression with a do loop, which obviously would not require any temporary array allocations.

For large n values, this algorithm might be less accurate than the original code because errors can accumulate here where they were kept local before. In the actual application, if this is important, then the first and last terms can be added in with the Kahan summation trick to reduce the cumulative errors. If you don’t know what that is, it is worth the time to google it to see how it works. Another possibility might be to declare and compute t in an extended precision, and then convert just when storing into res(:).

Edit: As discussed in the next post, I changed the dummy array declaration to assumed shape.

RonShepard · June 14, 2022, 9:44pm

Here is my attempt at the lm_jump() function.

function lm_jump(n, x, x_roll_bv) result(res)
   integer(ip), intent(in) :: n
   real(wp), intent(in)    :: x(:), x_roll_bv(:)  ! (1:n)                                                                        
   integer(ip) :: i
   real(wp)    :: res, max_z, c, z, log_n
   real(wp), parameter :: log_pi = log(pi), s2log_pi = sqrt(2*log_pi)
   log_n = log(real(n,kind=wp))
   max_z = -huge(max_z)
   do i = 1, n
      max_z = max( max_z, (abs(x(i)) / sqrt(x_roll_bv(i) / (k - 2))) )
   end do
   c = sqrt(2*log_n) / 0.8_wp - (log_pi + log(log_n)) / (1.6_wp * s2log_pi)
   z = (max_z - c) * 0.6_wp * s2log_pi
   res = 1.0_wp - exp(-exp(-z))
end function lm_jump

The main thing here is to eliminate the allocate() for the temporary array whose elements are only referenced once. I left in the do loop, but of course that could be replaced with an array expression. The compiler is probably smart enough to produce the same code in both cases. I probably prefer the do loop in this case to the array expression.

The other things are replacing all of the **(1.0/2.0) with SQRT() and factoring out some constant terms. An optimizing compiler would probably have done that last part anyway, but it also clarifies the code a little to see what is constant and what are variables. I also eliminated the double reciprocal in the expression, as others had already recognized.

I also changed the array declarations to assumed shape. I probably should have done that for the other function too. It isn’t a big deal, but if you were to ever call either of those original functions with an array slice of some kind, it would invoke copy-in/copy-out and would cost some extra overhead. The assumed shape declarations avoid that possibility.

The original codes had an annoying mishmash of real(8) declarations and d exponents. There is no need for that, just declare the integer parameter wp and use it consistently throughout. Then it is easy to change if you want to change kinds, and also there is no chance of mixed kinds in the expressions.

JohnCampbell · June 15, 2022, 1:14pm

Here is my 2 cents worth. I replaced !$omp with an array = value and cleaned up some of the ugly d-1
(very similar to Ron’s, worth the same!; max_z can’t be -ve?)

module opt
  use ieee_arithmetic, only: ieee_value, ieee_quiet_nan
  integer(8), parameter :: k = 270
  real(8),    parameter :: pi = acos(-1d0), one=1, two=2

contains

  function roll_bv (n, x) result(res)
    integer(8) :: n
    real(8)    :: x(n), res(n)

    integer(8) :: i0, i1, i2
    real(8)    :: ac

    res(1:k-1) = ieee_value (0d0, ieee_quiet_nan)

!   i1 = k
    ac = 0
    do i2 = 1, k - 1
      ac = ac + abs (x(i2)*x(i2+1))
    end do
    res(k) = pi/two * k/(k - one) * ac

    do i1 = k+1, n
      i0 = i1-k
      ac = ac - abs (x(i0)*x(i0+1))  &
              + abs (x(i1-1)*x(i1))
      res(i1) = pi/two * k/(k - one) * ac
    end do

  end function

  function lm_jump (n, x, x_roll_bv) result(res)
    integer(8) :: n, i
    real(8)    :: x(n), x_roll_bv(n), res
    real(8)    :: max_z, log_n, c, z
    real(8),  parameter :: sq_two_lpi = sqrt (two*log(pi))

    max_z = 0
    do i = 1, n
      max_z = max ( max_z, abs (x(i)) / sqrt ( x_roll_bv(i)/(k - two) ) )
    end do

    log_n = log(dble(n))
    c     = sqrt(two*log_n) / 0.8d0 - ( log(pi) + log(log_n) )/(1.6d0 * sq_two_lpi )
    z     = (max_z - c) * ( 0.6d0 * sq_two_lpi )
    res   = one - exp(-exp(-z))
  end function

end module

Topic		Replies	Views
Fortran openMP - make code faster	4	478	August 31, 2022
OpenMP and FORTRAN Help	32	1023	December 9, 2024
Fortran is dead – Long live Fortran! Tutorials	9	847	June 27, 2025
Compiling only part of code with -ffast-math Help	21	1362	April 21, 2022
Performance of vectorized code in ifort and ifx	28	1275	April 9, 2024

Code optimization

Related topics