Tips to make this toy program faster?

Here is a little Fortran subroutine which calculates the negative log-likelihood of a normally distributed random sample.

subroutine negloglik(x, n, mu, sigma, nll)
  integer n, i
  double precision x(n), z(n), mu, sigma, nll, ll, pi

  pi = 4*atan(1.0)

  z = (x - mu)/sigma

  ll = 0.0
  do i=1,n
    ll = ll - 0.5*z(i)**2 - log(sigma*sqrt(2*pi))
  end do

  nll = -ll
end subroutine negloglik

I originally wrote this so I could try calling Fortran from the statistical package R. Comparing this implementation (including wrapper code in R) with R’s built-in dnorm function I was surprised to see that the above Fortran code is 3-4 times faster. Since my code is just a naive “formula translation”, I’m interested to know if there are any small changes which could provide more of a speed-up.

I’d try with openmp directives wrapping the do loop. That log and sqrt function will very probably make the stencil compute-bounded, so it’s possible to see a quasi-linear scaling also on a cheap laptop.

1 Like

I suppose you could move this constant out of the loop, although the compiler might already do it himself.

2 Likes

Oops. Did not see it was a constant. However, depending on memory bandwidth, with openmp some improvements can be seen.

1 Like

You should also make sure to use 1.0d0 instead of 1.0. Literal constants of real variables in Fortran are by default single precision. Same goes for the 0.5 factor in the loop.

1 Like

If n is large or if this subroutine is called many times, I would expect some performance improvement by fusing the two loops and removing the temporary array z.

This would both remove a memory allocation from the subroutine overhead and improve the arithmetic intensity (flops per byte accessed) of the overall procedure by only iterating through memory once.

e.g.

subroutine negloglik(x, n, mu, sigma, nll)
  integer n, i
  double precision x(n), z, mu, sigma, nll, ll, pi

  pi = 4*atan(1.0)

  ll = 0.0
  do i=1,n
    z = (x(i) - mu)/sigma
    ll = ll - 0.5*z**2 - log(sigma*sqrt(2*pi))
  end do

  nll = -ll
end subroutine negloglik
3 Likes

Hi Roberto, Ivan and Laurence. Thank you for your replies. Moving the constant out of the loop reduced time taken by 5%. Fusing the loops reduced time taken by 20%. Applying both changes, this code

subroutine negloglik(x, n, mu, sigma, nll)
  integer n, i
  double precision x(n), z, mu, sigma, nll, ll, pi

  pi = 4*atan(1.0d0)

  ll = -n*log(sigma*sqrt(2*pi))
  do i=1,n
    z = (x(i) - mu)/sigma
    ll = ll - 0.5d0*z**2
  end do

  nll = -ll
end subroutine negloglik

has a running time which is 25% lower compared with the code in my original post.

@Rob777 When you mention openmp directives, is there a specific annotation that I can add to the do loop? I haven’t used openmp before (or Fortran for that matter… just getting started now).

2 Likes
  ll = 0.0
  do i=1,n
    z = (x(i) - mu)/sigma
    ll = ll - 0.5*z**2 - log(sigma*sqrt(2*pi))
  end do

You should mathematically factorize everything that can be factorized. For example, -log(sigma*sqrt(2*pi)) is substracted n times.

ll = 0.0
do i=1,n
    z = (x(i) - mu)/sigma
    ll = ll - 0.5*z**2
end do
ll = ll - n*log(sigma*sqrt(2*pi))

Here, n subtractions are replaced by one substraction and one multiplication.
The 0.5* operation can now be also factorized. And probably the /sigma …

Each single mathematical operation is eating some CPU…

2 Likes

Glad that helped @catulus. Forgot to say: welcome to the Discourse!

May I also ask what compiler flags you are using to compile the Fortran code? With full optimisation, I would expect most trivial things like factorising/moving constants to be done by the compiler already.

A good set of optimisation flags to start-off with are:
-O3 -march=native -mtune=native

2 Likes

If I made no error, the only thing that the loop should compute is:
\sum_1^n{x(i)*(x(i)-2*mu)}

2 Likes

I’m compiling using the command R CMD SHLIB negloglik.f90 which calls

gfortran  -fpic -g -O2 -fdebug-prefix-map=/build/r-base-xoRi9J/r-base-3.5.2=. -fstack-protector-strong  -c  negloglik.f90 -o negloglik.o
gfortran -shared -L/usr/lib/R/lib -Wl,-z,relro -o negloglik.so negloglik.o -L/usr/lib/R/lib -lR
1 Like

As @lkedward said, the compiler should make the most obvious factorization with the -O3 flag. But perhaps not those coming from the **2 operation. It would be interesting to compare the CPU time…

2 Likes

Given all the factorization that others have suggested so far, I would write it like the following. Good optimizing compiler ought to be able to parallelize/vectorize this quite well then.

subroutine negloglik(x, n, mu, sigma, nll)
  integer, intent(in) :: n
  double precision, intent(in) :: x(n), mu, sigma
  double precision, intent(out) :: nll

  double precision, parameter :: pi = 4*atan(1.0d0)

  nll = sum(term(x, mu, sigma)) + n*log(sigma * sqrt(2 * pi))

  contains
    elemental function term(x_, mu_, sigma_)
      double precision, intent(in) :: x_, mu_, sigma_
      double precision :: term

      term = 0.5d0 * ((x_ - mu_)/sigma_)**2
    end function
end subroutine
1 Like

For the code

module negloglik_mod
implicit none
private
public :: negloglik
contains
subroutine negloglik(x, n, mu, sigma, nll, imethod)
  integer         , intent(in)  :: n,imethod
  double precision, intent(in)  :: x(n), mu, sigma
  double precision, intent(out) :: nll
  double precision, parameter   :: pi = 4*atan(1.0d0), sqrt_2_pi = sqrt(2.0d0*pi)
  double precision              :: z(n),ll,zscalar
  integer                       :: i
  if (imethod < 3) z = (x - mu)/sigma
  if (imethod == 1) then
     ll = 0.0d0
     do i=1,n
        ll = ll - 0.5*z(i)**2 - log(sigma*sqrt_2_pi)
     end do
  else if (imethod == 2) then ! avoid computing log(sigma*sqrt(2*pi)) in loop
     ll = -n * log(sigma*sqrt_2_pi) - 0.5d0*sum(z**2)
  else if (imethod == 3) then ! loop fusion
     ll = 0.0d0
     do i=1,n
       zscalar = (x(i) - mu)/sigma
       ll = ll - 0.5d0*zscalar**2 - log(sigma*sqrt_2_pi)
     end do
  end if
  nll = -ll
end subroutine negloglik
end module negloglik_mod

program xnegloglik
use negloglik_mod  , only: negloglik
use iso_fortran_env, only: compiler_version,compiler_options
implicit none
integer, parameter :: dp = kind(1.0d0)
integer, parameter :: n = 1000000, niter = 1000, nmethods = 3
real(kind=dp)      :: x(n),mu,sigma,nll,nll_sum,tt(0:nmethods)
integer            :: iter,imethod
call cpu_time(tt(0))
do imethod=1,nmethods
   nll_sum = 0.0_dp
   do iter=1,niter
      call random_number(x)
      mu = sum(x)/n
      sigma = sqrt(sum((x-mu)**2)/n)
      call negloglik(x,n,mu,sigma,nll,imethod)
      nll_sum = nll_sum + nll
   end do
   write (*,"('imethod, nll_sum = ',i0,1x,f0.6)") imethod,nll_sum
   call cpu_time(tt(imethod))
end do
write (*,"('times (s) for methods:',100(1x,f0.4))") tt(1:) - tt(0:nmethods-1)
write (*,"('compiler_version: ',a)") trim(compiler_version())
end program xnegloglik

I get for Intel Fortran and gfortran on Windows 10

times (s) for methods: 5.6094 5.4375 4.9062
compiler_version: Intel(R) Fortran Intel(R) 64 Compiler Classic for applications running on Intel(R) 64, Version 2021.1 Build 20201112_000000
compiler_options: /c /fast

times (s) for methods: 6.2969 5.5312 5.6094
compiler_version: Intel(R) Fortran Intel(R) 64 Compiler Classic for applications running on Intel(R) 64, Version 2021.1 Build 20201112_000000
compiler_options: /c /fast /check:bounds /F512000000

times (s) for methods: 11.0781 9.8438 8.4375
compiler_version: GCC version 11.0.0 20200927 (experimental)
compiler_options: -mtune=generic -march=x86-64 -O2

times (s) for methods: 10.9219 9.8906 8.4219
compiler_version: GCC version 11.0.0 20200927 (experimental)
compiler_options: -mtune=generic -march=x86-64 -O2 -fbounds-check

so loop fusion helps for gfortran. For Intel Fortran, which is generally faster for this code, loop fusion helps only when bounds checking is disabled.

1 Like

Same program with GCC 8.3.0 and -O2 or -O3 flags:

$ gfortran speed.f90 -O2 && ./a.out
imethod, nll_sum = 1 176470500.363913
imethod, nll_sum = 2 176459692.189152
imethod, nll_sum = 3 176472639.104788
times (s) for methods: 14.1439 13.3749 12.4523
compiler_version: GCC version 8.3.0

$ gfortran speed.f90 -O3 && ./a.out
imethod, nll_sum = 1 176483880.530271
imethod, nll_sum = 2 176502438.163822
imethod, nll_sum = 3 176486385.252288
times (s) for methods: 12.1894 11.3992 12.4512
compiler_version: GCC version 8.3.0

Interestingly, with -O2, method 3 is faster, with -O3 it’s method 2 (on that CPU, but not on another one…). You can also try with different versions of gfortran… The latest is not always the fastest…

Note also that mathematical factorization limits the number of operations and therefore of accumulating rounding errors.

2 Likes

Thanks @Beliavsky, nice bench-marking. I notice method 2 has taken the constant out of the loop whereas method 3 has not in your implementation; a stricter analysis of loop fusion versus the sum intrinsic would probably benefit from doing the same for method 3.

Good point about bounds checking; if bounds-checked performance is important then clearly using the sum intrinsic is beneficial since it isn’t affected compared to hand-written array operations. This is not something I had considered before.

1 Like

also -Ofast, but read Optimize Options (Using the GNU Compiler Collection (GCC))

1 Like

Your improvement is now incorporated in method 4, which is faster:

times (s) for methods: 11.8750 10.6562 9.1562 8.2188
compiler_version: GCC version 11.0.0 20200927 (experimental)
compiler_options: -mtune=generic -march=x86-64 -O2

for the code

module negloglik_mod
implicit none
private
public :: negloglik
contains
subroutine negloglik(x, n, mu, sigma, nll, imethod)
  integer         , intent(in)  :: n,imethod
  double precision, intent(in)  :: x(n), mu, sigma
  double precision, intent(out) :: nll
  double precision, parameter   :: pi = 4*atan(1.0d0), sqrt_2_pi = sqrt(2.0d0*pi)
  double precision              :: z(n),ll,zscalar,log_sigma_sqrt_2_pi
  integer                       :: i
  if (imethod < 3) z = (x - mu)/sigma
  if (imethod == 1) then
     ll = 0.0d0
     do i=1,n
        ll = ll - 0.5*z(i)**2 - log(sigma*sqrt_2_pi)
     end do
  else if (imethod == 2) then ! avoid computing log(sigma*sqrt(2*pi)) in loop
     ll = -n * log(sigma*sqrt_2_pi) - 0.5d0*sum(z**2)
  else if (imethod == 3) then ! loop fusion
     ll = 0.0d0
     do i=1,n
       zscalar = (x(i) - mu)/sigma
       ll = ll - 0.5d0*zscalar**2 - log(sigma*sqrt_2_pi)
     end do
  else if (imethod == 4) then ! better loop fusion with log outside loop
     ll = -n * log(sigma*sqrt_2_pi)
     do i=1,n
       zscalar = (x(i) - mu)/sigma
       ll = ll - 0.5d0*zscalar**2
     end do
  end if
  nll = -ll
end subroutine negloglik
end module negloglik_mod

program xnegloglik
use negloglik_mod  , only: negloglik
use iso_fortran_env, only: compiler_version,compiler_options
implicit none
integer, parameter :: dp = kind(1.0d0)
integer, parameter :: n = 1000000, niter = 1000, nmethods = 4
real(kind=dp)      :: x(n),mu,sigma,nll,nll_sum,tt(0:nmethods)
integer            :: iter,imethod
call cpu_time(tt(0))
do imethod=1,nmethods
   nll_sum = 0.0_dp
   do iter=1,niter
      call random_number(x)
      mu = sum(x)/n
      sigma = sqrt(sum((x-mu)**2)/n)
      call negloglik(x,n,mu,sigma,nll,imethod)
      nll_sum = nll_sum + nll
   end do
   write (*,"('imethod, nll_sum = ',i0,1x,f0.6)") imethod,nll_sum
   call cpu_time(tt(imethod))
end do
write (*,"('times (s) for methods:',100(1x,f0.4))") tt(1:) - tt(0:nmethods-1)
write (*,"('compiler_version: ',a)") trim(compiler_version())
write (*,"('compiler_options: ',a)") trim(compiler_options())
end program xnegloglik
3 Likes

Method 5 is same as method 4 improved by factorizing the sigma division and 0.5d0 multiplication (and removing zscalar):

$ gfortran speed.f90 -mtune=generic -march=x86-64 -O2 && ./a.out
imethod, nll_sum = 1 176513032.418208
imethod, nll_sum = 2 176504717.977701
imethod, nll_sum = 3 176486226.004816
imethod, nll_sum = 4 176480747.481051
imethod, nll_sum = 5 176491733.413728
times (s) for methods: 12.0716 10.4207 9.1838 8.7447 8.1351
compiler_version: GCC version 10.2.0
compiler_options: -mtune=generic -march=x86-64 -O2 -fpre-include=/usr/include/finclude/math-vector-fortran.h
module negloglik_mod
implicit none
private
public :: negloglik
contains
subroutine negloglik(x, n, mu, sigma, nll, imethod)
  integer         , intent(in)  :: n,imethod
  double precision, intent(in)  :: x(n), mu, sigma
  double precision, intent(out) :: nll
  double precision, parameter   :: pi = 4*atan(1.0d0), sqrt_2_pi = sqrt(2.0d0*pi)
  double precision              :: z(n),ll,zscalar,log_sigma_sqrt_2_pi
  integer                       :: i
  if (imethod < 3) z = (x - mu)/sigma
  if (imethod == 1) then
     ll = 0.0d0
     do i=1,n
        ll = ll - 0.5*z(i)**2 - log(sigma*sqrt_2_pi)
     end do
  else if (imethod == 2) then ! avoid computing log(sigma*sqrt(2*pi)) in loop
     ll = -n * log(sigma*sqrt_2_pi) - 0.5d0*sum(z**2)
  else if (imethod == 3) then ! loop fusion
     ll = 0.0d0
     do i=1,n
       zscalar = (x(i) - mu)/sigma
       ll = ll - 0.5d0*zscalar**2 - log(sigma*sqrt_2_pi)
     end do
  else if (imethod == 4) then ! better loop fusion with log outside loop
     ll = -n * log(sigma*sqrt_2_pi)
     do i=1,n
       zscalar = (x(i) - mu)/sigma
       ll = ll - 0.5d0*zscalar**2
     end do
  else if (imethod == 5) then ! better with 0.5d0/sigma**2 outside loop
      ll = 0.0d0
       do i=1,n
         ll = ll - ((x(i) - mu))**2
       end do
       ll = ll * 0.5d0 / (sigma**2)
       ll = ll -n * log(sigma*sqrt_2_pi)
  end if
  nll = -ll
end subroutine negloglik
end module negloglik_mod

program xnegloglik
use negloglik_mod  , only: negloglik
use iso_fortran_env, only: compiler_version,compiler_options
implicit none
integer, parameter :: dp = kind(1.0d0)
integer, parameter :: n = 1000000, niter = 1000, nmethods = 5
real(kind=dp)      :: x(n),mu,sigma,nll,nll_sum,tt(0:nmethods)
integer            :: iter,imethod
call cpu_time(tt(0))
do imethod=1,nmethods
   nll_sum = 0.0_dp
   do iter=1,niter
      call random_number(x)
      mu = sum(x)/n
      sigma = sqrt(sum((x-mu)**2)/n)
      call negloglik(x,n,mu,sigma,nll,imethod)
      nll_sum = nll_sum + nll
   end do
   write (*,"('imethod, nll_sum = ',i0,1x,f0.6)") imethod,nll_sum
   call cpu_time(tt(imethod))
end do
write (*,"('times (s) for methods:',100(1x,f0.4))") tt(1:) - tt(0:nmethods-1)
write (*,"('compiler_version: ',a)") trim(compiler_version())
write (*,"('compiler_options: ',a)") trim(compiler_options())
end program xnegloglik
3 Likes

Indeed, 4atan(1.) would be better as 4atan(1.0d0) to get the desired precision. Actually, acos(-1.d0) might be more accurate - in involves multiplication by 4 and would likely return the library’s internal version of pi without doing any computation. Alternatively, define pi as a named constant instead of “computing” it.

2 Likes