Tips to make this toy program faster?

catulus · March 10, 2021, 2:33pm

Here is a little Fortran subroutine which calculates the negative log-likelihood of a normally distributed random sample.

subroutine negloglik(x, n, mu, sigma, nll)
  integer n, i
  double precision x(n), z(n), mu, sigma, nll, ll, pi

  pi = 4*atan(1.0)

  z = (x - mu)/sigma

  ll = 0.0
  do i=1,n
    ll = ll - 0.5*z(i)**2 - log(sigma*sqrt(2*pi))
  end do

  nll = -ll
end subroutine negloglik

I originally wrote this so I could try calling Fortran from the statistical package R. Comparing this implementation (including wrapper code in R) with R’s built-in dnorm function I was surprised to see that the above Fortran code is 3-4 times faster. Since my code is just a naive “formula translation”, I’m interested to know if there are any small changes which could provide more of a speed-up.

Rob777 · March 10, 2021, 2:57pm

I’d try with openmp directives wrapping the do loop. That log and sqrt function will very probably make the stencil compute-bounded, so it’s possible to see a quasi-linear scaling also on a cheap laptop.

ivanpribec · March 10, 2021, 2:59pm

I suppose you could move this constant out of the loop, although the compiler might already do it himself.

Rob777 · March 10, 2021, 3:00pm

Oops. Did not see it was a constant. However, depending on memory bandwidth, with openmp some improvements can be seen.

ivanpribec · March 10, 2021, 3:01pm

You should also make sure to use 1.0d0 instead of 1.0. Literal constants of real variables in Fortran are by default single precision. Same goes for the 0.5 factor in the loop.

lkedward · March 10, 2021, 3:30pm

If n is large or if this subroutine is called many times, I would expect some performance improvement by fusing the two loops and removing the temporary array z.

This would both remove a memory allocation from the subroutine overhead and improve the arithmetic intensity (flops per byte accessed) of the overall procedure by only iterating through memory once.

e.g.

subroutine negloglik(x, n, mu, sigma, nll)
  integer n, i
  double precision x(n), z, mu, sigma, nll, ll, pi

  pi = 4*atan(1.0)

  ll = 0.0
  do i=1,n
    z = (x(i) - mu)/sigma
    ll = ll - 0.5*z**2 - log(sigma*sqrt(2*pi))
  end do

  nll = -ll
end subroutine negloglik

catulus · March 10, 2021, 3:46pm

Hi Roberto, Ivan and Laurence. Thank you for your replies. Moving the constant out of the loop reduced time taken by 5%. Fusing the loops reduced time taken by 20%. Applying both changes, this code

subroutine negloglik(x, n, mu, sigma, nll)
  integer n, i
  double precision x(n), z, mu, sigma, nll, ll, pi

  pi = 4*atan(1.0d0)

  ll = -n*log(sigma*sqrt(2*pi))
  do i=1,n
    z = (x(i) - mu)/sigma
    ll = ll - 0.5d0*z**2
  end do

  nll = -ll
end subroutine negloglik

has a running time which is 25% lower compared with the code in my original post.

@Rob777 When you mention openmp directives, is there a specific annotation that I can add to the do loop? I haven’t used openmp before (or Fortran for that matter… just getting started now).

vmagnin · March 10, 2021, 3:49pm

  ll = 0.0
  do i=1,n
    z = (x(i) - mu)/sigma
    ll = ll - 0.5*z**2 - log(sigma*sqrt(2*pi))
  end do

You should mathematically factorize everything that can be factorized. For example, -log(sigma*sqrt(2*pi)) is substracted n times.

ll = 0.0
do i=1,n
    z = (x(i) - mu)/sigma
    ll = ll - 0.5*z**2
end do
ll = ll - n*log(sigma*sqrt(2*pi))

Here, n subtractions are replaced by one substraction and one multiplication.
The 0.5* operation can now be also factorized. And probably the /sigma …

Each single mathematical operation is eating some CPU…

lkedward · March 10, 2021, 4:01pm

Glad that helped @catulus. Forgot to say: welcome to the Discourse!

May I also ask what compiler flags you are using to compile the Fortran code? With full optimisation, I would expect most trivial things like factorising/moving constants to be done by the compiler already.

A good set of optimisation flags to start-off with are:
-O3 -march=native -mtune=native

vmagnin · March 10, 2021, 4:07pm

If I made no error, the only thing that the loop should compute is:
\sum_1^n{x(i)*(x(i)-2*mu)}

catulus · March 10, 2021, 4:17pm

I’m compiling using the command R CMD SHLIB negloglik.f90 which calls

gfortran  -fpic -g -O2 -fdebug-prefix-map=/build/r-base-xoRi9J/r-base-3.5.2=. -fstack-protector-strong  -c  negloglik.f90 -o negloglik.o
gfortran -shared -L/usr/lib/R/lib -Wl,-z,relro -o negloglik.so negloglik.o -L/usr/lib/R/lib -lR

vmagnin · March 10, 2021, 4:22pm

As @lkedward said, the compiler should make the most obvious factorization with the -O3 flag. But perhaps not those coming from the **2 operation. It would be interesting to compare the CPU time…

everythingfunctional · March 10, 2021, 4:32pm

Given all the factorization that others have suggested so far, I would write it like the following. Good optimizing compiler ought to be able to parallelize/vectorize this quite well then.

subroutine negloglik(x, n, mu, sigma, nll)
  integer, intent(in) :: n
  double precision, intent(in) :: x(n), mu, sigma
  double precision, intent(out) :: nll

  double precision, parameter :: pi = 4*atan(1.0d0)

  nll = sum(term(x, mu, sigma)) + n*log(sigma * sqrt(2 * pi))

  contains
    elemental function term(x_, mu_, sigma_)
      double precision, intent(in) :: x_, mu_, sigma_
      double precision :: term

      term = 0.5d0 * ((x_ - mu_)/sigma_)**2
    end function
end subroutine

Beliavsky · March 10, 2021, 4:35pm

For the code

module negloglik_mod
implicit none
private
public :: negloglik
contains
subroutine negloglik(x, n, mu, sigma, nll, imethod)
  integer         , intent(in)  :: n,imethod
  double precision, intent(in)  :: x(n), mu, sigma
  double precision, intent(out) :: nll
  double precision, parameter   :: pi = 4*atan(1.0d0), sqrt_2_pi = sqrt(2.0d0*pi)
  double precision              :: z(n),ll,zscalar
  integer                       :: i
  if (imethod < 3) z = (x - mu)/sigma
  if (imethod == 1) then
     ll = 0.0d0
     do i=1,n
        ll = ll - 0.5*z(i)**2 - log(sigma*sqrt_2_pi)
     end do
  else if (imethod == 2) then ! avoid computing log(sigma*sqrt(2*pi)) in loop
     ll = -n * log(sigma*sqrt_2_pi) - 0.5d0*sum(z**2)
  else if (imethod == 3) then ! loop fusion
     ll = 0.0d0
     do i=1,n
       zscalar = (x(i) - mu)/sigma
       ll = ll - 0.5d0*zscalar**2 - log(sigma*sqrt_2_pi)
     end do
  end if
  nll = -ll
end subroutine negloglik
end module negloglik_mod

program xnegloglik
use negloglik_mod  , only: negloglik
use iso_fortran_env, only: compiler_version,compiler_options
implicit none
integer, parameter :: dp = kind(1.0d0)
integer, parameter :: n = 1000000, niter = 1000, nmethods = 3
real(kind=dp)      :: x(n),mu,sigma,nll,nll_sum,tt(0:nmethods)
integer            :: iter,imethod
call cpu_time(tt(0))
do imethod=1,nmethods
   nll_sum = 0.0_dp
   do iter=1,niter
      call random_number(x)
      mu = sum(x)/n
      sigma = sqrt(sum((x-mu)**2)/n)
      call negloglik(x,n,mu,sigma,nll,imethod)
      nll_sum = nll_sum + nll
   end do
   write (*,"('imethod, nll_sum = ',i0,1x,f0.6)") imethod,nll_sum
   call cpu_time(tt(imethod))
end do
write (*,"('times (s) for methods:',100(1x,f0.4))") tt(1:) - tt(0:nmethods-1)
write (*,"('compiler_version: ',a)") trim(compiler_version())
end program xnegloglik

I get for Intel Fortran and gfortran on Windows 10

times (s) for methods: 5.6094 5.4375 4.9062
compiler_version: Intel(R) Fortran Intel(R) 64 Compiler Classic for applications running on Intel(R) 64, Version 2021.1 Build 20201112_000000
compiler_options: /c /fast

times (s) for methods: 6.2969 5.5312 5.6094
compiler_version: Intel(R) Fortran Intel(R) 64 Compiler Classic for applications running on Intel(R) 64, Version 2021.1 Build 20201112_000000
compiler_options: /c /fast /check:bounds /F512000000

times (s) for methods: 11.0781 9.8438 8.4375
compiler_version: GCC version 11.0.0 20200927 (experimental)
compiler_options: -mtune=generic -march=x86-64 -O2

times (s) for methods: 10.9219 9.8906 8.4219
compiler_version: GCC version 11.0.0 20200927 (experimental)
compiler_options: -mtune=generic -march=x86-64 -O2 -fbounds-check

so loop fusion helps for gfortran. For Intel Fortran, which is generally faster for this code, loop fusion helps only when bounds checking is disabled.

vmagnin · March 10, 2021, 4:58pm

Same program with GCC 8.3.0 and -O2 or -O3 flags:

$ gfortran speed.f90 -O2 && ./a.out
imethod, nll_sum = 1 176470500.363913
imethod, nll_sum = 2 176459692.189152
imethod, nll_sum = 3 176472639.104788
times (s) for methods: 14.1439 13.3749 12.4523
compiler_version: GCC version 8.3.0

$ gfortran speed.f90 -O3 && ./a.out
imethod, nll_sum = 1 176483880.530271
imethod, nll_sum = 2 176502438.163822
imethod, nll_sum = 3 176486385.252288
times (s) for methods: 12.1894 11.3992 12.4512
compiler_version: GCC version 8.3.0

Interestingly, with -O2, method 3 is faster, with -O3 it’s method 2 (on that CPU, but not on another one…). You can also try with different versions of gfortran… The latest is not always the fastest…

Note also that mathematical factorization limits the number of operations and therefore of accumulating rounding errors.

lkedward · March 10, 2021, 5:25pm

Thanks @Beliavsky, nice bench-marking. I notice method 2 has taken the constant out of the loop whereas method 3 has not in your implementation; a stricter analysis of loop fusion versus the sum intrinsic would probably benefit from doing the same for method 3.

Good point about bounds checking; if bounds-checked performance is important then clearly using the sum intrinsic is beneficial since it isn’t affected compared to hand-written array operations. This is not something I had considered before.

vmagnin · March 10, 2021, 5:26pm

also -Ofast, but read Optimize Options (Using the GNU Compiler Collection (GCC))

Beliavsky · March 10, 2021, 6:00pm

Your improvement is now incorporated in method 4, which is faster:

times (s) for methods: 11.8750 10.6562 9.1562 8.2188
compiler_version: GCC version 11.0.0 20200927 (experimental)
compiler_options: -mtune=generic -march=x86-64 -O2

for the code

module negloglik_mod
implicit none
private
public :: negloglik
contains
subroutine negloglik(x, n, mu, sigma, nll, imethod)
  integer         , intent(in)  :: n,imethod
  double precision, intent(in)  :: x(n), mu, sigma
  double precision, intent(out) :: nll
  double precision, parameter   :: pi = 4*atan(1.0d0), sqrt_2_pi = sqrt(2.0d0*pi)
  double precision              :: z(n),ll,zscalar,log_sigma_sqrt_2_pi
  integer                       :: i
  if (imethod < 3) z = (x - mu)/sigma
  if (imethod == 1) then
     ll = 0.0d0
     do i=1,n
        ll = ll - 0.5*z(i)**2 - log(sigma*sqrt_2_pi)
     end do
  else if (imethod == 2) then ! avoid computing log(sigma*sqrt(2*pi)) in loop
     ll = -n * log(sigma*sqrt_2_pi) - 0.5d0*sum(z**2)
  else if (imethod == 3) then ! loop fusion
     ll = 0.0d0
     do i=1,n
       zscalar = (x(i) - mu)/sigma
       ll = ll - 0.5d0*zscalar**2 - log(sigma*sqrt_2_pi)
     end do
  else if (imethod == 4) then ! better loop fusion with log outside loop
     ll = -n * log(sigma*sqrt_2_pi)
     do i=1,n
       zscalar = (x(i) - mu)/sigma
       ll = ll - 0.5d0*zscalar**2
     end do
  end if
  nll = -ll
end subroutine negloglik
end module negloglik_mod

program xnegloglik
use negloglik_mod  , only: negloglik
use iso_fortran_env, only: compiler_version,compiler_options
implicit none
integer, parameter :: dp = kind(1.0d0)
integer, parameter :: n = 1000000, niter = 1000, nmethods = 4
real(kind=dp)      :: x(n),mu,sigma,nll,nll_sum,tt(0:nmethods)
integer            :: iter,imethod
call cpu_time(tt(0))
do imethod=1,nmethods
   nll_sum = 0.0_dp
   do iter=1,niter
      call random_number(x)
      mu = sum(x)/n
      sigma = sqrt(sum((x-mu)**2)/n)
      call negloglik(x,n,mu,sigma,nll,imethod)
      nll_sum = nll_sum + nll
   end do
   write (*,"('imethod, nll_sum = ',i0,1x,f0.6)") imethod,nll_sum
   call cpu_time(tt(imethod))
end do
write (*,"('times (s) for methods:',100(1x,f0.4))") tt(1:) - tt(0:nmethods-1)
write (*,"('compiler_version: ',a)") trim(compiler_version())
write (*,"('compiler_options: ',a)") trim(compiler_options())
end program xnegloglik

vmagnin · March 10, 2021, 6:27pm

Method 5 is same as method 4 improved by factorizing the sigma division and 0.5d0 multiplication (and removing zscalar):

$ gfortran speed.f90 -mtune=generic -march=x86-64 -O2 && ./a.out
imethod, nll_sum = 1 176513032.418208
imethod, nll_sum = 2 176504717.977701
imethod, nll_sum = 3 176486226.004816
imethod, nll_sum = 4 176480747.481051
imethod, nll_sum = 5 176491733.413728
times (s) for methods: 12.0716 10.4207 9.1838 8.7447 8.1351
compiler_version: GCC version 10.2.0
compiler_options: -mtune=generic -march=x86-64 -O2 -fpre-include=/usr/include/finclude/math-vector-fortran.h

module negloglik_mod
implicit none
private
public :: negloglik
contains
subroutine negloglik(x, n, mu, sigma, nll, imethod)
  integer         , intent(in)  :: n,imethod
  double precision, intent(in)  :: x(n), mu, sigma
  double precision, intent(out) :: nll
  double precision, parameter   :: pi = 4*atan(1.0d0), sqrt_2_pi = sqrt(2.0d0*pi)
  double precision              :: z(n),ll,zscalar,log_sigma_sqrt_2_pi
  integer                       :: i
  if (imethod < 3) z = (x - mu)/sigma
  if (imethod == 1) then
     ll = 0.0d0
     do i=1,n
        ll = ll - 0.5*z(i)**2 - log(sigma*sqrt_2_pi)
     end do
  else if (imethod == 2) then ! avoid computing log(sigma*sqrt(2*pi)) in loop
     ll = -n * log(sigma*sqrt_2_pi) - 0.5d0*sum(z**2)
  else if (imethod == 3) then ! loop fusion
     ll = 0.0d0
     do i=1,n
       zscalar = (x(i) - mu)/sigma
       ll = ll - 0.5d0*zscalar**2 - log(sigma*sqrt_2_pi)
     end do
  else if (imethod == 4) then ! better loop fusion with log outside loop
     ll = -n * log(sigma*sqrt_2_pi)
     do i=1,n
       zscalar = (x(i) - mu)/sigma
       ll = ll - 0.5d0*zscalar**2
     end do
  else if (imethod == 5) then ! better with 0.5d0/sigma**2 outside loop
      ll = 0.0d0
       do i=1,n
         ll = ll - ((x(i) - mu))**2
       end do
       ll = ll * 0.5d0 / (sigma**2)
       ll = ll -n * log(sigma*sqrt_2_pi)
  end if
  nll = -ll
end subroutine negloglik
end module negloglik_mod

program xnegloglik
use negloglik_mod  , only: negloglik
use iso_fortran_env, only: compiler_version,compiler_options
implicit none
integer, parameter :: dp = kind(1.0d0)
integer, parameter :: n = 1000000, niter = 1000, nmethods = 5
real(kind=dp)      :: x(n),mu,sigma,nll,nll_sum,tt(0:nmethods)
integer            :: iter,imethod
call cpu_time(tt(0))
do imethod=1,nmethods
   nll_sum = 0.0_dp
   do iter=1,niter
      call random_number(x)
      mu = sum(x)/n
      sigma = sqrt(sum((x-mu)**2)/n)
      call negloglik(x,n,mu,sigma,nll,imethod)
      nll_sum = nll_sum + nll
   end do
   write (*,"('imethod, nll_sum = ',i0,1x,f0.6)") imethod,nll_sum
   call cpu_time(tt(imethod))
end do
write (*,"('times (s) for methods:',100(1x,f0.4))") tt(1:) - tt(0:nmethods-1)
write (*,"('compiler_version: ',a)") trim(compiler_version())
write (*,"('compiler_options: ',a)") trim(compiler_options())
end program xnegloglik

billlong · March 11, 2021, 4:20am

Indeed, 4atan(1.) would be better as 4atan(1.0d0) to get the desired precision. Actually, acos(-1.d0) might be more accurate - in involves multiplication by 4 and would likely return the library’s internal version of pi without doing any computation. Alternatively, define pi as a named constant instead of “computing” it.

Topic		Replies	Views
Physical constants Announcements	67	4468	July 4, 2022
Testing LogSumExp (or RealSoftMax)	23	380	August 5, 2025
Normal random number generator Help	21	4574	June 14, 2022
Fortran code snippets	103	8478	May 20, 2025
What ChatGPT says about Fortran AI	53	5167	December 31, 2023

Tips to make this toy program faster?

Related topics