# Tips to make this toy program faster?

Here is a little Fortran subroutine which calculates the negative log-likelihood of a normally distributed random sample.

subroutine negloglik(x, n, mu, sigma, nll)
integer n, i
double precision x(n), z(n), mu, sigma, nll, ll, pi

pi = 4*atan(1.0)

z = (x - mu)/sigma

ll = 0.0
do i=1,n
ll = ll - 0.5*z(i)**2 - log(sigma*sqrt(2*pi))
end do

nll = -ll
end subroutine negloglik


I originally wrote this so I could try calling Fortran from the statistical package R. Comparing this implementation (including wrapper code in R) with Râ€™s built-in dnorm function I was surprised to see that the above Fortran code is 3-4 times faster. Since my code is just a naive â€śformula translationâ€ť, Iâ€™m interested to know if there are any small changes which could provide more of a speed-up.

Iâ€™d try with openmp directives wrapping the do loop. That log and sqrt function will very probably make the stencil compute-bounded, so itâ€™s possible to see a quasi-linear scaling also on a cheap laptop.

1 Like

I suppose you could move this constant out of the loop, although the compiler might already do it himself.

2 Likes

Oops. Did not see it was a constant. However, depending on memory bandwidth, with openmp some improvements can be seen.

1 Like

You should also make sure to use 1.0d0 instead of 1.0. Literal constants of real variables in Fortran are by default single precision. Same goes for the 0.5 factor in the loop.

1 Like

If n is large or if this subroutine is called many times, I would expect some performance improvement by fusing the two loops and removing the temporary array z.

This would both remove a memory allocation from the subroutine overhead and improve the arithmetic intensity (flops per byte accessed) of the overall procedure by only iterating through memory once.

e.g.

subroutine negloglik(x, n, mu, sigma, nll)
integer n, i
double precision x(n), z, mu, sigma, nll, ll, pi

pi = 4*atan(1.0)

ll = 0.0
do i=1,n
z = (x(i) - mu)/sigma
ll = ll - 0.5*z**2 - log(sigma*sqrt(2*pi))
end do

nll = -ll
end subroutine negloglik

3 Likes

Hi Roberto, Ivan and Laurence. Thank you for your replies. Moving the constant out of the loop reduced time taken by 5%. Fusing the loops reduced time taken by 20%. Applying both changes, this code

subroutine negloglik(x, n, mu, sigma, nll)
integer n, i
double precision x(n), z, mu, sigma, nll, ll, pi

pi = 4*atan(1.0d0)

ll = -n*log(sigma*sqrt(2*pi))
do i=1,n
z = (x(i) - mu)/sigma
ll = ll - 0.5d0*z**2
end do

nll = -ll
end subroutine negloglik


has a running time which is 25% lower compared with the code in my original post.

@Rob777 When you mention openmp directives, is there a specific annotation that I can add to the do loop? I havenâ€™t used openmp before (or Fortran for that matterâ€¦ just getting started now).

2 Likes
  ll = 0.0
do i=1,n
z = (x(i) - mu)/sigma
ll = ll - 0.5*z**2 - log(sigma*sqrt(2*pi))
end do


You should mathematically factorize everything that can be factorized. For example, -log(sigma*sqrt(2*pi)) is substracted n times.

ll = 0.0
do i=1,n
z = (x(i) - mu)/sigma
ll = ll - 0.5*z**2
end do
ll = ll - n*log(sigma*sqrt(2*pi))


Here, n subtractions are replaced by one substraction and one multiplication.
The 0.5* operation can now be also factorized. And probably the /sigma â€¦

Each single mathematical operation is eating some CPUâ€¦

2 Likes

Glad that helped @catulus. Forgot to say: welcome to the Discourse!

May I also ask what compiler flags you are using to compile the Fortran code? With full optimisation, I would expect most trivial things like factorising/moving constants to be done by the compiler already.

A good set of optimisation flags to start-off with are:
-O3 -march=native -mtune=native

2 Likes

If I made no error, the only thing that the loop should compute is:
\sum_1^n{x(i)*(x(i)-2*mu)}

2 Likes

Iâ€™m compiling using the command R CMD SHLIB negloglik.f90 which calls

gfortran  -fpic -g -O2 -fdebug-prefix-map=/build/r-base-xoRi9J/r-base-3.5.2=. -fstack-protector-strong  -c  negloglik.f90 -o negloglik.o
gfortran -shared -L/usr/lib/R/lib -Wl,-z,relro -o negloglik.so negloglik.o -L/usr/lib/R/lib -lR

1 Like

As @lkedward said, the compiler should make the most obvious factorization with the -O3 flag. But perhaps not those coming from the **2 operation. It would be interesting to compare the CPU timeâ€¦

2 Likes

Given all the factorization that others have suggested so far, I would write it like the following. Good optimizing compiler ought to be able to parallelize/vectorize this quite well then.

subroutine negloglik(x, n, mu, sigma, nll)
integer, intent(in) :: n
double precision, intent(in) :: x(n), mu, sigma
double precision, intent(out) :: nll

double precision, parameter :: pi = 4*atan(1.0d0)

nll = sum(term(x, mu, sigma)) + n*log(sigma * sqrt(2 * pi))

contains
elemental function term(x_, mu_, sigma_)
double precision, intent(in) :: x_, mu_, sigma_
double precision :: term

term = 0.5d0 * ((x_ - mu_)/sigma_)**2
end function
end subroutine

1 Like

For the code

module negloglik_mod
implicit none
private
public :: negloglik
contains
subroutine negloglik(x, n, mu, sigma, nll, imethod)
integer         , intent(in)  :: n,imethod
double precision, intent(in)  :: x(n), mu, sigma
double precision, intent(out) :: nll
double precision, parameter   :: pi = 4*atan(1.0d0), sqrt_2_pi = sqrt(2.0d0*pi)
double precision              :: z(n),ll,zscalar
integer                       :: i
if (imethod < 3) z = (x - mu)/sigma
if (imethod == 1) then
ll = 0.0d0
do i=1,n
ll = ll - 0.5*z(i)**2 - log(sigma*sqrt_2_pi)
end do
else if (imethod == 2) then ! avoid computing log(sigma*sqrt(2*pi)) in loop
ll = -n * log(sigma*sqrt_2_pi) - 0.5d0*sum(z**2)
else if (imethod == 3) then ! loop fusion
ll = 0.0d0
do i=1,n
zscalar = (x(i) - mu)/sigma
ll = ll - 0.5d0*zscalar**2 - log(sigma*sqrt_2_pi)
end do
end if
nll = -ll
end subroutine negloglik
end module negloglik_mod

program xnegloglik
use negloglik_mod  , only: negloglik
use iso_fortran_env, only: compiler_version,compiler_options
implicit none
integer, parameter :: dp = kind(1.0d0)
integer, parameter :: n = 1000000, niter = 1000, nmethods = 3
real(kind=dp)      :: x(n),mu,sigma,nll,nll_sum,tt(0:nmethods)
integer            :: iter,imethod
call cpu_time(tt(0))
do imethod=1,nmethods
nll_sum = 0.0_dp
do iter=1,niter
call random_number(x)
mu = sum(x)/n
sigma = sqrt(sum((x-mu)**2)/n)
call negloglik(x,n,mu,sigma,nll,imethod)
nll_sum = nll_sum + nll
end do
write (*,"('imethod, nll_sum = ',i0,1x,f0.6)") imethod,nll_sum
call cpu_time(tt(imethod))
end do
write (*,"('times (s) for methods:',100(1x,f0.4))") tt(1:) - tt(0:nmethods-1)
write (*,"('compiler_version: ',a)") trim(compiler_version())
end program xnegloglik


I get for Intel Fortran and gfortran on Windows 10

times (s) for methods: 5.6094 5.4375 4.9062
compiler_version: Intel(R) Fortran Intel(R) 64 Compiler Classic for applications running on Intel(R) 64, Version 2021.1 Build 20201112_000000
compiler_options: /c /fast

times (s) for methods: 6.2969 5.5312 5.6094
compiler_version: Intel(R) Fortran Intel(R) 64 Compiler Classic for applications running on Intel(R) 64, Version 2021.1 Build 20201112_000000
compiler_options: /c /fast /check:bounds /F512000000

times (s) for methods: 11.0781 9.8438 8.4375
compiler_version: GCC version 11.0.0 20200927 (experimental)
compiler_options: -mtune=generic -march=x86-64 -O2

times (s) for methods: 10.9219 9.8906 8.4219
compiler_version: GCC version 11.0.0 20200927 (experimental)
compiler_options: -mtune=generic -march=x86-64 -O2 -fbounds-check


so loop fusion helps for gfortran. For Intel Fortran, which is generally faster for this code, loop fusion helps only when bounds checking is disabled.

1 Like

Same program with GCC 8.3.0 and -O2 or -O3 flags:

$gfortran speed.f90 -O2 && ./a.out imethod, nll_sum = 1 176470500.363913 imethod, nll_sum = 2 176459692.189152 imethod, nll_sum = 3 176472639.104788 times (s) for methods: 14.1439 13.3749 12.4523 compiler_version: GCC version 8.3.0$ gfortran speed.f90 -O3 && ./a.out
imethod, nll_sum = 1 176483880.530271
imethod, nll_sum = 2 176502438.163822
imethod, nll_sum = 3 176486385.252288
times (s) for methods: 12.1894 11.3992 12.4512
compiler_version: GCC version 8.3.0


Interestingly, with -O2, method 3 is faster, with -O3 itâ€™s method 2 (on that CPU, but not on another oneâ€¦). You can also try with different versions of gfortranâ€¦ The latest is not always the fastestâ€¦

Note also that mathematical factorization limits the number of operations and therefore of accumulating rounding errors.

2 Likes

Thanks @Beliavsky, nice bench-marking. I notice method 2 has taken the constant out of the loop whereas method 3 has not in your implementation; a stricter analysis of loop fusion versus the sum intrinsic would probably benefit from doing the same for method 3.

Good point about bounds checking; if bounds-checked performance is important then clearly using the sum intrinsic is beneficial since it isnâ€™t affected compared to hand-written array operations. This is not something I had considered before.

1 Like

also -Ofast, but read Optimize Options (Using the GNU Compiler Collection (GCC))

1 Like

Your improvement is now incorporated in method 4, which is faster:

times (s) for methods: 11.8750 10.6562 9.1562 8.2188
compiler_version: GCC version 11.0.0 20200927 (experimental)
compiler_options: -mtune=generic -march=x86-64 -O2


for the code

module negloglik_mod
implicit none
private
public :: negloglik
contains
subroutine negloglik(x, n, mu, sigma, nll, imethod)
integer         , intent(in)  :: n,imethod
double precision, intent(in)  :: x(n), mu, sigma
double precision, intent(out) :: nll
double precision, parameter   :: pi = 4*atan(1.0d0), sqrt_2_pi = sqrt(2.0d0*pi)
double precision              :: z(n),ll,zscalar,log_sigma_sqrt_2_pi
integer                       :: i
if (imethod < 3) z = (x - mu)/sigma
if (imethod == 1) then
ll = 0.0d0
do i=1,n
ll = ll - 0.5*z(i)**2 - log(sigma*sqrt_2_pi)
end do
else if (imethod == 2) then ! avoid computing log(sigma*sqrt(2*pi)) in loop
ll = -n * log(sigma*sqrt_2_pi) - 0.5d0*sum(z**2)
else if (imethod == 3) then ! loop fusion
ll = 0.0d0
do i=1,n
zscalar = (x(i) - mu)/sigma
ll = ll - 0.5d0*zscalar**2 - log(sigma*sqrt_2_pi)
end do
else if (imethod == 4) then ! better loop fusion with log outside loop
ll = -n * log(sigma*sqrt_2_pi)
do i=1,n
zscalar = (x(i) - mu)/sigma
ll = ll - 0.5d0*zscalar**2
end do
end if
nll = -ll
end subroutine negloglik
end module negloglik_mod

program xnegloglik
use negloglik_mod  , only: negloglik
use iso_fortran_env, only: compiler_version,compiler_options
implicit none
integer, parameter :: dp = kind(1.0d0)
integer, parameter :: n = 1000000, niter = 1000, nmethods = 4
real(kind=dp)      :: x(n),mu,sigma,nll,nll_sum,tt(0:nmethods)
integer            :: iter,imethod
call cpu_time(tt(0))
do imethod=1,nmethods
nll_sum = 0.0_dp
do iter=1,niter
call random_number(x)
mu = sum(x)/n
sigma = sqrt(sum((x-mu)**2)/n)
call negloglik(x,n,mu,sigma,nll,imethod)
nll_sum = nll_sum + nll
end do
write (*,"('imethod, nll_sum = ',i0,1x,f0.6)") imethod,nll_sum
call cpu_time(tt(imethod))
end do
write (*,"('times (s) for methods:',100(1x,f0.4))") tt(1:) - tt(0:nmethods-1)
write (*,"('compiler_version: ',a)") trim(compiler_version())
write (*,"('compiler_options: ',a)") trim(compiler_options())
end program xnegloglik
3 Likes

Method 5 is same as method 4 improved by factorizing the sigma division and 0.5d0 multiplication (and removing zscalar):

\$ gfortran speed.f90 -mtune=generic -march=x86-64 -O2 && ./a.out
imethod, nll_sum = 1 176513032.418208
imethod, nll_sum = 2 176504717.977701
imethod, nll_sum = 3 176486226.004816
imethod, nll_sum = 4 176480747.481051
imethod, nll_sum = 5 176491733.413728
times (s) for methods: 12.0716 10.4207 9.1838 8.7447 8.1351
compiler_version: GCC version 10.2.0
compiler_options: -mtune=generic -march=x86-64 -O2 -fpre-include=/usr/include/finclude/math-vector-fortran.h

module negloglik_mod
implicit none
private
public :: negloglik
contains
subroutine negloglik(x, n, mu, sigma, nll, imethod)
integer         , intent(in)  :: n,imethod
double precision, intent(in)  :: x(n), mu, sigma
double precision, intent(out) :: nll
double precision, parameter   :: pi = 4*atan(1.0d0), sqrt_2_pi = sqrt(2.0d0*pi)
double precision              :: z(n),ll,zscalar,log_sigma_sqrt_2_pi
integer                       :: i
if (imethod < 3) z = (x - mu)/sigma
if (imethod == 1) then
ll = 0.0d0
do i=1,n
ll = ll - 0.5*z(i)**2 - log(sigma*sqrt_2_pi)
end do
else if (imethod == 2) then ! avoid computing log(sigma*sqrt(2*pi)) in loop
ll = -n * log(sigma*sqrt_2_pi) - 0.5d0*sum(z**2)
else if (imethod == 3) then ! loop fusion
ll = 0.0d0
do i=1,n
zscalar = (x(i) - mu)/sigma
ll = ll - 0.5d0*zscalar**2 - log(sigma*sqrt_2_pi)
end do
else if (imethod == 4) then ! better loop fusion with log outside loop
ll = -n * log(sigma*sqrt_2_pi)
do i=1,n
zscalar = (x(i) - mu)/sigma
ll = ll - 0.5d0*zscalar**2
end do
else if (imethod == 5) then ! better with 0.5d0/sigma**2 outside loop
ll = 0.0d0
do i=1,n
ll = ll - ((x(i) - mu))**2
end do
ll = ll * 0.5d0 / (sigma**2)
ll = ll -n * log(sigma*sqrt_2_pi)
end if
nll = -ll
end subroutine negloglik
end module negloglik_mod

program xnegloglik
use negloglik_mod  , only: negloglik
use iso_fortran_env, only: compiler_version,compiler_options
implicit none
integer, parameter :: dp = kind(1.0d0)
integer, parameter :: n = 1000000, niter = 1000, nmethods = 5
real(kind=dp)      :: x(n),mu,sigma,nll,nll_sum,tt(0:nmethods)
integer            :: iter,imethod
call cpu_time(tt(0))
do imethod=1,nmethods
nll_sum = 0.0_dp
do iter=1,niter
call random_number(x)
mu = sum(x)/n
sigma = sqrt(sum((x-mu)**2)/n)
call negloglik(x,n,mu,sigma,nll,imethod)
nll_sum = nll_sum + nll
end do
write (*,"('imethod, nll_sum = ',i0,1x,f0.6)") imethod,nll_sum
call cpu_time(tt(imethod))
end do
write (*,"('times (s) for methods:',100(1x,f0.4))") tt(1:) - tt(0:nmethods-1)
write (*,"('compiler_version: ',a)") trim(compiler_version())
write (*,"('compiler_options: ',a)") trim(compiler_options())
end program xnegloglik

3 Likes

Indeed, 4atan(1.) would be better as 4atan(1.0d0) to get the desired precision. Actually, acos(-1.d0) might be more accurate - in involves multiplication by 4 and would likely return the libraryâ€™s internal version of pi without doing any computation. Alternatively, define pi as a named constant instead of â€ścomputingâ€ť it.

2 Likes