Code slower than C++ version

I have translated a C++ code to Fortran code, but the Fortran code is about 50% slower. I need some advice regarding performance improvement:

subroutine three_median(x, y, z, res)
  real(8) :: x, y, z, res
  
  if ((y <= x .and. x <= z) .or. (z <= x .and. x <= y)) then
    res = x  
  else if ((x <= y .and. y <= z) .or. (z <= y .and. y <= x)) then
    res = y  
  else
    res = z
  end if
end subroutine

subroutine med_rv(n, x, res)
  integer(8) :: n
  real(8) :: x(n), res    
  real(8) :: param1, param2, param3, param4, pi
  integer(8) :: i
  real(8) :: x_median(n - 2)
  
  param1 = 2d0
  param2 = 3d0
  param3 = 4d0 
  param4 = 6d0
  pi = acos(-1d0)
  
  !$omp parallel do reduction(+:res)
  do i = 1, n - 2
    call three_median(abs(x(i)), abs(x(i + 1)), abs(x(i + 2)), x_median(i))      
    res = res + x_median(i)**param1  
  end do
  !$omp end parallel do 
  
  res = res*pi/(param4 - param3*sqrt(param2) + pi)*n/(n - param1)
end subroutine

!gfortran -O3 -fopenmp -fpic -shared med_rv.f95 -o med_rv.so

PS: Since the shared library needs to be loaded into R, I cannot use the -Ofast flag. Using -Ofast will crash R when x is a long vector.

C++ code as below:

#include <vector>
#include <cmath>
#include <algorithm>
using namespace std;

double three_median(double x, double y, double z)
{
  double res;
  
  if ((y <= x && x <= z) || (z <= x && x <= y))
  {
    res = x;  
  }
  else if ((x <= y && y <= z) || (z <= y && y <= x))
  {
    res = y;  
  }  
  else
  {
    res = z;
  }
    
  return res;
}

double med_rv(vector<double> x)
{
  long n = x.size();
  double param1 = 2.0, param2 = 3.0;
  double param3 = 4.0, param4 = 6.0;
  double pi = acos(-1.0);
  double res = 0.0;

  #pragma omp parallel for reduction(+:res)
  for (long i = 0; i < n - 2; i++)
  {
    res += pow(three_median(abs(x[i]), abs(x[i + 1]), abs(x[i + 2])), param1);  
  } 
  
  res *= pi/(param4 - param3*sqrt(param2) + pi)*n/(n - param1);

  return res;
}

Can you post the C++ code also?

Perhaps adding a main program calling the routine would be helpful to demonstrate how you expect to use the routine. It’s also not clear what kind of time gains you are expecting to see and whether they are worth pursuing.

Moreover if you already have a fast C++ routine, why the need for a Fortran version? Also showing the C++ code could be helpful to spot any obvious differences.

I’ve found the following resource contains some good advice on profiling (see Section 1) and also other interesting anecdotes:

Algorithms for Modern Hardware - Algorithmica

Other suggestions not strictly related to performance:

  • Use a kind parameter instead of 8
  • Specify intent
  • Make param1, param2, param3, param4, pi parameters
4 Likes

C++ code posted.

The most obvious difference that I can see is

    res = res + x_median(i)**param1  

Try to rework the Fortran code to be equivalent to the C++ code. Do not declare a temporary array if you don’t need it, as that might be the reason it is slower.

I’m also not sure what’s the purpose of using a real number as the exponent.

Even the C++ code looks like it could be improved. Why use long when x.size() returns an integer of type size_t. Why is the vector x passed by value?

1 Like

You could also experiment with a branchless median of three (c++ - Number of comparisons made in median of 3 function? - Stack Overflow), e.g.

median = max(min(a,b), min(max(a,b),c));

Applied to Fortran:

pure real(dp) function three_median(a,b,c)
  real(dp), intent(in) :: a, b, c
  three_median = max(min(a,b), min(max(a,b),c))
end function
1 Like

Here’s a small experiment for the three median:

module med3

implicit none

integer, parameter :: dp = kind(1.0d0)

contains

  pure function three_median1(a, b, c) result(res)
    real(dp), intent(in) :: a, b, c
    real(dp) :: res
    res = max(min(a,b),min(max(a,b),c))
  end function

  pure function three_median2(x, y, z) result(res)
    real(dp), intent(in) :: x, y, z
    real(dp) :: res
    if ((y <= x .and. x <= z) .or. (z <= x .and. x <= y)) then
      res = x  
    else if ((x <= y .and. y <= z) .or. (z <= y .and. y <= x)) then
      res = y  
    else
      res = z
    end if
  end function

end module

program test_med3

  use med3
  implicit none

  real(dp), allocatable :: r(:), m1(:), m2(:)

  character(len=6) :: argv
  integer :: n, i
  real(dp) :: tstart, tend, t1, t2

  call get_command_argument(1,argv)
  read(argv,*) n

  allocate(r(n))
  allocate(m1(n-2),m2(n-2))

  call random_number(r)

  ! Processor warm-up
  do i = 1, n-2
    m1(i) = three_median1(r(i),r(i+1),r(i+2))
  end do
  do i = 1, n-2
    m2(i) = three_median2(r(i),r(i+1),r(i+2))
  end do

  write(*,*) "diff = ", maxval(m1-m2)

  call cpu_time(tstart)
  do i = 1, n-2
    m1(i) = three_median1(r(i),r(i+1),r(i+2))
  end do
  call cpu_time(tend)
  t1 = (tend-tstart)
  write(*,*) "Median 1: ", t1

  call cpu_time(tstart)
  do i = 1, n-2
    m2(i) = three_median2(r(i),r(i+1),r(i+2))
  end do
  call cpu_time(tend)
  t2 = (tend-tstart)
  write(*,*) "Median 2: ", t2

  write(*,*) "Ratio: ", t2/t1

  ! Print output for verification
  ! write(*,*)
  ! do i = 1, n
  !   if (i <= n-2) then
  !     write(*,*) r(i), m1(i)
  !   else
  !     write(*,*) r(i)
  !   end if
  ! end do

end program

With compile flag -O3 and gfortran the min/max version appears to be between 5Γ— to 10Γ— faster than the branching version. Not sure what to make of the large variability in these measurements. Maybe due to the use of random numbers for the test, it has something to do with the success rate of the branch predictor. With ifort the variability is much lower but the PRNG is always seeded with the same value.

Edit: just for reference, both versions require less than a microsecond to compute the rolling three median for 1000000 numbers on a normal consumer laptop.

2 Likes

thanks for the advice. are there any other places that can be improved? I would like to push the limit a little bit.

The updated Fortran code is twice as fast as the updated C++ code.

function three_median(x, y, z)
  real(8) :: x, y, z, three_median

  three_median = max(min(x, y),min(max(x, y), z))
end function

subroutine med_rv(n, x, res)
  integer(8) :: n
  real(8) :: x(n), res    
  real(8) :: param1, param2, param3, param4, pi
  integer(8) :: i
  real(8) :: three_median

  param1 = 2d0
  param2 = 3d0
  param3 = 4d0 
  param4 = 6d0
  pi = acos(-1d0)

  !$omp parallel do reduction(+:res)
  do i = 1, n - 2
    res = res + three_median(abs(x(i)), abs(x(i + 1)), abs(x(i + 2)))**param1  
  end do
  !$omp end parallel do 

  res = res*pi/(param4 - param3*sqrt(param2) + pi)*n/(n - param1)
end subroutine
double three_median(double x, double y, double z)
{
  double res;
  res = max(min(x, y),min(max(x, y), z));
    
  return res;
}

double med_rv(const vector<double>& x)
{
  long n = x.size();
  double param1 = 2.0, param2 = 3.0;
  double param3 = 4.0, param4 = 6.0;
  double pi = acos(-1.0);
  double res = 0.0;

  #pragma omp parallel for reduction(+:res)
  for (long i = 0; i < n - 2; i++)
  {
    res += pow(three_median(abs(x[i]), abs(x[i + 1]), abs(x[i + 2])), param1);  
  } 
  
  res *= pi/(param4 - param3*sqrt(param2) + pi)*n/(n - param1);

  return res;
}
1 Like

One more idea would be to avoid repeating the abs calls (but maybe the compiler already does this for you):

real :: a1, a2, a3

a1 = abs(x(1))
a2 = abs(x(2))
do i = 1, n-2
  a3 = abs(x(i+2))
  res = res + three_median(a1, a2, a3)**2
  a1 = a2
  a2 = a3
end do
1 Like

I don’t have the time right now to do it myself, but at least I can tell you the general procedure to get and prove that you have the maximum best performance:

  • Count the clock counts per array element (double) for the body of your loop. The new version with min, max is predictable, you can lookup how long the instructions have to take. You do it by first counting the operations. I can see:
    • abs: 3
    • min/max: 4
    • power: 1, although param1=2, so then it’s just one multiplication
    • addition: 1 (the accumulation in res)
    • memory read: 3
    • memory write: 0
  • Lookup the (reciprocal) throughputs for these operations, add them up (it looks like the bottleneck will be computation, so you can ignore memory read). That gives you the theoretical performance peak.
  • Then you try to get it with the code. Usually one can do it from Fortran, sometimes one has to do it in assembly (this means the compilers need to improve). Typically once you can get within 50% of the peak, you can call it good. You know that you are within a factor of 2x from the best possible performance.
  • Finally, you then vary the algorithm, such as taking the abs values out of the loop as @ivanpribec suggested, and you redo the theoretical performance peak for each such algorithm and see which one is the fastest and that’s the one you choose, assuming you can get within 50% with actual code.

The code can be in any language, whether assembly, C++ or Fortran.

3 Likes

This is not directly related to this post. But relevant. I recently realized (or at least Metcalf’s book states) that the standard apparently requires integers to be promoted to real when multiplied to a real. Is that really the case? I thought compilers would convert integer multiplication to addition.

I also don’t understand why three_median1 and three_meidan2 have such a different performance. Could this be a bug in gfortran?

This guy’s code is really optimized.
I am not Fortran expert. But first of all, max, min those internal function is highly optimized.
Second, this guy’s version seem only contain 2 operations of min and 2 operations for max.
So it must be very fast.

But the if then elseif if if else blablaba version, you can see it needs to do a lot of x y z comparison judgement , and combine with many .and. .or. operation, also have many if then else judgements.
Definitely many more than 4 operations. That for sure is very slow.

I would say please try your best to prevent using too many if then else stuff, especially if those if then else are in big loops. Also if there are same thing repeated calculated in a loop, please take them outside the loop.

In conclusion, I believe the difference in speed is not a bug in gfortran.
Afterall, gfortran did all you tell it do as fast as it can. It is just that gfortran is not that intelligent to realize the if then else version can be simplified to the min max version. it is not the fault of gfortran :nerd_face:
In many cases, in order to make code fast, human need to do more work, it can make the life of computer much easier.

Probably because without the branches the compiler is able to use SIMD?

1 Like

I don’t see why you’d call it a bug. Both versions return the correct result. Both are extremely fast at least in my book.

I’ve tried to point you towards some resources that can help rationalize the difference:

Putting blame on the compiler is not helpful, IMO. Quoting from the Algorithmica page:

Writing optimizing compilers is very hard. The last Turing award went to Alfred Aho and Jeffrey Ullman for a 1000-page book they wrote about compilers, which is considered an introductory textbook.


@lmiq, using the Intel Fortran Report generator with flag -qopt-report-phase=loop,vec, I see that vectorization does not occur:

LOOP BEGIN at test_med3.f90(62,3)  [min,max version]
   remark #15335: loop was not vectorized: vectorization possible but seems inefficient. Use vector always directive or -vec-threshold0 to override 
   remark #25439: unrolled with remainder by 8  
LOOP END

LOOP BEGIN at test_med3.f90(62,3)
<Remainder>
   remark #25436: completely unrolled by 7  
LOOP END

LOOP BEGIN at test_med3.f90(70,3) [if-else version]
   remark #15335: loop was not vectorized: vectorization possible but seems inefficient. Use vector always directive or -vec-threshold0 to override 
LOOP END

The apparent variability of results may be depending on which OS you are using. The accuracy of CPU_TIME could be an issue. ( only accurate to 0.0156 seconds as value updates 64 times per second on windows O/S )
It may be better to time using β€œcall SYSTEM_CLOCK ( ticks )”, where ticks is an 8-byte integer (gFortran case on Win64)

I got curious about that difference, then I did some tests using Julia here (I find it easier to benchmark). The function I’m benchmarking sums the result of three_median for 10_000 points:

julia> function test(v,f)
           s = 0.
           for x in v
               s += f(x[1],x[2],x[3])
           end
           return s
       end
test (generic function with 1 method)

The three functions I tested are:

# original 
function three_median(x, y, z)
  if ((y <= x && x <= z) || (z <= x && x <= y))
    res = x  
  elseif ((x <= y && y <= z) || (z <= y && y <= x))
    res = y  
  else
    res = z
  end
  return res
end

# using max/min
function three_median1(a, b, c)
  res = max(min(a,b),min(max(a,b),c))
  return res
end

# identical to max/min, but with conditionals:
function three_median2(a,b,c)
    a1 = if a < b 
        a 
    else
        b
    end
    a2 = if a > b
        a
    else
        b
    end
    a3 = if a2 < c
        a2
    else
        c
    end
    res = if a3 > a1
        a3
    else
        a1
    end 
    return res
end

The results are:

julia> @btime test($x,$three_median)
  65.632 ΞΌs (0 allocations: 0 bytes)
5046.737882726079

julia> @btime test($x,$three_median1)
  47.879 ΞΌs (0 allocations: 0 bytes)
5046.737882726079

julia> @btime test($x,$three_median2)
  10.952 ΞΌs (0 allocations: 0 bytes)
5046.737882726079

Thus, function with max/min is indeed faster than the original one, but, curiously, the one with the conditionals written in a different way is quite faster than both options.

I don’t know if that will be the case with Fortran as well, the corresponding function would be:

function three_median(x, y, z)
  real(8) :: x, y, z, three_median, a1, a2, a3
  if (a < b) then
      a1 = a
  else
      a1 = b
  end if
  if (a > b) then
      a2 = a
  else
      a2 = b
  end if
  if (a2 < c) then
      a3 = a2
  else
      a3 = c
  end if
  if (a3 > a1) then
      three_median = a3
  else
      three_median = a1
  end if
end function

I do not understand very clearly lowered code, and those differences are clearly interesting. Anyway, if someone wants to check that out, here go the assembly codes of all these options:

Summary
julia> @code_native three_median(a,b,c)
	.text
; β”Œ @ three_median.jl:2 within `three_median'
; β”‚β”Œ @ float.jl:374 within `<='
	vucomisd	%xmm1, %xmm0
; β”‚β””
	jb	L12
	vucomisd	%xmm0, %xmm2
	jae	L24
L12:
	vucomisd	%xmm0, %xmm1
	jb	L25
	vucomisd	%xmm2, %xmm0
	jb	L25
; β”‚ @ three_median.jl:9 within `three_median'
L24:
	retq
; β”‚ @ three_median.jl:4 within `three_median'
L25:
	vcmpnlesd	%xmm1, %xmm2, %xmm3
	vblendvpd	%xmm3, %xmm2, %xmm1, %xmm3
	vcmpnlesd	%xmm0, %xmm1, %xmm4
	vblendvpd	%xmm4, %xmm2, %xmm3, %xmm3
	vcmpnlesd	%xmm2, %xmm1, %xmm2
	vblendvpd	%xmm2, %xmm3, %xmm1, %xmm2
	vcmpnlesd	%xmm1, %xmm0, %xmm0
	vblendvpd	%xmm0, %xmm3, %xmm2, %xmm0
	retq
	nopw	%cs:(%rax,%rax)
; β””

julia> @code_native three_median1(a,b,c)
	.text
; β”Œ @ three_median.jl:13 within `three_median1'
; β”‚β”Œ @ math.jl:726 within `min'
; β”‚β”‚β”Œ @ floatfuncs.jl:15 within `signbit'
	vmovq	%xmm1, %rcx
	vmovq	%xmm0, %rax
; β”‚β”‚β””
	vcmpordsd	%xmm0, %xmm0, %xmm3
	vblendvpd	%xmm3, %xmm1, %xmm0, %xmm3
	vcmpordsd	%xmm1, %xmm1, %xmm4
	vblendvpd	%xmm4, %xmm0, %xmm1, %xmm5
; β”‚β”‚β”Œ @ operators.jl:305 within `>'
; β”‚β”‚β”‚β”Œ @ bool.jl:84 within `<'
; β”‚β”‚β”‚β”‚β”Œ @ bool.jl:36 within `&'
	testq	%rcx, %rcx
; β”‚β”‚β””β””β””
	js	L52
	vmovapd	%xmm5, %xmm8
	vmovapd	%xmm3, %xmm6
; β”‚β”‚β”Œ @ operators.jl:305 within `>'
; β”‚β”‚β”‚β”Œ @ bool.jl:84 within `<'
; β”‚β”‚β”‚β”‚β”Œ @ bool.jl:33 within `!'
	testq	%rax, %rax
; β”‚β”‚β””β””β””
	jns	L69
	jmp	L65
L52:
	vmovapd	%xmm3, %xmm8
	vmovapd	%xmm5, %xmm6
; β”‚β”‚β”Œ @ operators.jl:305 within `>'
; β”‚β”‚β”‚β”Œ @ bool.jl:84 within `<'
; β”‚β”‚β”‚β”‚β”Œ @ bool.jl:33 within `!'
	testq	%rax, %rax
; β”‚β”‚β””β””β””
	jns	L69
L65:
	vmovapd	%xmm5, %xmm8
; β”‚β””
; β”‚β”Œ @ math.jl:722 within `max'
L69:
	js	L75
	vmovapd	%xmm5, %xmm6
; β”‚β””
; β”‚β”Œ @ math.jl within `min'
L75:
	vcmpltsd	%xmm0, %xmm1, %xmm7
; β”‚β””
; β”‚β”Œ @ math.jl:722 within `max'
	vcmpltsd	%xmm1, %xmm0, %xmm0
	vblendvpd	%xmm0, %xmm3, %xmm6, %xmm1
; β”‚β””
; β”‚β”Œ @ math.jl:726 within `min'
; β”‚β”‚β”Œ @ floatfuncs.jl:15 within `signbit'
	vmovq	%xmm2, %rax
	vmovq	%xmm1, %rcx
; β”‚β”‚β””
	vcmpordsd	%xmm1, %xmm1, %xmm0
	vblendvpd	%xmm0, %xmm2, %xmm1, %xmm4
	vcmpordsd	%xmm2, %xmm2, %xmm0
	vblendvpd	%xmm0, %xmm1, %xmm2, %xmm6
	vmovapd	%xmm4, %xmm5
; β”‚β”‚β”Œ @ operators.jl:305 within `>'
; β”‚β”‚β”‚β”Œ @ bool.jl:84 within `<'
; β”‚β”‚β”‚β”‚β”Œ @ bool.jl:33 within `!'
	testq	%rcx, %rcx
; β”‚β”‚β””β””β””
	jns	L136
	vmovapd	%xmm6, %xmm5
; β”‚β””
; β”‚β”Œ @ math.jl within `min'
L136:
	vblendvpd	%xmm7, %xmm3, %xmm8, %xmm0
; β”‚β””
; β”‚β”Œ @ math.jl:726 within `min'
; β”‚β”‚β”Œ @ operators.jl:305 within `>'
; β”‚β”‚β”‚β”Œ @ bool.jl:84 within `<'
; β”‚β”‚β”‚β”‚β”Œ @ bool.jl:36 within `&'
	testq	%rax, %rax
; β”‚β”‚β””β””β””
	js	L151
	vmovapd	%xmm6, %xmm5
L151:
	vcmpltsd	%xmm1, %xmm2, %xmm1
	vblendvpd	%xmm1, %xmm4, %xmm5, %xmm1
; β”‚β””
; β”‚β”Œ @ math.jl:722 within `max'
; β”‚β”‚β”Œ @ floatfuncs.jl:15 within `signbit'
	vmovq	%xmm1, %rcx
	vmovq	%xmm0, %rax
; β”‚β”‚β””
	vcmpordsd	%xmm0, %xmm0, %xmm2
	vblendvpd	%xmm2, %xmm1, %xmm0, %xmm2
	vcmpordsd	%xmm1, %xmm1, %xmm3
	vblendvpd	%xmm3, %xmm0, %xmm1, %xmm3
	vmovapd	%xmm2, %xmm4
; β”‚β”‚β”Œ @ bool.jl:84 within `<'
; β”‚β”‚β”‚β”Œ @ bool.jl:33 within `!'
	testq	%rcx, %rcx
; β”‚β”‚β””β””
	jns	L207
	vmovapd	%xmm3, %xmm4
; β”‚β”‚β”Œ @ bool.jl:84 within `<'
; β”‚β”‚β”‚β”Œ @ bool.jl:36 within `&'
L207:
	testq	%rax, %rax
; β”‚β”‚β””β””
	js	L216
	vmovapd	%xmm3, %xmm4
L216:
	vcmpltsd	%xmm1, %xmm0, %xmm0
	vblendvpd	%xmm0, %xmm2, %xmm4, %xmm0
; β”‚β””
; β”‚ @ three_median.jl:14 within `three_median1'
	retq
	nopw	%cs:(%rax,%rax)
; β””

julia> @code_native three_median2(a,b,c)
	.text
; β”Œ @ three_median.jl within `three_median2'
	vminsd	%xmm1, %xmm0, %xmm3
; β”‚ @ three_median.jl:23 within `three_median2'
	vmaxsd	%xmm1, %xmm0, %xmm0
; β”‚ @ three_median.jl within `three_median2'
	vminsd	%xmm2, %xmm0, %xmm0
; β”‚ @ three_median.jl:33 within `three_median2'
	vmaxsd	%xmm3, %xmm0, %xmm0
; β”‚ @ three_median.jl:38 within `three_median2'
	retq
	nopw	%cs:(%rax,%rax)
; β””

The assembly code of the third option is much more concise, curiously.

edit: In Julia, the reason for the difference is that there are specific checks for NaNs in the intrinsic max functions.

1 Like

That’s easy to understand. The three_median2 function in Julia translates to:

	vminsd	%xmm1, %xmm0, %xmm3
	vmaxsd	%xmm1, %xmm0, %xmm0
	vminsd	%xmm2, %xmm0, %xmm0
	vmaxsd	%xmm3, %xmm0, %xmm0

which is the optimal way to do it, it just uses the min/max vectorized assembly instructions. The version three_median1 however seems to inline the pure Julia implementations of min/max in math.jl:722 and those implementations seem to be using the vcmpordsd and vblendvpd instructions, so it’s a lot slower. If you follow the link to math.jl, you can see that the Julia’s implementations use signbit and handle nans. You can see this handling in the assembly also.

That’s unfortunate. I think the idea is that the Julia’s optimizer will eventually be able to optimize it out and generate the min/max assembly instructions, but it is not there yet. @oscardssmith, what is your understanding of this?

This is something that I’ve been thinking about a lot for LFortran also. We are also providing pure Fortran implementations of these basic functions. The reason is that it’s easier on the backends, they don’t have to implement these natively, so it’s easier to write it. And also potentially it might be possible to partially optimize these out in the optimizer for many cases (such as when the optimizer knows the argument x is positive for abs(x)). But what we are also planning is to add an option in the LLVM backend to translate these intrinsics in such a way into LLVM so that the min/max assembly instructions are generated.

1 Like

I think you’re are right. With gfortran 11.2 and -O3, it is clear that the compiler is able to inline the function (since it is in the same input unit) and then subsequently vectorise the loop.

1 Like