Comparing Fortran and Julia's Bessel function performance

Thank you for putting that together, the Compiler Explorer link was incredibly helpful. It is interesting to see the difference the two compilers output there and how that compares to Julia. The instructions look similar between the two compilers but the way it is evaluating the rational function appears to be different (please correct me if I’m wrong). The evaluation by ifx 2022.1.0 evaluates the first polynomial and then moves onto the second whereas the gfortran 7.3 evaluates the unrolled version of both going back and forth between each term in the separate polynomials.

Both compilers give vmadd213ss instructions it is just ordered a bit different. I would imagine the gfortran to be slightly better? But that should be tested. The only difference between the Julia code instruction is that it gives vfmadd213pd (on my computer so might be different). It should also be mentioned that Julia has moved away from the macro version and the evalpoly that Bessels.jl uses is a function where we rely on the compiler to do the unrolling. But I think there are two things that could explain the 2-3x faster than the Cephes library in addition to the overhead of the function call.

Of course, unrolling it like you have in the Fortran code is the first that will bring that difference down. The loopy version in the Cephes library is clearly not optimal. But the second thing might be even more costly to the Cephes library as the rational function is actually evaluated by two separate functions. Horner’s scheme is pretty latency bound (have to wait for one multiply to be down to go to next) but the evaluation of the rational function can be done simultaneously. As in P(x)/Q(x) should be able to exploit SIMD instructions fairly well to reduce the time by half (if the lengths of P and Q are similar).

In Julia, the erfinv function has this benchmark…

julia> @btime erfinv($(Ref(0.9))[])
  6.035 ns (0 allocations: 0 bytes)

Now, I’m going to completely replace Q(x) with just dividing by x so just benchmarking one polynomial evaluation (and the division)…

julia> @btime erfinv($(Ref(0.9))[])
  5.369 ns (0 allocations: 0 bytes)

So this is of course faster but not as much as you think considering I removed an entire polynomial of degree 10… If we benchmark just a single polynomial evaluation of degree 10…

julia> @btime evalpoly($(Ref(0.9))[], $(ntuple(n -> (-1)^n / n, 10)))
  3.084 ns (0 allocations: 0 bytes)

So ya it looks like the Julia compiler is unrolling the tuple but also recognizing that P and Q can be computed at the same time. Now the difference between the two versions can probably be explained by the length of P and Q not being equal so it is not possible to completely be done at the same time and have some leftover instructions. So the optimal code will take advantage of that which I doubt the Cephes library is. So I think all of those things along with the calling overhead could explain that difference. I have no doubt that we could slightly adjust its routine to give the same performance.

Bessels.jl tries to exploit all these type of SIMD things in all of its functions. We don’t have a lot of rational approximations but there are times where we need to evaluate polynomials of large degree. Now we can exploit some of those same principles to dramatically reduce the computation time. Instead of evaluating two polynomials separately we can try to use a second order Horner scheme that splits the polynomial into even and odd terms and computes the polynomial that way. Here is a Julia version of that…

@inline function even_odd_poly_add(x, P)
    N = length(P)
    xx = x*x

    out = P[end]
    out2 = P[end-1]

    for i in N-2:-2:2
        out = muladd(xx, out, P[i])
        out2 = muladd(xx, out2, P[i-1])
    end
    if iszero(rem(N, 2)) 
        return muladd(out, x, out2) 
    else 
        out = muladd(xx, out, P[1]) 
        return muladd(x, out2, out) 
    end
end

So if we benchmark these for different polynomial degrees…

# N = 5
julia> @btime evalpoly($(Ref(1.2))[], $(ntuple(n -> (-1)^n / n, 5)))
  2.113 ns (0 allocations: 0 bytes)
-0.8627199999999999

julia> @btime even_odd_poly_add($(Ref(1.2))[], $(ntuple(n -> (-1)^n / n, 5)))
  2.346 ns (0 allocations: 0 bytes)
-0.8627199999999999

# N = 10
julia> @btime evalpoly($(Ref(1.2))[], $(ntuple(n -> (-1)^n / n, 10)))
  3.079 ns (0 allocations: 0 bytes)
-0.38845094765714294

julia> @btime even_odd_poly_add($(Ref(1.2))[], $(ntuple(n -> (-1)^n / n, 10)))
  2.967 ns (0 allocations: 0 bytes)
-0.38845094765714316

N = 50
julia> @btime evalpoly($(Ref(1.2))[], $(ntuple(n -> (-1)^n / n, 50)))
  23.274 ns (0 allocations: 0 bytes)
81.32088598046435

julia> @btime even_odd_poly_add($(Ref(1.2))[], $(ntuple(n -> (-1)^n / n, 50)))
  10.625 ns (0 allocations: 0 bytes)
81.32088598046445

N = 150
julia> @btime evalpoly($(Ref(1.2))[], $(ntuple(n -> (-1)^n / n, 150)))
  89.602 ns (0 allocations: 0 bytes)
2.2769534597705355e9

julia> @btime even_odd_poly_add($(Ref(1.2))[], $(ntuple(n -> (-1)^n / n, 150)))
  41.843 ns (0 allocations: 0 bytes)
2.2769534597705407e9

So as we would expect for smaller polynomials the evalpoly is slightly faster but there is some transition point at around N = 10 where we can see a clear advantage and then the 2x speedup for higher orders N > 50 (of course we can also use four accumulators as well). One might also notice the slight difference in accuracy. The accuracy will depend on the coefficients (so that needs to be tested), but for this example the even/odd scheme is actually slightly more accurate. (I also tested the error in the erfinv function and using the even odd scheme actually decreased the max ULP error (~1.9 → ~1.7))

julia> even_odd_poly_add(big"1.2", ntuple(n -> (-1)^n / n, 50))
81.32088598046447810940231091273091124409909302016418750000000000000000000001763

Anyway, that is a fairly long winded answer to say that compilers are very smart… we should set up our code to exploit these aspects as much as possible (though they don’t always auto-vectorize when they should so should make sure they are) and let the compiler do its job. So yes, even though the algorithms are close transcriptions of each other appear, it still isn’t really an apples to apples comparison between languages. Though, how easy it is to get the compiler to do that I think is an important consideration. Please correct anything I may have said in error and I will correct this.

This is incredible. It’s awesome to see this work progress so much progress over the past year and thank you for all the hard work on that project. This (IMO :slight_smile: ) is such a cool application and demonstration of this work!

These are all excellent questions that I don’t think I feel knowledgeable enough to answer (speaking as someone who has not made a single contribution to the language :man_shrugging:) I personally think having your code be easy to use from any language is an advantage. I think works like GitHub - JuliaPy/PythonCall.jl: Python and Julia in harmony. and GitHub - tshort/StaticCompiler.jl: Compiles Julia code to a standalone library (experimental) are awesome! I don’t really try to convince Python users (maybe if you are using Matlab :wink:) to switch to Julia at all. Honestly, it would be kind of nice if the Julia code could replace some of the math functions in Scipy that call the cephes library. I think that would definitely be beneficial to both ecosystems… I apologize for the long winded answer. Again, please correct or add any misinformation and I will update.

3 Likes