Comparing Fortran and Julia's Bessel function performance

Continuing the discussion from A Modern Fortran Scientific Programming Ecosystem:

if you’re sceptical benchmark it yourself. the script used is on GitHub and linked in the readme. most of the differences come from better asymptotic series that converge faster in various regions. bessels.jl tries pretty hard to avoid continued fraction expanding because they’re really slow on modern hardware. leveraging vectorization of the polynomial kernel is is another place where it’s ready to get a 4x speedup on modern processors.

2 Likes

also the benchmarks aren’t fully fair to bessels.jl since AMOS has significantly higher error for some of the oscillatory functions near zeros.

3 Likes

ah we will fix the link. the calling path for the comparison is SpecialFunctions.jl which calls openlibm which uses AMOS for it’s bessel functions (due to JIT this isn’t introducing significant call time overhead, stuff gets inlined). we probably should compare directly against AMOS to make it more clear.

if you think this benchmark is biased, please let me know how. good benchmarking is really hard, especially for complicated functions like these. if there are in fact places where AMOS is faster than Bessels.jl we would love to know about it since the library is still being actively developed.

I’ll try to take a look if I get a chance.

But, note that Bessel functions are now part of Fortran, so they are provided by the compiler (since Fortran 2008 I think?). So a benchmark against that would also be interesting.

2 Likes

Thank you for the blog post @jacobwilliams. Truly an amazing effort. I think these kinds of posts are very beneficial to the ecosystem and would love to have something like this in the Julia community. I sometimes feel open-source developments can be quite fractured which makes for a less cohesive experience especially for new users. I’m sorry that this discussion has led to just comparing Fortran to Julia (as fun as that is). Please feel free to split this topic into a new thread as I’d be happy to continue discussing and answer any questions.

The benchmarks at Bessels.jl are only meant to exemplify expected performance gains from users coming from SpecialFunctions.jl to Bessels.jl within the Julia community. SpecialFunctions.jl wraps several C and Fortran libraries (OpenLibm, OpenSpecFun, MPFR, etc.) which I do not differentiate in those benchmarks as they are just meant to signal the advantages of using Bessels.jl over SpecialFunctions.jl. With that said the performance gains are very real (one example). After all, I started this development after I discovered that Bessel function calculations were over 75% of the runtime in my research software. They now account for around 25% which is a noticeable decrease.

I am very interested in benchmarking these further against the standard gfortran compiler. I just installed this and ran my first Fortran code today! However, some help with setting up a proper benchmark would be very helpful so it is as fair as possible (and because I don’t know Fortran). I can then post all results here for those interested. Ideally, I would like to compute some distribution of runtimes for variable inputs. In Julia (and all the benchmarks on the Readme) are done like this…

# Bessels
julia> @benchmark Bessels.airybi(x) setup=(x=rand()*50)
BenchmarkTools.Trial: 10000 samples with 998 evaluations.
 Range (min … max):  11.940 ns … 91.978 ns  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     24.514 ns              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   29.572 ns Β± 12.198 ns  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

         β–β–‡β–ƒβ–ˆβ–‚β–‡β–‚β–†β–β–…β–… β–„ β–„ β–ƒβ–ƒ β–ƒ β–ƒ ▂▂▁▂▁▂  β–‚ ▁ β–‚  β–‚ ▁ β–‚  β–‚ ▁ β–‚   β–‚
  β–†β–ƒβ–ƒβ–ˆβ–β–ˆβ–β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–†β–ˆβ–‡β–ˆβ–†β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–ˆβ–ˆβ–‡β–ˆβ–‡β–ˆβ–…β–†β–ˆβ–ƒβ–ˆβ–…β–ˆβ–ƒβ–ƒβ–ˆβ–„β–ˆβ–„β–ˆβ–ƒβ–ˆ β–ˆ
  11.9 ns      Histogram: log(frequency) by time      72.8 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

# SpecialFunctions.jl (wrapper to Amos)
julia> @benchmark SpecialFunctions.airybi(x) setup=(x=rand()*50)
BenchmarkTools.Trial: 5969 samples with 994 evaluations.
 Range (min … max):   36.339 ns …   3.881 ΞΌs  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     591.031 ns               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   841.066 ns Β± 545.141 ns  β”Š GC (mean Β± Οƒ):  0.03% Β± 1.51%

  β–ƒ      β–ƒ β–„β–ˆβ–‡β–†β–„β–„β–ƒβ–ƒβ–‚β–‚    β–‚ ▁▂▂▂▂▂▂▃▂▃▂▂▁▁▁                      β–‚
  β–ˆβ–‡β–β–β–β–β–…β–ˆβ–…β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–‡β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–ˆβ–‡β–ˆβ–†β–‡β–†β–‡β–†β–†β–‡β–…β–…β–…β–†β–…β–„β–…β–†β–†β–… β–ˆ
  36.3 ns       Histogram: log(frequency) by time       2.93 ΞΌs <

 Memory estimate: 16 bytes, allocs estimate: 1.

You can see all the different branches hit for the different arguments.

In my 30 min of Fortran coding experience I have…

program calling_func
    real :: startT, endT
    real :: a
    real:: execTime
    real:: r

    r = 1.2
    call cpu_time(startT)

    a = bess(r) 
    
    call cpu_time(endT)
    execTime = (endT - startT) * 1000000
    print '("Time = ",f6.3," seconds.")',execTime

    Print *, r
    Print *, a
    
 end program calling_func
 
 
function bess (r)  
implicit none      
    real :: bess   
    real :: r     
    bess = BESSEL_JN(1, r)
end function bess

Any help improving this would be most appreciated.

-Michael

5 Likes

Welcome to the forum! Bessels.jl looks amazing. Great job on that! As you can tell we are always suspicious of benchmarks here. :slight_smile: I can definitely help with your benchmark when I get a chance (or maybe somebody else can step in).

4 Likes

Welcome to the forum. I believe the default floating point type in Julia is 64-bit (double precision). Note that

real :: x

in Fortran declares a single precision variable for which compilers will use 32 bits. You could write

program calling_func
implicit none
integer, parameter :: wp = kind(1.0d0)
real(kind=wp) :: a,r
end program calling_func

etc. to get double precision floating point variables.

2 Likes

Thank you for the suggestion! Is there a difference between initializing with double precision :: a?
This is what I have for now. Let me know if there are any more tips or improvements. I’m getting around 42 nanoseconds for each evaluation.

program bess
    integer, parameter :: p = selected_real_kind(15, 307)
    integer, parameter :: n = 50000000
    real(p) :: startT, endT, execTime
    real(p) :: r, v(n)
    real(p) :: a
    
    call random_number(r)
    call random_number(v)

    do i=1,n
        v(i) = 200*v(i)
    enddo

    call cpu_time(startT)

    do i=1,n
        a = BESSEL_JN(1, v(i))
    enddo
    
    call cpu_time(endT)
    execTime = (endT - startT)
    print '("Time = ",f6.3," seconds.")',execTime
    print *, a
 end program bess

Your code is standard Fortran, but it is less flexible, because if you want the code to be single or quadruple precision you need to change many declarations. Also note that Fortran floating point literals are single precision by default. I don’t think your

a = 0.0

causes a problem, but a common mistake is to write

double precision, parameter :: pi = 3.14159265359 ! RHS is single precision

when one should write

double precision, parameter :: pi = 3.14159265359d0 ! d0 makes RHS double precision

The Fortran-lang tutorial has a section on Floating Point Numbers.

1 Like

Thank you for the link. I updated my code above. I quite like that syntax and we have a similar problem in Julia but the other way around (literals promote Float32 to Float64). I originally thought if I didn’t include some operation within the loop that the compiler would optimize the loop away but it looks like the benchmarks are working as hoped. It looks like the BESSEL_JN only supports integer arguments for the order. I’ll have to look at other routines as well.

2 Likes

Mmm, which compiler flags are you using? With optimized flags I also would have that fear. Look at this answer, for a fairly similar example: Why a function returns an array is much slower than a subroutine returns an array? (real MWE included) - #2 by Beliavsky (spoiler, the function and the subroutine had the same performance, just the first call did not print nor did anything with the value so the compiler skipped it altogether :).

Of course you are getting the intended behavior with the flags you are using (or with no flags maybe). But if you intend to benchmark for speed feeding proper flags is crucial.

Yes these suspicions are true with more aggressive compiler options gfortran -O3 besselj.f90 -o bess optimizes away that as expected. I’ll need a more clever benchmark for that. Simply summing up the terms and having it also return the value seems to fix that issue though. Is there an accepted flag that I should run this with? I’m hoping to avoid any fastmath operations as well. But I didn’t notice any difference between -O1, -O2, -O3.

Thanks again for all the help everyone!

1 Like

Just about any optimization level other than zero with gfortran, ifort, nvfortran brought the time to zero
with the original code. This has an overhead in that the RHS is not something that could just go in a register, but at least prevents the compiler from eliminating most of the computation …

program bess
    integer, parameter :: p = selected_real_kind(15, 307)
    integer, parameter :: n = 50000000
    real(p) :: startT, endT, execTime
    real(p) :: r, v(n)
    
    call random_number(v)
    v = 200*v

    call cpu_time(startT)
    v = BESSEL_JN(1, v)
    call cpu_time(endT)

    execTime = (endT - startT)
    print '("Time = ",f6.3," seconds.")',execTime

    call random_number(r)
    print *, v(max(1,int(r*n)))
 end program bess
2 Likes

Thank you for cleaning up that syntax. How would you feel about a benchmark that measures how fast it can sum the array? So with slight modification…

program bess
    integer, parameter :: p = selected_real_kind(15, 307)
    integer, parameter :: n = 50000000
    real(p) :: startT, endT, execTime
    real(p) :: v(n)
    real(p) :: a

    a = 0.0_p
    call random_number(v)
    v = 200*v

    call cpu_time(startT)
    do i=1,n
        a = a + BESSEL_JN(1, v(i))
    enddo
    call cpu_time(endT)

    execTime = (endT - startT) / n * 1E9
    print '("Time = ",f6.3," nanoseconds.")',execTime
    print *, a
 end program bess

I see very similar times to your code as well (about 38 nanoseconds for each iteration).

One thing to be wary of is making making sure to be careful about whether you want to allow the compiler to vectorize the computation. I doubt it is in this case, but if you want to benchmark latency of scalar evaluation, you probably want to make sure that the output of one eval is required to compute the input of the next one.

After using Fortran for a while, you would probably tend to do the sum with a function; and more significantly for order one use the specific routine bessel_j1(3) instead of the more general bessel_jn(3)
I would guess; and use some optimization switches (very processor-dependent); although it would probably not have much affect here where you are timing an intrinsic, which would be pre-compiled.
And you might as well get in the habit of adding β€œimplicit none” :slight_smile:

g0 is a nice descriptor instead of f6.3, which might have to be changed for different values of N.

program bess
implicit none
integer, parameter :: p = selected_real_kind(15, 307)
integer, parameter :: n = 50000000
real(p) :: startT, endT, execTime
real(p) :: v(n)
real(p) :: a

    call random_number(v)
    v = 200*v

    call cpu_time(startT)
    !a = sum( BESSEL_JN(1, v) )
    a = sum( BESSEL_J1( v) )
    call cpu_time(endT)

    execTime = (endT - startT) 
    print '("Time = ",g0.3," nanoseconds per value.")',execTime/n * 1.0d9
    print '("Time = ",g0.3," total seconds.")',execTime
    print *, a
 end program bess
1 Like

While we’re discussing floating-point computation some people may like fuller details of it. I recommend google or duckduckgo with 'Goldberg floating point" which should lead you to Goldberg’s famous article on that. Years ago I wrote two Fortran programs called kinds.f90 in f95 and kinds03.f90 in f2003 to give details of all a compiler’s real and integer kinds that SELECTED_REAL_KIND and SELECTED_INT_KIND would give, but neither program includes IEEE_SELECTED_REAL_KIND. If you would like them they are downloadable in https://homepages.ecs.vuw.ac.nz/~harper/fortranstuff.shtml

3 Likes

Free access:

4 Likes

Thank you for those edits. I was actually running into the string issue for f6.3 :slight_smile:

This has been very informative. I have created a github repository with all these files and some preliminary benchmarks (GitHub - heltonmc/bessel-benchmarks: Benchmarking Bessel Function routines). I’ve already learned some very interesting things so I really appreciate this discussion. I also created an additional section for a sorted array as (at least in my research) we usually sum over a sorted array of roots in a Bessel series. This comparison was perhaps most interesting to me as you can really see the power of branch prediction.

Of course, the repository is very rough and hopefully over time I can continue improving it. Please add any PRs of benchmarks on your computers as they would be most welcome. I have noticed significant differences in runtimes on different machines and for different Julia versions. Hopefully, in the future I can also add comparisons to other Fortran libraries (Amos) as well as include other programming languages.

2 Likes