Continuing the discussion from A Modern Fortran Scientific Programming Ecosystem:
if youβre sceptical benchmark it yourself. the script used is on GitHub and linked in the readme. most of the differences come from better asymptotic series that converge faster in various regions. bessels.jl tries pretty hard to avoid continued fraction expanding because theyβre really slow on modern hardware. leveraging vectorization of the polynomial kernel is is another place where itβs ready to get a 4x speedup on modern processors.
also the benchmarks arenβt fully fair to bessels.jl since AMOS has significantly higher error for some of the oscillatory functions near zeros.
ah we will fix the link. the calling path for the comparison is SpecialFunctions.jl which calls openlibm which uses AMOS for itβs bessel functions (due to JIT this isnβt introducing significant call time overhead, stuff gets inlined). we probably should compare directly against AMOS to make it more clear.
if you think this benchmark is biased, please let me know how. good benchmarking is really hard, especially for complicated functions like these. if there are in fact places where AMOS is faster than Bessels.jl we would love to know about it since the library is still being actively developed.
Iβll try to take a look if I get a chance.
But, note that Bessel functions are now part of Fortran, so they are provided by the compiler (since Fortran 2008 I think?). So a benchmark against that would also be interesting.
Thank you for the blog post @jacobwilliams. Truly an amazing effort. I think these kinds of posts are very beneficial to the ecosystem and would love to have something like this in the Julia community. I sometimes feel open-source developments can be quite fractured which makes for a less cohesive experience especially for new users. Iβm sorry that this discussion has led to just comparing Fortran to Julia (as fun as that is). Please feel free to split this topic into a new thread as Iβd be happy to continue discussing and answer any questions.
The benchmarks at Bessels.jl are only meant to exemplify expected performance gains from users coming from SpecialFunctions.jl to Bessels.jl within the Julia community. SpecialFunctions.jl wraps several C and Fortran libraries (OpenLibm, OpenSpecFun, MPFR, etc.) which I do not differentiate in those benchmarks as they are just meant to signal the advantages of using Bessels.jl over SpecialFunctions.jl. With that said the performance gains are very real (one example). After all, I started this development after I discovered that Bessel function calculations were over 75% of the runtime in my research software. They now account for around 25% which is a noticeable decrease.
I am very interested in benchmarking these further against the standard gfortran compiler. I just installed this and ran my first Fortran code today! However, some help with setting up a proper benchmark would be very helpful so it is as fair as possible (and because I donβt know Fortran). I can then post all results here for those interested. Ideally, I would like to compute some distribution of runtimes for variable inputs. In Julia (and all the benchmarks on the Readme) are done like thisβ¦
# Bessels
julia> @benchmark Bessels.airybi(x) setup=(x=rand()*50)
BenchmarkTools.Trial: 10000 samples with 998 evaluations.
Range (min β¦ max): 11.940 ns β¦ 91.978 ns β GC (min β¦ max): 0.00% β¦ 0.00%
Time (median): 24.514 ns β GC (median): 0.00%
Time (mean Β± Ο): 29.572 ns Β± 12.198 ns β GC (mean Β± Ο): 0.00% Β± 0.00%
ββββββββββ
β
β β ββ β β ββββββ β β β β β β β β β β
ββββββββββββββββββββββββββββββββββββββββββββ
βββββ
ββββββββββ β
11.9 ns Histogram: log(frequency) by time 72.8 ns <
Memory estimate: 0 bytes, allocs estimate: 0.
# SpecialFunctions.jl (wrapper to Amos)
julia> @benchmark SpecialFunctions.airybi(x) setup=(x=rand()*50)
BenchmarkTools.Trial: 5969 samples with 994 evaluations.
Range (min β¦ max): 36.339 ns β¦ 3.881 ΞΌs β GC (min β¦ max): 0.00% β¦ 0.00%
Time (median): 591.031 ns β GC (median): 0.00%
Time (mean Β± Ο): 841.066 ns Β± 545.141 ns β GC (mean Β± Ο): 0.03% Β± 1.51%
β β ββββββββββ β βββββββββββββββ β
βββββββ
ββ
βββββββββββββββββββββββββββββββββββββββββββ
β
β
ββ
ββ
βββ
β
36.3 ns Histogram: log(frequency) by time 2.93 ΞΌs <
Memory estimate: 16 bytes, allocs estimate: 1.
You can see all the different branches hit for the different arguments.
In my 30 min of Fortran coding experience I haveβ¦
program calling_func
real :: startT, endT
real :: a
real:: execTime
real:: r
r = 1.2
call cpu_time(startT)
a = bess(r)
call cpu_time(endT)
execTime = (endT - startT) * 1000000
print '("Time = ",f6.3," seconds.")',execTime
Print *, r
Print *, a
end program calling_func
function bess (r)
implicit none
real :: bess
real :: r
bess = BESSEL_JN(1, r)
end function bess
Any help improving this would be most appreciated.
-Michael
Welcome to the forum! Bessels.jl looks amazing. Great job on that! As you can tell we are always suspicious of benchmarks here. I can definitely help with your benchmark when I get a chance (or maybe somebody else can step in).
Welcome to the forum. I believe the default floating point type in Julia is 64-bit (double precision). Note that
real :: x
in Fortran declares a single precision variable for which compilers will use 32 bits. You could write
program calling_func
implicit none
integer, parameter :: wp = kind(1.0d0)
real(kind=wp) :: a,r
end program calling_func
etc. to get double precision floating point variables.
Thank you for the suggestion! Is there a difference between initializing with double precision :: a
?
This is what I have for now. Let me know if there are any more tips or improvements. Iβm getting around 42 nanoseconds for each evaluation.
program bess
integer, parameter :: p = selected_real_kind(15, 307)
integer, parameter :: n = 50000000
real(p) :: startT, endT, execTime
real(p) :: r, v(n)
real(p) :: a
call random_number(r)
call random_number(v)
do i=1,n
v(i) = 200*v(i)
enddo
call cpu_time(startT)
do i=1,n
a = BESSEL_JN(1, v(i))
enddo
call cpu_time(endT)
execTime = (endT - startT)
print '("Time = ",f6.3," seconds.")',execTime
print *, a
end program bess
Your code is standard Fortran, but it is less flexible, because if you want the code to be single or quadruple precision you need to change many declarations. Also note that Fortran floating point literals are single precision by default. I donβt think your
a = 0.0
causes a problem, but a common mistake is to write
double precision, parameter :: pi = 3.14159265359 ! RHS is single precision
when one should write
double precision, parameter :: pi = 3.14159265359d0 ! d0 makes RHS double precision
The Fortran-lang tutorial has a section on Floating Point Numbers.
Thank you for the link. I updated my code above. I quite like that syntax and we have a similar problem in Julia but the other way around (literals promote Float32 to Float64). I originally thought if I didnβt include some operation within the loop that the compiler would optimize the loop away but it looks like the benchmarks are working as hoped. It looks like the BESSEL_JN
only supports integer arguments for the order. Iβll have to look at other routines as well.
Mmm, which compiler flags are you using? With optimized flags I also would have that fear. Look at this answer, for a fairly similar example: Why a function returns an array is much slower than a subroutine returns an array? (real MWE included) - #2 by Beliavsky (spoiler, the function and the subroutine had the same performance, just the first call did not print nor did anything with the value so the compiler skipped it altogether :).
Of course you are getting the intended behavior with the flags you are using (or with no flags maybe). But if you intend to benchmark for speed feeding proper flags is crucial.
Yes these suspicions are true with more aggressive compiler options gfortran -O3 besselj.f90 -o bess
optimizes away that as expected. Iβll need a more clever benchmark for that. Simply summing up the terms and having it also return the value seems to fix that issue though. Is there an accepted flag that I should run this with? Iβm hoping to avoid any fastmath operations as well. But I didnβt notice any difference between -O1
, -O2
, -O3
.
Thanks again for all the help everyone!
Just about any optimization level other than zero with gfortran, ifort, nvfortran brought the time to zero
with the original code. This has an overhead in that the RHS is not something that could just go in a register, but at least prevents the compiler from eliminating most of the computation β¦
program bess
integer, parameter :: p = selected_real_kind(15, 307)
integer, parameter :: n = 50000000
real(p) :: startT, endT, execTime
real(p) :: r, v(n)
call random_number(v)
v = 200*v
call cpu_time(startT)
v = BESSEL_JN(1, v)
call cpu_time(endT)
execTime = (endT - startT)
print '("Time = ",f6.3," seconds.")',execTime
call random_number(r)
print *, v(max(1,int(r*n)))
end program bess
Thank you for cleaning up that syntax. How would you feel about a benchmark that measures how fast it can sum the array? So with slight modificationβ¦
program bess
integer, parameter :: p = selected_real_kind(15, 307)
integer, parameter :: n = 50000000
real(p) :: startT, endT, execTime
real(p) :: v(n)
real(p) :: a
a = 0.0_p
call random_number(v)
v = 200*v
call cpu_time(startT)
do i=1,n
a = a + BESSEL_JN(1, v(i))
enddo
call cpu_time(endT)
execTime = (endT - startT) / n * 1E9
print '("Time = ",f6.3," nanoseconds.")',execTime
print *, a
end program bess
I see very similar times to your code as well (about 38 nanoseconds for each iteration).
One thing to be wary of is making making sure to be careful about whether you want to allow the compiler to vectorize the computation. I doubt it is in this case, but if you want to benchmark latency of scalar evaluation, you probably want to make sure that the output of one eval is required to compute the input of the next one.
After using Fortran for a while, you would probably tend to do the sum with a function; and more significantly for order one use the specific routine bessel_j1(3) instead of the more general bessel_jn(3)
I would guess; and use some optimization switches (very processor-dependent); although it would probably not have much affect here where you are timing an intrinsic, which would be pre-compiled.
And you might as well get in the habit of adding βimplicit noneβ
g0 is a nice descriptor instead of f6.3, which might have to be changed for different values of N.
program bess
implicit none
integer, parameter :: p = selected_real_kind(15, 307)
integer, parameter :: n = 50000000
real(p) :: startT, endT, execTime
real(p) :: v(n)
real(p) :: a
call random_number(v)
v = 200*v
call cpu_time(startT)
!a = sum( BESSEL_JN(1, v) )
a = sum( BESSEL_J1( v) )
call cpu_time(endT)
execTime = (endT - startT)
print '("Time = ",g0.3," nanoseconds per value.")',execTime/n * 1.0d9
print '("Time = ",g0.3," total seconds.")',execTime
print *, a
end program bess
While weβre discussing floating-point computation some people may like fuller details of it. I recommend google or duckduckgo with 'Goldberg floating point" which should lead you to Goldbergβs famous article on that. Years ago I wrote two Fortran programs called kinds.f90 in f95 and kinds03.f90 in f2003 to give details of all a compilerβs real and integer kinds that SELECTED_REAL_KIND and SELECTED_INT_KIND would give, but neither program includes IEEE_SELECTED_REAL_KIND. If you would like them they are downloadable in https://homepages.ecs.vuw.ac.nz/~harper/fortranstuff.shtml
Free access:
Thank you for those edits. I was actually running into the string issue for f6.3
This has been very informative. I have created a github repository with all these files and some preliminary benchmarks (GitHub - heltonmc/bessel-benchmarks: Benchmarking Bessel Function routines). Iβve already learned some very interesting things so I really appreciate this discussion. I also created an additional section for a sorted array as (at least in my research) we usually sum over a sorted array of roots in a Bessel series. This comparison was perhaps most interesting to me as you can really see the power of branch prediction.
Of course, the repository is very rough and hopefully over time I can continue improving it. Please add any PRs of benchmarks on your computers as they would be most welcome. I have noticed significant differences in runtimes on different machines and for different Julia versions. Hopefully, in the future I can also add comparisons to other Fortran libraries (Amos) as well as include other programming languages.