Comparing Fortran and Julia's Bessel function performance

FortranFan · October 16, 2022, 3:53pm

Welcome to the forum and thank you for your great effort and wonderful dialogue with the special functions. Your work with Bessel functions is most impressive and it really showcases how highly talented computational mathematicians can use Julia, which is incorporating into the platform itself a tremendous amount of desired computational language facilities that will be of particularly of interest and delight to those in academia and research - to advance fundamental sciences.

Generally in the Fortran Community, you will find the folks will appreciate if the microbenchmarks are performed thoroughly and cleanly in an easily comprehensible manner across processors and platforms so that the readers can make proper apples-to-apples comparisons. Also, to get into discussions that strive to explain the algorithmic advances that yield computational benefits with special function evaluations, etc… This is as opposed to mere marketing materials that instead try to make blanket statements around language superiority (Julia 10x better than xxx, etc.)

In terms of Fortran evaluations, you have likely already realized one has to generally take good care to “wire in” the needed instrumentation to attempt computing performance evaluations and would need to keep in mind several aspects:

There are several targeted compiler implementations toward multiple platforms including commercial ones and the performance mileage can vary significantly across implementations. gfortran is one among several and thus gfortran \neq Fortran.
Fortran has an ISO IEC standard and the use of standard-conforming code across compilers and platforms can be a preferred way to level-set the code used toward benchmarks.
With microbenchmarks in sequential computations, CPU_TIME in the Fortran standard may not always provide the required resolution and SYSTEM_CLOCK can be more revealing.
You will know well as the data sizes used toward function evaluations increase, chances are high the system times can be dictated by data memory access and read/write. With Fortran processors, there can be large variances in this the memory data aspect across different compilers resulting in different performance behavior that overwhelm any of the computational aspects around function evaluations.
Simple-minded considerations of stats with a certain number of trial evaluations in the same run with the exclusion of the best and worst performance toward the determination of the “mean” helps with a compiler-based program execution of test cases.

With the above and a few points in mind, here is an edited version of your code you may want to review:

Click to see code

   integer, parameter :: Float32 = selected_real_kind( p=6 )
   integer, parameter :: Float64 = selected_real_kind( p=12 )
   integer, parameter :: WP = Float64

   ! Program data
   real(WP), parameter :: ONE = 1.0_wp
   integer, parameter :: N = 1000000
   real(WP), allocatable :: v(:), r(:)
   integer, parameter :: NTRIALS = 12
   real(Float64) :: Calc_Times(NTRIALS)
   real(Float64) :: Mean_Time
   real(WP) :: rsum(NTRIALS)
   real(WP) :: stddev

   allocate( v(N), r(N) )

   call random_number( v )
   v = v * 200.0_wp
   ! Set first element to a reference value for a numerical check
   v(1) = ONE

   call calc( Calc_Times, rsum )

   Mean_Time = ( sum(Calc_Times) - minval(Calc_Times, dim=1) - maxval(Calc_Times, dim=1) ) / (NTRIALS-2)

   print *, "BESSEL_J1 Evaluation: Mean compute time for ", N, " evaluations:"
   print *, Mean_Time, " seconds."
   stddev = norm2( rsum - sum(rsum)/NTRIALS )
   print *, "Standard deviation in calculated results: ", stddev

contains

   impure elemental subroutine calc( t, sumr )

      ! Argument list
      real(Float64), intent(inout) :: t
      real(WP), intent(inout) :: sumr

      ! Local variables
      real(Float64) :: t0
      real(Float64) :: t1

      call cpu_t( t0)

      r = bessel_j1( v )

      call cpu_t( t1 )

      t = t1 - t0

      ! Check calculation result
      if ( abs(r(1) - 0.44_wp) > 1E-3_wp ) then
         print *, "BESSEL_J1( x=1 ): ", r(1), " vs an expected value of approximately ", 0.44_wp
         error stop "Unexpected result with x = 1.0: " 
      end if
      
      sumr = sum( r ) 

   end subroutine
 
   subroutine cpu_t( time )

      use, intrinsic :: iso_fortran_env, only : I8 => int64

      !.. Argument list
      real(Float64), intent(inout) :: time

      !.. Local variables
      integer(I8) :: tick
      integer(I8) :: rate
      time = 0.0_Float64

      call system_clock (tick, rate)

      time = real(tick, kind=kind(time) ) / real(rate, kind=kind(time) )

      return

   end subroutine

end

C:\temp>gfortran -O3 p.f90 -o p.exe

C:\temp>p.exe
BESSEL_J1 Evaluation: Mean compute time for 1000000 evaluations:
4.9179059994639826E-002 seconds.
Standard deviation in calculated results: 0.0000000000000000

urbanjost · October 16, 2022, 5:58pm

Just a note that nvfortran, gfortran, and ifort produced very similar timings with the simpler test. That is a little surprising as each vendor is free to implement the intrinsics differently. With some of the intrinsics some compilers concentrate on precision, others on speed or error detection, etc …

The netlib bessel functions support fractional orders, and some like the Sandia Mathematical Program Library routines were about 15 to 20 percent faster than the intrinsics when compiled with high optimization last I timed them; but considerably slower than the intrinsics when compiled with common debug flags; for example ( I used the Sandia procedures for a long time, before the intrinsics were available on the majority of common compilers).

FortranFan · October 16, 2022, 8:46pm

@heltonmc ,

To add to above, from a simple-minded Fortran practitioner perspective, especially to clear away all the fog which comes about due to Julia benchmark tools and the various macro add-ons that would appear to do a lot of stuff “under the hood” which is not at all obvious to an unfamiliar reader of Julia commands, a straightforward computation somewhat similar to the Fortran coding approach shown above can be as follows:

using Bessels

function fn!(v::Array{Float64}, r::Array{Float64})
    for i = 1:length(v)
        r[i] = Bessels.besselj1(v[i])
    end
    return nothing
end

v = rand(1000000)*200.0
r = zeros(1000000)
@time x=fn!(v,r)

With above on the same platform where the gfortran-compiled program execution result I had provided previously, Julia yields:

C:\temp>julia b1.jl
  0.086703 seconds (13.91 k allocations: 777.971 KiB, 63.65% compilation time)

So you will notice the time of 0.086703 seconds reported by Julia. I don’t use Julia so I’m unsure how to interpret this timing value. It is about 1.7x greater than gfortran for 1,000,000 evaluations of the Bessel function of the 1st kind, order 1.

Does the Julia time include compilation and allocation even though the @time macro is applied on the function evaluation?

Mason · October 16, 2022, 10:41pm

Yes, if you scroll a bit to the right on the output you showed, you’ll see that it reported

63.65% compilation time

suggesting that the run-time of your benchmark was 0.02861 seconds.

It’s also worth noting that julia’s array indexing and setting is safe by default, so each iteration of your benchmark loop will always include at least one bounds check for r (the compiler will probably, but not always, figure out that the v indexing is inbounds and elide the check). You can annotate the inside of your loop body with the @inbounds macro to eliminate this.

heltonmc · October 16, 2022, 11:03pm

There are a couple things I would be wary of (you can see how I set up the Julia benchmarks at (bessel-benchmarks/besselj.jl at main · heltonmc/bessel-benchmarks · GitHub). v and r are global variables and as Mason said it is not making any inference about the length of r and v so a lot of time is spent simply checking out of bounds access and assignment. Here is an implementation that matches your Fortran code…

function bessel(f; N=50000000, Order=false)
    v = rand(N)*200
    r = similar(v)
    if Order
        v = sort(v)
    end
    tstart = time()
    @inbounds for ind in eachindex(v)
        r[ind] = f(v[ind])
    end
    tend = time()
    t = (tend-tstart) / N * 1e9
    return t
end

julia> bessel(besselj1)
29.67360019683838

julia> bessel(besselj1)
29.674997329711914

Which is about 29.7 ns per evaluation (which is what I report at GitHub - heltonmc/bessel-benchmarks: Benchmarking Bessel Function routines). The code per evaluation that you reported is about 49 ns per eval (which is also similar to what I got). If you wanted to enable bounds checks you can remove the inbounds. My timings removing that statement are …

julia> bessel(besselj1)
31.076979637145996

julia> bessel(besselj1)
30.705680847167972

julia> bessel(besselj1)
30.715522766113285

Mason · October 16, 2022, 11:08pm

They’re not globals inside the function, they’re just regular function arguments which is fine.

shahmoradi · October 16, 2022, 11:14pm

This thread should have stopped when Kargl rightly said so. The front page of Bessel.jl states,

There are instances where we don’t quite match that desired accuracy (within a digit or two) …

What is the goal of this benchmark? that two different algorithms have different performances?
Compare this attitude and vague benchmarking standards to https://github.com/JuliaLang/julia/issues/24568

Mason · October 16, 2022, 11:54pm

If you read the thread, the goal is abundantly clear. Someone in Hacker News brought up Bessels.jl’s performance claims and people doubted whether or not that was truly hitting the reported speed.

Yes, these are different algorithms and they have different performance and accuracy characteristics. The accuracy characteristics are definitely important context for any discussion on this topic, but I’ll note that perhaps @heltonmc was a bit modest in the Readme when mentioning accuracy. What I see, sampling 100_000 inputs between 0 and 200 is that the LibM routines called though SpecialFunctions do indeed have have an accuracy distribution that has more weight closer to zero, than Bessels, but also has a longer tail of low accuracy.

Bessels.jl accuracy compared to arbitrary precision:

                        ┌                                        ┐ 
   [-9.0e-17, -8.0e-17) ┤▏ 1                                       
   [-8.0e-17, -7.0e-17) ┤▏ 7                                       
   [-7.0e-17, -6.0e-17) ┤▏ 65                                      
   [-6.0e-17, -5.0e-17) ┤▍ 218                                     
   [-5.0e-17, -4.0e-17) ┤▌ 464                                     
   [-4.0e-17, -3.0e-17) ┤█▌ 1243                                   
   [-3.0e-17, -2.0e-17) ┤█████▌ 4730                               
   [-2.0e-17, -1.0e-17) ┤████████████████▉ 14749                   
   [-1.0e-17,  0.0    ) ┤█████████████████████████████████  28674  
   [ 0.0    ,  1.0e-17) ┤████████████████████████████████▋ 28454   
   [ 1.0e-17,  2.0e-17) ┤████████████████▋ 14449                   
   [ 2.0e-17,  3.0e-17) ┤█████▎ 4558                               
   [ 3.0e-17,  4.0e-17) ┤█▌ 1252                                   
   [ 4.0e-17,  5.0e-17) ┤▌ 431                                     
   [ 5.0e-17,  6.0e-17) ┤▎ 183                                     
   [ 6.0e-17,  7.0e-17) ┤▎ 135                                     
   [ 7.0e-17,  8.0e-17) ┤▎ 118                                     
   [ 8.0e-17,  9.0e-17) ┤▎ 85                                      
   [ 9.0e-17,  1.0e-16) ┤▏ 64                                      
   [ 1.0e-16,  1.1e-16) ┤▏ 54                                      
   [ 1.1e-16,  1.2e-16) ┤▏ 37                                      
   [ 1.2e-16,  1.3e-16) ┤▏ 19                                      
   [ 1.3e-16,  1.4e-16) ┤▏ 8                                       
   [ 1.4e-16,  1.5e-16) ┤▏ 2                                       
                        └                                        ┘

and OpenLibM

                        ┌                                        ┐ 
   [-2.4e-16, -2.2e-16) ┤▏ 1                                       
   [-2.2e-16, -2.0e-16) ┤  0                                       
   [-2.0e-16, -1.8e-16) ┤▏ 2                                       
   [-1.8e-16, -1.6e-16) ┤▏ 4                                       
   [-1.6e-16, -1.4e-16) ┤▏ 6                                       
   [-1.4e-16, -1.2e-16) ┤▏ 19                                      
   [-1.2e-16, -1.0e-16) ┤▏ 35                                      
   [-1.0e-16, -8.0e-17) ┤▏ 73                                      
   [-8.0e-17, -6.0e-17) ┤▎ 184                                     
   [-6.0e-17, -4.0e-17) ┤▍ 547                                     
   [-4.0e-17, -2.0e-17) ┤██▎ 2934                                  
   [-2.0e-17, -0.0    ) ┤████████████████████████████████▉ 46170   
   [ 0.0    ,  2.0e-17) ┤█████████████████████████████████  46214  
   [ 2.0e-17,  4.0e-17) ┤██▎ 3006                                  
   [ 4.0e-17,  6.0e-17) ┤▍ 533                                     
   [ 6.0e-17,  8.0e-17) ┤▎ 171                                     
   [ 8.0e-17,  1.0e-16) ┤▏ 55                                      
   [ 1.0e-16,  1.2e-16) ┤▏ 34                                      
   [ 1.2e-16,  1.4e-16) ┤▏ 8                                       
   [ 1.4e-16,  1.6e-16) ┤▏ 3                                       
   [ 1.6e-16,  1.8e-16) ┤▏ 1

So yes, one needs to think carefully about which accuracy measure is more important for their usecase. This is important context for benchmarks but obviously does not invalidate benchmarks of this type.

That’s a completely different case, where the stated goal is to compare the same algorithm. That’s a language benchmark where things like algorithmic differences would be an unwanted distraction.

If someone found a faster or better algorithm to implement a Bessel function, with acceptable error characteristics, that’s good news!

heltonmc · October 16, 2022, 11:56pm

Thank you for taking a comment I made in good faith to users that this package is still being developed. Something I do in the evenings because I simply enjoy it and something of a signal that there will be cases where there will not be the desired accuracy because at the end of the day it is a hobby project and I have no background in computer science or programming. In that repository I specifically mention that I was going to test accuracy as well. On quick inspection…

# for x = 1.5
# Bessels.jl
julia> besselj0(1.5)
0.5118276717359181
# gfortran
src % ./bess                          
  0.51182767173591814   

# on the exact first root
# Bessels.jl
julia> besselj0(2.404825557695772768622)
-6.10876525973673e-17
# gfortran
src % ./bess                          
  -6.1087652597367303E-017

# and here is what SpecialFunctions.jl gives
julia> besselj0(2.404825557695772768622)
-5.553876295239997e-17
# and here is what Arb provides
julia> besselj0(ArbFloat(2.404825557695772768622))
-6.108765259736730397082e-17

Bessels.jl and the gfortran return the exact results which also match Arb. I’m sure there are cases where they don’t and I’ll happily try my best to improve those.

I only commented because of general statements like not apples to apples and skepticism of the benchmarks and some general interest in trying Fortran. I commented here to answer any questions people had on these benchmarks in hopes that we could have some dialogue on actual implementation details. Possibly improving algorithms and perhaps share any knowledge on the algorithms that are faster.

Edit: Mason gave a much more thorough analysis that I provided and I thank him a lot for taking the time to do that.

Mason · October 17, 2022, 6:03am

Just an update on this, I tried doing an accuracy comparison between the Bessels.jl implementation and the BESSEL_J1 function provided by the Fortran compiler which the benchmarks here were done in comparison to.

However, what I found was that even with double precision inputs, the accuracy of the BESSEL_J1 function was only good up to single precision accuracy anyways, i.e. the answers only agreed with the arbitrary precision answer up to around 1 part in 10^7. In contrast, the Bessels.jl routine is more like one part in 10^17, since it’s a pretty accurate true double precision function.

Am I doing something wrong here? The fortran code I had was

module bess
  contains
    subroutine f90_bessel_j1(x)
      double precision :: x
      out = bessel_j1(x)
    end subroutine f90_bessel_j1
  end module bess

compiled as

gfortran bessel_j1.f90 -o bessel_j1_fort.so -shared -fPIC

and then linked to and called from julia. I also tried just printing out some values from bessel_j1 to make sure it wasn’t some weird FFI problem, but the values returned from this thing called on double precision inputs from inside a Fortran program had the same characteristics.

If that’s true, then the context for the above benchmarks would instead be that Bessels.jl is outperforming this fortran implementation handily while having much much much higher accuracy.

Is there a builtin double precision routine I can call instead?

rgba · October 17, 2022, 7:44am

Where are you declaring out? If there’s no implicit none and you’re just doing what I’m seeing than it’s single precision.

jacobwilliams · October 17, 2022, 2:04pm

Also, the compiler will matter too. With the code from @FortranFan above, ifort is a bit faster than gfortran on one of my systems:


# ifort (IFORT) 19.0.5.281 20190815
$ ifort -O3 bes.f90 -o bes_intel_O3
$ ./bes_intel_O3
 BESSEL_J1 Evaluation: Mean compute time for      1000000  evaluations:
  8.479578495025634E-002  seconds.
 Standard deviation in calculated results:   3.150582065370831E-012
$ ./bes_intel_O3
 BESSEL_J1 Evaluation: Mean compute time for      1000000  evaluations:
  8.608586788177490E-002  seconds.
 Standard deviation in calculated results:   3.150582065370831E-012

# GNU Fortran (conda-forge gcc 12.1.0-16) 12.1.0
$ gfortran -O3 bes.f90 -o bes_gfortran12_O3
$ ./bes_gfortran12_O3 
 BESSEL_J1 Evaluation: Mean compute time for      1000000  evaluations:
  0.12651509410934522       seconds.
 Standard deviation in calculated results:    3.1505820653708310E-012
$ ./bes_gfortran12_O3 
 BESSEL_J1 Evaluation: Mean compute time for      1000000  evaluations:
  0.13019724107580261       seconds.
 Standard deviation in calculated results:    0.0000000000000000

jacobwilliams · October 17, 2022, 2:13pm

I don’t think you have shown all your code (nothing is being done with out). But, since you didn’t declare it and you didn’t use implicit none, out is a single precision number (Congrats! This is a rite of passage that every Fortran programmer has been through for decades!) So, even though you are computing a double precision value, you are putting it into a single precision variable, which is causing the behavior you are seeing.

If you want a function that returns a double result, a modern version would look something like this:

module bess
  use iso_fortran_env, only: wp => real64
  implicit none
  contains
    pure function f90_bessel_j1(x) result(r)
      real(wp),intent(in) :: x
      real(wp) :: r
      r = bessel_j1(x)
    end function f90_bessel_j1
  end module bess

FortranFan · October 17, 2022, 3:13pm

jacobwilliams:

Also, the compiler will matter too. With the code from @FortranFan above, ifort is a bit faster than gfortran on one of my systems:

# ifort (IFORT) 19.0.5.281 20190815
..
 BESSEL_J1 Evaluation: Mean compute time for      1000000  evaluations:
  8.479578495025634E-002  seconds.
..
# GNU Fortran (conda-forge gcc 12.1.0-16) 12.1.0
$ gfortran -O3 bes.f90 -o bes_gfortran12_O3
$ ./bes_gfortran12_O3 
 BESSEL_J1 Evaluation: Mean compute time for      1000000  evaluations:
  0.12651509410934522       seconds.
..

Thanks @jacobwilliams for trying out the code on your system.

What you’re noticing is about 35-40% difference between IFORT and gfortran on your system for the Fortran standard implementation of the BESSEL_J1 intrinsic which is in line with my expectations in terms of variance one can notice on various processors between different Fortran compilers.

So with some intrinsics, Fortran compiler A can be somewhat faster than other compilers; with other mathematical functions, compiler B can be faster. and so forth. On other processors (hardware/OS combinations), it may even be the opposite.

I won’t be surprised if one were to thoroughly examine the standard intrinsics for mathematical functions including special functions such as Bessel in various Fortran compilers - gfortran, IFORT, NAG Fortran, LLVM-based ones including NVidia, AMD, etc., and specialized ones such as Cray/HPE Fortran, and also IBM Fortran across different architectures (x86_64, ARM, etc.) and OSs (Windows, Linux, MacOS, XLF, etc.), one will see a similar difference as reported by @jacobwilliams above, perhaps even somewhat larger than that, possibly up to 2X.

So if @heltonmc is noticing about 40% improvement over gfortran (29 vs 49 ns) on a certain processor used for the performance evaluation, that is a good result to study further academically on the algorithmic side. But that “improvement” is well within the variance one can see among Fortran compiler implementations which needs to be kept in mind.

So to read anything further in terms of Julia vs Fortran makes no sense, especially that HN comment about “Julia 10x faster.”

heltonmc · October 17, 2022, 3:13pm

I too got similar timings but am no expert I want to try and give an opinion about the implementation but must admit I am only assuming what gfortran is doing under the hood. All of the implementations are somewhat constrained by typically having three (or two) branches that consist of (a) power series for small arguments, (b) minimax polynomials for medium arguments, and (c) asymptotic expansion for large arguments. Within that framework all of the implementations are able to make their own tradeoffs.

For example Bessels.jl employs just two main branches (branched at x < 26) using fully compensated arithmetic around the bessel zeros for the small and medium arguments and using a large argument expansion that is not fully compensated. Therefore, Bessels.jl provides relative tolerances of less than 1.5 ULP for x < 26 but then after that we provide absolute tolerances of less than 1.5 ULP. After all, the approach we use for medium arguments is not feasible for x → Inf because we would have to know all the roots to compensate around.

My original tradeoffs of not fully compensating for x > 26 and only giving absolute tolerances was that the branch was over 3x faster. For x > 26 the routine runs in about ~8ns whereas for x < 26 the routine is around ~28 ns. One thing I am most appreciative of this thread is my initial consideration of that speed tradeoff compared to fully compensating. As you can see with the benchmarks we show in this thread Bessels.jl gives around ~30ns which is essentially dominated by the slowest branch. This is an excellent example of branch prediction and I am wondering if you sorted the array of random values in the Fortran code if you would also see a noticeable difference. When you sort the array in Bessels.jl the benchmark comes down to around ~12 ns where you can really see the effect of that faster branch.

If you don’t see a difference in the Fortran routine, my guess would be that the Fortran routine is continuing that compensation providing relative tolerances to higher arguments. One thing I will think much harder about now is the cost of branches and the idea of optimizing the “fastest” branch and those tradeoffs. Fully compensating the besselj0 routine for larger arguments would be a very welcome addition to Bessels.jl. Because with these benchmarks you can see that you are really only as fast as your slowest branch (at least for these single value functions that can be very optimized) if you can’t reasonably predict which branch to take.

Edit: So it looks like my suspicions were true. gfortran looks to appear to output what the Arb library does near higher zeros.

# gfortran
src % ./bess
8.5422227289043417E-016

# Bessels.jl
julia> besselj0(313.37426607752786)
8.5691282296808535e-16

# Arb
julia> ArbNumerics.besselj0(ArbNumerics.ArbFloat(313.37426607752786))
8.542222728904342131095e-16

# openlibm
julia> SpecialFunctions.besselj0(313.37426607752786)
8.542268540102593e-16

# converting to big
julia> SpecialFunctions.besselj0(big"313.37426607752786")
6.88717044447645288013177668498237112801935118418650963224675789967214652104016e-16

julia> ArbNumerics.besselj0(ArbNumerics.ArbFloat(big"313.37426607752786"))
6.88717044447645288013e-16

So some interesting details obviously. It’s important to notice the difference between the numbers though in extended precision (or rather what does it mean to convert a number to higher precision?)

julia> big"313.37426607752786"
313.3742660775278600000000000000000000000000000000000000000000000000000000000004

julia> ArbNumerics.ArbFloat(313.37426607752786)
313.3742660775278636720031499863

So I would be very interested to see what is going on under the hood of gfortran it definitely looks like there is some conversion to higher precision to match Arb but computing around zeros is difficult.

Edit 2: I apologize I was not aware that the fortran routine we were discussing was also calling libm.

heltonmc · October 17, 2022, 4:08pm

I’m not for sure what you are insinuating considering every benchmark I have reported has been identical to what was reported in this thread by Fortran users. In some cases it is ok for me to go through source code that has an MIT license but I try to implement code directly from papers and bridge different ideas. I try to avoid diving into the source code of these functions because most of them have vague licensing and I want all of the code at Bessels.jl to be MIT and open source. I would hate for someone to scour through random projects and file copyright claims. At the end of the day, the algorithms at Bessels.jl are clear and easily accessible. I was happy discussing any improvements from the repository that could possibly be beneficial to the scientific community.

Mason · October 17, 2022, 4:36pm

Thank you @rgba and @jacobwilliams, I thought I had also tried with implicit none and declaring an output variable type and didn’t see a change, but I must have accidentally kept around an old shared library or something, because doing this again, I do indeed see the full double precision output. My bad!

I updated my code with the proper output type annotations and it agrees with the nicer, more modern version @jacobwilliams showed, so I’m using their version here.

Here is what I see for the accuracy of the Fortran BESSEL_J1 relative to an arbitrary precision calculuation, and compared to Bessels.bessel_j1 for 1,000,000 random samples between 0 and 200.
acc

So the algorithm Fortran is using does indeed have a tighter error distribution near zero relative to Bessels, but also a significantly longer error tail. That makes for a rather interesting comparison where different people may value the different characteristics differently.

Click for full reproducible code

#+begin_src fortran
module bess
  use iso_fortran_env, only: wp => real64
  implicit none
  contains
    pure function f90_bessel_j1(x) result(r)
      real(wp),intent(in) :: x
      real(wp) :: r
      r = bessel_j1(x)
    end function f90_bessel_j1
  end module bess
#+end_src

shell> gfortran bessel_j1.f90 -o bessel_j1_fort.so -shared -fPIC

#+begin_src julia
using Bessels, ArbNumerics, Libdl, Plots

setworkingprecision(ArbFloat, 500)
setextrabits(128)

dlopen("/home/mason/bessel_j1_fort.so") do lib
    fptr = dlsym(lib, :__bess_MOD_f90_bessel_j1)
    f90_besselj1(x) = ccall(fptr, Float64, (Ref{Float64},), x)
    
    xs = 200 .* rand(1_000_000)
    
    v_bessels = Bessels.besselj1.(xs)
    v_fort    = f90_besselj1.(xs)
    v_arb     = ArbNumerics.besselj1.(ArbFloat.(xs))
    
    resid_bessels = float.(v_bessels .- v_arb) ./ eps()
    resid_fort    = float.(v_fort .- v_arb) ./ eps()
    
    plot(xlabel="error / eps", ylabel="counts")
    histogram!(resid_bessels, nbins=41, yscale=:log10, ylims=(1, Inf), label="Bessels.jl besselj1")
    histogram!(resid_fort, fillalpha=0.5, nbins=41, yscale=:log10, ylims=(1, Inf), label="Fortran BESSEL_J1")
end
#+end_src

Click for system details

shell> gfortran --version
GNU Fortran (GCC) 12.2.0
Copyright (C) 2022 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

julia> versioninfo()
Julia Version 1.8.0
Commit 5544a0fab7* (2022-08-17 13:38 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: 12 × AMD Ryzen 5 5600X 6-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-13.0.1 (ORCJIT, znver3)
  Threads: 6 on 12 virtual cores
Environment:
  JULIA_NUM_THREADS = 6

jacobwilliams · October 14, 2022, 11:22pm

I’m a little skeptical of the comment that the Julia Bessel functions are 10x faster than the Fortran ones… That may require some further investigation!

gnikit · October 14, 2022, 11:35pm

Yeah, I tend to agree with your skepticism. That’s a noticeable difference, sounds like the comparison was not apples to apples.

gnikit · October 16, 2022, 2:41pm

The reason why posts from that location were moved first was because that was the first post that brought up benchmarking with some additional context. I did want to include yours but it felt I would be doing your very detailed explanations a disservice by moving them without some context into a new thread.

Anyway, all Bessel posts have been moved now. I don’t see a way that they could be arranged to make the thread easier to follow, so apologies for the not so ideal order.

Topic		Replies	Views
Fortran port of the Bessels.jl library Announcements	41	2072	November 22, 2022
Julia: Fast as Fortran, Beautiful as Python	184	11700	November 13, 2022
Using Julia libraries in Fortran with StaticCompiler.jl	11	948	October 22, 2022
An interesting video: Python vs Fortran vs GNU Octave/MATLAB --- side by side performance comparison	5	1175	January 6, 2022
Simple summation 8x slower than in Julia	89	14978	April 2, 2022

Comparing Fortran and Julia's Bessel function performance

Related topics