Comparing Fortran and Julia's Bessel function performance

Also, the compiler will matter too. With the code from @FortranFan above, ifort is a bit faster than gfortran on one of my systems:


# ifort (IFORT) 19.0.5.281 20190815
$ ifort -O3 bes.f90 -o bes_intel_O3
$ ./bes_intel_O3
 BESSEL_J1 Evaluation: Mean compute time for      1000000  evaluations:
  8.479578495025634E-002  seconds.
 Standard deviation in calculated results:   3.150582065370831E-012
$ ./bes_intel_O3
 BESSEL_J1 Evaluation: Mean compute time for      1000000  evaluations:
  8.608586788177490E-002  seconds.
 Standard deviation in calculated results:   3.150582065370831E-012

# GNU Fortran (conda-forge gcc 12.1.0-16) 12.1.0
$ gfortran -O3 bes.f90 -o bes_gfortran12_O3
$ ./bes_gfortran12_O3 
 BESSEL_J1 Evaluation: Mean compute time for      1000000  evaluations:
  0.12651509410934522       seconds.
 Standard deviation in calculated results:    3.1505820653708310E-012
$ ./bes_gfortran12_O3 
 BESSEL_J1 Evaluation: Mean compute time for      1000000  evaluations:
  0.13019724107580261       seconds.
 Standard deviation in calculated results:    0.0000000000000000     
3 Likes

I don’t think you have shown all your code (nothing is being done with out). But, since you didn’t declare it and you didn’t use implicit none, out is a single precision number (Congrats! This is a rite of passage that every Fortran programmer has been through for decades!) So, even though you are computing a double precision value, you are putting it into a single precision variable, which is causing the behavior you are seeing.

If you want a function that returns a double result, a modern version would look something like this:

module bess
  use iso_fortran_env, only: wp => real64
  implicit none
  contains
    pure function f90_bessel_j1(x) result(r)
      real(wp),intent(in) :: x
      real(wp) :: r
      r = bessel_j1(x)
    end function f90_bessel_j1
  end module bess
5 Likes

Thanks @jacobwilliams for trying out the code on your system.

What you’re noticing is about 35-40% difference between IFORT and gfortran on your system for the Fortran standard implementation of the BESSEL_J1 intrinsic which is in line with my expectations in terms of variance one can notice on various processors between different Fortran compilers.

So with some intrinsics, Fortran compiler A can be somewhat faster than other compilers; with other mathematical functions, compiler B can be faster. and so forth. On other processors (hardware/OS combinations), it may even be the opposite.

I won’t be surprised if one were to thoroughly examine the standard intrinsics for mathematical functions including special functions such as Bessel in various Fortran compilers - gfortran, IFORT, NAG Fortran, LLVM-based ones including NVidia, AMD, etc., and specialized ones such as Cray/HPE Fortran, and also IBM Fortran across different architectures (x86_64, ARM, etc.) and OSs (Windows, Linux, MacOS, XLF, etc.), one will see a similar difference as reported by @jacobwilliams above, perhaps even somewhat larger than that, possibly up to 2X.

So if @heltonmc is noticing about 40% improvement over gfortran (29 vs 49 ns) on a certain processor used for the performance evaluation, that is a good result to study further academically on the algorithmic side. But that “improvement” is well within the variance one can see among Fortran compiler implementations which needs to be kept in mind.

So to read anything further in terms of Julia vs Fortran makes no sense, especially that HN comment about “Julia 10x faster.”

3 Likes

I too got similar timings but am no expert :slight_smile: I want to try and give an opinion about the implementation but must admit I am only assuming what gfortran is doing under the hood. All of the implementations are somewhat constrained by typically having three (or two) branches that consist of (a) power series for small arguments, (b) minimax polynomials for medium arguments, and (c) asymptotic expansion for large arguments. Within that framework all of the implementations are able to make their own tradeoffs.

For example Bessels.jl employs just two main branches (branched at x < 26) using fully compensated arithmetic around the bessel zeros for the small and medium arguments and using a large argument expansion that is not fully compensated. Therefore, Bessels.jl provides relative tolerances of less than 1.5 ULP for x < 26 but then after that we provide absolute tolerances of less than 1.5 ULP. After all, the approach we use for medium arguments is not feasible for x → Inf because we would have to know all the roots to compensate around.

My original tradeoffs of not fully compensating for x > 26 and only giving absolute tolerances was that the branch was over 3x faster. For x > 26 the routine runs in about ~8ns whereas for x < 26 the routine is around ~28 ns. One thing I am most appreciative of this thread is my initial consideration of that speed tradeoff compared to fully compensating. As you can see with the benchmarks we show in this thread Bessels.jl gives around ~30ns which is essentially dominated by the slowest branch. This is an excellent example of branch prediction and I am wondering if you sorted the array of random values in the Fortran code if you would also see a noticeable difference. When you sort the array in Bessels.jl the benchmark comes down to around ~12 ns where you can really see the effect of that faster branch.

If you don’t see a difference in the Fortran routine, my guess would be that the Fortran routine is continuing that compensation providing relative tolerances to higher arguments. One thing I will think much harder about now is the cost of branches and the idea of optimizing the “fastest” branch and those tradeoffs. Fully compensating the besselj0 routine for larger arguments would be a very welcome addition to Bessels.jl. Because with these benchmarks you can see that you are really only as fast as your slowest branch (at least for these single value functions that can be very optimized) if you can’t reasonably predict which branch to take.

Edit: So it looks like my suspicions were true. gfortran looks to appear to output what the Arb library does near higher zeros.

# gfortran
src % ./bess
8.5422227289043417E-016

# Bessels.jl
julia> besselj0(313.37426607752786)
8.5691282296808535e-16

# Arb
julia> ArbNumerics.besselj0(ArbNumerics.ArbFloat(313.37426607752786))
8.542222728904342131095e-16

# openlibm
julia> SpecialFunctions.besselj0(313.37426607752786)
8.542268540102593e-16

# converting to big
julia> SpecialFunctions.besselj0(big"313.37426607752786")
6.88717044447645288013177668498237112801935118418650963224675789967214652104016e-16

julia> ArbNumerics.besselj0(ArbNumerics.ArbFloat(big"313.37426607752786"))
6.88717044447645288013e-16

So some interesting details obviously. It’s important to notice the difference between the numbers though in extended precision (or rather what does it mean to convert a number to higher precision?)

julia> big"313.37426607752786"
313.3742660775278600000000000000000000000000000000000000000000000000000000000004

julia> ArbNumerics.ArbFloat(313.37426607752786)
313.3742660775278636720031499863

So I would be very interested to see what is going on under the hood of gfortran it definitely looks like there is some conversion to higher precision to match Arb but computing around zeros is difficult.

Edit 2: I apologize I was not aware that the fortran routine we were discussing was also calling libm.

I’m not for sure what you are insinuating considering every benchmark I have reported has been identical to what was reported in this thread by Fortran users. In some cases it is ok for me to go through source code that has an MIT license but I try to implement code directly from papers and bridge different ideas. I try to avoid diving into the source code of these functions because most of them have vague licensing and I want all of the code at Bessels.jl to be MIT and open source. I would hate for someone to scour through random projects and file copyright claims. At the end of the day, the algorithms at Bessels.jl are clear and easily accessible. I was happy discussing any improvements from the repository that could possibly be beneficial to the scientific community.

2 Likes

Thank you @rgba and @jacobwilliams, I thought I had also tried with implicit none and declaring an output variable type and didn’t see a change, but I must have accidentally kept around an old shared library or something, because doing this again, I do indeed see the full double precision output. My bad!


I updated my code with the proper output type annotations and it agrees with the nicer, more modern version @jacobwilliams showed, so I’m using their version here.

Here is what I see for the accuracy of the Fortran BESSEL_J1 relative to an arbitrary precision calculuation, and compared to Bessels.bessel_j1 for 1,000,000 random samples between 0 and 200.
acc

So the algorithm Fortran is using does indeed have a tighter error distribution near zero relative to Bessels, but also a significantly longer error tail. That makes for a rather interesting comparison where different people may value the different characteristics differently.

Click for full reproducible code
#+begin_src fortran
module bess
  use iso_fortran_env, only: wp => real64
  implicit none
  contains
    pure function f90_bessel_j1(x) result(r)
      real(wp),intent(in) :: x
      real(wp) :: r
      r = bessel_j1(x)
    end function f90_bessel_j1
  end module bess
#+end_src
shell> gfortran bessel_j1.f90 -o bessel_j1_fort.so -shared -fPIC
#+begin_src julia
using Bessels, ArbNumerics, Libdl, Plots

setworkingprecision(ArbFloat, 500)
setextrabits(128)

dlopen("/home/mason/bessel_j1_fort.so") do lib
    fptr = dlsym(lib, :__bess_MOD_f90_bessel_j1)
    f90_besselj1(x) = ccall(fptr, Float64, (Ref{Float64},), x)
    
    xs = 200 .* rand(1_000_000)
    
    v_bessels = Bessels.besselj1.(xs)
    v_fort    = f90_besselj1.(xs)
    v_arb     = ArbNumerics.besselj1.(ArbFloat.(xs))
    
    resid_bessels = float.(v_bessels .- v_arb) ./ eps()
    resid_fort    = float.(v_fort .- v_arb) ./ eps()
    
    plot(xlabel="error / eps", ylabel="counts")
    histogram!(resid_bessels, nbins=41, yscale=:log10, ylims=(1, Inf), label="Bessels.jl besselj1")
    histogram!(resid_fort, fillalpha=0.5, nbins=41, yscale=:log10, ylims=(1, Inf), label="Fortran BESSEL_J1")
end
#+end_src
Click for system details
shell> gfortran --version
GNU Fortran (GCC) 12.2.0
Copyright (C) 2022 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
julia> versioninfo()
Julia Version 1.8.0
Commit 5544a0fab7* (2022-08-17 13:38 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: 12 × AMD Ryzen 5 5600X 6-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-13.0.1 (ORCJIT, znver3)
  Threads: 6 on 12 virtual cores
Environment:
  JULIA_NUM_THREADS = 6
3 Likes

I’m a little skeptical of the comment that the Julia Bessel functions are 10x faster than the Fortran ones… That may require some further investigation! :slight_smile:

3 Likes

Yeah, I tend to agree with your skepticism. That’s a noticeable difference, sounds like the comparison was not apples to apples.

The reason why posts from that location were moved first was because that was the first post that brought up benchmarking with some additional context. I did want to include yours but it felt I would be doing your very detailed explanations a disservice by moving them without some context into a new thread.

Anyway, all Bessel posts have been moved now. I don’t see a way that they could be arranged to make the thread easier to follow, so apologies for the not so ideal order.

Yes, splitting should be done with extreme care. I am sorry about the mess. I think it’s still out of order here, but at least it’s all in one thread.

I am quite bit out of my league here, but given that gcc and gfortran are open source, one does not need to speculate about the method used to evaluate the intrinsic Bessel functions. One can simply have a look! :slight_smile:
If I am not mistaken, the files below show what happens under the hood for fortran and (out of curiosity) for C++.
bessel_r8.c.txt (4.6 KB)
bessel_function.tcc.txt (23.1 KB)

/* Implementation of the BESSEL_JN and BESSEL_YN transformational
   function using a recurrence algorithm.
   Copyright (C) 2010-2022 Free Software Foundation, Inc.
   Contributed by Tobias Burnus <burnus@net-b.de>

This file is part of the GNU Fortran runtime library (libgfortran).

Libgfortran is free software; you can redistribute it and/or
modify it under the terms of the GNU General Public
License as published by the Free Software Foundation; either
version 3 of the License, or (at your option) any later version.

Libgfortran is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
GNU General Public License for more details.

Under Section 7 of GPL version 3, you are granted additional
permissions described in the GCC Runtime Library Exception, version
3.1, as published by the Free Software Foundation.

You should have received a copy of the GNU General Public License and
a copy of the GCC Runtime Library Exception along with this program;
see the files COPYING3 and COPYING.RUNTIME respectively.  If not, see
<http://www.gnu.org/licenses/>.  */

#include "libgfortran.h"



#define MATHFUNC(funcname) funcname

#if defined (HAVE_GFC_REAL_8)



#if defined (HAVE_JN)
extern void bessel_jn_r8 (gfc_array_r8 * const restrict ret, int n1,
				     int n2, GFC_REAL_8 x);
export_proto(bessel_jn_r8);

void
bessel_jn_r8 (gfc_array_r8 * const restrict ret, int n1, int n2, GFC_REAL_8 x)
{
  int i;
  index_type stride;

  GFC_REAL_8 last1, last2, x2rev;

  stride = GFC_DESCRIPTOR_STRIDE(ret,0);

  if (ret->base_addr == NULL)
    {
      size_t size = n2 < n1 ? 0 : n2-n1+1; 
      GFC_DIMENSION_SET(ret->dim[0], 0, size-1, 1);
      ret->base_addr = xmallocarray (size, sizeof (GFC_REAL_8));
      ret->offset = 0;
    }

  if (unlikely (n2 < n1))
    return;

  if (unlikely (compile_options.bounds_check)
      && GFC_DESCRIPTOR_EXTENT(ret,0) != (n2-n1+1))
    runtime_error("Incorrect extent in return value of BESSEL_JN "
		  "(%ld vs. %ld)", (long int) n2-n1,
		  (long int) GFC_DESCRIPTOR_EXTENT(ret,0));

  stride = GFC_DESCRIPTOR_STRIDE(ret,0);

  if (unlikely (x == 0))
    {
      ret->base_addr[0] = 1;
      for (i = 1; i <= n2-n1; i++)
        ret->base_addr[i*stride] = 0;
      return;
    }

  last1 = MATHFUNC(jn) (n2, x);
  ret->base_addr[(n2-n1)*stride] = last1;

  if (n1 == n2)
    return;

  last2 = MATHFUNC(jn) (n2 - 1, x);
  ret->base_addr[(n2-n1-1)*stride] = last2;

  if (n1 + 1 == n2)
    return;

  x2rev = GFC_REAL_8_LITERAL(2.)/x;

  for (i = n2-n1-2; i >= 0; i--)
    {
      ret->base_addr[i*stride] = x2rev * (i+1+n1) * last2 - last1;
      last1 = last2;
      last2 = ret->base_addr[i*stride];
    }
}

#endif

#if defined (HAVE_YN)
extern void bessel_yn_r8 (gfc_array_r8 * const restrict ret,
				     int n1, int n2, GFC_REAL_8 x);
export_proto(bessel_yn_r8);

void
bessel_yn_r8 (gfc_array_r8 * const restrict ret, int n1, int n2,
			 GFC_REAL_8 x)
{
  int i;
  index_type stride;

  GFC_REAL_8 last1, last2, x2rev;

  stride = GFC_DESCRIPTOR_STRIDE(ret,0);

  if (ret->base_addr == NULL)
    {
      size_t size = n2 < n1 ? 0 : n2-n1+1; 
      GFC_DIMENSION_SET(ret->dim[0], 0, size-1, 1);
      ret->base_addr = xmallocarray (size, sizeof (GFC_REAL_8));
      ret->offset = 0;
    }

  if (unlikely (n2 < n1))
    return;

  if (unlikely (compile_options.bounds_check)
      && GFC_DESCRIPTOR_EXTENT(ret,0) != (n2-n1+1))
    runtime_error("Incorrect extent in return value of BESSEL_JN "
		  "(%ld vs. %ld)", (long int) n2-n1,
		  (long int) GFC_DESCRIPTOR_EXTENT(ret,0));

  stride = GFC_DESCRIPTOR_STRIDE(ret,0);

  if (unlikely (x == 0))
    {
      for (i = 0; i <= n2-n1; i++)
#if defined(GFC_REAL_8_INFINITY)
        ret->base_addr[i*stride] = -GFC_REAL_8_INFINITY;
#else
        ret->base_addr[i*stride] = -GFC_REAL_8_HUGE;
#endif
      return;
    }

  last1 = MATHFUNC(yn) (n1, x);
  ret->base_addr[0] = last1;

  if (n1 == n2)
    return;

  last2 = MATHFUNC(yn) (n1 + 1, x);
  ret->base_addr[1*stride] = last2;

  if (n1 + 1 == n2)
    return;

  x2rev = GFC_REAL_8_LITERAL(2.)/x;

  for (i = 2; i <= n2 - n1; i++)
    {
#if defined(GFC_REAL_8_INFINITY)
      if (unlikely (last2 == -GFC_REAL_8_INFINITY))
	{
	  ret->base_addr[i*stride] = -GFC_REAL_8_INFINITY;
	}
      else
#endif
	{
	  ret->base_addr[i*stride] = x2rev * (i-1+n1) * last2 - last1;
	  last1 = last2;
	  last2 = ret->base_addr[i*stride];
	}
    }
}
#endif

#endif

Thanks, very interesting. I had never tried the -S flag. Still following the see-rather-than-speculate approach, I was about to suggest analyzing the content of libm, but I realize that you already did so, based on your comment from 8:29 pm (which I had missed).

1 Like

Haha. Douglas Adams’ answer to everything.

1 Like

To me, the thread has thus far proved the “Julia 10x faster” claim on the Hacker News thread is entirely invalid.

It will be advisable for the likes of @oscardssmith , @Mason , @heltonmc et al. to be very mindful of what appears to be incessant false claims around Julia. If this persists, it will continue to further harm the credibility of Julia community - it already has.

Julia has been a fine effort thus far - it is a good advance, highly useful toward certain computing pursuits, computational mathematics in academia and research being a particularly relevant example given this thread.

However as one of the replies to the above Hacker News mentioned, ““often 10X faster” is just plain exaggeration looking at it positively, and plain falsehood spread by the for-profit JuliaComputing, looking at it realistically. The statement does not even reflect the numbers posted on the project’s page, which has, by the way, nothing to do with Fortran directly.”

In this instance, at the root of the matter are the “numbers posted on the project’s page” that were not clear to begin with and a tendency by some then to “run with them” to make claims around Julia. Those in academia / research e.g., those pursuing PhDs can help by taking great care with numbers posted online, either only post after having investigated matters extremely comprehensively and having understood what they mean and developed proper context and explanations around them, or refrain from posting them online.

7 Likes

This I don´t understand. The Bessels.jl page claims that its implementation of some function (besselk0 specifically) is 10x faster than that of another Julia package, SpecialFunctions.jl. It does not mention at all Fortran (has it been edited?). The claim of the page, related to the comparison of the two Julia packages, seems to be easily verifiable:

julia> import Bessels, SpecialFunctions

julia> all( Bessels.besselk0(x) ≈ SpecialFunctions.besselk(0, x) for x in 100*rand(1000) )
true

julia> @benchmark SpecialFunctions.besselk(0, x) setup=(x=100*rand())
BenchmarkTools.Trial: 10000 samples with 909 evaluations.
 Range (min … max):   97.887 ns …  1.176 μs  ┊ GC (min … max): 0.00% … 84.24%
 Time  (median):     170.382 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   182.881 ns ± 44.042 ns  ┊ GC (mean ± σ):  0.16% ±  1.47%

             ▁██▇▇▇▆▆▆▆▅▅▄▃▃▃▂▂▂▁▁▁                            ▂
  ▄▂▃▄▆▅▆▅▅▆▄██████████████████████████▇██▇█▇█▆▇▇▆▆▇▇▆▆▇▅▅▆▅▄▆ █
  97.9 ns       Histogram: log(frequency) by time       368 ns <

 Memory estimate: 16 bytes, allocs estimate: 1.

julia> @benchmark Bessels.besselk0(x) setup=(x=100*rand())
BenchmarkTools.Trial: 10000 samples with 999 evaluations.
 Range (min … max):  7.264 ns … 22.253 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     9.315 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   9.460 ns ±  0.787 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

                      ▄█▂▅▁▂           ▁                     ▁
  ▄▄▆▅▇▅▇▆▆▅▃▄▄▁▁▁▁▁▃▃██████▇██▇█▇▅▅▅▇████▆▇▄▅▃▃▄▃▁▁▃▄▁▁▃▁▄▄ █
  7.26 ns      Histogram: log(frequency) by time     12.8 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.


Even if the SpecialFunctions package calls some Fortran routine, and perhaps that routine could be improved, or compiled differently, etc, the claim of the page is among Julia packages. I’m pretty certain that if someone knows how to improve the performance of SpecialFunctions by tuning some (tunable in the context of the Julia package) compiler option of the underlying routines, that would be greatly appreciated. No need for conspiracy theories here :slight_smile: . Neither should we take too seriously these social media comments or their importance. The communities of Fortran and Julia users far exceed that of readers of these threads. Most Fortran and Julia programmers I know don’t even know about the existence of HN, or even of these forums.

(edit: updated benchmark upon request)

6 Likes

@lmiq I agree that what’s in the README of Bessels.jl is legitimate and appropriate (if you launch a package targeting specific special functions and a SpecialFunctions.jl exists you gotta show what’s the deal). And thanks for clarifying that the numbers there are substantiated, this is important.

But I think that the main truth in FortranFan’s comment lies in the second part of that sentence. I was surprised earlier on exactly about how a quite ordinary benchmark between two Julia packages has turned (somewhere by someone, but it’s nothing unheard of when it comes to Julia marketing) in a tranchant statement about Julia vs Fortran. That’s all.

2 Likes

The statement wasn’t one about Julia vs Fortran. It was about modern Julia code vs ancient Fortran code. Fortran is obviously perfectly capable of good performance, but older code often fails to hit that target due to being designed for previous architectures and being based on out of date research. As far as I can tell, the claim that this represented Julia vs Fortran as languages rather than library implementations originated on this discourse as a strawman.

3 Likes

FWIW, I only came to hear about that here.

The exact phrase used at HN was:

For a simple example of how much of a difference this can make, look at GitHub - JuliaMath/Bessels.jl: Bessel functions for real arguments and orders which is often 10x faster than AMOS (the old reliable Fortran code).

My understanding is that they compared to the AMOS library, not Fortran in general.

Often when you write a new implementation, it’s quite easy to get better performance, especially if you are willing to accept slightly lower accuracy or narrower argument range. As an example, here is a 28x faster sin(x) implementation implemented in Fortran, compared to libm: blog/Blog-2.md · master · lfortran / math_intrinsics · GitLab

The speedup on my computer was real, we investigated it very carefully. But it had lower argument range, and lower accuracy (although still 10^{-15} relative accuracy, which might be enough for many applications). So the devil is in the details, as @FortranFan, @kargl and others said above.

8 Likes

Speaking of individuals making unscrupulous claims with little or no basis in reality, where do these conspiracy theories even come from? What on earth makes you think that JuliaComputing is paying people to shill on HackerNews threads?

The reality is that under-informed people are everywhere, and it is human nature for one to run their mouth on subjects they don’t fully understand.

2 Likes