Comparing Fortran and Julia's Bessel function performance

certik · October 17, 2022, 6:15pm

Yes, splitting should be done with extreme care. I am sorry about the mess. I think it’s still out of order here, but at least it’s all in one thread.

HugoMVale · October 17, 2022, 7:23pm

I am quite bit out of my league here, but given that gcc and gfortran are open source, one does not need to speculate about the method used to evaluate the intrinsic Bessel functions. One can simply have a look!
If I am not mistaken, the files below show what happens under the hood for fortran and (out of curiosity) for C++.
bessel_r8.c.txt (4.6 KB)
bessel_function.tcc.txt (23.1 KB)

/* Implementation of the BESSEL_JN and BESSEL_YN transformational
   function using a recurrence algorithm.
   Copyright (C) 2010-2022 Free Software Foundation, Inc.
   Contributed by Tobias Burnus <burnus@net-b.de>

This file is part of the GNU Fortran runtime library (libgfortran).

Libgfortran is free software; you can redistribute it and/or
modify it under the terms of the GNU General Public
License as published by the Free Software Foundation; either
version 3 of the License, or (at your option) any later version.

Libgfortran is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
GNU General Public License for more details.

Under Section 7 of GPL version 3, you are granted additional
permissions described in the GCC Runtime Library Exception, version
3.1, as published by the Free Software Foundation.

You should have received a copy of the GNU General Public License and
a copy of the GCC Runtime Library Exception along with this program;
see the files COPYING3 and COPYING.RUNTIME respectively.  If not, see
<http://www.gnu.org/licenses/>.  */

#include "libgfortran.h"



#define MATHFUNC(funcname) funcname

#if defined (HAVE_GFC_REAL_8)



#if defined (HAVE_JN)
extern void bessel_jn_r8 (gfc_array_r8 * const restrict ret, int n1,
				     int n2, GFC_REAL_8 x);
export_proto(bessel_jn_r8);

void
bessel_jn_r8 (gfc_array_r8 * const restrict ret, int n1, int n2, GFC_REAL_8 x)
{
  int i;
  index_type stride;

  GFC_REAL_8 last1, last2, x2rev;

  stride = GFC_DESCRIPTOR_STRIDE(ret,0);

  if (ret->base_addr == NULL)
    {
      size_t size = n2 < n1 ? 0 : n2-n1+1; 
      GFC_DIMENSION_SET(ret->dim[0], 0, size-1, 1);
      ret->base_addr = xmallocarray (size, sizeof (GFC_REAL_8));
      ret->offset = 0;
    }

  if (unlikely (n2 < n1))
    return;

  if (unlikely (compile_options.bounds_check)
      && GFC_DESCRIPTOR_EXTENT(ret,0) != (n2-n1+1))
    runtime_error("Incorrect extent in return value of BESSEL_JN "
		  "(%ld vs. %ld)", (long int) n2-n1,
		  (long int) GFC_DESCRIPTOR_EXTENT(ret,0));

  stride = GFC_DESCRIPTOR_STRIDE(ret,0);

  if (unlikely (x == 0))
    {
      ret->base_addr[0] = 1;
      for (i = 1; i <= n2-n1; i++)
        ret->base_addr[i*stride] = 0;
      return;
    }

  last1 = MATHFUNC(jn) (n2, x);
  ret->base_addr[(n2-n1)*stride] = last1;

  if (n1 == n2)
    return;

  last2 = MATHFUNC(jn) (n2 - 1, x);
  ret->base_addr[(n2-n1-1)*stride] = last2;

  if (n1 + 1 == n2)
    return;

  x2rev = GFC_REAL_8_LITERAL(2.)/x;

  for (i = n2-n1-2; i >= 0; i--)
    {
      ret->base_addr[i*stride] = x2rev * (i+1+n1) * last2 - last1;
      last1 = last2;
      last2 = ret->base_addr[i*stride];
    }
}

#endif

#if defined (HAVE_YN)
extern void bessel_yn_r8 (gfc_array_r8 * const restrict ret,
				     int n1, int n2, GFC_REAL_8 x);
export_proto(bessel_yn_r8);

void
bessel_yn_r8 (gfc_array_r8 * const restrict ret, int n1, int n2,
			 GFC_REAL_8 x)
{
  int i;
  index_type stride;

  GFC_REAL_8 last1, last2, x2rev;

  stride = GFC_DESCRIPTOR_STRIDE(ret,0);

  if (ret->base_addr == NULL)
    {
      size_t size = n2 < n1 ? 0 : n2-n1+1; 
      GFC_DIMENSION_SET(ret->dim[0], 0, size-1, 1);
      ret->base_addr = xmallocarray (size, sizeof (GFC_REAL_8));
      ret->offset = 0;
    }

  if (unlikely (n2 < n1))
    return;

  if (unlikely (compile_options.bounds_check)
      && GFC_DESCRIPTOR_EXTENT(ret,0) != (n2-n1+1))
    runtime_error("Incorrect extent in return value of BESSEL_JN "
		  "(%ld vs. %ld)", (long int) n2-n1,
		  (long int) GFC_DESCRIPTOR_EXTENT(ret,0));

  stride = GFC_DESCRIPTOR_STRIDE(ret,0);

  if (unlikely (x == 0))
    {
      for (i = 0; i <= n2-n1; i++)
#if defined(GFC_REAL_8_INFINITY)
        ret->base_addr[i*stride] = -GFC_REAL_8_INFINITY;
#else
        ret->base_addr[i*stride] = -GFC_REAL_8_HUGE;
#endif
      return;
    }

  last1 = MATHFUNC(yn) (n1, x);
  ret->base_addr[0] = last1;

  if (n1 == n2)
    return;

  last2 = MATHFUNC(yn) (n1 + 1, x);
  ret->base_addr[1*stride] = last2;

  if (n1 + 1 == n2)
    return;

  x2rev = GFC_REAL_8_LITERAL(2.)/x;

  for (i = 2; i <= n2 - n1; i++)
    {
#if defined(GFC_REAL_8_INFINITY)
      if (unlikely (last2 == -GFC_REAL_8_INFINITY))
	{
	  ret->base_addr[i*stride] = -GFC_REAL_8_INFINITY;
	}
      else
#endif
	{
	  ret->base_addr[i*stride] = x2rev * (i-1+n1) * last2 - last1;
	  last1 = last2;
	  last2 = ret->base_addr[i*stride];
	}
    }
}
#endif

#endif

HugoMVale · October 17, 2022, 8:11pm

Thanks, very interesting. I had never tried the -S flag. Still following the see-rather-than-speculate approach, I was about to suggest analyzing the content of libm, but I realize that you already did so, based on your comment from 8:29 pm (which I had missed).

RonShepard · October 17, 2022, 11:38pm

Haha. Douglas Adams’ answer to everything.

FortranFan · October 18, 2022, 2:54pm

FortranFan:

jacobwilliams:
Also, the compiler will matter too. With the code from @FortranFan above, ifort is a bit faster than gfortran on one of my systems:
# ifort (IFORT) 19.0.5.281 20190815
..
 BESSEL_J1 Evaluation: Mean compute time for      1000000  evaluations:
  8.479578495025634E-002  seconds.
..
# GNU Fortran (conda-forge gcc 12.1.0-16) 12.1.0
$ gfortran -O3 bes.f90 -o bes_gfortran12_O3
$ ./bes_gfortran12_O3 
 BESSEL_J1 Evaluation: Mean compute time for      1000000  evaluations:
  0.12651509410934522       seconds.
..
…
What you’re noticing is about 35-40% difference between IFORT and gfortran on your system for the Fortran standard implementation of the BESSEL_J1 intrinsic which is in line with my expectations in terms of variance one can notice on various processors between different Fortran compilers.

To me, the thread has thus far proved the “Julia 10x faster” claim on the Hacker News thread is entirely invalid.

It will be advisable for the likes of @oscardssmith , @Mason , @heltonmc et al. to be very mindful of what appears to be incessant false claims around Julia. If this persists, it will continue to further harm the credibility of Julia community - it already has.

Julia has been a fine effort thus far - it is a good advance, highly useful toward certain computing pursuits, computational mathematics in academia and research being a particularly relevant example given this thread.

However as one of the replies to the above Hacker News mentioned, ““often 10X faster” is just plain exaggeration looking at it positively, and plain falsehood spread by the for-profit JuliaComputing, looking at it realistically. The statement does not even reflect the numbers posted on the project’s page, which has, by the way, nothing to do with Fortran directly.”

In this instance, at the root of the matter are the “numbers posted on the project’s page” that were not clear to begin with and a tendency by some then to “run with them” to make claims around Julia. Those in academia / research e.g., those pursuing PhDs can help by taking great care with numbers posted online, either only post after having investigated matters extremely comprehensively and having understood what they mean and developed proper context and explanations around them, or refrain from posting them online.

lmiq · October 18, 2022, 7:52pm

This I don´t understand. The Bessels.jl page claims that its implementation of some function (besselk0 specifically) is 10x faster than that of another Julia package, SpecialFunctions.jl. It does not mention at all Fortran (has it been edited?). The claim of the page, related to the comparison of the two Julia packages, seems to be easily verifiable:

julia> import Bessels, SpecialFunctions

julia> all( Bessels.besselk0(x) ≈ SpecialFunctions.besselk(0, x) for x in 100*rand(1000) )
true

julia> @benchmark SpecialFunctions.besselk(0, x) setup=(x=100*rand())
BenchmarkTools.Trial: 10000 samples with 909 evaluations.
 Range (min … max):   97.887 ns …  1.176 μs  ┊ GC (min … max): 0.00% … 84.24%
 Time  (median):     170.382 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   182.881 ns ± 44.042 ns  ┊ GC (mean ± σ):  0.16% ±  1.47%

             ▁██▇▇▇▆▆▆▆▅▅▄▃▃▃▂▂▂▁▁▁                            ▂
  ▄▂▃▄▆▅▆▅▅▆▄██████████████████████████▇██▇█▇█▆▇▇▆▆▇▇▆▆▇▅▅▆▅▄▆ █
  97.9 ns       Histogram: log(frequency) by time       368 ns <

 Memory estimate: 16 bytes, allocs estimate: 1.

julia> @benchmark Bessels.besselk0(x) setup=(x=100*rand())
BenchmarkTools.Trial: 10000 samples with 999 evaluations.
 Range (min … max):  7.264 ns … 22.253 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     9.315 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   9.460 ns ±  0.787 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

                      ▄█▂▅▁▂           ▁                     ▁
  ▄▄▆▅▇▅▇▆▆▅▃▄▄▁▁▁▁▁▃▃██████▇██▇█▇▅▅▅▇████▆▇▄▅▃▃▄▃▁▁▃▄▁▁▃▁▄▄ █
  7.26 ns      Histogram: log(frequency) by time     12.8 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

Even if the SpecialFunctions package calls some Fortran routine, and perhaps that routine could be improved, or compiled differently, etc, the claim of the page is among Julia packages. I’m pretty certain that if someone knows how to improve the performance of SpecialFunctions by tuning some (tunable in the context of the Julia package) compiler option of the underlying routines, that would be greatly appreciated. No need for conspiracy theories here . Neither should we take too seriously these social media comments or their importance. The communities of Fortran and Julia users far exceed that of readers of these threads. Most Fortran and Julia programmers I know don’t even know about the existence of HN, or even of these forums.

(edit: updated benchmark upon request)

rgba · October 18, 2022, 11:48pm

@lmiq I agree that what’s in the README of Bessels.jl is legitimate and appropriate (if you launch a package targeting specific special functions and a SpecialFunctions.jl exists you gotta show what’s the deal). And thanks for clarifying that the numbers there are substantiated, this is important.

But I think that the main truth in FortranFan’s comment lies in the second part of that sentence. I was surprised earlier on exactly about how a quite ordinary benchmark between two Julia packages has turned (somewhere by someone, but it’s nothing unheard of when it comes to Julia marketing) in a tranchant statement about Julia vs Fortran. That’s all.

oscardssmith · October 18, 2022, 11:56pm

The statement wasn’t one about Julia vs Fortran. It was about modern Julia code vs ancient Fortran code. Fortran is obviously perfectly capable of good performance, but older code often fails to hit that target due to being designed for previous architectures and being based on out of date research. As far as I can tell, the claim that this represented Julia vs Fortran as languages rather than library implementations originated on this discourse as a strawman.

lmiq · October 19, 2022, 12:16am

FWIW, I only came to hear about that here.

certik · October 19, 2022, 4:29am

The exact phrase used at HN was:

For a simple example of how much of a difference this can make, look at GitHub - JuliaMath/Bessels.jl: Bessel functions for real arguments and orders which is often 10x faster than AMOS (the old reliable Fortran code).

My understanding is that they compared to the AMOS library, not Fortran in general.

Often when you write a new implementation, it’s quite easy to get better performance, especially if you are willing to accept slightly lower accuracy or narrower argument range. As an example, here is a 28x faster sin(x) implementation implemented in Fortran, compared to libm: blog/Blog-2.md · master · lfortran / math_intrinsics · GitLab

The speedup on my computer was real, we investigated it very carefully. But it had lower argument range, and lower accuracy (although still 10^{-15} relative accuracy, which might be enough for many applications). So the devil is in the details, as @FortranFan, @kargl and others said above.

Mason · October 19, 2022, 6:11pm

Speaking of individuals making unscrupulous claims with little or no basis in reality, where do these conspiracy theories even come from? What on earth makes you think that JuliaComputing is paying people to shill on HackerNews threads?

The reality is that under-informed people are everywhere, and it is human nature for one to run their mouth on subjects they don’t fully understand.

RonShepard · October 19, 2022, 9:25pm

That first quote is not equivalent to the second quote. The first can be true (and is, as far as I know), but the second can be false (I would have no idea one way or the other).

heltonmc · October 19, 2022, 10:13pm

I appreciate the thoughts and effort that everyone has put into providing feedback and running their own benchmarks. Overall, the majority of interactions have been overwhelmingly positive and I did find several aspects about the Fortran language that I liked. I think we all benefit from diverse advancements in different programming languages and algorithm designs.

At this point, I’m not sure where the discussion is going. There was a lot of skepticism on the benchmarks which I think we have verified. We then test for accuracy and then repeat. I would be very happy to answer any questions about improved algorithms for these functions that would in anyway help the Fortran community. No one here is doubting that Fortran and Julia can deliver similar performances if the algorithms are the same.

I’ve found all the comments about somehow making false claims and being part of some marketing machine pretty offensive. Though, after looking at other threads (Language Marketing for Julia compared to Fortran - #59 by logankilpatrick) this is a very common theme which the community has given thoughtful replies. I’m really not sure how the Julia language benefits more from commercial interests than Fortran has over the decades and continues to now… Both languages suffer from difficulties in convincing people to use their language and both market the advantages. I don’t think this is particularly an issue because if you have something cool you should want to tell people about it.

However, this really hasn’t been a positive experience at all. I say this with as much sincerity as I can, but the likelihood of me using Fortran again is very low. This discussion could have gone in more positive directions about sharing fun ideas and learning from each other. After all, I only started started programming more after my experimental work was shut down due to the pandemic. So we might reflect on how some random grad student who started programming at 24 was able to write some code in Julia to outperform standard C/Fortran libraries. Now, of course absolutely none of that could have happened without @oscardsmith immediately jumping on and contributing and teaching me the language because I posted some random question one day. And that goes to the greater Julia community who have constantly answered my dumb questions over the past two years and continue to do so. I deeply thank them for that. So I do think it is important to think about how that happened. I would probably say that both the Julia community and language facilitated that and I think the Fortran community could also continue to grow because of it (though I’m unsure after this thread). After all, I am not the first person that was able to implement special functions faster (Julia vs Fortran complaint - #7 by stevengj - General Usage - Julia Programming Language). It does seem like a particular advantage of Julia. But I also think it demonstrates that rewriting old code has many advantages.

Now of course there are limitations that I would love to see addressed. I am currently working on linking a lot of research code that I wrote in Julia so other people working in my research field (biomedical sciences) can use it in Python. I am hoping for a lot more synergy between Julia and Python in the future because even if you have very fancy algorithms to efficiently compute complicated math problems the real challenge is still accessibility and integration across languages and systems. This is something I think Fortran does better than Julia obviously. I would honestly like to see Bessels.jl become the test subject for the new static compilation work. Having the ability to distribute small binaries to other users in Python I think would be a very good thing for the language. After all, Bessels.jl appears to be a very good candidate for this with its strong types, no external dependencies, no allocations (no GC), and type stability. I believe (or rather hope) this is the next step for the language.

shahmoradi · October 19, 2022, 11:00pm

Hi @heltonmc, I think everyone appreciates your positive tone in the discussion.
I resisted the temptation for several days to participate in the discussions of this thread any further because I do not see where it is heading. Such discussions can become more objective if people pay more attention to a few details.

Python, SciPy, or any other wrapper is not Fortran. Please clarify to the Julia community that these three are not the same if you can. Testing Julia packages against Python wrappers that may or may not call some dynamic C/Fortran libraries is not the same as benchmarking it against pure C/Fortran. Please look at the link you posted in your comment and see how the accepted answer uses MATLAB, SciPy, and Fortran effectively synonymously.
I remember @oscardssmith 's amazement in this discourse at the fact that Fortran compilers can indeed unroll loops. As far as I remember from his comment, someone had shown him a Python-Fortran wrapper benchmark in JuliaCon19 that somehow proved Fortran compilers incapable of loop-unrolling (I vaguely remember the discussion, apologies in advance if it is incomplete or inaccurate).
When someone from a corporation comments on a topic in this forum, they always begin the discussion by introducing themselves, mentioning their affiliations, and revealing any conflicts of interest. It would certainly help if people from Julia Discourse and JuliaComputing followed suit in this forum to eliminate irrelevant and unproductive comments (and sometimes accusations) from this forum.

ivanpribec · October 20, 2022, 1:25am

Hi @heltonmc,

Thanks for sharing your view. I hope we gain access to your implementations of Bessel functions and other special functions in the future also from Fortran and not only Python. Whether this will be via manual translation, transpilation, or directly linking with a Julia static library remains an open question.

The example Steven Johnson chose is a bit unfortunate/misinformed. The claim he made in his JuliaCon 2019 talk was that his pure Julia erfinv is 2-3× faster than the implementation in the Cephes library (presumably called through Python/SciPy):

Cephes is a library implemented in C, not Fortran like he claims in the video. The entry point is scipy/special/erfinv.c which calls the function ndtri.c. The original source code on Netlib (cephes) is all C.
A Fortran implementation of erfinv is available in the NSWC Mathematical library. It uses the same rational function approximations from Blair, which are used in SpecialFunctions.jl.
For the NSWC function ERFI the polynomials are unrolled manually, however it pending you use the right compilations flags, it can also be done automatically. Here’s the output from the NSWC routine in Compiler Explorer. Interestingly, their real precision coefficients, are closer to the tables used for Float64. Who knows what kind of precision/floating-point representation was available on the hardware of the Naval Surface Warfare Center.
If you really don’t trust the compiler to do the unrolling (perhaps because it’s a new compiler or the optimizer for a particular hardware platform is less mature) implementing a Horner macro can be done with the fypp preprocessor or some other scripting language. This is of course less elegant than a built in meta-programming solution. Below is an implementation of erfinv I wrote; the code looks almost identical to the Julia version.
Unlike what Steven Johnson claims in the post you linked above at the Julia Discourse:

In principle, the Fortran authors could have done the same inlining and gotten similar performance, but the code would have been much more painful to write by hand. (They could even have written a program to generate Fortran code, but that is even more painful.)

the fypp @horner macro is only 15 lines of code and wasn’t difficult to write. Would I have done it, had I not seen Steven’s talk? Probably not.

elemental function erfinv(x) result(res)
    use, intrinsic:: ieee_arithmetic, only: ieee_value, &
      ieee_positive_inf, ieee_negative_inf, ieee_quiet_nan
    real(dp), intent(in) :: x
    real(dp) :: a, t, res, p, q

    a = abs(x)
    if (a >= 1.0_dp) then
      if (x == 1.0_dp) then
        res = ieee_value(1._dp, ieee_positive_inf)
      else if (x == -1.0_dp) then
        res = ieee_value(1._dp, ieee_negative_inf)
      else
        ! domain error
        res = ieee_value(1._dp,ieee_quiet_nan)
      end if
    else if (a <= 0.75_dp) then ! Table 17 in Blair et al.
        t = x*x - 0.5625_dp
        @:horner(p,t,0.160304955844066229311e2,&
                      -0.90784959262960326650e2,&
                       0.18644914861620987391e3,&
                      -0.16900142734642382420e3,&
                       0.6545466284794487048e2,&
                      -0.864213011587247794e1,&
                       0.1760587821390590,prec=dp)
        @:horner(q,t,0.147806470715138316110e2,&
                    -0.91374167024260313936e2,&
                     0.21015790486205317714e3,&
                    -0.22210254121855132366e3,&
                     0.10760453916055123830e3,&
                    -0.206010730328265443e2,&
                     0.1e1,prec=dp)
        res = x * p / q

    else if (a <= 0.9375_dp) then ! Table 37 in Blair et al.
        t = x*x - 0.87890625_dp
        @:horner(p,t,-0.152389263440726128e-1, &
                      0.3444556924136125216, &
                     -0.29344398672542478687e1, &
                      0.11763505705217827302e2, &
                     -0.22655292823101104193e2, &
                      0.19121334396580330163e2, &
                     -0.5478927619598318769e1, &
                      0.237516689024448,prec=dp)
        @:horner(q,t,-0.108465169602059954e-1, & 
                      0.2610628885843078511, &
                     -0.24068318104393757995e1, &
                      0.10695129973387014469e2, &
                     -0.23716715521596581025e2, &
                      0.24640158943917284883e2, &
                     -0.10014376349783070835e2, &
                      0.1e1,prec=dp)
        res = x * p/q
    else ! Table 57 in Blair et al.
      t = 1.0_dp / sqrt(-log(1.0_dp - a))

        @:horner(p,t,0.10501311523733438116e-3, &
                    0.1053261131423333816425e-1, &
                    0.26987802736243283544516, &
                    0.23268695788919690806414e1, &
                    0.71678547949107996810001e1, &
                    0.85475611822167827825185e1, &
                    0.68738088073543839802913e1, &
                    0.3627002483095870893002e1, &
                    0.886062739296515468149,prec=dp) 
        @:horner(q,t,0.10501266687030337690e-3, &
                    0.1053286230093332753111e-1, &
                    0.27019862373751554845553, &
                    0.23501436397970253259123e1, &
                    0.76078028785801277064351e1, &
                    0.111815861040569078273451e2, &
                    0.119487879184353966678438e2, &
                    0.81922409747269907893913e1, &
                    0.4099387907636801536145e1, &
                    0.1e1,prec=dp)

        res = p / (sign(t,x) * q)
        
    end if
end function

That’s no problem at all. If you’re proficient in Julia, just use that to drive change and solve the mathematical/technical/societal problems you face. There’s no purpose in being apologetic. As Jeff Atwood says:

Who cares what technology you use, as long as it works, and both you and your users are happy with it?

I’m glad you see it that way. I feel like Julia is at a bit of a crossroad point in terms of interoperability and usage. Is the goal to have everyone writing Julia? If Julia is compiled statically, will it use C-like interfaces, or something more advanced? Is the plan of Julia to interface easily with other languages, or will it be a walled garden exclusive to Julia developers?

certik · October 20, 2022, 3:28pm

@heltonmc I apologize for some of the remarks in this thread. Everybody, please let’s not do remarks regarding JuliaComputing. If you want to make such remarks, then you can do it at their forum. This forum is about Fortran.

awvwgk · October 23, 2022, 8:19am

13 posts were split to a new topic: Using Julia libraries in Fortran with StaticCompiler.jl

heltonmc · October 20, 2022, 9:34pm

Thank you for putting that together, the Compiler Explorer link was incredibly helpful. It is interesting to see the difference the two compilers output there and how that compares to Julia. The instructions look similar between the two compilers but the way it is evaluating the rational function appears to be different (please correct me if I’m wrong). The evaluation by ifx 2022.1.0 evaluates the first polynomial and then moves onto the second whereas the gfortran 7.3 evaluates the unrolled version of both going back and forth between each term in the separate polynomials.

Both compilers give vmadd213ss instructions it is just ordered a bit different. I would imagine the gfortran to be slightly better? But that should be tested. The only difference between the Julia code instruction is that it gives vfmadd213pd (on my computer so might be different). It should also be mentioned that Julia has moved away from the macro version and the evalpoly that Bessels.jl uses is a function where we rely on the compiler to do the unrolling. But I think there are two things that could explain the 2-3x faster than the Cephes library in addition to the overhead of the function call.

Of course, unrolling it like you have in the Fortran code is the first that will bring that difference down. The loopy version in the Cephes library is clearly not optimal. But the second thing might be even more costly to the Cephes library as the rational function is actually evaluated by two separate functions. Horner’s scheme is pretty latency bound (have to wait for one multiply to be down to go to next) but the evaluation of the rational function can be done simultaneously. As in P(x)/Q(x) should be able to exploit SIMD instructions fairly well to reduce the time by half (if the lengths of P and Q are similar).

In Julia, the erfinv function has this benchmark…

julia> @btime erfinv($(Ref(0.9))[])
  6.035 ns (0 allocations: 0 bytes)

Now, I’m going to completely replace Q(x) with just dividing by x so just benchmarking one polynomial evaluation (and the division)…

julia> @btime erfinv($(Ref(0.9))[])
  5.369 ns (0 allocations: 0 bytes)

So this is of course faster but not as much as you think considering I removed an entire polynomial of degree 10… If we benchmark just a single polynomial evaluation of degree 10…

julia> @btime evalpoly($(Ref(0.9))[], $(ntuple(n -> (-1)^n / n, 10)))
  3.084 ns (0 allocations: 0 bytes)

So ya it looks like the Julia compiler is unrolling the tuple but also recognizing that P and Q can be computed at the same time. Now the difference between the two versions can probably be explained by the length of P and Q not being equal so it is not possible to completely be done at the same time and have some leftover instructions. So the optimal code will take advantage of that which I doubt the Cephes library is. So I think all of those things along with the calling overhead could explain that difference. I have no doubt that we could slightly adjust its routine to give the same performance.

Bessels.jl tries to exploit all these type of SIMD things in all of its functions. We don’t have a lot of rational approximations but there are times where we need to evaluate polynomials of large degree. Now we can exploit some of those same principles to dramatically reduce the computation time. Instead of evaluating two polynomials separately we can try to use a second order Horner scheme that splits the polynomial into even and odd terms and computes the polynomial that way. Here is a Julia version of that…

@inline function even_odd_poly_add(x, P)
    N = length(P)
    xx = x*x

    out = P[end]
    out2 = P[end-1]

    for i in N-2:-2:2
        out = muladd(xx, out, P[i])
        out2 = muladd(xx, out2, P[i-1])
    end
    if iszero(rem(N, 2)) 
        return muladd(out, x, out2) 
    else 
        out = muladd(xx, out, P[1]) 
        return muladd(x, out2, out) 
    end
end

So if we benchmark these for different polynomial degrees…

# N = 5
julia> @btime evalpoly($(Ref(1.2))[], $(ntuple(n -> (-1)^n / n, 5)))
  2.113 ns (0 allocations: 0 bytes)
-0.8627199999999999

julia> @btime even_odd_poly_add($(Ref(1.2))[], $(ntuple(n -> (-1)^n / n, 5)))
  2.346 ns (0 allocations: 0 bytes)
-0.8627199999999999

# N = 10
julia> @btime evalpoly($(Ref(1.2))[], $(ntuple(n -> (-1)^n / n, 10)))
  3.079 ns (0 allocations: 0 bytes)
-0.38845094765714294

julia> @btime even_odd_poly_add($(Ref(1.2))[], $(ntuple(n -> (-1)^n / n, 10)))
  2.967 ns (0 allocations: 0 bytes)
-0.38845094765714316

N = 50
julia> @btime evalpoly($(Ref(1.2))[], $(ntuple(n -> (-1)^n / n, 50)))
  23.274 ns (0 allocations: 0 bytes)
81.32088598046435

julia> @btime even_odd_poly_add($(Ref(1.2))[], $(ntuple(n -> (-1)^n / n, 50)))
  10.625 ns (0 allocations: 0 bytes)
81.32088598046445

N = 150
julia> @btime evalpoly($(Ref(1.2))[], $(ntuple(n -> (-1)^n / n, 150)))
  89.602 ns (0 allocations: 0 bytes)
2.2769534597705355e9

julia> @btime even_odd_poly_add($(Ref(1.2))[], $(ntuple(n -> (-1)^n / n, 150)))
  41.843 ns (0 allocations: 0 bytes)
2.2769534597705407e9

So as we would expect for smaller polynomials the evalpoly is slightly faster but there is some transition point at around N = 10 where we can see a clear advantage and then the 2x speedup for higher orders N > 50 (of course we can also use four accumulators as well). One might also notice the slight difference in accuracy. The accuracy will depend on the coefficients (so that needs to be tested), but for this example the even/odd scheme is actually slightly more accurate. (I also tested the error in the erfinv function and using the even odd scheme actually decreased the max ULP error (~1.9 → ~1.7))

julia> even_odd_poly_add(big"1.2", ntuple(n -> (-1)^n / n, 50))
81.32088598046447810940231091273091124409909302016418750000000000000000000001763

Anyway, that is a fairly long winded answer to say that compilers are very smart… we should set up our code to exploit these aspects as much as possible (though they don’t always auto-vectorize when they should so should make sure they are) and let the compiler do its job. So yes, even though the algorithms are close transcriptions of each other appear, it still isn’t really an apples to apples comparison between languages. Though, how easy it is to get the compiler to do that I think is an important consideration. Please correct anything I may have said in error and I will correct this.

This is incredible. It’s awesome to see this work progress so much progress over the past year and thank you for all the hard work on that project. This (IMO ) is such a cool application and demonstration of this work!

These are all excellent questions that I don’t think I feel knowledgeable enough to answer (speaking as someone who has not made a single contribution to the language ) I personally think having your code be easy to use from any language is an advantage. I think works like GitHub - JuliaPy/PythonCall.jl: Python and Julia in harmony. and GitHub - tshort/StaticCompiler.jl: Compiles Julia code to a standalone library (experimental) are awesome! I don’t really try to convince Python users (maybe if you are using Matlab ) to switch to Julia at all. Honestly, it would be kind of nice if the Julia code could replace some of the math functions in Scipy that call the cephes library. I think that would definitely be beneficial to both ecosystems… I apologize for the long winded answer. Again, please correct or add any misinformation and I will update.

rgba · October 21, 2022, 8:09am

My feeling has been that interoperability has received somewhat one-sided attention (calling from Julia is significantly more evoluted). But I do not think this is done with any malice. Rather it’s just that Julians are of course directly interested in calling widespread (mostly python, c++ or python-wrapped c++ I’d say) external libraries and as such reserve a lot of development time to that functionality. The reverse is more of a chore, and an ungrateful one, mostly because of the time-to-first-something problem. No one wants to start a Julia session inside their exe/script.

So, especially having this in mind, I find this

a very, very, very welcome surprise. I knew already about StaticCompiler.jl and heard a lot about its limitation. It’s very nice to see it accomplish so well here, this is an important milestone (in my view at least). I’m glad we are already here, even if for simple, single objects. Then I have a question, about deployment: here you show how to compile from an opened REPL, how would it play within, say, a CMake build system? I mean, I would prefer to build the Julia object together with all the rest of the Fortran library / application that is calling it. (sorry if the question it’s silly, I realize that it could well be, I’m just very curious about what seems a very nice possibility). If it’s possible within a CMake build, we should be able at some point to do that within fpm (the fortran package manager) too, I guess.

Anyway I’m installing Julia right now on my HPC account, with the goal to build your example and try profiling that (vs a call to the intrinsic) all within fortran. That way we could see if there is really some imbalance due to ‘who calls who’ or not, and if any, quantify it.

Well if you want to take offense for debatable comments / unpleasant jokes about Julia maybe don’t to the same with other languages, right?

More seriously, being both open source and with similar “target” I actually think pythonistas are an easier catch than people used to matlab: everything is done under the hood by MathWorks engineers, you have no explicit control on many things and just benefit from trade offs and optimizations done by that benevolus dictator. Within julia you have a lot of manual switches, being it @inline, @inbounds, Threads.@threads or whataver. That has a lot more resemblance to what is the experience of the average pythonista, fighting (or playing with joy) with numba or similar ways to overcome the performance problems you have when you are not just calling into numpy/scipy, with respect to the ‘know nothing, run like a black-box’ thing that matlab gives. A matlab guy normally know nothing about performance tricks, except from what MathWorks documents (i.e. criptic statements about vectorization, avoiding loops, etc), so he/she would feel totally lost and uninterested in dealing with all Julia’s details and complexities. That’s at least what comes to my mind, as someone that has coded a lot in matlab and now is progressively transitioning to modern fortran.

rgba · October 21, 2022, 8:20am

Maybe now I’ll take it too far, but somehow I feel the Fortran experience is also in some respect similar to the Matlab deal: Fortran compilers are normally very very much optimized, in some cases they take quite opinionated choices to enhance performance for you (I’m thinking about Intel compiler, that as far as I’m aware enables fast math by default) and very often you just trust it would optimize things for you: no need to explicitly request inlining, it will do, in many many relevant cases, other stuff is switched on just with global compiler flags (e.g. openMP parallelism, acting on both do concurrent and regular do constructs if the flag is given and the compiler thinks it’s worth it), unrolling loops, and so forth(ran, pun intended).

Topic		Replies	Views
Fortran port of the Bessels.jl library Announcements	41	2071	November 22, 2022
Julia: Fast as Fortran, Beautiful as Python	184	11669	November 13, 2022
Using Julia libraries in Fortran with StaticCompiler.jl	11	944	October 22, 2022
An interesting video: Python vs Fortran vs GNU Octave/MATLAB --- side by side performance comparison	5	1162	January 6, 2022
Simple summation 8x slower than in Julia	89	14955	April 2, 2022

Comparing Fortran and Julia's Bessel function performance

Related topics