Fortran port of the Bessels.jl library

I think using single precision, or even half precision when it becomes available in compilers, will be useful even in combustion. But those low-precision values will need to be used at just the right places and times in order to speed up the calculations without losing final accuracy.

1 Like

Strictly considering the standard with the general case of real literal constants, the answer is no as others have pointed out.

In the case of the specific constant of zero you show, you will be hard-pressed to find a processor that does not take the approximation of 0.0 with your kind BK as anything but 0.0_bk.

The standard leaves it as processor-dependent the approximation that comes in the intrinsic assignment such as x = 0.0, where x is of real type with kind BK that say is not the default real, and which is then equivalent to x = real( 0.0, kind=BK )

1 Like

Readers may be want to take a look at this suggestion at the J3 Fortran proposals site to alleviate the problem currently with default real which, colloquially stating, is single-precision:

Take a look also at the other thread at this site on NAMESPACEs.

Notice a comment I just added there with a thought I long had which is to employ @certik’s NAMESPACE concept further to give more control to the users to specify “package” wide (aka namespace wide) defaults for various things:

With respect to the original post here toward the proposal for a “proper” NAMESPACE facility in Fortran and given also discussions on this forum (#78) and requests elsewhere regarding the default kind of constants (and perhaps) intrinsics, the thought also is the NAMESPACE to be able to define such defaults and to have all the enclosing modules to “inherit” the defaults:

namespace n
    default real ( kind=selected_real_kind( p=12 ) )
    ..
    module m
    ..
end namespace
namespace(n) module m
    ! the default kind of real literal constants shall be the one defined in the enclosing module which has a precision of 12
    ..
end module
3 Likes

I would add that a very straightforward way to fix this would be to slightly extend the implicit keyword… with something like

implicit double precision(a-c,e-z,constants) 

i.e. that it applies to all real constants or whatever else keyword)

  • we all gotta type that anyways all the time
  • fully backward compatible
  • no need to add even more syntax (namespaces, selected kinds, etc)
  • all compilers mangle that already

Since we can’t get rid of implicit none, why not double down on that and make it better!

In the spirit of implied save and implicit none, why restrict it to real constants? There have been numerous instances where new Fortran programmers are caught out by integer(4) vs integer(8) (yes, I know 4 and 8 are non-standard). Let’s move implicit kind to the obsolescent list with a new implicit constant statement. The syntax can be

   implicit constant :: real(kind(1.0))
   implicit constant :: integer(selected_int_kind(20))
   implicit constant :: logical(kind(.true.))
   implicit constant :: character(selected_char_kind('iso_10646'))

If a user does not use an implicit constant statement, then enforce an interpretation that constants are those with maximum storage unit. This certainly cannot cause problems with backwards compatibility!

1 Like

Exactly! whatever the syntax, what we all want is the ability to control what’s the default (implicit, fortranly speaking) kind of all otherwise unspecified constants in the program unit.

I don’t get the point, here. Do you mean that the standard deviation of the errors of the Fortran library is lower, but the distribution of the errors have longer tails ? I’m asking because from the histograms I would say the standard deviation of the Julia library is actually the lowest one… But the total number of counts doesn’t seem to be the same between the txo histograms, so there’s something unclear to me here (I’m not even sure what is exactly “eps” here)…

EDIT: OK I missed it was a log vertical scale. Would be good to have the normal scale as well, and the calculated standard deviations of both histograms. I also notice that the Julia distribution is biased toward positive errors (but again to which point is difficult to evaluate on a log scale). The extracted mean values would be useful (and the standard deviations calculated for both an assumed mean at zero and the calculated mean).

1 Like

Let me bring this discussion back to share that @ivanpribec has replaced the dot_product in the polynomial evaluation function with a Horner loop, and the new implementation is now almost twice as fast as the intrinsic function! This is an unexpected plot twist, excellent work Ivan! Here are the results on my M1 Mac:

federico@Federicos-MacBook-Pro fortran-bessels % ./bessels_test.exe 
INTRINSIC time used:   46.9030 ns/eval, sum(z)=-88.451782122026188
PACKAGE   time used:   26.3650 ns/eval, sum(z)=-88.451782121434462
[bessels] 3 test completed: 3 passed, 0 failed.
STOP 0

For the records, I’ve just finished the implementation of besselj1(x) and it also look promising, with approximately ~2x speed-up on the M1:

[bessel_j0] INTRINSIC time used:   45.9751 ns/eval, sum(z)=99.706555563648948
[bessel_j0] PACKAGE   time used:   26.3218 ns/eval, sum(z)=99.706555564603875
[bessel_j1] INTRINSIC time used:   45.6402 ns/eval, sum(z)=48.275497720728119
[bessel_j1] PACKAGE   time used:   26.2781 ns/eval, sum(z)=48.275497722514878
4 Likes

I think I’ve gotten to the point people were talking about when a ~100x speed-up achievement was brought up. With my latest implementation of I0(x), I1(x), K0(x), K1(x), where a Fortran standard version is not available, the current implementation really is up to >100x faster than a reference implementation from the netlib:

[bessel_k0] NETLIB    time used:   50.7966 ns/eval, sum(z)=152003.48979146607
[bessel_k0] PACKAGE   time used:    5.9995 ns/eval, sum(z)=152003.48979146607
[bessel_k1] NETLIB    time used:   31.8502 ns/eval, sum(z)=27011.085779998772
[bessel_k1] PACKAGE   time used:    5.9887 ns/eval, sum(z)=173714.70475950497
[bessel_i0] NETLIB    time used: 1729.0142 ns/eval, sum(z)=0.10292708849736258E+264
[bessel_i0] PACKAGE   time used:   11.0356 ns/eval, sum(z)=0.10292708849736258E+264
[bessel_i1] NETLIB    time used:  332.6255 ns/eval, sum(z)=0.10805437774341595E+48
[bessel_i1] PACKAGE   time used:   11.0384 ns/eval, sum(z)=0.10750715731528420E+48
2 Likes

Is this the latest implementation: add `besselk0`, `besselk1` · perazz/fortran-bessels@111642b · GitHub, that already seems quite optimized. To speed it up further, one can work on the argument reduction, and/or change the polynomials, possibly use just a polynomial, no division. But all the “easy” speedups seems to have been done. :slight_smile:

It is. I’ve tried to replace some Horner loops with a manually unrolled version, but that has no practical effect and, tbh, it’s probably better to leave it as a loop and let the compiler decide what to do based on the cpu flags

2 Likes

You can read these two blog posts where we nailed the “sin(x)” implementation using similar techniques that you use. On one computer it is 30x faster than the libc’s sin(x) function, on another computer only 7x faster. Still, huge speedup!

In short, you need to write the function in such a way so that the compiler can efficiently vectorize it when applied to a large vector. It’s a lot of work, you have to setup benchmarking suite, compute theoretical performance peak, reason about the code, and then try various things. See the two blog posts where we have done exactly this for sin(x) and prove that we are pretty much within a factor of 2x from the best possible speed that anyone can ever get on the given CPU I was using.

5 Likes

A very smart thing I believe most compilers do is to return both sin and cos with 1 call only when both are requested; the smart move is that if you have a polynomial expansion for the first one, a shifted query (sin(x+pi/2)) or just the polynomial n-1 derivative gives you the latter in virtually no more cpu operations

1 Like

Can somebody explain to me how the values of the histogram in this post above have been computed?
If I understand it correctly, from a sample of 1 million values, every single one of them (except maybe a few hundreds for the case of Fortran) is below 1 eps? I assume eps here is 2^(-23) / 2^(-52) for 32/64-bit precision.

I’m asking because I just did a quick check in a small sample of 5000 values and the number of computed values with an error above 1 eps is much higher than the one shown in the histogram. I tested both the intrinsic (from gfortran) and the ported library announced in this thread. So I just want to know what I am doing wrong in the error computation.

I’m not trying to criticize the implementation in itself, I am aware it is not easy to provide such a high accuracy for these types of functions; I am implementing some non-trivial functions myself and getting around 1 eps everywhere is definitely taking some time.

PS: Offtopic, but I had to truncate lines 180-196 in the constants module in order to compile. Is the line-length limit 132 characters? Or was that me with too-strict flags?

I’m porting the library to Fortran but I’m not the one who ran those tests - I clicked on the link and it goes to this post in the forum, apparently the “error” is comparing Bessels.jl against another Julia library.

IMHO this type of tests should be run against an analytical solution, but of course we’re all lazy and end up comparing against just another library. In the post it looks like it’s a quad-precision library, so probably not bad?

You’re fine, so far the Fortran part is just a port, i.e., no attempts at changing the accuracy of the original library. If you could contribute your test case either here or on the github page, it would be a nice addition.

Right, the original Julia library has insanely long lines and some parts haven’t been fully refactored yet - as you see on the page you need to run with -ffree-line-length-none with gfortran, or equivalent commands from other compilers

I clicked on the link and it goes to this post in the forum,

Oh, I saw the other post but missed the (click to expand). So according to the code, it is just (y_comp - y_exact) / eps(). That’s one of the things I tried, I still get worse errors.

The other library is Arb; it’s a multiprecision library, and from the code it seems to be set to a precision of 500. I’ve used it, I will take those values as ‘exact’. I was also comparing against an ‘eact’ value, so that should not be the problem.

so far the Fortran part is just a port, i.e., no attempts at changing the accuracy

It’s a different compiler, but I assume it should give pretty much the same accuracy as the Julia version.

… you need to run with -ffree-line-length-none with gfortran

My bad. Did not read the README.

What’s the extent of the errors you get? Like, close to 2*eps, 10*eps, 100*eps ?

What’s the extent of the errors you get? Like, close to 2*eps, 10*eps, 100*eps ?

Below 10, in the range 1 - 5 usually.

But I think I found the mistake(s) in my computations.
I will test things once again and come back when done. This may even be favorable for the functions I was implementing and couldn’t get under 1 eps.

OK thank you,
so still acceptable for a “fast-but-approximate” library. The reason I’m asking is that this started off as a side hustle, and I dropped the effort soon after I realized it wasn’t so much faster than the compiler intrinsics. However, with some optimizations on the code side, and especially for some of the less trivial functions, some speed-ups are dramatic, so I’m trying to pull more work on this and get soon to a full working port

1 Like

This is a test of the total error, not the relative error. That kind of test would be meaningless in some contexts, e.g. where some number of correct significant digits is required for small function values such as near a root, or in the asymptotic regions of the function tails.

1 Like