Comparing Fortran and Julia's Bessel function performance

Welcome to the forum! Bessels.jl looks amazing. Great job on that! As you can tell we are always suspicious of benchmarks here. :slight_smile: I can definitely help with your benchmark when I get a chance (or maybe somebody else can step in).

4 Likes

Welcome to the forum. I believe the default floating point type in Julia is 64-bit (double precision). Note that

real :: x

in Fortran declares a single precision variable for which compilers will use 32 bits. You could write

program calling_func
implicit none
integer, parameter :: wp = kind(1.0d0)
real(kind=wp) :: a,r
end program calling_func

etc. to get double precision floating point variables.

2 Likes

Thank you for the suggestion! Is there a difference between initializing with double precision :: a?
This is what I have for now. Let me know if there are any more tips or improvements. I’m getting around 42 nanoseconds for each evaluation.

program bess
    integer, parameter :: p = selected_real_kind(15, 307)
    integer, parameter :: n = 50000000
    real(p) :: startT, endT, execTime
    real(p) :: r, v(n)
    real(p) :: a
    
    call random_number(r)
    call random_number(v)

    do i=1,n
        v(i) = 200*v(i)
    enddo

    call cpu_time(startT)

    do i=1,n
        a = BESSEL_JN(1, v(i))
    enddo
    
    call cpu_time(endT)
    execTime = (endT - startT)
    print '("Time = ",f6.3," seconds.")',execTime
    print *, a
 end program bess

Your code is standard Fortran, but it is less flexible, because if you want the code to be single or quadruple precision you need to change many declarations. Also note that Fortran floating point literals are single precision by default. I don’t think your

a = 0.0

causes a problem, but a common mistake is to write

double precision, parameter :: pi = 3.14159265359 ! RHS is single precision

when one should write

double precision, parameter :: pi = 3.14159265359d0 ! d0 makes RHS double precision

The Fortran-lang tutorial has a section on Floating Point Numbers.

1 Like

Thank you for the link. I updated my code above. I quite like that syntax and we have a similar problem in Julia but the other way around (literals promote Float32 to Float64). I originally thought if I didn’t include some operation within the loop that the compiler would optimize the loop away but it looks like the benchmarks are working as hoped. It looks like the BESSEL_JN only supports integer arguments for the order. I’ll have to look at other routines as well.

2 Likes

Mmm, which compiler flags are you using? With optimized flags I also would have that fear. Look at this answer, for a fairly similar example: Why a function returns an array is much slower than a subroutine returns an array? (real MWE included) - #2 by Beliavsky (spoiler, the function and the subroutine had the same performance, just the first call did not print nor did anything with the value so the compiler skipped it altogether :).

Of course you are getting the intended behavior with the flags you are using (or with no flags maybe). But if you intend to benchmark for speed feeding proper flags is crucial.

Yes these suspicions are true with more aggressive compiler options gfortran -O3 besselj.f90 -o bess optimizes away that as expected. I’ll need a more clever benchmark for that. Simply summing up the terms and having it also return the value seems to fix that issue though. Is there an accepted flag that I should run this with? I’m hoping to avoid any fastmath operations as well. But I didn’t notice any difference between -O1, -O2, -O3.

Thanks again for all the help everyone!

1 Like

Just about any optimization level other than zero with gfortran, ifort, nvfortran brought the time to zero
with the original code. This has an overhead in that the RHS is not something that could just go in a register, but at least prevents the compiler from eliminating most of the computation …

program bess
    integer, parameter :: p = selected_real_kind(15, 307)
    integer, parameter :: n = 50000000
    real(p) :: startT, endT, execTime
    real(p) :: r, v(n)
    
    call random_number(v)
    v = 200*v

    call cpu_time(startT)
    v = BESSEL_JN(1, v)
    call cpu_time(endT)

    execTime = (endT - startT)
    print '("Time = ",f6.3," seconds.")',execTime

    call random_number(r)
    print *, v(max(1,int(r*n)))
 end program bess
2 Likes

Thank you for cleaning up that syntax. How would you feel about a benchmark that measures how fast it can sum the array? So with slight modification…

program bess
    integer, parameter :: p = selected_real_kind(15, 307)
    integer, parameter :: n = 50000000
    real(p) :: startT, endT, execTime
    real(p) :: v(n)
    real(p) :: a

    a = 0.0_p
    call random_number(v)
    v = 200*v

    call cpu_time(startT)
    do i=1,n
        a = a + BESSEL_JN(1, v(i))
    enddo
    call cpu_time(endT)

    execTime = (endT - startT) / n * 1E9
    print '("Time = ",f6.3," nanoseconds.")',execTime
    print *, a
 end program bess

I see very similar times to your code as well (about 38 nanoseconds for each iteration).

One thing to be wary of is making making sure to be careful about whether you want to allow the compiler to vectorize the computation. I doubt it is in this case, but if you want to benchmark latency of scalar evaluation, you probably want to make sure that the output of one eval is required to compute the input of the next one.

After using Fortran for a while, you would probably tend to do the sum with a function; and more significantly for order one use the specific routine bessel_j1(3) instead of the more general bessel_jn(3)
I would guess; and use some optimization switches (very processor-dependent); although it would probably not have much affect here where you are timing an intrinsic, which would be pre-compiled.
And you might as well get in the habit of adding “implicit none” :slight_smile:

g0 is a nice descriptor instead of f6.3, which might have to be changed for different values of N.

program bess
implicit none
integer, parameter :: p = selected_real_kind(15, 307)
integer, parameter :: n = 50000000
real(p) :: startT, endT, execTime
real(p) :: v(n)
real(p) :: a

    call random_number(v)
    v = 200*v

    call cpu_time(startT)
    !a = sum( BESSEL_JN(1, v) )
    a = sum( BESSEL_J1( v) )
    call cpu_time(endT)

    execTime = (endT - startT) 
    print '("Time = ",g0.3," nanoseconds per value.")',execTime/n * 1.0d9
    print '("Time = ",g0.3," total seconds.")',execTime
    print *, a
 end program bess
1 Like

While we’re discussing floating-point computation some people may like fuller details of it. I recommend google or duckduckgo with 'Goldberg floating point" which should lead you to Goldberg’s famous article on that. Years ago I wrote two Fortran programs called kinds.f90 in f95 and kinds03.f90 in f2003 to give details of all a compiler’s real and integer kinds that SELECTED_REAL_KIND and SELECTED_INT_KIND would give, but neither program includes IEEE_SELECTED_REAL_KIND. If you would like them they are downloadable in https://homepages.ecs.vuw.ac.nz/~harper/fortranstuff.shtml

3 Likes

Free access:

4 Likes

Thank you for those edits. I was actually running into the string issue for f6.3 :slight_smile:

This has been very informative. I have created a github repository with all these files and some preliminary benchmarks (GitHub - heltonmc/bessel-benchmarks: Benchmarking Bessel Function routines). I’ve already learned some very interesting things so I really appreciate this discussion. I also created an additional section for a sorted array as (at least in my research) we usually sum over a sorted array of roots in a Bessel series. This comparison was perhaps most interesting to me as you can really see the power of branch prediction.

Of course, the repository is very rough and hopefully over time I can continue improving it. Please add any PRs of benchmarks on your computers as they would be most welcome. I have noticed significant differences in runtimes on different machines and for different Julia versions. Hopefully, in the future I can also add comparisons to other Fortran libraries (Amos) as well as include other programming languages.

2 Likes

@heltonmc ,

Welcome to the forum and thank you for your great effort and wonderful dialogue with the special functions. Your work with Bessel functions is most impressive and it really showcases how highly talented computational mathematicians can use Julia, which is incorporating into the platform itself a tremendous amount of desired computational language facilities that will be of particularly of interest and delight to those in academia and research - to advance fundamental sciences.

Generally in the Fortran Community, you will find the folks will appreciate if the microbenchmarks are performed thoroughly and cleanly in an easily comprehensible manner across processors and platforms so that the readers can make proper apples-to-apples comparisons. Also, to get into discussions that strive to explain the algorithmic advances that yield computational benefits with special function evaluations, etc… This is as opposed to mere marketing materials that instead try to make blanket statements around language superiority (Julia 10x better than xxx, etc.)

In terms of Fortran evaluations, you have likely already realized one has to generally take good care to “wire in” the needed instrumentation to attempt computing performance evaluations and would need to keep in mind several aspects:

  • There are several targeted compiler implementations toward multiple platforms including commercial ones and the performance mileage can vary significantly across implementations. gfortran is one among several and thus gfortran \neq Fortran.

  • Fortran has an ISO IEC standard and the use of standard-conforming code across compilers and platforms can be a preferred way to level-set the code used toward benchmarks.

  • With microbenchmarks in sequential computations, CPU_TIME in the Fortran standard may not always provide the required resolution and SYSTEM_CLOCK can be more revealing.

  • You will know well as the data sizes used toward function evaluations increase, chances are high the system times can be dictated by data memory access and read/write. With Fortran processors, there can be large variances in this the memory data aspect across different compilers resulting in different performance behavior that overwhelm any of the computational aspects around function evaluations.

  • Simple-minded considerations of stats with a certain number of trial evaluations in the same run with the exclusion of the best and worst performance toward the determination of the “mean” helps with a compiler-based program execution of test cases.

With the above and a few points in mind, here is an edited version of your code you may want to review:

Click to see code
   integer, parameter :: Float32 = selected_real_kind( p=6 )
   integer, parameter :: Float64 = selected_real_kind( p=12 )
   integer, parameter :: WP = Float64

   ! Program data
   real(WP), parameter :: ONE = 1.0_wp
   integer, parameter :: N = 1000000
   real(WP), allocatable :: v(:), r(:)
   integer, parameter :: NTRIALS = 12
   real(Float64) :: Calc_Times(NTRIALS)
   real(Float64) :: Mean_Time
   real(WP) :: rsum(NTRIALS)
   real(WP) :: stddev

   allocate( v(N), r(N) )

   call random_number( v )
   v = v * 200.0_wp
   ! Set first element to a reference value for a numerical check
   v(1) = ONE

   call calc( Calc_Times, rsum )

   Mean_Time = ( sum(Calc_Times) - minval(Calc_Times, dim=1) - maxval(Calc_Times, dim=1) ) / (NTRIALS-2)

   print *, "BESSEL_J1 Evaluation: Mean compute time for ", N, " evaluations:"
   print *, Mean_Time, " seconds."
   stddev = norm2( rsum - sum(rsum)/NTRIALS )
   print *, "Standard deviation in calculated results: ", stddev

contains

   impure elemental subroutine calc( t, sumr )

      ! Argument list
      real(Float64), intent(inout) :: t
      real(WP), intent(inout) :: sumr

      ! Local variables
      real(Float64) :: t0
      real(Float64) :: t1

      call cpu_t( t0)

      r = bessel_j1( v )

      call cpu_t( t1 )

      t = t1 - t0

      ! Check calculation result
      if ( abs(r(1) - 0.44_wp) > 1E-3_wp ) then
         print *, "BESSEL_J1( x=1 ): ", r(1), " vs an expected value of approximately ", 0.44_wp
         error stop "Unexpected result with x = 1.0: " 
      end if
      
      sumr = sum( r ) 

   end subroutine
 
   subroutine cpu_t( time )

      use, intrinsic :: iso_fortran_env, only : I8 => int64

      !.. Argument list
      real(Float64), intent(inout) :: time

      !.. Local variables
      integer(I8) :: tick
      integer(I8) :: rate
      time = 0.0_Float64

      call system_clock (tick, rate)

      time = real(tick, kind=kind(time) ) / real(rate, kind=kind(time) )

      return

   end subroutine

end

C:\temp>gfortran -O3 p.f90 -o p.exe

C:\temp>p.exe
BESSEL_J1 Evaluation: Mean compute time for 1000000 evaluations:
4.9179059994639826E-002 seconds.
Standard deviation in calculated results: 0.0000000000000000

2 Likes

Just a note that nvfortran, gfortran, and ifort produced very similar timings with the simpler test. That is a little surprising as each vendor is free to implement the intrinsics differently. With some of the intrinsics some compilers concentrate on precision, others on speed or error detection, etc …

The netlib bessel functions support fractional orders, and some like the Sandia Mathematical Program Library routines were about 15 to 20 percent faster than the intrinsics when compiled with high optimization last I timed them; but considerably slower than the intrinsics when compiled with common debug flags; for example ( I used the Sandia procedures for a long time, before the intrinsics were available on the majority of common compilers).

2 Likes

@heltonmc ,

To add to above, from a simple-minded Fortran practitioner perspective, especially to clear away all the fog which comes about due to Julia benchmark tools and the various macro add-ons that would appear to do a lot of stuff “under the hood” which is not at all obvious to an unfamiliar reader of Julia commands, a straightforward computation somewhat similar to the Fortran coding approach shown above can be as follows:

using Bessels

function fn!(v::Array{Float64}, r::Array{Float64})
    for i = 1:length(v)
        r[i] = Bessels.besselj1(v[i])
    end
    return nothing
end

v = rand(1000000)*200.0
r = zeros(1000000)
@time x=fn!(v,r)

With above on the same platform where the gfortran-compiled program execution result I had provided previously, Julia yields:

C:\temp>julia b1.jl
  0.086703 seconds (13.91 k allocations: 777.971 KiB, 63.65% compilation time)

So you will notice the time of 0.086703 seconds reported by Julia. I don’t use Julia so I’m unsure how to interpret this timing value. It is about 1.7x greater than gfortran for 1,000,000 evaluations of the Bessel function of the 1st kind, order 1.

Does the Julia time include compilation and allocation even though the @time macro is applied on the function evaluation?

1 Like

Yes, if you scroll a bit to the right on the output you showed, you’ll see that it reported

63.65% compilation time

suggesting that the run-time of your benchmark was 0.02861 seconds.

It’s also worth noting that julia’s array indexing and setting is safe by default, so each iteration of your benchmark loop will always include at least one bounds check for r (the compiler will probably, but not always, figure out that the v indexing is inbounds and elide the check). You can annotate the inside of your loop body with the @inbounds macro to eliminate this.

1 Like

There are a couple things I would be wary of (you can see how I set up the Julia benchmarks at (bessel-benchmarks/besselj.jl at main · heltonmc/bessel-benchmarks · GitHub). v and r are global variables and as Mason said it is not making any inference about the length of r and v so a lot of time is spent simply checking out of bounds access and assignment. Here is an implementation that matches your Fortran code…

function bessel(f; N=50000000, Order=false)
    v = rand(N)*200
    r = similar(v)
    if Order
        v = sort(v)
    end
    tstart = time()
    @inbounds for ind in eachindex(v)
        r[ind] = f(v[ind])
    end
    tend = time()
    t = (tend-tstart) / N * 1e9
    return t
end

julia> bessel(besselj1)
29.67360019683838

julia> bessel(besselj1)
29.674997329711914

Which is about 29.7 ns per evaluation (which is what I report at GitHub - heltonmc/bessel-benchmarks: Benchmarking Bessel Function routines). The code per evaluation that you reported is about 49 ns per eval (which is also similar to what I got). If you wanted to enable bounds checks you can remove the inbounds. My timings removing that statement are …

julia> bessel(besselj1)
31.076979637145996

julia> bessel(besselj1)
30.705680847167972

julia> bessel(besselj1)
30.715522766113285
2 Likes

They’re not globals inside the function, they’re just regular function arguments which is fine.

1 Like