Welcome to the forum! Bessels.jl looks amazing. Great job on that! As you can tell we are always suspicious of benchmarks here. I can definitely help with your benchmark when I get a chance (or maybe somebody else can step in).
Welcome to the forum. I believe the default floating point type in Julia is 64-bit (double precision). Note that
real :: x
in Fortran declares a single precision variable for which compilers will use 32 bits. You could write
program calling_func
implicit none
integer, parameter :: wp = kind(1.0d0)
real(kind=wp) :: a,r
end program calling_func
etc. to get double precision floating point variables.
Thank you for the suggestion! Is there a difference between initializing with double precision :: a
?
This is what I have for now. Let me know if there are any more tips or improvements. I’m getting around 42 nanoseconds for each evaluation.
program bess
integer, parameter :: p = selected_real_kind(15, 307)
integer, parameter :: n = 50000000
real(p) :: startT, endT, execTime
real(p) :: r, v(n)
real(p) :: a
call random_number(r)
call random_number(v)
do i=1,n
v(i) = 200*v(i)
enddo
call cpu_time(startT)
do i=1,n
a = BESSEL_JN(1, v(i))
enddo
call cpu_time(endT)
execTime = (endT - startT)
print '("Time = ",f6.3," seconds.")',execTime
print *, a
end program bess
Your code is standard Fortran, but it is less flexible, because if you want the code to be single or quadruple precision you need to change many declarations. Also note that Fortran floating point literals are single precision by default. I don’t think your
a = 0.0
causes a problem, but a common mistake is to write
double precision, parameter :: pi = 3.14159265359 ! RHS is single precision
when one should write
double precision, parameter :: pi = 3.14159265359d0 ! d0 makes RHS double precision
The Fortran-lang tutorial has a section on Floating Point Numbers.
Thank you for the link. I updated my code above. I quite like that syntax and we have a similar problem in Julia but the other way around (literals promote Float32 to Float64). I originally thought if I didn’t include some operation within the loop that the compiler would optimize the loop away but it looks like the benchmarks are working as hoped. It looks like the BESSEL_JN
only supports integer arguments for the order. I’ll have to look at other routines as well.
Mmm, which compiler flags are you using? With optimized flags I also would have that fear. Look at this answer, for a fairly similar example: Why a function returns an array is much slower than a subroutine returns an array? (real MWE included) - #2 by Beliavsky (spoiler, the function and the subroutine had the same performance, just the first call did not print nor did anything with the value so the compiler skipped it altogether :).
Of course you are getting the intended behavior with the flags you are using (or with no flags maybe). But if you intend to benchmark for speed feeding proper flags is crucial.
Yes these suspicions are true with more aggressive compiler options gfortran -O3 besselj.f90 -o bess
optimizes away that as expected. I’ll need a more clever benchmark for that. Simply summing up the terms and having it also return the value seems to fix that issue though. Is there an accepted flag that I should run this with? I’m hoping to avoid any fastmath operations as well. But I didn’t notice any difference between -O1
, -O2
, -O3
.
Thanks again for all the help everyone!
Just about any optimization level other than zero with gfortran, ifort, nvfortran brought the time to zero
with the original code. This has an overhead in that the RHS is not something that could just go in a register, but at least prevents the compiler from eliminating most of the computation …
program bess
integer, parameter :: p = selected_real_kind(15, 307)
integer, parameter :: n = 50000000
real(p) :: startT, endT, execTime
real(p) :: r, v(n)
call random_number(v)
v = 200*v
call cpu_time(startT)
v = BESSEL_JN(1, v)
call cpu_time(endT)
execTime = (endT - startT)
print '("Time = ",f6.3," seconds.")',execTime
call random_number(r)
print *, v(max(1,int(r*n)))
end program bess
Thank you for cleaning up that syntax. How would you feel about a benchmark that measures how fast it can sum the array? So with slight modification…
program bess
integer, parameter :: p = selected_real_kind(15, 307)
integer, parameter :: n = 50000000
real(p) :: startT, endT, execTime
real(p) :: v(n)
real(p) :: a
a = 0.0_p
call random_number(v)
v = 200*v
call cpu_time(startT)
do i=1,n
a = a + BESSEL_JN(1, v(i))
enddo
call cpu_time(endT)
execTime = (endT - startT) / n * 1E9
print '("Time = ",f6.3," nanoseconds.")',execTime
print *, a
end program bess
I see very similar times to your code as well (about 38 nanoseconds for each iteration).
One thing to be wary of is making making sure to be careful about whether you want to allow the compiler to vectorize the computation. I doubt it is in this case, but if you want to benchmark latency of scalar evaluation, you probably want to make sure that the output of one eval is required to compute the input of the next one.
After using Fortran for a while, you would probably tend to do the sum with a function; and more significantly for order one use the specific routine bessel_j1(3) instead of the more general bessel_jn(3)
I would guess; and use some optimization switches (very processor-dependent); although it would probably not have much affect here where you are timing an intrinsic, which would be pre-compiled.
And you might as well get in the habit of adding “implicit none”
g0 is a nice descriptor instead of f6.3, which might have to be changed for different values of N.
program bess
implicit none
integer, parameter :: p = selected_real_kind(15, 307)
integer, parameter :: n = 50000000
real(p) :: startT, endT, execTime
real(p) :: v(n)
real(p) :: a
call random_number(v)
v = 200*v
call cpu_time(startT)
!a = sum( BESSEL_JN(1, v) )
a = sum( BESSEL_J1( v) )
call cpu_time(endT)
execTime = (endT - startT)
print '("Time = ",g0.3," nanoseconds per value.")',execTime/n * 1.0d9
print '("Time = ",g0.3," total seconds.")',execTime
print *, a
end program bess
While we’re discussing floating-point computation some people may like fuller details of it. I recommend google or duckduckgo with 'Goldberg floating point" which should lead you to Goldberg’s famous article on that. Years ago I wrote two Fortran programs called kinds.f90 in f95 and kinds03.f90 in f2003 to give details of all a compiler’s real and integer kinds that SELECTED_REAL_KIND and SELECTED_INT_KIND would give, but neither program includes IEEE_SELECTED_REAL_KIND. If you would like them they are downloadable in https://homepages.ecs.vuw.ac.nz/~harper/fortranstuff.shtml
Free access:
Thank you for those edits. I was actually running into the string issue for f6.3
This has been very informative. I have created a github repository with all these files and some preliminary benchmarks (GitHub - heltonmc/bessel-benchmarks: Benchmarking Bessel Function routines). I’ve already learned some very interesting things so I really appreciate this discussion. I also created an additional section for a sorted array as (at least in my research) we usually sum over a sorted array of roots in a Bessel series. This comparison was perhaps most interesting to me as you can really see the power of branch prediction.
Of course, the repository is very rough and hopefully over time I can continue improving it. Please add any PRs of benchmarks on your computers as they would be most welcome. I have noticed significant differences in runtimes on different machines and for different Julia versions. Hopefully, in the future I can also add comparisons to other Fortran libraries (Amos) as well as include other programming languages.
Welcome to the forum and thank you for your great effort and wonderful dialogue with the special functions. Your work with Bessel functions is most impressive and it really showcases how highly talented computational mathematicians can use Julia, which is incorporating into the platform itself a tremendous amount of desired computational language facilities that will be of particularly of interest and delight to those in academia and research - to advance fundamental sciences.
Generally in the Fortran Community, you will find the folks will appreciate if the microbenchmarks are performed thoroughly and cleanly in an easily comprehensible manner across processors and platforms so that the readers can make proper apples-to-apples comparisons. Also, to get into discussions that strive to explain the algorithmic advances that yield computational benefits with special function evaluations, etc… This is as opposed to mere marketing materials that instead try to make blanket statements around language superiority (Julia 10x better than xxx, etc.)
In terms of Fortran evaluations, you have likely already realized one has to generally take good care to “wire in” the needed instrumentation to attempt computing performance evaluations and would need to keep in mind several aspects:
-
There are several targeted compiler implementations toward multiple platforms including commercial ones and the performance mileage can vary significantly across implementations.
gfortran
is one among several and thus gfortran \neq Fortran. -
Fortran has an ISO IEC standard and the use of standard-conforming code across compilers and platforms can be a preferred way to level-set the code used toward benchmarks.
-
With microbenchmarks in sequential computations,
CPU_TIME
in the Fortran standard may not always provide the required resolution andSYSTEM_CLOCK
can be more revealing. -
You will know well as the data sizes used toward function evaluations increase, chances are high the system times can be dictated by data memory access and read/write. With Fortran processors, there can be large variances in this the memory data aspect across different compilers resulting in different performance behavior that overwhelm any of the computational aspects around function evaluations.
-
Simple-minded considerations of stats with a certain number of trial evaluations in the same run with the exclusion of the best and worst performance toward the determination of the “mean” helps with a compiler-based program execution of test cases.
With the above and a few points in mind, here is an edited version of your code you may want to review:
Click to see code
integer, parameter :: Float32 = selected_real_kind( p=6 )
integer, parameter :: Float64 = selected_real_kind( p=12 )
integer, parameter :: WP = Float64
! Program data
real(WP), parameter :: ONE = 1.0_wp
integer, parameter :: N = 1000000
real(WP), allocatable :: v(:), r(:)
integer, parameter :: NTRIALS = 12
real(Float64) :: Calc_Times(NTRIALS)
real(Float64) :: Mean_Time
real(WP) :: rsum(NTRIALS)
real(WP) :: stddev
allocate( v(N), r(N) )
call random_number( v )
v = v * 200.0_wp
! Set first element to a reference value for a numerical check
v(1) = ONE
call calc( Calc_Times, rsum )
Mean_Time = ( sum(Calc_Times) - minval(Calc_Times, dim=1) - maxval(Calc_Times, dim=1) ) / (NTRIALS-2)
print *, "BESSEL_J1 Evaluation: Mean compute time for ", N, " evaluations:"
print *, Mean_Time, " seconds."
stddev = norm2( rsum - sum(rsum)/NTRIALS )
print *, "Standard deviation in calculated results: ", stddev
contains
impure elemental subroutine calc( t, sumr )
! Argument list
real(Float64), intent(inout) :: t
real(WP), intent(inout) :: sumr
! Local variables
real(Float64) :: t0
real(Float64) :: t1
call cpu_t( t0)
r = bessel_j1( v )
call cpu_t( t1 )
t = t1 - t0
! Check calculation result
if ( abs(r(1) - 0.44_wp) > 1E-3_wp ) then
print *, "BESSEL_J1( x=1 ): ", r(1), " vs an expected value of approximately ", 0.44_wp
error stop "Unexpected result with x = 1.0: "
end if
sumr = sum( r )
end subroutine
subroutine cpu_t( time )
use, intrinsic :: iso_fortran_env, only : I8 => int64
!.. Argument list
real(Float64), intent(inout) :: time
!.. Local variables
integer(I8) :: tick
integer(I8) :: rate
time = 0.0_Float64
call system_clock (tick, rate)
time = real(tick, kind=kind(time) ) / real(rate, kind=kind(time) )
return
end subroutine
end
C:\temp>gfortran -O3 p.f90 -o p.exe
C:\temp>p.exe
BESSEL_J1 Evaluation: Mean compute time for 1000000 evaluations:
4.9179059994639826E-002 seconds.
Standard deviation in calculated results: 0.0000000000000000
Just a note that nvfortran, gfortran, and ifort produced very similar timings with the simpler test. That is a little surprising as each vendor is free to implement the intrinsics differently. With some of the intrinsics some compilers concentrate on precision, others on speed or error detection, etc …
The netlib bessel functions support fractional orders, and some like the Sandia Mathematical Program Library routines were about 15 to 20 percent faster than the intrinsics when compiled with high optimization last I timed them; but considerably slower than the intrinsics when compiled with common debug flags; for example ( I used the Sandia procedures for a long time, before the intrinsics were available on the majority of common compilers).
To add to above, from a simple-minded Fortran practitioner perspective, especially to clear away all the fog which comes about due to Julia benchmark tools and the various macro add-ons that would appear to do a lot of stuff “under the hood” which is not at all obvious to an unfamiliar reader of Julia commands, a straightforward computation somewhat similar to the Fortran coding approach shown above can be as follows:
using Bessels
function fn!(v::Array{Float64}, r::Array{Float64})
for i = 1:length(v)
r[i] = Bessels.besselj1(v[i])
end
return nothing
end
v = rand(1000000)*200.0
r = zeros(1000000)
@time x=fn!(v,r)
With above on the same platform where the gfortran-compiled program execution result I had provided previously, Julia yields:
C:\temp>julia b1.jl
0.086703 seconds (13.91 k allocations: 777.971 KiB, 63.65% compilation time)
So you will notice the time of 0.086703 seconds reported by Julia. I don’t use Julia so I’m unsure how to interpret this timing value. It is about 1.7x greater than gfortran for 1,000,000 evaluations of the Bessel function of the 1st kind, order 1.
Does the Julia time include compilation and allocation even though the @time
macro is applied on the function evaluation?
Yes, if you scroll a bit to the right on the output you showed, you’ll see that it reported
63.65% compilation time
suggesting that the run-time of your benchmark was 0.02861 seconds
.
It’s also worth noting that julia’s array indexing and setting is safe by default, so each iteration of your benchmark loop will always include at least one bounds check for r
(the compiler will probably, but not always, figure out that the v
indexing is inbounds and elide the check). You can annotate the inside of your loop body with the @inbounds
macro to eliminate this.
There are a couple things I would be wary of (you can see how I set up the Julia benchmarks at (bessel-benchmarks/besselj.jl at main · heltonmc/bessel-benchmarks · GitHub). v
and r
are global variables and as Mason said it is not making any inference about the length of r
and v
so a lot of time is spent simply checking out of bounds access and assignment. Here is an implementation that matches your Fortran code…
function bessel(f; N=50000000, Order=false)
v = rand(N)*200
r = similar(v)
if Order
v = sort(v)
end
tstart = time()
@inbounds for ind in eachindex(v)
r[ind] = f(v[ind])
end
tend = time()
t = (tend-tstart) / N * 1e9
return t
end
julia> bessel(besselj1)
29.67360019683838
julia> bessel(besselj1)
29.674997329711914
Which is about 29.7 ns per evaluation (which is what I report at GitHub - heltonmc/bessel-benchmarks: Benchmarking Bessel Function routines). The code per evaluation that you reported is about 49 ns per eval (which is also similar to what I got). If you wanted to enable bounds checks you can remove the inbounds. My timings removing that statement are …
julia> bessel(besselj1)
31.076979637145996
julia> bessel(besselj1)
30.705680847167972
julia> bessel(besselj1)
30.715522766113285
They’re not globals inside the function, they’re just regular function arguments which is fine.