Is there a good Fortran package for SDE, DDE and ODE?

@oscardssmith I think if the FLINT author really compare FLINT with Julia’s differentialequation.jl package, then the result is what it is.

I can be wrong, but it is difficult to believe the people who wrote, test, and maintain the Julia’s differentialequation.jl package did not aware of the stuff you mentioned. If those can really help they will do so already. Perhaps you could also make your comments and suggestions to Julia Forum and to see how they respond.

I can be absolutely wrong. But based on my very shallow experience with Julia, no offense, I feel that with or without the Julia’s optimization tricks you suggested, the speed of the Julia code will not really change much. 10-30% faster is possible, but like you said easily 10x faster is unlikely. Because as long as they follow the performance tips, Performance Tips · The Julia Language
recent versions of Julia should be intelligent enough to optimize those `stupid’ part of the code you pointed out automatically. If Julia cannot handle those optimization automatically and requires the user to write code with that level of awareness and carefulness you suggested, it will be really funny and seems lost its purpose because not everyone is software engineer level.

Like even in Fortran, based on my shallow experience, with modern compilers and hardware, even if I wrote things like A( i, : ) instead of column major A(:, i ) in the loop, 10-20% slower is possible (actually sometimes the difference is even smaller and even negligible). But the code will highly unlikely slow down by 10 times. In Julia I believe things are similar, the compiler will do some optimization work automatically and the modern hardware is fast enough. Besides, it is difficult to always maintain column major everywhere in the code. Sometime you just have to have some A(i, : ) no matter what.

This isn’t about code in DifferentialEquations.jl this is about the benchmarking code that the FLINT authors used to benchmark Julia against their FLINT. They wrote their own model separately in fortran and in julia, so their inefficiency in their code for the Julia model isn’t something DifferentialEquations is able to optimize for.

1 Like

Thank you! @oscardssmith
Eh, which jl test file in FLINT are you referring to, it seems I only find one jl file,

It seems the Julia script they wrote was not that bad.
Perhaps you could quickly do a test on you computer, with their original Julia script, and with your optimized script, to see if there is a 10x speedup?

Thank you very much.
May I ask a really stupid question, how to use the FOODIE library? It seems I cannot find installation guide from the github page.

I’m not going to bother fixing the whole thing, but just making the change I made to JacobiConstant results in

using BenchmarkTools
X = rand(6,1000)
@btime JacobiConstant(X) #6.7 ms
@btime JacobiConstant2(X) #0.19 ms

Thank you very much indeed!
Uhm, may I ask (sorry if it is very stupid), does this fix, influence the benchmark time for the differential equations much?
Like, totally the test take 10s, and 9s is spend in solving the differential equations.
Is it that your optimization managed to make the total time decreased to 9.01s, however the differential equation solver still took 9s?
Or, are you saying the optimization you made can decrease the total time to like, 1s or less?

Is it possible that those ‘stupid’ part only run once in their Julia code, therefore that do not influence the time of actually solving the differential equations very much?
For sure, if those ‘stupid’ part is repeatedly called during solving the differential equations, your optimization could help a lot!

Again, I am not expert in Julia in anyway, and I am just begin to dig into the world of differential equations, and I always tend to ask stupid questions.
Uhm, like, is it possible, perhaps your could check, in the Julia code the FLINT author wrote, how much time is really spent in calculating the differential equations? If majority of the time it is not calculating the differential equations, then for sure their benchmark timing is not meaningful.

EDIT: The timings below are not correct. See post below.

Yes it does, very much.

Making just the change suggested by @oscardssmith ,

Before:

julia> include("benchmark.jl")
FLINT Performance comparison with Julia DiffEq Package

rtol = 1.0 ./ 10.0 .^ 11 = 1.0e-11
atol = rtol * 0.001 = 1.0e-14
Running: DP5
Running: TSIT5
Running: DP8
Running: Vern6
Running: Vern9
tol: 1.0e-11

Solver: Julia time, FLINT time, Julia closing errors
==============================================================
    DP5: 1.42786e+01, 1.47000e+00 (FLINT   DOP54), 4.32869e-01
  TSIT5: 1.16113e+01, 1.47000e+00 (FLINT   DOP54), 4.32869e-01
    DP8: 1.92547e+01, 1.97000e+00 (FLINT  DOP853), 4.32869e-01
  Vern6: 1.48502e+01, 1.64000e+00 (FLINT Vern65E), 4.32869e-01
  Vern9: 4.67123e+01, 2.27000e+00 (FLINT Vern98R), 4.32869e-01

After:

julia> include("benchmark1.jl")
FLINT Performance comparison with Julia DiffEq Package

rtol = 1.0 ./ 10.0 .^ 11 = 1.0e-11
atol = rtol * 0.001 = 1.0e-14
Running: DP5
Running: TSIT5
Running: DP8
Running: Vern6
Running: Vern9
tol: 1.0e-11

Solver: Julia time, FLINT time, Julia closing errors
==============================================================
    DP5: 7.12239e+00, 1.47000e+00 (FLINT   DOP54), 4.32869e-01
  TSIT5: 8.75959e+00, 1.47000e+00 (FLINT   DOP54), 4.32869e-01
    DP8: 1.46508e+01, 1.97000e+00 (FLINT  DOP853), 4.32869e-01
  Vern6: 1.19671e+01, 1.64000e+00 (FLINT Vern65E), 4.32869e-01
  Vern9: 4.24091e+01, 2.27000e+00 (FLINT Vern98R), 4.32869e-01

Notice that the julia time for each of solver decreases very much. For the first solver it speeds up by about 50%! If this small change gives this much speed up, probably the speed will match the Fortran code easily by making the julia code “less suck” :slight_smile: .

2 Likes

The above timings are not correct. I found out that the reason for speedup in the benchmark was due to not including compilation time in the second run. When I ran the benchmark.jl again, I got similar timings as that of benchmark1.jl. I was thinking the compilation time was not included in the benchmarks. But it seems they do. Now it seems the improvement in timings is not much.

Timings without including compilation time:
Before:

julia> include("benchmark.jl")
FLINT Performance comparison with Julia DiffEq Package

rtol = 1.0 ./ 10.0 .^ 11 = 1.0e-11
atol = rtol * 0.001 = 1.0e-14
Running: DP5
Running: TSIT5
Running: DP8
Running: Vern6
Running: Vern9
tol: 1.0e-11

Solver: Julia time, FLINT time, Julia closing errors
==============================================================
    DP5: 7.12747e+00, 1.47000e+00 (FLINT   DOP54), 4.32869e-01
  TSIT5: 8.95990e+00, 1.47000e+00 (FLINT   DOP54), 4.32869e-01
    DP8: 1.46165e+01, 1.97000e+00 (FLINT  DOP853), 4.32869e-01
  Vern6: 1.18848e+01, 1.64000e+00 (FLINT Vern65E), 4.32869e-01
  Vern9: 4.24177e+01, 2.27000e+00 (FLINT Vern98R), 4.32869e-01

After:

julia> include("benchmark1.jl")
FLINT Performance comparison with Julia DiffEq Package

rtol = 1.0 ./ 10.0 .^ 11 = 1.0e-11
atol = rtol * 0.001 = 1.0e-14
Running: DP5
Running: TSIT5
Running: DP8
Running: Vern6
Running: Vern9
tol: 1.0e-11

Solver: Julia time, FLINT time, Julia closing errors
==============================================================
    DP5: 7.12068e+00, 1.47000e+00 (FLINT   DOP54), 4.32869e-01
  TSIT5: 8.87110e+00, 1.47000e+00 (FLINT   DOP54), 4.32869e-01
    DP8: 1.46547e+01, 1.97000e+00 (FLINT  DOP853), 4.32869e-01
  Vern6: 1.18421e+01, 1.64000e+00 (FLINT Vern65E), 4.32869e-01
  Vern9: 4.23092e+01, 2.27000e+00 (FLINT Vern98R), 4.32869e-01
1 Like

Thank you very much and welcome!

Sure, the compile time should not be included.

Uhm, so may I ask a stupid question, in your opinion, what is the conclusion, Julia version and FLINT, which is faster? Or their speed should be comparable?

I don’t think I’m qualified to judge that based on how little I know about them.

Alright, I am the author of FLINT, just made an account to reply to these comments.

First, which stupid benchmark code you are reading from, certainly not mine :slight_smile:. Because if you read my code properly you will see that the stupid JacobiConstant function is actually not part of the loop that is being timed. See line # 194-203 in my original file on the github link that Jacob Williams shared.

Second, this stupid julia code was written by me and was optimized by the author or one of the authors (Chris Rackauckas) of DifferentialEquations.jl package. By writing this Julia code (again the code that is being actually timed, not the extraneous things) as you write in C, he claimed he was able to make Julia code run as fast as the Fortran version (without ODE events) on his machine (and he acknowledges Fortran code is much more concise). You can read the entire discussion here. If I remember correctly, there are some interesting findings on the use of @view also there.

The latest benchmark code in my repository is exactly the optimized code from Chris plus the event condition that I added. I run the Julia code twice precisely to not account for compilation time in my second run.

As we all know there is always room to improve the codes and it is certainly true for this Julia code and maybe more so for FLINT.

5 Likes

Apologies for saying that JacobiConstant was part of the code being timed (I misread the code slightly).
The main problem in the code being benchmarked is the callbacks that you added. Since they are type unstable, and refer to a non-constant global variable, they kill performance.

To diagnose this, I’ve simplified your benchmark a bunch (just to make things easier to read).

using OrdinaryDiffEq, StaticArrays;

function cr3bpeom(dX, X, p, t)
    mu = 0.012277471
    @inbounds begin
        r1 = sqrt((X[1] + mu)^2 + X[2]^2)
        r2 = sqrt((X[1] - 1 + mu)^2 + X[2]^2)
        @views dX[1:3] = X[4:6]
        dX[4] = X[1] + 2*X[5] - (1 - mu)*(X[1] + mu)/r1^3 - mu*(X[1] - 1 + mu)/r2^3
        dX[5] = X[2] - 2*X[4] - (1 - mu)*X[2]/r1^3 - mu*X[2]/r2^3
        dX[6] = 0.0
    end
end

condition1 = function (u, t, integrator)
    if t > 0.5 && t < 0.5005
       1
    else
       -1
    end
end
function affect1(integrator)
end
function affect1_neg(integrator)
end
cb1 = ContinuousCallback(condition1, affect1, affect1_neg;
            rootfind=true,save_positions=(true,true),abstol=0.001)
            
Mask2 = false
function condition2(u, t, integrator)
    if Mask2 == false
        if u[1] < 0
            -1
        else
            u[2]
        end
    else
        -1
    end
end
function affect2(integrator)
    Mask2 = true
end
cb2 = ContinuousCallback(condition2, affect2; rootfind=true,save_positions=(true,true),abstol=1e-9)

Mask3 = false
function condition3(u, t, integrator)
    if Mask3 == false
        if u[1] > 0
            -1
        else
            u[2]
        end
    else
        -1
    end
end
function affect3(integrator)
    integrator.u .= X0
    Mask3 = true
end
cb3 = ContinuousCallback(condition3, affect3; rootfind=true,save_positions=(true,true),abstol=1e-9)
# callback set
cb = CallbackSet(cb1,cb2,cb3)

const X0 = [0.994, 0.0, 0.0, 0.0, -2.00158510637908252240537862224, 0.0]

#compile everything    
solve(ODEProblem(cr3bpeom, copy(X0), (0.0,.1)),Tsit5(),abstol=1e-14, reltol=1e-11,timeseries_errors=false, dense_errors=false, dense=false)
@time for i in 1:100
    # randomsize initial conditions
    X0[1] = X0[1] + .000000000001*X0[1].*rand()
    prob = ODEProblem(cr3bpeom, copy(X0), (0.0,42.66304140039491));                
    solve(prob,Tsit5(),abstol=1e-14, reltol=1e-11, timeseries_errors=false, dense_errors=false, dense=false,callback=cb)
end

On my computer, this takes 15 seconds. If we remove callback=cb, that drops to .2 seconds (or faster than FLINT).

We can then get another 10% performance improvement and a 20x drop in allocations by using the StaticVector approach that was mentioned when you originally asked for advice for on this (the other reason to do this is that it provides a much bigger speedup once you start adding back more optimized versions of the callbacks).

To optimize the callbacks, we need to make sure the conditions always return the same datatype, and don’t rely on non-const globals. For example,

condition1(u, t, integrator) = (t-0.5) * (t-0.5005)
cb1 = ContinuousCallback(condition1, nothing; rootfind=true,save_positions=(true,false), abstol=0.001)

const Mask2 = falsefunction condition2(u, t, integrator)
    if Mask2 == false
        if u[1] < 0
            -1.
        else
            u[2]
        end
    else
        -1.
    end
end
function affect2(integrator)
    Mask2 = true
end
cb2 = ContinuousCallback(condition2, affect2; rootfind=true,save_positions=(true,false),abstol=1e-9)

const Mask3 = false
function condition3(u, t, integrator)
    if Mask3 == false
        if u[1] > 0
            -1.
        else
            u[2]
        end
    else
        -1.
    end
end
function affect3(integrator)
    integrator.u = X0
    Mask3 = true
end
cb3 = ContinuousCallback(condition3, affect3; rootfind=true,save_positions=(true,true),abstol=1e-9)
cb = CallbackSet(cb1,cb2,cb3)

Using const for Mask2 and Mask3 isn’t great practice, but const variables allow you to change their value (but not type) for now, and it’s a good, easy way to show how much of a difference it makes. With these changes, the code runs in 3.5 seconds (or an over 4x speedup over the original code).

1 Like

We also get another 2.5x over what we had previously, by rewriting cr3bpeom as

function cr3bpeom(X, p, t)
    mu = 0.012277471
    @inbounds begin
        r1 = cbrt((X[1] + mu)^2 + X[2]^2)^2
        r2 = cbrt((X[1] - 1 + mu)^2 + X[2]^2)^2

        dX4 = X[1] + 2*X[5] - (1 - mu)*(X[1] + mu)*r1 - mu*(X[1] - 1 + mu)*r2
        dX5 = X[2] - 2*X[4] - (1 - mu)*X[2]*r1 - mu*X[2]*r2

        @SVector [X[4],X[5],X[6],dX4,dX5,0.0]
    end
end

Julia can’t do this re-write itself because it introduces floating point differences, but it will be as accurate, and much faster.

(Also, switching the algorithm to Vern8 brings this to under 1/3rd of a second)

Edit this rewrite is wrong, I got x^(2/3) != x^(-3/2)

2 Likes

Welcome @prince_mahajan, and thank you for posting!

All: I’ve been enjoying the conversation so far. Now that we all got a little bit of “stupid” out of the system, let’s please stay friendly and refrain from calling anybody’s work stupid, whether we think it or not. Thank you! :slight_smile:

7 Likes

@oscardssmith No problem. I think you missed the gist of my earlier post. Of course, it is possible to match Fortran performance with Julia by using static arrays and by avoiding many things that make a scripting language attractive to use. This was shown (for this particular differential equation problem) in that almost two year old post I had shared.

On my computer, this takes 15 seconds. If we remove callback=cb, that drops to .2 seconds (or faster than FLINT).

Your conclusion here is incorrect. You are comparing apples to oranges! Leme simplify for you. Those callbacks are for event handling. Turn off event handling in BOTH this Julia code as well as in FLINT to do a proper comparison. Otherwise, of course, you going to see Julia code taking less time than FLINT because the latter is still running event finding code.

I just did a test with the shortened script (not the one using static vector) from your post with event handling off (in which I see the main changes you did are just hard coding numbers and added @views). I increased the runs from 100 to 1000.

FLINT DOP54 solver time on my machine: ~2.4 s

JULIA TSIT5 DiffEq method on my machine: ~ 4.1 s

If you replace @views that you added with the plain assignment I had originally in the Julia script, you cut down by ~0.4 s: ~3.7 s (see below)

julia> include("JuliaDiffEq_CR3BP1.jl")
  3.720487 seconds (11.38 M allocations: 933.147 MiB, 3.18% gc time)

julia> include("JuliaDiffEq_CR3BP1.jl")
WARNING: redefinition of constant X0. This may fail, cause incorrect answers, or produce other errors.
  3.693575 seconds (11.38 M allocations: 933.090 MiB, 3.20% gc time)

julia> include("JuliaDiffEq_CR3BP1.jl")
WARNING: redefinition of constant X0. This may fail, cause incorrect answers, or produce other errors.
  3.843031 seconds (11.38 M allocations: 933.092 MiB, 3.05% gc time)
3 Likes

Just for curiosity, which Julia version are you using? From 1.5+ views should be pretty much for free, afaik. If not, maybe this is an issue that deserves being reported.

At the end, the claim that FLINT is an order of magnitude faster than DifferentialEquations.jl can be supported or not?

I’m kind of confused why you don’t consider StaticArrays fair game for Julia. They are an established part of the ecosystem for doing high performance computing in Julia, and I don’t think they make Julia less attractive to use as a scripting language.

the latest stable release. I don’t follow Julia updates religiously but I have heard this line for views multiple times!

julia> versioninfo()
Julia Version 1.6.2
Commit 1b93d53fc4 (2021-07-14 15:36 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: Intel(R) Core(TM) i7-4600U CPU @ 2.10GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, haswell)

For your last question, my results are not updated in a while and hopefully, I will find time to update them with the new version, and then we will see. We should avoid generalizations one way or another. I am just doing benchmarking for one differential equation IVP and that is what I am reporting.

@oscardssmith I didn’t consider StaticArrays version of your code because I was out of time. I will run that also and report the numbers here. But without using that, I am definitely not seeing julia diffeq code matching FLINT numbers. And I think we can at least agree to that.

Update: with StaticArrays, Julia TSIT5 solver time: ~3.6 s.

1 Like

What type of solver is DOP54? (I haven’t come across it before). For these tolerances, I’m seeing Vern8 outperforming Tsit5 by about 3x without the callbacks, and about 2x with the callbacks.

For large systems of stiff ODEs the CVODE BDF method in Sundials is really great. I use it in my model of atmospheric chemistry. CVODE is in C, but they provide a great Fortran interface. Sundials makes it easy to use sparse jacobians (which is essential for big problems)

Example of how I use CVODE is here: Photochem/photochem.f90 at 11fcdda980e13983d816e9f8e42a4a4ea123be12 · Nicholaswogan/Photochem · GitHub

Sundials also has the ARKODE ODE solver, which is a collection of Runge Kutta methods (both implicit and explicit). I have tried these methods some, but without much success so far.

3 Likes