Julia: Fast as Fortran, Beautiful as Python

Well, there you are just measuring compilation time (the function is compiled the first time you call it in Julia). Run the function twice and you will get its performance:

julia> @time mydot(a,b,N)
  0.013960 seconds (25.05 k allocations: 1.596 MiB, 89.23% compilation time)
250280.42409540198

julia> @time mydot(a,b,N)
  0.002155 seconds (1 allocation: 16 bytes)
250280.42409540198

Which is roughly the same as the Fortran version (Time: 0.00295899808406830 s).

So, no trickery there. The version using LinearAlgebra.dot is much faster, actually:

julia> @time c = dot(a, b)
  0.000527 seconds (1 allocation: 16 bytes)
250412.44f0

Because it is loop-vectorized. You can achieve that in the custom Julia version by adding some flags (and fixing the fact that you were returning a float64 by initializing with c=0.; the function can be made generic by using c = zero(eltype(a))):

julia> function mydot(a, b, N)
           c = 0.f0
           @inbounds @simd for i = 1:N
               c += a[i]*b[i]
           end
           c
       end
mydot (generic function with 1 method)

julia> @time mydot(a,b,N)
  0.022222 seconds (44.48 k allocations: 2.825 MiB, 97.20% compilation time)
250412.59497003

julia> @time mydot(a,b,N)
  0.000728 seconds (1 allocation: 16 bytes)
250412.59497003

(unfortunately -march=native didnā€™t seem to provide the same speed in Fortran, I thought it would).

My point is: these micro-benchmarks are always a waste of resources, because all these languages can get close to best performance. You just need to know what you are doing. Julia does have a startup/compilation time lag, which is mostly irrelevant for real HPC applications (though it can be annoying in some cases for interactive use, and this is being worked extensively by the developers).

The point raised by @implicitall illustrates why the Julia community is so eager to show that Julia works: incorrect use or interpretation of simple tests can provide bad first impressions.

Concerning the comment of @shahmoradi : I cannot comment specifically about that video that claims 3x speedup over Fortran. But I have the same experience with some packages of mine, and I have spend much more time working on the Fortran codes than on the Julia ones. The best performance I have obtained using Julia is because I could improve the algorithms in Julia more easily than I was able to do in Fortran, basically because of better tooling. If I port back the algorithms to Fortran I will get similar performance, of course.

And that should not be a surprise here, after all, isnā€™t to have better tooling for Fortran that LFortran is being developed? I will not be surprised if in a few years we see posts like ā€œHow LFortran helped me to improve the speed of my (Fortran) code 3Xā€. That is what I expect to see, in fact, and this is the reason I am here in the forum, to follow the development of LFortran.

All these languages can generate optimal code. The choice of one over the other should not be because of the performance of properly written and benchmarked code, because they will be the same. The choice should be about syntax, the familiarity, stability, long-term support, distribution, tooling, community, etc. A lot of subjective reasons, and some objective, that favor one or the other.

10 Likes

Thank you all for the interesting discussion. Along the lines of a few comments in this thread, Iā€™m interested in this comparison: Have Fortran and Julia code for small yet real-world examples that produce equivalent assembly; then compare the two sources for conciseness, readability (however that is defined), tooling needed to run it, and similar.

2 Likes

Interesting, when I run the code as a script it does not mention compilation time. So I can call it twice, when I call it twice I still get a 7x faster program in Fortran. I donā€™t find that initialization with a float makes much difference. When I run your function with the metaprogramming help, I still get a 2.5x faster result in Fortran. What are you Fortran compiler flags?

1 Like

The experience shared by @Imiq here is another testament to the fact the Fortran community needs volunteers to begin developing the missing new tools in Fortran on a massive scale like the '70s. (I am paying my share by developing tools for which I have the expertise, but a single or handful of developer(s) is nowhere near the sheer number of graduate students who develop almost one Julia package a day, as I see it on GitHub. Now to be fair, I donā€™t think if the Julia communityā€™s approach will be sustainable and many of these packages will be likely abandoned in the future. But the Fortran community has the discipline and experience to develop a minimal long-lasting ecosystem built upon the existing infrastructure.

I think the Julia approach is sustainable. Just like in Python. Some packages will be abandoned, but a package that has users will stay useful and somebody will maintain it. We need a similar ecosystem in Fortran. Thanks to fpm and the fortran-lang community, it is already happening, we just need to keep the momentum going to grow much bigger.

8 Likes

Compilation time is printed from Julia 1.6+ only.

Compilation occurs at the first call of each function for each Julia section. Thus, at every call to the script you will be measuring the compile time (unless you use some tool like this one). Thus, Julia is not the most appropriate tool to run very fast scripts multiple times, and one quick program like this will for a single run, always be slower than a compiled Fortran code.

Let us see what happens with a much larger N, where compilation time is less important. For N=1_000_000_000 (one billion), for example:

% gfortran -O3 -march=native dot.f90 -o dot
% ./dot
   16777216.0    
Time: 1.23213911056518555 s

and for julia:

% julia dot.jl
  2.366178 seconds (26.00 k allocations: 1.658 MiB, 0.74% compilation time)
1.6777216e7

I donā€™t know why Fortran is faster here, and I just asked that in the Julia forums.

Thus, Fortran wins in this example. Yet, LinearAlgebra.dot is still faster:

  0.450951 seconds (49.83 k allocations: 3.192 MiB, 3.35% compilation time)

Probably if one links BLAS to the Fortran code we would get the same result.

That last result that can be reproduced in the custom loop with the LoopVectorization package (and, that of course can be very useful if your loop does something less standard). Compilation time is longer with that (~10s), thus, if your problem effectively takes 1-2s to run, that is not a good option either if you are not running that more than once.

Again, I donā€™t think that these micro-benchmarks are important. In both languages it is possible to get close-to-best performance, and in Julia you will always have some initialization/compilation delay (there is package compiler, but I never tried it - not had the need to try it). But you will have to know how to work with the language, and one or other might be better for the task and deployment you want, and for the workflow you find more comfortable. I have even been saying this to people in Julia forums: donā€™t expect Julia to be faster than C, C++ or Fortran. But do not expect it to be slower either. If one or other is much faster, certainly there is something that can be improved in the slower side. Sometimes that optimization feels natural in one language, sometimes it feels natural in the other.

add on: as you can see from the 2-seconds answer of my post in the Julia forums, I still sometimes donā€™t readily visualize the full picture of the programming model in Julia, after about 1,5yr programming with it. If it was my code I would have figured that out because there are some tools to check for that type of drawback, but Fortran is certainly more robust in that regard, there is certainly less space for that kind of problem, which is introduced by the dynamic nature of Julia.

2 Likes

Sorry, I wasnā€™t explicitly clear, what I mean by call it twice is this.

function mydot(a, b, N)
    c = 0.f0
    @inbounds @simd for i = 1:N
        c += a[i]*b[i]
    end
    c
end

N = 1000000
a = rand(Float32, N)
b = rand(Float32, N)
@time c = mydot(a, b, N)
println(c)

@time c = mydot(a, b, N)
println(c)

In any case, that is all the time I have for these particular benchmarks. I do think the benchmarks are important and that Julia is slower. I may change my mind later if I ever see any program run faster on my computer.

Well, exactly that code is faster here in (the second time printout) in Julia than in Fortran. But really, the choice of what language to learn should not be based on this kind of benchmark. If you were my student I would tell you to learn both: the programming concepts, tools, mental model, are different, both interesting and complementary. And later make a more informed decision if keep using one, the other, or both, for the tasks you feel they are better.

3 Likes

Yeah, I agree about trying to make an informed decision. So I asked for your compiler flags to try to find out why I do not see better performance in Julia on this exact example or any code in this topic.

I tried: -O3, -O3 -march=native, -O3 -march=native -funroll-loops -fvariable-expansion-in-unroller, but I couldnā€™t notice much of a difference. (maybe the Julia version?)

Thanks! I narrowed down to the -ffast-math flag. It has a huge speedup taking it from slower to faster. So I have reproduced your results.

1 Like

Hi everyone,

One of my colleagues just pointed out this thread to me. I wrote a version in fortran that matches the performance of the Julia code.

AFAICT the issue is entirely down to the poor intrinsics of gfortran, especially for complex functions. Using Intel VMKL makes up for it (but introduces additional call overhead). Compiling the whole thing with ifort would probably also work.

Here is my code btw


function eikR(a,k,n) result(M)
    implicit none

    integer,            intent(in)  :: n
    real(kind=8),       intent(in)  :: a(n)
    complex(kind=8),    intent(in)  :: k
    complex(kind=8)  :: M(n,n)

    ! Local
    integer                         :: i, j
    real(kind=8)  :: rtmp(n), sinp(n), cosp(n), tmp_exp(n)
    real(kind=8)  :: itmp(n), a_pow_2(n)

    complex(kind=8) :: imk
    real(kind=8) :: rimk
    real(kind=8) :: iimk, tmp

    imk = (0,1) * k
    rimk = real(imk)
    iimk = imag(imk)

    a_pow_2 = a * a
  
    !$OMP PARALLEL DO DEFAULT(private) SHARED(a_pow_2, M, n, rimk,iimk)
    do j=1,n
    
    rtmp = a_pow_2 + a_pow_2(j)
    call vdsqrt(n, rtmp, rtmp)

    ! real part
    call vdexp(n, rimk * rtmp, tmp_exp)

    ! imag part
    call vdsincos(n, iimk * rtmp, sinp, cosp)

    ! assemble output
    M(:, j) = cmplx(tmp_exp * cosp, tmp_exp * sinp)
    end do
    !$OMP END PARALLEL DO

end function eikR

4 Likes

@clavigne , thank you for your analysis. Re: your comment about Intel and it being commercial, can you please confirm you have seen this about Intel oneAPI being free and IFORT being available with it: Intel releases oneAPI Toolkit, free, Fortran 2018

free but not free software. I havenā€™t looked at the oneAPI license in details but the main reason why Iā€™ve used gfortran for that kind of mixed python code in the past is that it makes it really easy to distribute the source to collaborators and have them build binaries if they want.

talking about binaries, Iā€™d also like to point out that Julia itself is like a 600mb download which makes it hard to distribute apps and librairies. For comparison a statically linked large fortran and MKL program like @awvwgk xtbt is 15 MB.

4 Likes

Iā€™m not clear on whether the program is written in Fortran. Are you still running the Python program?

Welcome to the discourse, Cyrille, nice to see you around here.

Totally agree on this point, statically linking is one of my most loved features in compiled languages (not only Fortran, but C and C++ as well). There are of course some remaining issues there like GLIBC compatibility and distributions for different platforms / architectures, but static binaries tend to have the majority of users covered.

In C we can now take it a step further with cosmopolitan, I wonder what it takes to compile the C++ and Fortran runtimes against it and build actually portable executables in those languages as well.

1 Like

I think one thing to consider here is that for Fortran to ā€œwinā€ it canā€™t tie. Most users will prefer a higher level language if it offers the same performance. As such, Iā€™m not sure if Fortran can be improved enough to stop Julia (or nim /dex etc) from stealing itā€™s users.

Yes, Iā€™m compiling with f2py and identical flags except for -lmkl and running the same python test.

Earlier in this thread I wrote the equivalent Fortran program for your Julia program.

Wait sorry I think there is confusion. I wrote a fortran program that matches the performance of the (original) Julia code by calling intel MKL functions instead of using exp() and sin().

Sorry if it wasnt clear Iā€™m new to this platform.