Julia: Fast as Fortran, Beautiful as Python

lmiq · June 21, 2021, 11:50am

Well, there you are just measuring compilation time (the function is compiled the first time you call it in Julia). Run the function twice and you will get its performance:

julia> @time mydot(a,b,N)
  0.013960 seconds (25.05 k allocations: 1.596 MiB, 89.23% compilation time)
250280.42409540198

julia> @time mydot(a,b,N)
  0.002155 seconds (1 allocation: 16 bytes)
250280.42409540198

Which is roughly the same as the Fortran version (Time: 0.00295899808406830 s).

So, no trickery there. The version using LinearAlgebra.dot is much faster, actually:

julia> @time c = dot(a, b)
  0.000527 seconds (1 allocation: 16 bytes)
250412.44f0

Because it is loop-vectorized. You can achieve that in the custom Julia version by adding some flags (and fixing the fact that you were returning a float64 by initializing with c=0.; the function can be made generic by using c = zero(eltype(a))):

julia> function mydot(a, b, N)
           c = 0.f0
           @inbounds @simd for i = 1:N
               c += a[i]*b[i]
           end
           c
       end
mydot (generic function with 1 method)

julia> @time mydot(a,b,N)
  0.022222 seconds (44.48 k allocations: 2.825 MiB, 97.20% compilation time)
250412.59497003

julia> @time mydot(a,b,N)
  0.000728 seconds (1 allocation: 16 bytes)
250412.59497003

(unfortunately -march=native didn’t seem to provide the same speed in Fortran, I thought it would).

My point is: these micro-benchmarks are always a waste of resources, because all these languages can get close to best performance. You just need to know what you are doing. Julia does have a startup/compilation time lag, which is mostly irrelevant for real HPC applications (though it can be annoying in some cases for interactive use, and this is being worked extensively by the developers).

The point raised by @implicitall illustrates why the Julia community is so eager to show that Julia works: incorrect use or interpretation of simple tests can provide bad first impressions.

Concerning the comment of @shahmoradi : I cannot comment specifically about that video that claims 3x speedup over Fortran. But I have the same experience with some packages of mine, and I have spend much more time working on the Fortran codes than on the Julia ones. The best performance I have obtained using Julia is because I could improve the algorithms in Julia more easily than I was able to do in Fortran, basically because of better tooling. If I port back the algorithms to Fortran I will get similar performance, of course.

And that should not be a surprise here, after all, isn’t to have better tooling for Fortran that LFortran is being developed? I will not be surprised if in a few years we see posts like “How LFortran helped me to improve the speed of my (Fortran) code 3X”. That is what I expect to see, in fact, and this is the reason I am here in the forum, to follow the development of LFortran.

All these languages can generate optimal code. The choice of one over the other should not be because of the performance of properly written and benchmarked code, because they will be the same. The choice should be about syntax, the familiarity, stability, long-term support, distribution, tooling, community, etc. A lot of subjective reasons, and some objective, that favor one or the other.

milancurcic · June 21, 2021, 4:03pm

Thank you all for the interesting discussion. Along the lines of a few comments in this thread, I’m interested in this comparison: Have Fortran and Julia code for small yet real-world examples that produce equivalent assembly; then compare the two sources for conciseness, readability (however that is defined), tooling needed to run it, and similar.

implicitall · June 21, 2021, 4:10pm

Interesting, when I run the code as a script it does not mention compilation time. So I can call it twice, when I call it twice I still get a 7x faster program in Fortran. I don’t find that initialization with a float makes much difference. When I run your function with the metaprogramming help, I still get a 2.5x faster result in Fortran. What are you Fortran compiler flags?

shahmoradi · June 21, 2021, 4:27pm

The experience shared by @Imiq here is another testament to the fact the Fortran community needs volunteers to begin developing the missing new tools in Fortran on a massive scale like the '70s. (I am paying my share by developing tools for which I have the expertise, but a single or handful of developer(s) is nowhere near the sheer number of graduate students who develop almost one Julia package a day, as I see it on GitHub. Now to be fair, I don’t think if the Julia community’s approach will be sustainable and many of these packages will be likely abandoned in the future. But the Fortran community has the discipline and experience to develop a minimal long-lasting ecosystem built upon the existing infrastructure.

certik · June 21, 2021, 4:49pm

I think the Julia approach is sustainable. Just like in Python. Some packages will be abandoned, but a package that has users will stay useful and somebody will maintain it. We need a similar ecosystem in Fortran. Thanks to fpm and the fortran-lang community, it is already happening, we just need to keep the momentum going to grow much bigger.

lmiq · June 21, 2021, 5:26pm

Compilation time is printed from Julia 1.6+ only.

Compilation occurs at the first call of each function for each Julia section. Thus, at every call to the script you will be measuring the compile time (unless you use some tool like this one). Thus, Julia is not the most appropriate tool to run very fast scripts multiple times, and one quick program like this will for a single run, always be slower than a compiled Fortran code.

Let us see what happens with a much larger N, where compilation time is less important. For N=1_000_000_000 (one billion), for example:

% gfortran -O3 -march=native dot.f90 -o dot
% ./dot
   16777216.0    
Time: 1.23213911056518555 s

and for julia:

% julia dot.jl
  2.366178 seconds (26.00 k allocations: 1.658 MiB, 0.74% compilation time)
1.6777216e7

I don’t know why Fortran is faster here, and I just asked that in the Julia forums.

Thus, Fortran wins in this example. Yet, LinearAlgebra.dot is still faster:

  0.450951 seconds (49.83 k allocations: 3.192 MiB, 3.35% compilation time)

Probably if one links BLAS to the Fortran code we would get the same result.

That last result that can be reproduced in the custom loop with the LoopVectorization package (and, that of course can be very useful if your loop does something less standard). Compilation time is longer with that (~10s), thus, if your problem effectively takes 1-2s to run, that is not a good option either if you are not running that more than once.

Again, I don’t think that these micro-benchmarks are important. In both languages it is possible to get close-to-best performance, and in Julia you will always have some initialization/compilation delay (there is package compiler, but I never tried it - not had the need to try it). But you will have to know how to work with the language, and one or other might be better for the task and deployment you want, and for the workflow you find more comfortable. I have even been saying this to people in Julia forums: don’t expect Julia to be faster than C, C++ or Fortran. But do not expect it to be slower either. If one or other is much faster, certainly there is something that can be improved in the slower side. Sometimes that optimization feels natural in one language, sometimes it feels natural in the other.

add on: as you can see from the 2-seconds answer of my post in the Julia forums, I still sometimes don’t readily visualize the full picture of the programming model in Julia, after about 1,5yr programming with it. If it was my code I would have figured that out because there are some tools to check for that type of drawback, but Fortran is certainly more robust in that regard, there is certainly less space for that kind of problem, which is introduced by the dynamic nature of Julia.

implicitall · June 21, 2021, 6:10pm

Sorry, I wasn’t explicitly clear, what I mean by call it twice is this.

function mydot(a, b, N)
    c = 0.f0
    @inbounds @simd for i = 1:N
        c += a[i]*b[i]
    end
    c
end

N = 1000000
a = rand(Float32, N)
b = rand(Float32, N)
@time c = mydot(a, b, N)
println(c)

@time c = mydot(a, b, N)
println(c)

In any case, that is all the time I have for these particular benchmarks. I do think the benchmarks are important and that Julia is slower. I may change my mind later if I ever see any program run faster on my computer.

lmiq · June 21, 2021, 6:22pm

Well, exactly that code is faster here in (the second time printout) in Julia than in Fortran. But really, the choice of what language to learn should not be based on this kind of benchmark. If you were my student I would tell you to learn both: the programming concepts, tools, mental model, are different, both interesting and complementary. And later make a more informed decision if keep using one, the other, or both, for the tasks you feel they are better.

implicitall · June 21, 2021, 6:25pm

Yeah, I agree about trying to make an informed decision. So I asked for your compiler flags to try to find out why I do not see better performance in Julia on this exact example or any code in this topic.

lmiq · June 21, 2021, 6:29pm

I tried: -O3, -O3 -march=native, -O3 -march=native -funroll-loops -fvariable-expansion-in-unroller, but I couldn’t notice much of a difference. (maybe the Julia version?)

implicitall · June 21, 2021, 7:12pm

Thanks! I narrowed down to the -ffast-math flag. It has a huge speedup taking it from slower to faster. So I have reproduced your results.

clavigne · June 21, 2021, 7:41pm

Hi everyone,

One of my colleagues just pointed out this thread to me. I wrote a version in fortran that matches the performance of the Julia code.

github.com/mdmaas/julia-numpy-fortran-test

Fortran program is suboptimal

opened 06:36PM - 21 Jun 21 UTC

clavigne

Hi! I saw your post on HN and really enjoyed it! Julia is a great language and …very promising for HPC. I was kinda struck through by how poor the Fortran performance was compared to pure Numpy. I started digging around and I realized that there are a few pretty key performance improvements that can made to the Fortran version that are specifically not done by gfortran but that are done by both numpy and Julia. The key issue is that the GNU Fortran version uses the GNU intrinsics for exp() and sqrt(). These are well known to be poorly optimized compared to the intel ones, which is what is used by most numpy distributions (and I assume, Julia). Instead of using ifort, which is commercial, I elected to call the Vector MKL functions directly. The current code can be compiled using a conda env with numpy in it. Here are the single thread timings for the optimized fortran version ``` 0.09577751159667969 0.08719778060913086 0.1524660587310791 0.24555754661560059 0.3789374828338623 0.5553531646728516 0.740034818649292 0.9681270122528076 1.1795861721038818 1.4268994331359863 ``` and here are the corresponding numpy timings, ``` N=1000, time in Numpy: 0.06533169746398926 seconds N=2000, time in Numpy: 0.20993351936340332 seconds N=3000, time in Numpy: 0.4487905502319336 seconds N=4000, time in Numpy: 0.787825345993042 seconds N=5000, time in Numpy: 1.2224347591400146 seconds N=6000, time in Numpy: 1.7508139610290527 seconds N=7000, time in Numpy: 2.3738491535186768 seconds N=8000, time in Numpy: 3.0882439613342285 seconds N=9000, time in Numpy: 4.071347951889038 seconds N=10000, time in Numpy: 5.099215745925903 seconds ``` The speedup to fortran is 3.6x for the 10,000 version and 3x throughout. With four threads on the same machine, the speedup is about 2x. ``` 0.04375338554382324 0.04431009292602539 0.07324886322021484 0.11355233192443848 0.1703178882598877 0.24642515182495117 0.33534884452819824 0.4547462463378906 0.5772113800048828 0.7108926773071289 ``` I don't have Julia installed on this machine to compare with though, but if the performance trends are similar, this improved Fortran code should be pretty much as fast as the Julia version. Thanks! Cyrille

AFAICT the issue is entirely down to the poor intrinsics of gfortran, especially for complex functions. Using Intel VMKL makes up for it (but introduces additional call overhead). Compiling the whole thing with ifort would probably also work.

Here is my code btw


function eikR(a,k,n) result(M)
    implicit none

    integer,            intent(in)  :: n
    real(kind=8),       intent(in)  :: a(n)
    complex(kind=8),    intent(in)  :: k
    complex(kind=8)  :: M(n,n)

    ! Local
    integer                         :: i, j
    real(kind=8)  :: rtmp(n), sinp(n), cosp(n), tmp_exp(n)
    real(kind=8)  :: itmp(n), a_pow_2(n)

    complex(kind=8) :: imk
    real(kind=8) :: rimk
    real(kind=8) :: iimk, tmp

    imk = (0,1) * k
    rimk = real(imk)
    iimk = imag(imk)

    a_pow_2 = a * a
  
    !$OMP PARALLEL DO DEFAULT(private) SHARED(a_pow_2, M, n, rimk,iimk)
    do j=1,n
    
    rtmp = a_pow_2 + a_pow_2(j)
    call vdsqrt(n, rtmp, rtmp)

    ! real part
    call vdexp(n, rimk * rtmp, tmp_exp)

    ! imag part
    call vdsincos(n, iimk * rtmp, sinp, cosp)

    ! assemble output
    M(:, j) = cmplx(tmp_exp * cosp, tmp_exp * sinp)
    end do
    !$OMP END PARALLEL DO

end function eikR

FortranFan · June 21, 2021, 7:46pm

@clavigne , thank you for your analysis. Re: your comment about Intel and it being commercial, can you please confirm you have seen this about Intel oneAPI being free and IFORT being available with it: Intel releases oneAPI Toolkit, free, Fortran 2018

clavigne · June 21, 2021, 8:11pm

free but not free software. I haven’t looked at the oneAPI license in details but the main reason why I’ve used gfortran for that kind of mixed python code in the past is that it makes it really easy to distribute the source to collaborators and have them build binaries if they want.

talking about binaries, I’d also like to point out that Julia itself is like a 600mb download which makes it hard to distribute apps and librairies. For comparison a statically linked large fortran and MKL program like @awvwgk xtbt is 15 MB.

implicitall · June 21, 2021, 8:44pm

I’m not clear on whether the program is written in Fortran. Are you still running the Python program?

awvwgk · June 21, 2021, 8:45pm

Welcome to the discourse, Cyrille, nice to see you around here.

Totally agree on this point, statically linking is one of my most loved features in compiled languages (not only Fortran, but C and C++ as well). There are of course some remaining issues there like GLIBC compatibility and distributions for different platforms / architectures, but static binaries tend to have the majority of users covered.

In C we can now take it a step further with cosmopolitan, I wonder what it takes to compile the C++ and Fortran runtimes against it and build actually portable executables in those languages as well.

oscardssmith · June 21, 2021, 8:47pm

I think one thing to consider here is that for Fortran to “win” it can’t tie. Most users will prefer a higher level language if it offers the same performance. As such, I’m not sure if Fortran can be improved enough to stop Julia (or nim /dex etc) from stealing it’s users.

clavigne · June 21, 2021, 8:50pm

Yes, I’m compiling with f2py and identical flags except for -lmkl and running the same python test.

implicitall · June 21, 2021, 8:51pm

Earlier in this thread I wrote the equivalent Fortran program for your Julia program.

clavigne · June 21, 2021, 9:00pm

Wait sorry I think there is confusion. I wrote a fortran program that matches the performance of the (original) Julia code by calling intel MKL functions instead of using exp() and sin().

Sorry if it wasnt clear I’m new to this platform.

Topic		Replies	Views
An interesting video: Python vs Fortran vs GNU Octave/MATLAB --- side by side performance comparison	5	1182	January 6, 2022
Simple summation 8x slower than in Julia	89	14995	April 2, 2022
Comparing Fortran and Julia's Bessel function performance	69	4833	October 23, 2022
Julia regularly outperforms C/C++/Fortran	8	2131	March 29, 2022
Improving Fortran Results in the Julia Micro-benchmarks Help	44	3994	June 23, 2022

Julia: Fast as Fortran, Beautiful as Python

Related topics