Julia: Fast as Fortran, Beautiful as Python

Here is an article and the associated Hacker News discussion where they compare Fortran and Julia and concluded that Fortran was slower:

Here is the Fortran benchmark:

Here is the Julia benchmark:

If somebody has time to investigate what is going on, that would be great. For simple benchmarks like this, Fortran should not be slower.

Update 6/27/2021: the blog post title was updated from Julia: Faster than Fortran, cleaner than Numpy to Julia: Fast as Fortran, Beautiful as Python, so I updated the name of this thread also.

1 Like

Here is a preliminary analysis (i_ is an imaginary unit (0,1)):

!$OMP PARALLEL DO PRIVATE(J)
do j=1,n
do i=1,n
    M(i,j) = exp(i_*k*sqrt(a(i)**2 + a(j)**2))
end do
end do
!$OMP END PARALLEL DO

It’s parallelized over j, so for a given j, the work to investigate is:

do i=1,n
    M(i,j) = exp(i_*k*sqrt(a(i)**2 + a(j)**2))
end do

Let’s rewrite the loop to only use real operations:

do i=1,n
    M(i,j) = cos(k*sqrt(a(i)**2 + a(j)**2)) + i_*sin(k*sqrt(a(i)**2 + a(j)**2))
end do

Perhaps let’s write it like this:

X = a(j)**2
do i=1,n
    Y = k*sqrt(a(i)*a(i) + X)
    M(i,j) = (cos(Y), sin(Y))
end do

Now let’s count the (real, i.e. not complex) operations:

+: 1
*: 2
sqrt: 1
sin/cos: 1 (one sin, one cos of the same argument)
R (memory read): 1 (only `a(i)` has to be read from memory in each iteration)
W (memory write): 2 (one complex write into `M(i,j)`, which is two real writes)

Assuming everything is vectorized and the memory reads/writes are from L1 cache, the number of clock cycles per operation, say, on Haswell will be:

+: 0.250
*: 0.125
R: 0.125
W: 0.250
sqrt: 4-7 (?)
sin/cos: ? 

It seems the most expensive operation will be the sqrt and sin/cos, so this will be computation bound.

To obtain the exact theoretical performance peak, we should lookup the costs exactly on more modern hardware, and then try to achieve the peak.

There are already some responses to weaknesses in the Fortran implementation of this benchmark in the Hacker news post like:

This example is not really testing Fortran vs. Julia, but rather testing the speed of the math library you linked to.

Forgive me for mixing subjective opinion with this objective discussion. But from my perspective, I do not find such benchmarks informative at all. I feel like these benchmarks are mostly done by graduate students with a strong passion for Julia and to prove to others that Julia is fast (perhaps it is). But in the end, I see Julia programmers in the Julia forum expressing their struggles with making Julia scripts performant in practice. To be fair, I have never struggled to make any Fortran code performant, it is fast and efficient, by default. In my experience, algorithmic efficiency tends to dominate other sources of runtime efficiency bottlenecks in compiled languages, for most real-world applications. Whether Fortran is 10% faster or slower than C or Julia in particular tasks becomes irrelevant in most cases.

5 Likes

This is an interesting one, actually there is nothing wrong with the Fortran code, but maybe the way it is used. Check the repository and you will notice it is used from Python. Now the question which interested me more than why Julia is faster, is how bad can it be to call Fortran from Python.

I took the original code from here made a trial run on my machine (nothing fancy, Intel Sky Lake i7-6500U CPU @ 2.50GHz):

❯ python test_f2py.py
1000 0.02498769760131836
2000 0.10134744644165039
3000 0.2162013053894043
4000 0.3785884380340576
5000 0.5802724361419678
6000 0.828559398651123
7000 1.1193299293518066
8000 1.4555211067199707
9000 1.8339998722076416
10000 2.9962995052337646

Maybe the implementation was not optimal, taking Ondřej’s suggestions to check if we can improve with this one here:

! test-f2py.f90
module mymodule
  use omp_lib
  implicit none
contains
  function eikR(a,k,n) result(M)
    integer,            intent(in)  :: n
    real(kind=8),       intent(in)  :: a(n)
    complex(kind=8),    intent(in)  :: k
    complex(kind=8)                 :: M(n,n)
    real(8)                         :: Y
    integer                         :: i, j
    !$omp parallel do simd private(J, Y) collapse(2)
    do j=1,n
      do i=1,n
        Y = k*sqrt(a(i)*a(i) + a(j)*a(j))
        M(i,j) = cmplx(cos(Y), sin(Y))
      end do
    end do
  end function eikR
end module

Indeed with a hand optimized implementation we get faster, but unfortunately not by much

❯ python test_f2py.py
1000 0.026028871536254883
2000 0.10151910781860352
3000 0.21696782112121582
4000 0.3690149784088135
5000 0.5648448467254639
6000 0.8063969612121582
7000 1.1044635772705078
8000 1.7926850318908691
9000 2.2729897499084473
10000 2.574355125427246

The actual question is how bad can the f2py wrapping and the usage in Python actually be, here is an equivalent Fortran main

! main.f90
use mymodule
use omp_lib
implicit none
integer, parameter :: dp = selected_real_kind(15)
real(dp), parameter :: pi = 4*atan(1.0_dp)
integer :: i, j, n
real(dp), allocatable :: a(:)
complex(dp) :: k
complex(dp), allocatable :: m(:, :)
real(dp) :: t1, t2
do i = 1, 10
   n = 1000*i
   a = [(2*j*pi, j = 0, n-1)]
   k = cmplx(100.0_dp, 1.0_dp)
   t1 = omp_get_wtime()
   m = eikr(a, k, n)
   t2 = omp_get_wtime()
   print *, n, t2 - t1
end do
end

Using the same compile flags as for the f2py module we get a surprising outcome:

❯ gfortran -O3 -march=native -funroll-loops -fopenmp -ffast-math -floop-nest-optimize test-f2py.f90 main.f90
❯ ./a.out
        1000   2.1003986003051978E-002
        2000   6.6137752000940964E-002
        3000  0.13244638699688949     
        4000  0.24651855599950068     
        5000  0.35877340099978028     
        6000  0.51193849599803798     
        7000  0.69200517200079048     
        8000  0.89750151500629727     
        9000   1.1269864560017595     
       10000   1.3638167019962566

It’s interesting to see that the f2py wrapper costs almost as much as the actual calculation we are performing.

5 Likes

Calling Fortran from Python can be really slow. I have seen it slow down the code by an order of magnitude or more, depending on the frequency and complexity of Fortran-Python interactions.

@certik,

As was shown in your other thread, there is much that can be questioned about such comparisons.

Ok so the previous thread established Julia implementation has somewhat faster trig functions compared to readily-accessible Fortran implementations such as gfortran currently and there are some macros with Julia developed by highly zealous enthusiasts that enable convenient non-sequential calculations (vectorized/threaded/parallel, etc.) but these are hardly aspects of the language itself, especially Fortran.

Thus, not only do the comparisons posted at GitHub by “mdmaas” appear entirely amateurish technically (who in their right mind posts benchmarks of Fortran function evaluation using a NumPy driver!) they have all the subtle elements of bias that can lead to pointless “language wars”.

6 Likes

Excellent illustration, I had the very same concern when I saw the benchmark site for the Fortran test but your post nails down the issue.

1 Like

Indeed, using f2py is a mistake when it comes to performance of such kernels. I noticed such an overhead also for some benchmarks in the past. One has to benchmark from Fortran itself and be careful to obtain accurate timings.

Disclaimer: Fortran programmer and Julia enthusiast here.

That micro benchmark, as many others, are really not very important for Fortran (or C, or C++). People are absolutely convinced that these languages can generate fast code, and except for beginners, it is clear that all of them can be tweaked to have close-to-optimal performance. And algorithmic differences, sometimes some compiler flag, is what differentiate one from the other.

Therefore, I have no doubt that with some small effort we can get a Fortran code to run as fast as the Julia code there.

On the other side, these benchmarks have some importance for Julia. Because not everybody is (yet) convinced that Julia can produce fast-as-possible code. Therefore, in the Julia community, every time a beginner appears with some odd performance issue, the community is rapid in trying to solve and provide the hints on what is going on.

It is also true that to write code in Julia as fast as Fortran code one must know how Julia works, and it is possible to get very slow code (interpreted-language slow) code if one commits some mistakes. Fortran is fast more out of the box, and less open to these mistakes.

Properly written code in any of these languages should be similar in performance. And one or the other can have a lead because of some micro-optimization, which might be easier or necessary in one language or in the other.

Again, reinforcing that for the Julia side showing that this is the case is important and because of that you will find many people putting time to teach others how to do that. On the Fortran side there is no need to do that. Nobody can sanely doubt that Fortran code can be fast.

9 Likes

The post in question at Hacker News does not convey any such relatable notions, rather it comes across as primal in intent and tribal in its pursuit starting with, “I’m switching to Julia …”

Well, I cannot speak for the poster (which declares he is a beginner), but what I see, from the Julia community perspective, which read that with joy, is: Someone tried Julia and obtained a performant code, and decided to seriously give Julia a try.

That is important publicity from the Julia side because it is more common to hear “I tried Julia and got a lousy performance, all those Julia fans are lying.”

Getting a good performance or a lousy performance in Julia without knowing the basics of how Julia works is basically a matter of chance. And that is bad for Julia first impressions.

I can even illustrate that. Get the original code of that post, in Julia:

julia> function eval_exp(N)
           a = range(0, stop=2*pi, length=N)
           A = Matrix{ComplexF64}(undef, N, N)
           @inbounds Threads.@threads for j in 1:N
           for i in 1:N
               A[i,j] = exp((100+im)*im*sqrt(a[i]^2 + a[j]^2))
           end
           end
           return A
       end
eval_exp (generic function with 1 method)

julia> N = 100;

julia> @btime eval_exp($N);
  304.139 μs (8 allocations: 156.92 KiB)

Ok, it takes ~300 microseconds.

Now lets only change one thing: don’t pass N as a parameter to the function:

julia> function eval_exp()
           a = range(0, stop=2*pi, length=N)
           A = Matrix{ComplexF64}(undef, N, N)
           @inbounds Threads.@threads for j in 1:N
           for i in 1:N
               A[i,j] = exp((100+im)*im*sqrt(a[i]^2 + a[j]^2))
           end
           end
           return A
       end
eval_exp (generic function with 2 methods)

julia> @btime eval_exp();
  1.875 ms (100112 allocations: 2.75 MiB)

The code just got 6x slower. (and that is because N is type-unstable).

This is more common as a first impression in Julia, and is very bad for its publicity.

5 Likes

The problem is that such Julia publicity are often accompanied with unfair comparisons with other languages, in particular, Fortran, simply because there are not so many people proficient in Fortran who could question the validity of the claims. I even remember watching a presentation by one of Julia’s founders explaining how a Julia translation of an old Fortran code was 3X faster. I was surprised by the fact that no one in the audience asked whether the authors ever tried to put the same amount of effort into the Fortran code to make it more performant or not. This would be essential for a fair ethical comparison.

6 Likes

As someone just getting into HPC and giving Fortran and Julia a try, I can confirm that I have had the experience of “I tried Julia and got lousy performance”. All I did in my test was use the dot_product/dot intrinsic. I find that a dot product of 1000000 random numbers performs 9x faster in Fortran. I will be continuing down the path of trying to use Fortran for more work.

Julia:

using LinearAlgebra

N = 1000000
a = rand(Float32, N)
b = rand(Float32, N)
@time c = dot(a, b)
println(c)

And Fortran:

real(real32) :: c
integer, parameter :: N = 1000000
real(real32), allocatable :: a(:), b(:)
real(real32) :: tarray(2), time
integer :: i

allocate(a(N), b(N))

call random_number(a)
call random_number(b)

call dtime(tarray, time)
c = dot_product(a, b)
call dtime(tarray, time)
print '(f18.11)', c
print '("Time: ",f19.17," s")', time
6 Likes

Welcome to the Fortran forum :raised_hands:

1 Like

The funny thing is, I wrote a program that has 50x better performance in Fortran without trying to do any sort of trickery or pathological cases. It is just basic code. Is that perhaps the point?

program dot
use, intrinsic :: iso_fortran_env, only: real32
implicit none

real(real32) :: c
integer, parameter :: N = 1000000
real(real32), allocatable :: a(:), b(:)
real(real32) :: time1, time2

allocate(a(N), b(N))

call random_number(a)
call random_number(b)

call cpu_time(time1)
c = mydot(a, b, N)
call cpu_time(time2)
print '(f18.11)', c
print '("Time: ",f19.17," s")', time2 - time1

contains

function mydot(a, b, N) result(c)
   real(real32), intent(in) :: a(:)
   real(real32), intent(in) :: b(:)
   integer, intent(in) :: N
   real(real32) :: c
   integer :: i
   c = 0.
   do i = 1, N
      c = c + a(i)*b(i)
   enddo
endfunction
endprogram
function mydot(a, b, N)
    c = 0.
    for i = 1:N
        c += a[i]*b[i]
    end
    c
end

N = 1000000
a = rand(Float32, N)
b = rand(Float32, N)
@time c = mydot(a, b, N)
println(c)
5 Likes

@implicitall yes, that is the point why I like Fortran also. Others too. Welcome to the Forum!

That being said, I still want Fortran compilers to do a better job on the original benchmark.

3 Likes

My equivalent program runs about 1.2x-1.4x faster (depending on N) in Fortran using gfortran -fopenmp with the best Julia performance with 24 threads.

module omp
interface
   double precision function omp_get_wtime()
   endfunction
endinterface
endmodule

program test
use omp
complex(kind=8), allocatable :: A(:, :)
integer :: N
real(kind=8) :: time1, time2

do N = 1000, 10000, 1000
   time1 = omp_get_wtime()
   A = eval_exp(N)
   time2 = omp_get_wtime()
   print *, time2 - time1
enddo

contains

function eval_exp(n) result(M)
   implicit none

   integer, intent(in) :: n
   complex(kind=8) :: M(n, n)

   ! Local
   real(kind=8) :: a(n)
   complex(kind=8) :: k
   integer :: i, j

   k = (100., 1.)

   do i = 1, n
      a(i) = (i - 1)*2*3.14159/(n - 1)
   enddo

   !$omp PARALLEL DO PRIVATE(J)
   do j = 1, n
      do i = 1, n

         ! A[i,j] = exp((100+im)*im*sqrt(a[i]^2 + a[j]^2))
         M(i, j) = exp(k*(0., 1.)*sqrt(a(i)**2 + a(j)**2))

      enddo
   enddo
   !$omp END PARALLEL DO

endfunction eval_exp

endprogram
2 Likes

The Fortran Wiki has a section Fortran Advocacy and Language Comparisons with several speed comparisons of Fortran, Julia, Pytthon, and Matlab. It was discussed in a recent Hacker News thread Is Julia Really Fast?, with the initial comment

[P]eople do find Julia to be faster than Python/Numpy, but it is not uniformly faster than Fortran. And Julia’s start-up time should not be ignored. Quoting the last link, “In fact the whole Fortran benchmark (300 integrations) finishes roughly in the time it takes to startup a Julia session and import all required libraries (Julia 1.5.1).”

A meta-analysis of when Fortran is faster and slower than Julia would be interesting.

2 Likes

Not sure if this has already been posted, but here’s another speed comparison:

Unfortunately I can’t find all the code for the results. The author has a number of other blog posts about Julia and Fortran which are interesting reads.


Benchmarks aside, I think @implicitall has made an excellent point in that Fortran does not require much specialist knowledge or ‘trickery’ to get performant code. (For me, this frees up mental energy to spend on the actual problem I’m solving.)

4 Likes

Thanks for this @implicitall. Can others replicate this result? If so, we should submit this to the author, as that is quite unfair to be using Python overhead to conclude that Fortran is slower than Julia, if it would be faster than Julia without calling it via Python.

3 Likes