Julia: Fast as Fortran, Beautiful as Python

Yeah, if what you’er trying to show is that Julia’s GC is slower than manual memory management, you’re right.

using LoopVectorization
function eval_exp_turbo!(A)
     N = size(A,1);
    a = range(0, stop=2*pi, length=N)
    _A = reinterpret(reshape, Float64, A)
    @tturbo  for j in 1:N, i in 1:N
        x = sqrt(a[i]^2 + a[j]^2)
        prefac = exp(-x)
        s, c = sincos(100*x)
        _A[1,i,j] = prefac * c
        _A[2,i,j] = prefac * s
    end
    return A
end

eval_exp_turbo!(Matrix{ComplexF64}(undef, 5, 5));
eval_exp_turbo!(view(Matrix{ComplexF64}(undef, 5, 5), 1:5, 1:5));

Threads.nthreads()
@time begin
for N in 1000:1000:10000
    A = Matrix{ComplexF64}(undef, N, N)
    eval_exp_turbo!(A)
end
end

@time begin
Abig = Matrix{ComplexF64}(undef, 10000, 10000)
for N in 1000:1000:10000
    A = eval_exp_turbo!(view(Abig, 1:N, 1:N))
end
end

@time begin
for N in 1000:1000:10000
    p = Libc.malloc(sizeof(ComplexF64) * N * N)
    A = unsafe_wrap(Array, Ptr{ComplexF64}(p), (N,N))
    eval_exp_turbo!(A)
    Libc.free(p)
end
end


@time begin
p = Libc.malloc(sizeof(ComplexF64) * 10_000^2)
for N in 1000:1000:10000
    A = unsafe_wrap(Array, Ptr{ComplexF64}(p), (N,N))
    eval_exp_turbo!(A)
end
Libc.free(p)
end

Yields:

julia> Threads.nthreads()
8

julia> @time begin
       for N in 1000:1000:10000
           A = Matrix{ComplexF64}(undef, N, N)
           eval_exp_turbo!(A)
       end
       end
  0.437918 seconds (48 allocations: 5.737 GiB, 41.34% gc time)

julia> @time begin
       Abig = Matrix{ComplexF64}(undef, 10000, 10000)
       for N in 1000:1000:10000
           A = eval_exp_turbo!(view(Abig, 1:N, 1:N))
       end
       end
  0.264932 seconds (66 allocations: 1.490 GiB, 2.48% gc time)

julia> @time begin
       for N in 1000:1000:10000
           p = Libc.malloc(sizeof(ComplexF64) * N * N)
           A = unsafe_wrap(Array, Ptr{ComplexF64}(p), (N,N))
           eval_exp_turbo!(A)
           Libc.free(p)
       end
       end
  0.216784 seconds (50 allocations: 1.719 KiB)

julia> @time begin
       p = Libc.malloc(sizeof(ComplexF64) * 10_000^2)
       for N in 1000:1000:10000
           A = unsafe_wrap(Array, Ptr{ComplexF64}(p), (N,N))
           eval_exp_turbo!(A)
       end
       Libc.free(p)
       end
  0.232580 seconds (60 allocations: 2.172 KiB)

Out of these four, I do think the first is the most idiomatic, and will generally be slower than manual memory management in Fortran.
I do think that is a valid point in comparing the languages. It’s also a valid point that Fortran makes large, dynamically-sized stack-allocated arrays easy.

I do think it is also worth noting that this can be worked around in Julia when you care. In practice many libraries don’t care, making their development easier at the cost of crippling Julia’s multithreaded performance in many real-world workloads.

But from my perspective, I can’t fix how people manage memory in Julia programs or Julia’s GC from within LoopVectorization.jl, which is why those benchmarks are less interesting to me personally: while relevant and worth addressing, it isn’t something that can be addressed from within the libraries I like to work on.

EDIT:

> tail -n12 fortcomplexexp2.f90                                                         (base)
t1 = omp_get_wtime()
allocate(m(10000,10000))
do i = 1, 10
   n = 1000*i
   a = [(2*j*pi/(n - 1), j=0, n - 1)]
   k = cmplx(100.0, 1.0)
   call eikr(a, k, n, m)
enddo
deallocate(m)
t2 = omp_get_wtime()
print *, n, t2 - t1, 'seconds'
end program

> ifort -qopenmp -mkl -fast fortcomplexexp2.f90 -o fortcomplex_v2

> OMP_NUM_THREADS=8 ./fortcomplex_v2
       10000  0.230892181396484      seconds

> OMP_NUM_THREADS=18 ./fortcomplex_v2
       10000  0.212796926498413      seconds

Performance is comparable now when using 8 threads, but Julia slowed down from 0.23 → 0.35s when using 18 threads. Hmm.

2 Likes

Why is it that large-static arrays become prohibitively slow to compile in Julia? Or is this the wrong thing to ask here?

1 Like

Also note that Fortran’s stack arrays are mutable, because there’s no reason they shouldn’t be.

Julia unfortunately doesn’t give you dynamically sized mutable stack allocated arrays, but you can get statically sized mutable stack allocated arrays. At some point I’d like to make a PR to base Julia to add support for the alloca intrinsic.
This intrinsic isn’t really accessible from llvmcall, because the lifetime is defined as the function’s scope, and llvmcall works by defining an alwaysinline function. Thus even after inlining, if you define a function that just calls alloca, that memory has no life, and trying to use that pointer risks aliasing with other allocas and hence doesn’t seem advisable.

If you want statically sized mutable stack allocated arrays in Julia, use StaticArrays.MArray or StrideArrays.StrideArray.
For the arrays to be stack allocated instead of heap allocated, they are not allowed to escape the function that allocates them.
StrideArrays.@gc_preserve helps with this: it uses GC.@preserve to ensure the memory remains valid, while replacing the arrays in a function call with PtrArrays that aren’t tracked by the GC, so that the “does not escape” condition is met and the array can be stack allocated.
If not using StrideArrays.@gc_preserve, you’ll need to make sure to (a) not return the array from the function and (b) ensure every function you pass it to gets inlined.

Why do these arrays feature worse compile times in Julia?
Because they’re based on NTuple{N,T}, which don’t have a real memory layout.
Note that their compile time is relatively small, because the NTuple is never really touched, data is loaded and stored via loading/storing from pointers, for example. The NTuple merely exists to indicate the amount of memory.

Unlike StaticArrays.SArray, the statically sized immutable arrays.
These have bad compile times, because all operations on them must be unrolled.
These touch the NTuples, causing problems.
An NTuple in Julia is represented as an LLVM Array, and LLVM Arrays are really not well suited to what we normally think of as array operations. For one thing, all indexing must use compile-time constant indexes.
Hence, everything must be unrolled.
Hence, nothing scales.

4 Likes

If keen on making comparisons, make them on a consistent basis. Intel MKL optimally goes with Intel’s toolchain, so comparers can choose to either not use it which would make sense, or use it consistently with Intel’s processors. And this applies to the licensing aspect as well, or lack thereof: Intel MKL, IFORT, etc. are all part of the oneAPI toolkits. With oneAPI, Intel has reduced the whole aspect to a single sentence, “all of the oneAPI Toolkits are available for free download and use for commercial and non-commercial purposes.” This is effectively as permissive as it can get; the past and the future doesn’t matter here.

Re: “a comprehensive benchmark repository,” yes that much would be intuitively obvious. And it would be obvious too to not make a blogpost with a single questionable case where it’s unclear what is being benchmarked and that yet proceeds to a “faster than” generalization. Ideally the blog would be immediately withdrawn.

1 Like

I disagree, it is a question of trade-offs like most things in life. Some software tools offer advantages that may be worth any actual or perceived disadvantages (such as price, licence).

1 Like

With oneAPI, Intel has reduced the whole aspect to a single sentence, “all of the oneAPI Toolkits are available for free download and use for commercial and non-commercial purposes.”

You got this wrong. Read my post (with a corrected title now) where there is a link to an Intel’s fact-sheet where it says otherwise. Also, I suggest you to be skeptical about Intel’s marketing efforts and read the fine print legalese if you are in doubt.

I disagree, it is a question of trade-offs like most things in life. Some software tools offer advantages that may be worth any actual or perceived disadvantages (such as price, licence).

Yes, I agree with you. However, isn’t this the forum and community which strives to introduce the open-source tooling and community culture into Fortran?

1 Like

I think that is a question for others to answer. I did not get that impression. I am here to understand the needs of Fortran practitioners and wherever possible to share knowledge of the state of the art.

4 Likes

We have created fortran-lang.org and this forum as a vendor neutral (i.e. both open source and commercial) place where we can meet and discuss. The Fortran community is a mix of open source and commercial.

My personal goal is to find ways to collaborate (commercial and open source) where it makes sense, to get to know each other, and help each other out. The Fortran Committee is like that too — it is a mix of commercial, open source, vendors, users, enthusiasts, etc., and it is in all of our interests to rejuvenate Fortran and making it a thriving language.

12 Likes

I guess I’d say that my perspective is that all else equal, I’d much rather have an open source solution. Sure, if there is a high performance proprietary solution that’s better than the open source offerings, I’ll use it and ask for seconds. But I suspect that @martin.d.maas is just expressing that if he’s looking at approximately equal levels of support, performance and features he’d want to choose the open source solution.

I think in modern language development these days, the advantages (and corresponding market share) of proprietary compilers seems to be shrinking. I’m absolutely sure we’ll never see them totally disappear, but I think that the internet has really made it possible for many different entities (including corporate entities) to collaborate on creating state of the art language technologies under permissive licenses.

3 Likes

Yes, the market has changed in the past 30 years. Open source is a great model that thanks to the internet and modern tooling (git, GitHub, GitLab, …) allows to very effectively collaborate and scale with the number of people and companies contributing. But commercial is also a great model, because ultimately you want to figure out how to employ people working on the tools. So we want to have both.

2 Likes

Ok, that’s good to know.

I am under the impression that what you are looking to create for Fortran (like an ecosystem of open-source packages) depends crucially on building a strong open-source community, though.

I have no doubt that an open-source compiler is what will provide you the best support for open-source developers. For example, gfortran is not only open-source itself, but also the most standard-compliant compiler. That is, code that was written with gfortran in mind will be more easily shared online.

2 Likes

No, please see the table on page 9 of Compiler Support for the Fortran 2008 and 2018 Standards: Version 1 (April 2020). Gfortran and commercial compilers do accept non-standard features, since much older code relies on them, but compilers do have a strict-standard option, and if such an option is used, code should be portable. I believe a compiler that claims to be an F95/2003/2008/2018 compiler is required to have a mode that rejects non-standard programs.

I agree that gfortran is a crucial resource for Fortran programmers.

2 Likes

I have no idea what data gave you the impression that the second statement is true. The data I am aware of shows it is false.

The third statement is trivially true, if by “more easily shared online” you mean “bug compatible”.

The first statement is also trivially true if open-source developers are restricted to using open-source tools.

For me, the low-content marker was the use of “Fortran” instead of “XYZ Fortran processor”. I move to discourage threads like this on the grounds of being mostly a waste of time. A more productive presentation would say “I tried this Fortran code with XYZ Fortran processor on ABC machine and I have reason to believe the performance is sub-optimal; how can I do better?” and the answers could be “change your code” or “use this flag” or “switch to processor QRS and send an email to XYZ’s support”.

3 Likes

I don’t think this thread or Performance, C vs. Fortran are net negatives for the forum, but I do suggest that threads be titled in a more neutral way, as the other one was. The OP used the title of the original blog post, which was too strong. That blog post is now titled less strongly, Julia: Fast as Fortran, Beautiful as Python and incorporates some information from this thread.

1 Like

Performance, C vs. Fortran

I agree, I always feel the title of this kind of threads tends to be “over-generalized”,
and something like “Peformance, my C vs Fortran codes” would be much better.
Moreover, I am not interested in absolute performance games (the end results are usually similar and not surprising), but rather more interested in knowing effective coding/tuning TIPS in both languages (e.g., yesterday I (re-?)learned that abs() in C is something to be careful… XD)

3 Likes

I have just changed the title of this thread to conform to the new title of the blog post.

I must be odd, I find Python to be ugly with those damn underscores. Fortran reminds me of Ada – very easy to read.

4 Likes

I have done some simple bench marking of Python against Fortran, C++ and Java.

The source files can be found at

http://www.rhymneyconsulting.co.uk/python/examples/

c2701.py
c2702.f90
c2703.cxx
c2704.java

A spreadsheet with the timing is also available

27.xls

at the above site in the same directory.

                 Python     Fortran            C++                  Java
                            Nag       Intel    Microsoft  g++  
                 3.6.5      6.2       19       17.x       7.3.0     1.8.0_131

Initialisation   12.314372  0.433094  0.589469 0.394777   0.420932  0.609
            
Summation        14.283259  0.125     0.132813 0.148464   0.158285  0.164
            
Total            26.597631  0.558094  0.722282  0.543241  0.579217  0.773
Percentage                  2.10%     2.72%     2.04%     2.18%     2.91%

2 Likes

Thanks. I think the Fortran main program from your site is

program array_sum
use timing_module
use precision_module, wp => dp
implicit none
integer , parameter :: n=100000000
real (wp) ,dimension(n) :: x
real (wp) :: x_sum
  call start_timing()
  x=1.0_wp
  print *," array initialisation ",time_difference()
  x_sum=sum(x)
  print *," array summation      ",time_difference()
  print *,x_sum
  call end_timing()
end program array_sum

There was a long thread Simple summation 8x slower than in Julia (another thread title to be edited?) where it was found that Julia’s computed trig functions faster than gfortran.