Julia: Fast as Fortran, Beautiful as Python

Yeah… be aware that he, with all respect, is just some random person that posted a personal blog post after trying to play with Julia for the first time two weeks ago. If communities take every comment on the web as something representative of anything, everyone goes crazy. In parallel to that there are fantastic packages, projects, and science, being developed in Julia, Fortran, Python, R, etc… by people that does not even know about the existence of these benchmarks.

4 Likes

@shahmoradi I agree with you Re: Scientific conduct.

Then please write that. I think these would be great questions to ask on the Julia forum. I 100% support constructive criticism, deeper inquiry, and exposing all relevant context. But as a moderator, I ask that we do not make general disparaging statements about other communities on this forum. If you find specific statements that are inaccurate or dubious, please address them specifically.

6 Likes

@shahmoradi I also edited your second post that it is your opinion. I 100% agree with Milan.

@FortranFan, the author of this article @martin.d.mass posted above with the background. I think it is perfectly ok to post blog posts like that. If anybody is responsible, that would be myself for posting here, but I think it is ok to look beyind the title and look at the benchmark and if it shows that Fortran compilers could be improved, then we should improve them.

5 Likes

It is perfectly okay for people to post any scientific blog post they wish as long as they keep it objective otherwise is either non-scientific or scholastic dishonesty. The title of this post says “Faster than Fortran,” while that is apparently not the case with the benchmark presented. An objective title for this post would be something like “Faster than Python mixed with Fortran”.

2 Likes

Yes, I understood @martin.d.maas post as non-scientific. It was his personal experience with the languages. Our goal should be to ensure people’s personal individual subjective experience with Fortran is that Fortran is fast. :slight_smile:

2 Likes

Thank you Imiq for reminding that.

I also don’t see blog posts in the same way as peer-reviewed publications. I have a personal blog somewhat in a similar fashion to a Twitter account.

I have apologized for the lack of sufficient care in the online etiquette that is required in subjects that can become the matter of online holy wars. I have edited the title of my post to "My view of Julia: (…) " to better reflect that it is a personal opinion after a personal experiment.

By the way, I had said in the Julia forum (linked to in my post) way before this discussion started here:

I’m starting to wonder whether I’m being unfair on the Fortran side, and maybe there is some overhead in using F2PY vs a standalone code, or maybe it’s auto-parallelization vs OpenMP, or this is an issue of Gfortran vs ifort. Or Julia automatically relying on some optimized math library in the likes of Intel’s MKL vs having to link to MKL manually in gfortran (which I used to do, but I’m on a fresh OS so I haven’t even downloaded Intel’s MKL)…

I am in the process of going through those issues. I’d like to see this as an ongoing collaborative experiment, and I have received very valuable feedback (in the form of pull requests to the benchmark repository, or code posted here).

I will be updating both my website and my github repository as I find out more about Fortran compilers and interoperability with Python, a topic in which I have considerable interest.

I don’t have a PhD in compiler theory, I have one in Applied Math. If any of you would like to share a sophomore-level tutorial on compiler theory I would be a glad reader. In particular, I would like to know why is there such an overhead when using F2PY, which is supposed to build a C wrapper to a Fortran function using iso_c_bindings…

Also, F2PY performance report (-DF2PY_REPORT_ATEXIT) shows that the routine is spending 10% of the time in the f2py_interface, but the comparison with the standalone Fortran timings is contradictory. This is for N=10,000

                  /-----------------------\
                 < F2PY performance report >
                  \-----------------------/

Overall time spent in …
(a) wrapped (Fortran/C) functions : 1149 msec
(b) f2py interface, 1 calls : 142 msec
(c) call-back (Python) functions : 0 msec
(d) f2py call-back interface, 0 calls : 0 msec
(e) wrapped (Fortran/C) functions (actual) : 1149 msec

Standalone Fortran timing for the same case: 0.88 secs

Sadly, I read that there is no real maintenance going on in the F2PY library. But it is the standard for Fortran interoperability with Python, so this is a concern.

5 Likes

@martin.d.maas I’m likely commenting way beyond my pay grade on many fronts now, so apologies in advance :slight_smile:

My suggestion affects what is being benchmarked, but if an objective is to get closer correspondence between timing of a standalone Fortran executable calling some routine and the same routine called from Python through F2PY (or rather understanding where the differences originate), it might make sense to pre-allocate the function return variable in Python and only timing filling it with values.

Something like:

  M = np.empty((N, N), dtype=np.cdouble, order='F')
  start_time = time.time()
  M[:,:] = mymodule.eikr(a,k,N)
  print(str(time.time() - start_time))

instead of:

  start_time = time.time()
  M = mymodule.eikr(a,k,N)
  print(str(time.time() - start_time))
1 Like

@martin.d.maas thank you for your comment. Why not to benchmark Fortran code directly from Fortran itself? That would be my suggestion, then you avoid all these Python overheads.

We are trying to start improving all these tools around Fortran, it’s on my personal TODO list to help with the Python bindings, also via LFortran (Automatic wrappers Fortran -> Python (#133) · Issues · lfortran / lfortran · GitLab), but I will work on this later, I am concentrating now to get LFortran to compile enough Fortran features to make it useful for first users.

1 Like

@melissawm can you chime in on the state of f2py?

Thank you jbdv! That’s a good test. However, I just run it and saw no difference whatsoever.

All of what I have read online about f2py was pointing in the direction that there should not be any overhead.

I agree, and I will contribute my code to the fortran-bechmarks repository once it’s more polished (i.e. the tests are automated with something better than a shell script and the output is some html).

However, I am personally, right now, only considering Fortran as a part of a two-language solution, where the other language is Python. On this grounds, Fortran is not only competing with Julia, it is competing with the likes of numba, pythran, cython… you name it. I’d prefer to use Fortran for many reasons, but certainly those other projects must have got the python-interface thing right.

The Gitlab thread you pointed to is very interesting, btw. So I’m guessing the roadmap here would be to improve gfort2py as much as possible and later incorporate it into LFortran and fpm?

One limitation that I have encountered when resorting to ctypes (as gfort2py is doing) is that there is no built-in support for complex numbers on the Python side yet, even though complex numbers are a part of the C99 standard. See this issue:

https://bugs.python.org/issue16899

so I’m guessing my test wouldn’t run there.

2 Likes

As part of a collaborative effort I combined some of the different code that has been posted and gave it a try and I see great results.

! test.f90
module mymodule
use, intrinsic :: iso_fortran_env, only: real64
contains
subroutine eikR(a,k,n,M)
    implicit none

    integer,            intent(in)    :: n
    real(real64),       intent(in)    :: a(n)
    complex(real64),    intent(in)    :: k
    complex(real64),    intent(out)   :: M(n,n)

    ! Local
    integer                         :: i, j
    real(real64)  :: rtmp(n), sinp(n), cosp(n), tmp_exp(n)
    real(real64)  :: itmp(n), a_pow_2(n)

    complex(real64) :: imk
    real(real64) :: rimk
    real(real64) :: iimk, tmp

    imk = (0,1) * k
    rimk = real(imk)
    iimk = imag(imk)

    a_pow_2 = a * a
  
    !$OMP PARALLEL DO DEFAULT(private) SHARED(a_pow_2, M, n, rimk,iimk)
    do j=1,n
    
    rtmp = a_pow_2 + a_pow_2(j)
    call vdsqrt(n, rtmp, rtmp)

    ! real part
    call vdexp(n, rimk * rtmp, tmp_exp)

    ! imag part
    call vdsincos(n, iimk * rtmp, sinp, cosp)

    ! assemble output
    M(:, j) = cmplx(tmp_exp * cosp, tmp_exp * sinp)
    end do
    !$OMP END PARALLEL DO

end subroutine eikR
end module

program test
use, intrinsic :: iso_fortran_env, only: real64
use mymodule
use omp_lib
real(real64), parameter :: pi = 4*atan(1.0)
integer :: i, j, n
real(real64), allocatable :: a(:)
complex(real64) :: k
complex(real64), allocatable :: m(:, :)
real(real64) :: t1, t2
do i = 1, 10
   n = 1000*i
   if(allocated(m)) deallocate(m)
   allocate(m(n, n))
   t1 = omp_get_wtime()
   a = [(2*j*pi/(n - 1), j=0, n - 1)]
   k = cmplx(100.0, 1.0)
   call eikr(a, k, n, m)
   t2 = omp_get_wtime()
   print *, n, t2 - t1, 'seconds'
enddo
end program
ifort -qopenmp -mkl test.f90 && ./a.out
        1000  5.237913131713867E-002 seconds
        2000  6.802082061767578E-003 seconds
        3000  1.521897315979004E-002 seconds
        4000  3.453397750854492E-002 seconds
        5000  3.808498382568359E-002 seconds
        6000  4.601097106933594E-002 seconds
        7000  6.288909912109375E-002 seconds
        8000  6.456899642944336E-002 seconds
        9000  8.035993576049805E-002 seconds
       10000  9.922003746032715E-002 seconds
# test_loopvec.jl
using LoopVectorization

function eval_exp_turbo(N)
    a = range(0, stop=2*pi, length=N)
    A = Matrix{ComplexF64}(undef, N, N)
    _A = reinterpret(reshape, Float64, A)
    @tturbo  for j in 1:N, i in 1:N
        x = sqrt(a[i]^2 + a[j]^2)
        prefac = exp(-x)
        s, c = sincos(100*x)

        _A[1,i,j] = prefac * c
        _A[2,i,j] = prefac * s
    end
    return A
end

eval_exp_turbo(5)
print(string("running loop on ", Threads.nthreads(), " threads \n"))
for N in 1000:1000:10000
    @time begin
    A = eval_exp_turbo(N)
    end
end
~/julia-1.6.1/bin/julia -t12 test_loopvec.jl 
running loop on 12 threads 
  0.001817 seconds (2 allocations: 15.259 MiB)
  0.032877 seconds (11 allocations: 61.036 MiB, 37.83% gc time)
  0.045050 seconds (13 allocations: 137.330 MiB, 19.86% gc time)
  0.057736 seconds (13 allocations: 244.141 MiB, 46.40% gc time)
  0.100725 seconds (13 allocations: 381.470 MiB, 62.53% gc time)
  0.102150 seconds (13 allocations: 549.317 MiB, 53.63% gc time)
  0.073573 seconds (13 allocations: 747.681 MiB, 20.40% gc time)
  0.090383 seconds (13 allocations: 976.563 MiB, 22.08% gc time)
  0.111058 seconds (13 allocations: 1.207 GiB, 23.14% gc time)
  0.134160 seconds (13 allocations: 1.490 GiB, 24.33% gc time)
5 Likes

I’m assuming that’s my original unoptimized Julia code that you are testing.

I just managed to link again to MKL, and I get slightly faster Fortran code now (standalone gfortran + MKL).

sh run_julia.sh 
running loop on 8 threads 
  0.004717 seconds (2 allocations: 15.259 MiB)
  0.030413 seconds (3 allocations: 61.035 MiB, 38.20% gc time)
  0.063101 seconds (6 allocations: 137.329 MiB, 22.85% gc time)
  0.113589 seconds (3 allocations: 244.141 MiB, 40.97% gc time)
  0.204455 seconds (5 allocations: 381.470 MiB, 56.04% gc time)
  0.222886 seconds (5 allocations: 549.317 MiB, 40.11% gc time)
  0.209522 seconds (5 allocations: 747.681 MiB, 15.50% gc time)
  0.279210 seconds (5 allocations: 976.563 MiB, 13.38% gc time)
  0.324246 seconds (5 allocations: 1.207 GiB, 13.93% gc time)
  0.403646 seconds (5 allocations: 1.490 GiB, 13.55% gc time)

./f_mkl
        1000   1.0492420988157392E-002 seconds
        2000   1.4469360990915447E-002 seconds
        3000   3.1452757015358657E-002 seconds
        4000   5.4610630031675100E-002 seconds
        5000   8.2629415963310748E-002 seconds
        6000  0.11908876401139423      seconds
        7000  0.16488010797183961      seconds
        8000  0.22062794101657346      seconds
        9000  0.28234076703665778      seconds
       10000  0.35623603704152629      seconds
2 Likes

What LFortran can provide is the symbolic representation of the code and semantics, so it’s much easier to write tools like gfort2py. Once it’s easier to write tools like that, I think the community will help improve all these tools.

3 Likes

edit again: it was fine (forgot to start with “-t 8”)

the tturbo version below runs somewhat twice as fast as the original code, and should be similar to the above Fortran version. Yet, I think neither of those are versions that would normally be used by a regular user, except if this resulted to be the ultimate bottleneck of a code. I think your original version is quite fine, except that probably I would allocate the matrix outside the function, which gives a small gain in performance and avoids all those allocations.

Code
julia> function eval_exp0(N)
           a = range(0, stop=2*pi, length=N)
           A = Matrix{ComplexF64}(undef, N, N)

           @inbounds Threads.@threads for j in 1:N
           for i in 1:N
               A[i,j] = exp((100+im)*im*sqrt(a[i]^2 + a[j]^2))
           end
           end

           return A

       end
eval_exp0 (generic function with 1 method)

julia> using LoopVectorization
       function eval_exp_turbo(N)
           a = range(0, stop=2*pi, length=N)
           A = Matrix{ComplexF64}(undef, N, N)
           _A = reinterpret(reshape, Float64, A)
           @tturbo  for j in 1:N, i in 1:N
               x = sqrt(a[i]^2 + a[j]^2)
               prefac = exp(-x)
               s, c = sincos(100*x)

               _A[1,i,j] = prefac * c
               _A[2,i,j] = prefac * s
           end
           return A
       end
eval_exp_turbo (generic function with 1 method)

julia> for N in 1000:1000:10000
                  @time begin
                  A = eval_exp0(N)
                  end
              end
  0.236365 seconds (457.42 k allocations: 41.644 MiB, 96.15% compilation time)
  0.043115 seconds (45 allocations: 61.039 MiB, 19.41% gc time)
  0.104456 seconds (45 allocations: 137.333 MiB, 24.52% gc time)
  0.238770 seconds (44 allocations: 244.144 MiB, 41.50% gc time)
  0.285508 seconds (47 allocations: 381.474 MiB, 28.06% gc time)
  0.330531 seconds (45 allocations: 549.320 MiB, 0.41% gc time)
  0.448299 seconds (46 allocations: 747.684 MiB, 0.38% gc time)
  0.559138 seconds (44 allocations: 976.566 MiB, 0.38% gc time)
  0.630563 seconds (45 allocations: 1.207 GiB, 2.85% gc time)
  0.750414 seconds (43 allocations: 1.490 GiB, 13.01% gc time)

julia> for N in 1000:1000:10000
                  @time begin
                  A = eval_exp_turbo(N)
                  end
              end
  0.037624 seconds (114.94 k allocations: 22.580 MiB)
  0.012085 seconds (2 allocations: 61.035 MiB)
  0.026431 seconds (9 allocations: 137.329 MiB)
  0.047758 seconds (9 allocations: 244.141 MiB)
  0.125676 seconds (9 allocations: 381.470 MiB, 42.38% gc time)
  0.105112 seconds (9 allocations: 549.317 MiB)
  0.155813 seconds (9 allocations: 747.681 MiB, 9.17% gc time)
  0.196100 seconds (9 allocations: 976.563 MiB, 4.57% gc time)
  0.246955 seconds (9 allocations: 1.207 GiB, 1.97% gc time)
  0.308183 seconds (9 allocations: 1.490 GiB, 1.84% gc time)

2 Likes

IIRC per prior feedback by Intel support and comments by Intel users at its forums, gfortran + Intel MKL + OpenMP PARALLEL DO is suboptimal.

Given Intel releases oneAPI Toolkit, free, Fortran 2018, @implicitall 's use of ifort -qopenmp -mkl test.f90 && ./a.out is what makes sense if using the code posted in @implicitall 's analysis upthread, and not gfortran + MKL.

1 Like

Intel’s oneAPI is not entirely free, you have to pay for certain “concurrent license”, which is what you might need for some use cases (I suspect related to cloud usage by third parties, but haven’t checked). It is also free right now, but I remember many years ago I had a free Intel licence (for non-commercial use or for research) that got discontinued, until this new non-concurrent license. So you can’t know for how long this will be free. Or how long before we all fall under this “concurrent” use case. A permissive license like the one offered by gfortran or Julia (MKL is in a grey area) should be always preferred.

I think all compilers/wrappers comparisons are valid, and plan to go along what Certik seems to be suggesting, which to have a comprehensive benchmark repository, rather than relying on forum comments or what people from Intel support are claiming.

I will also add a test using oneAPI once I finish installing it, but it has to be clarified that this is a closed-source compiler with a restrictive license.

I would dare to say that the Fortran libraries which are most widely used today (Lapack comes to mind) have been compiled with gfortran and maybe even wrapped by f2py.

Hi everyone, author of LoopVectorization.jl here. A few comments:

  1. LoopVectorization.jl is benchmarked by memory bandwidth on most computers. This requires more accurate timing at smaller sizes to observe.
  2. Given that, as long as MKL’s SIMD implementations of exp and sincos are also fast enough to exceed the computer’s memory bandwidth, this benchmark won’t compare the performance of these function implementations. It won’t be measuring much other than the CPU’s ability to churn through memory, and perhaps the efficacy of the language’s memory management and array level optimizations.
  3. Note Julia’s GC is contributing to the @times, while the Fortran code was written with allocate and deallocate outside of the begin/end times. I assume the manual calls to allocate and deallocate are faster than Julia’s GC

Timings with @btime to show decreased FLOPS with increased problem size:

julia> for N in 1000:1000:10_000
           @btime begin
           A = eval_exp_turbo($N)
           end
       end
  248.607 μs (2 allocations: 15.26 MiB)
  2.977 ms (2 allocations: 61.04 MiB)
  7.031 ms (2 allocations: 137.33 MiB)
  12.811 ms (2 allocations: 244.14 MiB)
  20.509 ms (2 allocations: 381.47 MiB)
  29.615 ms (2 allocations: 549.32 MiB)
  40.767 ms (2 allocations: 747.68 MiB)
  53.211 ms (2 allocations: 976.56 MiB)
  67.617 ms (2 allocations: 1.21 GiB)
  85.009 ms (2 allocations: 1.49 GiB)

julia> Threads.nthreads()
18

julia> 85.009e-3 / 248.607e-6
341.9412969063623

340x slowdown for 100x the work.

Note the extreme performance improvement I get in Julia from preallocating the memory, letting it be ho

julia> function eval_exp_turbo!(A)
           N = size(A,1);
           a = range(0, stop=2*pi, length=N)
           _A = reinterpret(reshape, Float64, A)
           @tturbo  for j in 1:N, i in 1:N
               x = sqrt(a[i]^2 + a[j]^2)
               prefac = exp(-x)
               s, c = sincos(100*x)

               _A[1,i,j] = prefac * c
               _A[2,i,j] = prefac * s
           end
           return A
       end
eval_exp_turbo! (generic function with 1 method)

julia> for N in 1000:1000:10000
           A = Matrix{ComplexF64}(undef, N, N)
           @btime begin
           eval_exp_turbo!($A)
           end
       end
  153.820 μs (0 allocations: 0 bytes)
  1.067 ms (0 allocations: 0 bytes)
  3.423 ms (0 allocations: 0 bytes)
  6.692 ms (0 allocations: 0 bytes)
  11.054 ms (0 allocations: 0 bytes)
  16.176 ms (0 allocations: 0 bytes)
  22.101 ms (0 allocations: 0 bytes)
  28.599 ms (0 allocations: 0 bytes)
  36.822 ms (0 allocations: 0 bytes)
  45.961 ms (0 allocations: 0 bytes)

julia> 45.961e-3 / 153.82e-6
298.79729554024186

I’m not making a Julia vs Fortran statement here. I’m just saying that this is a benchmark of memory bandwidth, not of computation. That I bet the gfortran + MKL or ifort-compiled version the code would show the same substantial performance improvement from reusing memory hot in cache.

EDIT: version using @time instead:

julia> for N in 1000:1000:10000
           A = Matrix{ComplexF64}(undef, N, N); A .= 1
           @time begin
           eval_exp_turbo!(A)
           end
       end
  0.000628 seconds
  0.001533 seconds
  0.003676 seconds
  0.006812 seconds
  0.011264 seconds
  0.016306 seconds
  0.022359 seconds
  0.028936 seconds
  0.036775 seconds
  0.046169 seconds

This helps even if we do a single threaded dummy assignment before the multithreaded computation. If remove the .= 1, performance regresses to match the non-preallocated version.

It could also show how good the language is at reclaiming and reusing hot memory after it is no longer referenced (in Julia) or deallocated (in Fortran), but it seems they’re more or less the same there.

EDIT:
Other than deleting the first julia> for N in 1000:1000:10_000 line, I don’t know how to fix the syntax highlighting of my first Julia code block.

EDIT: Benchmarking simply filling A with 1 + 1im:

julia> for N in 1000:1000:10000
           A = Matrix{ComplexF64}(undef, N, N);
           @time begin
               A .= 1 + 1im
           end
       end
  0.003449 seconds
  0.010881 seconds
  0.021526 seconds
  0.034084 seconds
  0.048245 seconds
  0.065157 seconds
  0.088886 seconds
  0.115866 seconds
  0.146583 seconds
  0.180962 seconds

julia> for N in 1000:1000:10000
           A = Matrix{ComplexF64}(undef, N, N);
           @time begin
               @tturbo reinterpret(Float64, A) .= 1
           end
       end
  0.000752 seconds
  0.002465 seconds (9 allocations: 288 bytes)
  0.005445 seconds (17 allocations: 544 bytes)
  0.008501 seconds (17 allocations: 544 bytes)
  0.013545 seconds (17 allocations: 544 bytes)
  0.018184 seconds (17 allocations: 544 bytes)
  0.026851 seconds (17 allocations: 544 bytes)
  0.034388 seconds (17 allocations: 544 bytes)
  0.044439 seconds (17 allocations: 544 bytes)
  0.053580 seconds (17 allocations: 544 bytes)

julia> for N in 1000:1000:10000
           A = Matrix{ComplexF64}(undef, N, N); A .= 1
           @time begin
               @tturbo reinterpret(Float64, A) .= 1
           end
       end
  0.000247 seconds
  0.001496 seconds
  0.003446 seconds
  0.006456 seconds
  0.010281 seconds
  0.014997 seconds
  0.020578 seconds
  0.026987 seconds
  0.034338 seconds
  0.042368 seconds
3 Likes

Rerunning the benchmarks with 8 threads instead of 18:

julia> using LoopVectorization

julia> for N in 1000:1000:10000
           A = Matrix{ComplexF64}(undef, N, N);
           @time begin
               A .= 1 + 1im
           end
       end
  0.002275 seconds
  0.007359 seconds
  0.016423 seconds
  0.028061 seconds
  0.043624 seconds
  0.062537 seconds
  0.085763 seconds
  0.112238 seconds
  0.141690 seconds
  0.174704 seconds

julia> for N in 1000:1000:10000
           A = Matrix{ComplexF64}(undef, N, N);
           @time begin
               @tturbo reinterpret(Float64, A) .= 1
           end
       end
  0.014089 seconds (114.93 k allocations: 7.320 MiB)
  0.001885 seconds
  0.003939 seconds
  0.007085 seconds
  0.010916 seconds
  0.015616 seconds
  0.020996 seconds (7 allocations: 224 bytes)
  0.027593 seconds
  0.034584 seconds (7 allocations: 224 bytes)
  0.042716 seconds (7 allocations: 224 bytes)

julia> for N in 1000:1000:10000
           A = Matrix{ComplexF64}(undef, N, N); A .= 1
           @time begin
               @tturbo reinterpret(Float64, A) .= 1
           end
       end
  0.000165 seconds
  0.001266 seconds
  0.003278 seconds
  0.006302 seconds
  0.009963 seconds
  0.014733 seconds
  0.020239 seconds
  0.026483 seconds
  0.033647 seconds
  0.041608 seconds

julia> function eval_exp_turbo(N)
           a = range(0, stop=2*pi, length=N)
           A = Matrix{ComplexF64}(undef, N, N)
           _A = reinterpret(reshape, Float64, A)
           @tturbo  for j in 1:N, i in 1:N
               x = sqrt(a[i]^2 + a[j]^2)
               prefac = exp(-x)
               s, c = sincos(100*x)

               _A[1,i,j] = prefac * c
               _A[2,i,j] = prefac * s
           end
           return A
       end
eval_exp_turbo (generic function with 1 method)

julia> eval_exp_turbo(5)
5×5 Matrix{ComplexF64}:
        1.0+0.0im             0.20788+2.04178e-16im    0.0432139+8.48888e-17im   0.00898329-2.28851e-16im    0.00186744+7.33676e-18im
    0.20788+2.04178e-16im  -0.0666538+0.0855526im      0.0243148-0.0172721im     0.00652112+0.00243801im     0.00135956+0.000721391im
  0.0432139+8.48888e-17im   0.0243148-0.0172721im    -0.00287652-0.0114048im     0.00223229+0.00265671im    0.000292883-0.000839933im
 0.00898329-2.28851e-16im  0.00652112+0.00243801im    0.00223229+0.00265671im    0.00116744+0.000514081im   0.000388203-3.11937e-17im
 0.00186744+7.33676e-18im  0.00135956+0.000721391im  0.000292883-0.000839933im  0.000388203-3.11937e-17im  -0.000121795+6.56123e-5im

julia> for N in 1000:1000:10_000
           @btime begin
           A = eval_exp_turbo($N)
           end
       end
  345.673 μs (2 allocations: 15.26 MiB)
  2.587 ms (2 allocations: 61.04 MiB)
  4.672 ms (2 allocations: 137.33 MiB)
  8.199 ms (2 allocations: 244.14 MiB)
  13.143 ms (2 allocations: 381.47 MiB)
  18.763 ms (2 allocations: 549.32 MiB)
  26.087 ms (2 allocations: 747.68 MiB)
  34.143 ms (2 allocations: 976.56 MiB)
  43.368 ms (2 allocations: 1.21 GiB)
  55.087 ms (2 allocations: 1.49 GiB)

julia> Threads.nthreads()
8

julia> function eval_exp_turbo!(A)
           N = size(A,1);
           a = range(0, stop=2*pi, length=N)
           _A = reinterpret(reshape, Float64, A)
           @tturbo  for j in 1:N, i in 1:N
               x = sqrt(a[i]^2 + a[j]^2)
               prefac = exp(-x)
               s, c = sincos(100*x)

               _A[1,i,j] = prefac * c
               _A[2,i,j] = prefac * s
           end
           return A
       end
eval_exp_turbo! (generic function with 1 method)

julia> for N in 1000:1000:10000
           A = Matrix{ComplexF64}(undef, N, N)
           @btime begin
           eval_exp_turbo!($A)
           end
       end
  349.395 μs (0 allocations: 0 bytes)
  1.383 ms (0 allocations: 0 bytes)
  3.334 ms (0 allocations: 0 bytes)
  6.552 ms (0 allocations: 0 bytes)
  10.416 ms (0 allocations: 0 bytes)
  15.214 ms (0 allocations: 0 bytes)
  20.839 ms (0 allocations: 0 bytes)
  27.589 ms (0 allocations: 0 bytes)
  34.480 ms (0 allocations: 0 bytes)
  43.248 ms (0 allocations: 0 bytes)

Performance is the same or even better than before.

Cutting down to just 4 threads, however, and I start to see a substantial slow down for eval_exp_turbo! (with hot memory), but only a small difference for eval_exp_turbo and filling with 1 + 1im remains unchanged:

julia> using LoopVectorization

julia> for N in 1000:1000:10000
           A = Matrix{ComplexF64}(undef, N, N);
           @time begin
               A .= 1 + 1im
           end
       end
  0.001972 seconds
  0.007628 seconds
  0.016110 seconds
  0.028039 seconds
  0.043459 seconds
  0.062864 seconds
  0.086007 seconds
  0.112026 seconds
  0.141906 seconds
  0.175089 seconds

julia> for N in 1000:1000:10000
           A = Matrix{ComplexF64}(undef, N, N);
           @time begin
               @tturbo reinterpret(Float64, A) .= 1
           end
       end
  0.010743 seconds (56.12 k allocations: 3.589 MiB)
  0.001756 seconds
  0.003951 seconds
  0.006726 seconds
  0.010436 seconds
  0.014459 seconds
  0.020239 seconds (3 allocations: 96 bytes)
  0.026039 seconds
  0.033012 seconds (3 allocations: 96 bytes)
  0.039655 seconds (3 allocations: 96 bytes)

julia> for N in 1000:1000:10000
           A = Matrix{ComplexF64}(undef, N, N); A .= 1
           @time begin
               @tturbo reinterpret(Float64, A) .= 1
           end
       end
  0.000175 seconds
  0.001333 seconds
  0.003516 seconds
  0.006543 seconds
  0.010540 seconds
  0.015262 seconds
  0.020858 seconds
  0.027466 seconds
  0.034755 seconds
  0.043039 seconds

julia> function eval_exp_turbo(N)
           a = range(0, stop=2*pi, length=N)
           A = Matrix{ComplexF64}(undef, N, N)
           _A = reinterpret(reshape, Float64, A)
           @tturbo  for j in 1:N, i in 1:N
               x = sqrt(a[i]^2 + a[j]^2)
               prefac = exp(-x)
               s, c = sincos(100*x)

               _A[1,i,j] = prefac * c
               _A[2,i,j] = prefac * s
           end
           return A
       end
eval_exp_turbo (generic function with 1 method)

julia> eval_exp_turbo(5)
5×5 Matrix{ComplexF64}:
        1.0+0.0im             0.20788+2.04178e-16im    0.0432139+8.48888e-17im   0.00898329-2.28851e-16im    0.00186744+7.33676e-18im
    0.20788+2.04178e-16im  -0.0666538+0.0855526im      0.0243148-0.0172721im     0.00652112+0.00243801im     0.00135956+0.000721391im
  0.0432139+8.48888e-17im   0.0243148-0.0172721im    -0.00287652-0.0114048im     0.00223229+0.00265671im    0.000292883-0.000839933im
 0.00898329-2.28851e-16im  0.00652112+0.00243801im    0.00223229+0.00265671im    0.00116744+0.000514081im   0.000388203-3.11937e-17im
 0.00186744+7.33676e-18im  0.00135956+0.000721391im  0.000292883-0.000839933im  0.000388203-3.11937e-17im  -0.000121795+6.56123e-5im

julia> for N in 1000:1000:10_000
           @btime begin
           A = eval_exp_turbo($N)
           end
       end
  681.719 μs (2 allocations: 15.26 MiB)
  3.785 ms (2 allocations: 61.04 MiB)
  8.502 ms (2 allocations: 137.33 MiB)
  14.950 ms (2 allocations: 244.14 MiB)
  23.275 ms (2 allocations: 381.47 MiB)
  33.297 ms (2 allocations: 549.32 MiB)
  45.427 ms (2 allocations: 747.68 MiB)
  58.850 ms (2 allocations: 976.56 MiB)
  74.439 ms (2 allocations: 1.21 GiB)
  91.285 ms (2 allocations: 1.49 GiB)

julia> Threads.nthreads()
4

julia> function eval_exp_turbo!(A)
           N = size(A,1);
           a = range(0, stop=2*pi, length=N)
           _A = reinterpret(reshape, Float64, A)
           @tturbo  for j in 1:N, i in 1:N
               x = sqrt(a[i]^2 + a[j]^2)
               prefac = exp(-x)
               s, c = sincos(100*x)

               _A[1,i,j] = prefac * c
               _A[2,i,j] = prefac * s
           end
           return A
       end
eval_exp_turbo! (generic function with 1 method)

julia> for N in 1000:1000:10000
           A = Matrix{ComplexF64}(undef, N, N)
           @btime begin
           eval_exp_turbo!($A)
           end
       end
  682.330 μs (0 allocations: 0 bytes)
  2.733 ms (0 allocations: 0 bytes)
  6.148 ms (0 allocations: 0 bytes)
  11.020 ms (0 allocations: 0 bytes)
  17.214 ms (0 allocations: 0 bytes)
  24.778 ms (0 allocations: 0 bytes)
  33.800 ms (0 allocations: 0 bytes)
  44.228 ms (0 allocations: 0 bytes)
  56.712 ms (0 allocations: 0 bytes)
  69.088 ms (0 allocations: 0 bytes)

So this begs the question: what do we actually want to benchmark? What comparisons do we actually want to draw? Presumably, something that’s actionable.

4 Likes

That’s a great point because you can run the test like this.

t1 = omp_get_wtime()
do i = 1, 10
   n = 1000*i
   if(allocated(m)) deallocate(m)
   allocate(m(n, n))
   a = [(2*j*pi/(n - 1), j=0, n - 1)]
   k = cmplx(100.0, 1.0)
   call eikr(a, k, n, m)
enddo
t2 = omp_get_wtime()
print *, t2 - t1, 'seconds'
ifort -qopenmp -mkl test.f90 && ./a.out
  0.604980945587158      seconds
==24665== HEAP SUMMARY:
==24665==     in use at exit: 2,994 bytes in 10 blocks
==24665==   total heap usage: 165 allocs, 155 frees, 112,309 bytes allocated
==24665== 
==24665== LEAK SUMMARY:
==24665==    definitely lost: 0 bytes in 0 blocks
==24665==    indirectly lost: 0 bytes in 0 blocks
==24665==      possibly lost: 0 bytes in 0 blocks
==24665==    still reachable: 2,994 bytes in 10 blocks
==24665==         suppressed: 0 bytes in 0 blocks
@time begin
for N in 1000:1000:10000
    A = Matrix{ComplexF64}(undef, N, N)
    eval_exp_turbo!(A)
end
end
~/julia-1.6.1/bin/julia -t12 test_loopvec.jl
running loop on 12 threads 
  0.738399 seconds (119 allocations: 5.737 GiB, 31.02% gc time)
2 Likes