Simple summation 8x slower than in Julia

I agree with the two examples you gave, but they both are functions with a very cleanly defined API, that need nothing from the host language. I’m not sure you can do the same with vectorization for instance, which relies on “seeing” behind the function, and eg LoopVectorization looks much more tied to the language than BLAS is. Same goes for several nice things going on in the julia ecosystem (eg AD, GPUs, symbolic tracing, etc). I do agree that tying libraries to a language is unfortunate, but it appears to be necessary at least in some cases.

1 Like

Though my experience of Julia is very limited, I guess a key difference here is that it allows low-level code transformations directly using a set of builtin features in the language (e.g. symbols, expressions, and macros), so making it possible to modify user codes “on-the-fly” as desired. I think this is rather different from calling external libraries with fixed API (which I guess antoine-levitt mentions above). In my understanding, it may be like providing an ability to “extend” the language from the user side in various ways… (In this sense, it might be related to metaprogramming in other statically-typed languages, but I guess the dynamical nature in Julia makes it more flexible.)

3 Likes

RE “summation of sin(x) is not a useful benchmark”, I think the problem can become quickly more “practical” by making the function more complicated (for example, even the summation of abs(sin(i)) for i = 1…N does not have a closed solution?). I think one common scenario for this kind of summation may be an evaluation of quadrature (or integrals) for several dimensions. Much ago, I met one such integral which was difficult to accurately evaluate with Monte Carlo (because of some discontinuity in the integrand) and so used some “brute-force” combination of quadrature and automatic integrator (using the NAG library, thanks! :slight_smile: ). But the calculation took hours, so I feel a fast evaluation of analytical functions are always welcome… (I remember MKL also provides some nice vector(?) routines for this, but have not tried it up to now.)

By the way, to get some experience, I’ve tried the following code with three different “integrand” (assuming a toy scenario of quadrature), i.e.

  • \sin(x)
  • \sin( 2\pi x^2 )
  • \sqrt{ ( 1 - \exp( -x^2 ) ) / \cosh( -x^4 ) }

The Fortran code I used is like this:

program main
    implicit none
    integer, parameter :: dp = kind(0.d0)
    real(dp), parameter :: twopi = 2.0d0 * acos(-1.0d0)
    real(dp) :: t1, t2, val
    integer :: loop

    do loop = 1, 3
        call cpu_time(t1)
        val = integral( N = 10**8 )
        call cpu_time(t2)
        print *, "time = ", t2-t1
    enddo
    print *, "val = ", val

contains

function integral(N) result(r)
    integer, intent(in) :: N
    real(dp) :: r, x
    integer :: i

    r = 0
    do i = 1, N
        x = dble(i) / dble(N)
        r = r + sin( x )  !! [eq.1]
        ! r = r + sin( twopi * x**2 )  !! [eq.2]
        ! r = r + sqrt( ( 1 - exp( -x**2 ) ) / cosh( -x**4 ) )  !! [eq.3]
    end do
    r = r / dble(N)
end

end program

and the Julia code is like this:

#using LoopVectorization
const twopi = 2.0 * pi

function integral( N )
    s = 0.0

    #@avx for i in 1:N
    for i = 1 : N

        x = i / N
        s += sin( x )  # [eq.1]
        # s += sin( twopi * x^2 )  # [eq.2]
        # s += sqrt( ( 1 - exp( -x^2 ) ) / cosh( -x^4 ) )  # [eq.3]
    end
    return s / N
end

@time val = integral( 10^8 )
@time val = integral( 10^8 )
@time val = integral( 10^8 )
println( "val = ", val )

On my computer (MacMini2012, Core i7 2.3 GHz, extremely old!!), the result was like the following:

Environment:

gcc version 10.2.0 (Homebrew GCC 10.2.0) 

Julia Version 1.6.1
Commit 6aaedecc44 (2021-04-23 05:59 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin18.7.0)
  CPU: Intel(R) Core(TM) i7-3615QM CPU @ 2.30GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, ivybridge)
Below, "Fg", "J", "Jv" refer to these commands:

(Fg) gfortran -O3 -march=native
(J) julia (plain, no command-line option, @avx line commented out)
(Jv) julia (with @avx macro from LoopVectorization)

------------------------------
[eq.1] sin(x)

(Fg)
 time =   0.96957099999999996     
 time =   0.97305400000000009     
 time =   0.99764900000000001     
 val =   0.45969769833919766     

(J)
  1.344146 seconds (1 allocation: 16 bytes)
  1.376316 seconds (1 allocation: 16 bytes)
  1.372201 seconds (1 allocation: 16 bytes)
val = 0.45969769833919766

(Jv)
  0.404184 seconds (1 allocation: 16 bytes)
  0.421015 seconds (1 allocation: 16 bytes)
  0.414725 seconds (1 allocation: 16 bytes)
val = 0.4596976983392125

------------------------------
[eq.2] sin( 2pi * x**2 )

(Fg)
 time =    1.4407100000000002     
 time =    1.4383539999999999     
 time =    1.4315230000000003     
 val =   0.17170783918195714     

(J)
  2.000316 seconds (1 allocation: 16 bytes)
  2.000964 seconds (1 allocation: 16 bytes)
  2.022475 seconds (1 allocation: 16 bytes)
val = 0.17170783918195714

(Jv)
  0.456387 seconds (1 allocation: 16 bytes)
  0.445566 seconds (1 allocation: 16 bytes)
  0.443968 seconds (1 allocation: 16 bytes)
val = 0.17170783918184698

------------------------------
[eq.3] sqrt( ( 1 - exp( -x**2 ) ) / cosh( -x**4 ) )

(Fg)
 time =    2.2166120000000000     
 time =    2.2270540000000003     
 time =    2.2321980000000003     
 val =   0.42737032509713468     

(J)
 13.236536 seconds (1 allocation: 16 bytes)
  6.986087 seconds (1 allocation: 16 bytes)
  6.975467 seconds (1 allocation: 16 bytes)
val = 0.4273703250971347

(Jv)
  1.947188 seconds (1 allocation: 16 bytes)
  1.923063 seconds (1 allocation: 16 bytes)
  1.945297 seconds (1 allocation: 16 bytes)
val = 0.42737032509704814

Because my machine is too poor for benchmark, I am not sure to what extent I can trust the relative timing above (I guess it will change greatly on more recent machines). But possibly, is there some trend that SIMD optimization becomes challenging as the expression becomes more complicated? (e.g. because of increased overhead), or possibly because some function is accelerated (e.g. sin) while others not (e.g. exp)? <— I don’t know the mechanism of SIMD so cannot imagine what’s happening here…

One more thing I feel interesting is that gfortran, ifort, nvfortran give much different timings (shown here)

Is this because of the difference of vectorization(?) among different compilers…?

I’ve also tried using “SLEEF” on my Mac above, but the library installed via Homebrew did not work with gfortran-10 for some reason. (the error says that _ZGVdN4v_sin or something is missing even with linking the library). So I may need to build it from source…
https://sleef.org/

3 Likes

Thanks for the information. Here is pairwise summation in Fortran, which for 1E9 single precision uniform deviates is not just correct but slightly faster than SUM for gfortran -O2.

module m
implicit none
private
public :: sum_pair
contains
recursive pure function sum_pair(x) result(xsum)
real,   intent(in) :: x(:)
real               :: xsum
integer            :: n,imid
integer, parameter :: nmax = 1000000
n = size(x)
if (n <= nmax) then
   xsum = sum(x)
else
   imid = n/2
   xsum = sum_pair(x(:imid)) + sum_pair(x(imid+1:))
end if
end function sum_pair
end module m

program xsum_pair
use m, only: sum_pair
implicit none
integer, parameter :: n = 10**9, nt = 3, niter = 5
real               :: t(nt),mean_sum,mean_sum_pair
real, allocatable  :: x(:)
integer            :: iter
allocate (x(n))
do iter=1,niter
   call random_number(x)
   call cpu_time(t(1))
   mean_sum = sum(x)/n
   call cpu_time(t(2))
   mean_sum_pair = sum_pair(x)/n
   call cpu_time(t(3))
   write (*,"(/,3a10)") "method","sum","sum_pair"
   write (*,"(a10,2f10.7)") "means",mean_sum,mean_sum_pair
   write (*,"(a10,2f10.7)") "times",t(2:) - t(:nt-1)
end do
end program xsum_pair

Results:

method       sum  sum_pair
 means 0.0167772 0.5000080
 times 1.2031250 1.0468750

method       sum  sum_pair
 means 0.0167772 0.4999994
 times 1.1875000 1.0468750

method       sum  sum_pair
 means 0.0167772 0.4999858
 times 1.1875000 1.0468750

method       sum  sum_pair
 means 0.0167772 0.5000035
 times 1.2031250 1.0312500

method       sum  sum_pair
 means 0.0167772 0.4999901
 times 1.1875000 1.0468750

I wonder why gfortran sum(x) always gives 16777216.0 .

Just trying to help here, because I wrote that test in the original Julia discourse thread. The goal there was to check the overhead of mutithreading vs. OpenMP (because a Julia user raised the concern that threading in Julia was slow). Chris Elrod came with the @avxt example to show a very low-overhead multithreading scheme. That is the original context of that example (and that is why the inner function, sin, was used, it was supposed to be fast and, at the same time, avoid the compiler to get the solution analytically).

The complete set of Julia tests that I had from there is here (which in the results below is run with 4 thread):

using BenchmarkTools, Test
using LoopVectorization
using ThreadsX

# @inline g(i) = sin(i)
@inline g(i) = 1e-6*i

function f(N)
    s = 0.
    for i in 1:N
        s += g(i)
    end
    s
end

function f_fast(N)
    s = 0.
    @fastmath for i in 1:N
        s += g(i)
    end
    s
end

function f_avx(N)
    s = 0.
    @avx for i in 1:N
        s += g(i)
    end
    s
end

function f_avxt(N)
    s = 0.
    @avxt for i in 1:N
        s += g(i)
    end
    s
end

function f_simd(N)
    s = 0.
    @simd for i in 1:N
        s += g(i)
    end
    s
end

f_sumiter(N) = sum(g(i) for i in 1:N)

f_mapreduce(N) = mapreduce(g, +, 1:N)

f_threadx_mapreduce(N) = ThreadsX.mapreduce(g, +, 1:N)

N = 10000000

@test f(N) ≈ 
      f_fast(N) ≈  
      f_avx(N) ≈  
      f_avxt(N) ≈ 
      f_simd(N) ≈ 
      f_sumiter(N) ≈ 
      f_mapreduce(N) ≈ 
      f_threadx_mapreduce(N)

print("loop");println(@btime f($N))
print("fast");println(@btime f_fast($N))
print("avx");println(@btime f_avx($N))
print("avxt");println(@btime f_avxt($N))
print("simd");println(@btime f_simd($N))
print("sumiter");println(@btime f_sumiter($N))
print("mapreduce");println(@btime f_mapreduce($N))
print("threadsx.mapreduce");println(@btime f_threadx_mapreduce($N))

Results:

julia -t4 sum2.jl
loop  10.060 ms (0 allocations: 0 bytes)
5.0000005000000015e7
fast  6.266 ms (0 allocations: 0 bytes)
5.0000005e7
avx  2.506 ms (0 allocations: 0 bytes)
5.0000005e7
avxt  895.321 μs (0 allocations: 0 bytes)
5.0000005e7
simd  6.266 ms (0 allocations: 0 bytes)
5.0000005e7
sumiter  10.026 ms (0 allocations: 0 bytes)
5.0000005000000015e7
mapreduce  6.361 ms (0 allocations: 0 bytes)
5.0000005e7
threadsx.mapreduce  2.833 ms (254 allocations: 15.62 KiB)
5.0000005e7
5 Likes

Thanks @septc. That is exactly right, this benchmark can be a proxy for a simple numerical integration, among other things.

16777216=2^24, so once the sum gets to there, adding a number less than 1, won’t change the result. Pair summation keeps all the numbers you are adding of roughly equal magnitude so that there isn’t rounding error.

2 Likes

I think that the point of the demonstrator is that a special fast implementation of a trig function can be inlined to a great extent, preserving the ability to use wide registers to compute many requested values at a time and sum them efficiently without storing all the summands.

You can do something similar by calling MKL’s vdsin function on 100,000,000 elements, but vdsin gives you all the 100,000,000 answers which you can then SUM(). My timings of MKL suggest 1.4 nanosec for each trig and 0.3 nanosec for each summand. But I can write, in OpenMP Fortran, a much faster (x4) and more accurate (x1000) sum than SUM().

4 Likes

This is what I get with the integration suggested by @septc:

julia -t4 sum2.jl  # @avxt and threadX use 4-threads
loop  1.434 s (0 allocations: 0 bytes)
0.4273703250971347
fast  1.512 s (0 allocations: 0 bytes)
0.4273703250971347
avx  832.967 ms (0 allocations: 0 bytes)
0.42737032509704814
avxt  315.714 ms (0 allocations: 0 bytes)
0.4273703250970827
simd  1.443 s (0 allocations: 0 bytes)
0.4273703250971347
sumiter  3.253 s (0 allocations: 0 bytes)
0.4273703250971347
mapreduce  3.299 s (0 allocations: 0 bytes)
0.4273703250970798
threadsx.mapreduce  1.127 s (293 allocations: 16.77 KiB)
0.4273703250970845

and with gfortran I get:

gfortran -O3 -march=native int.f95 -o int
./int 
 time =    2.1403490000000001     
 time =    2.1468080000000000     
 time =    2.0591350000000004     
 val =   0.42737032509713468     

The julia code is:

Code
using BenchmarkTools, Test
using LoopVectorization
using ThreadsX

#g(i) = 1e-6*i
@inline function g(i,N) 
   x = i/N
   sqrt( ( 1 - exp( -x^2 ) ) / cosh( -x^4 ) )
end

function f(N)
    s = 0.
    for i in 1:N
        s += g(i,N)
    end
    s/N
end

function f_fast(N)
    s = 0.
    @fastmath for i in 1:N
        s += g(i,N)
    end
    s/N
end

function f_avx(N)
    s = 0.
    @avx for i in 1:N
        s += g(i,N)
    end
    s/N
end

function f_avxt(N)
    s = 0.
    @avxt for i in 1:N
        s += g(i,N)
    end
    s/N
end

function f_simd(N)
    s = 0.
    @simd for i in 1:N
        s += g(i,N)
    end
    s/N
end

f_sumiter(N) = sum(g(i,N) for i in 1:N)/N

f_mapreduce(N) = mapreduce(i -> g(i,N), +, 1:N)/N

f_threadx_mapreduce(N) = ThreadsX.mapreduce(i -> g(i,N), +, 1:N)/N

N = 10^8

@test f(N) ≈ 
      f_fast(N) ≈  
      f_avx(N) ≈  
      f_avxt(N) ≈ 
      f_simd(N) ≈ 
      f_sumiter(N) ≈ 
      f_mapreduce(N) ≈ 
      f_threadx_mapreduce(N)

print("loop");println(@btime f($N));
print("fast");println(@btime f_fast($N));
print("avx");println(@btime f_avx($N));
print("avxt");println(@btime f_avxt($N));
print("simd");println(@btime f_simd($N));
print("sumiter");println(@btime f_sumiter($N));
print("mapreduce");println(@btime f_mapreduce($N));
print("threadsx.mapreduce");println(@btime f_threadx_mapreduce($N));

By the way, I have used Fortran for 20 years, and since about 1,5yrs I am using Julia. Unfortunately I pretty much got stuck in what I knew about Fortran. I came to know that the syntax evolved that much very recently (on the Julia forums). My main programs are still Fortran.

4 Likes

I agree. The trend especially in C++ is to have libraries like Kokkos doing the heavy lifting and array support and while that allows C++ to use today with many compilers on GPUs, I think it is much better if the compiler itself can do this and implement these optimizations in the compiler. Fortran, being a simpler language is ideal for this kind of compiler optimization.

Actually, I have no doubt that @chriselrod could help us greatly with Fortran compilers. My experience with people contributing to LFortran is that they get up to speed in very little time. I would even argue it is easier to implement these optimizations in a compiler than as a library.

There are pros and cons to each approach (library vs. compiler), but both approaches work.

2 Likes

We seem to have great front-ends, great platform-specific back-ends, great runtime libraries, great instrumentation libraries, great optimization reporting and open-source compilers. Unfortunately, that’s about 6 different compilers!

5 Likes

Slightly tangential, but Steven G. Johnson, one of the two authors of FFTW, is a long time Julia contributor and user and gave this excellent keynote about metaprogramming in Julia at JuliaCon 2019:

This talk is particularly interesting because rather than being some esoteric topic, metaprogramming is of great relevance to high performance numerical programming since libraries like FFTW and BLAS, and pretty much any other kind of really high performance library, end up having to do pretty sophisticated code generation in order to achieve the highest levels of performance. FFTW, for example, uses OCaml to generate highly optimized C code. Julia is very much designed to make this kind of code generation easy. (It’s even possible at run time based on the characteristics of specific problems, if necessary, without the user needing to do anything other than calling a function.)

11 Likes

I get 4X speedup for free by just allowing @fastmath in front of g(i,N). This is partly because x^4 is calculated in Julia using Base.power_by_squaring instead of the much faster but less accurate x*x*x*x. You can test this if you replace x^4 with x*x^3, in this case, x is multiplied by the optimized x^3 calculated using Base.literal_pow algorithm. Being a loyal lover for Fortran, Julia still amazes me every day with its impressive performance.

Of course I tested my Intel compiler with the /fast flag to compare and tried the x*x**3 trick but nothing changes the timings of the Fortran version. Here are my benchmark results.

Intel Fortran:

 time =    3.8906250000000000     
 time =    3.9062500000000000     
 time =    3.9062500000000000     
 val =   0.42737032509713474

Julia 1.7.0-beta2:

loop      1.140    s (0 allocations: 0 bytes)  0.4273703250971348
fast      1.140    s (0 allocations: 0 bytes)  0.4273703250971348
avx       1.512    s (0 allocations: 0 bytes)  0.42737032509704814
avxt      388.076 ms (0 allocations: 0 bytes)  0.4273703250970827
simd      1.140    s (0 allocations: 0 bytes)  0.4273703250971348
sumiter   3.135    s (0 allocations: 0 bytes)  0.4273703250971348
mapreduce 3.135    s (0 allocations: 0 bytes)  0.4273703250970799
threadsx.mapreduce 412.653 ms (620 allocations: 44.83 KiB) 0.42737032509707623
3 Likes

You might want to give NVFortran a try. It has been established as being the fastest Fortran compiler that has been tried on this benchmark. Also, in another topic, the Intel MKL was used to provide a speedup for Fortran.

3 Likes

@seif_shebl thank you for submitting your version and welcome to the forum!

I’ve recently opened an issue at Add summation of sin example · Issue #14 · fortran-lang/benchmarks · GitHub to include this example in our “benchmarks” repository, both Fortran and Julia. So this work will not be lost, we’ll eventually get to it to include it.

2 Likes

I am amazed that people think that there exist one-liners that matter in practice that decades of competitive Fortran compiler and program development have left on the tree to be picked.

3 Likes

A lot of the problem is that the optimal elementary function implimentation changes over time as hardware changes. One thing Julia does to get fast and accurate special functions is that it heavily relies on fused multiply add instructions. On most processors from the last decade or so, these are 1 cycle instructions that do 2 operations with only 1 rounding. However, fma is a relatively new instruction, so if you wrote your elementary function code before you could rely on them, you would use a different, less efficient algorithm.

To some extent, I think there is a broader lesson here. Even if you are writing statically typed, low level code, writing it in a fast, high level language with good introspection capability has a lot of value. If you are using a language with 10 different compilers which doesn’t let you see how intrinsics are implimented easily, it is much less likely that users will see places to optimize intrinsics, and the language will fall behind.

5 Likes

@oscardssmith, no doubt as a user I would like to see exactly how the compiler implements intrinsics and contribute to its implementation. I think Fortran intrinsic math functions should be written in Fortran and the Fortran compiler(s) should be able to optimize Fortran code so well, that you can write them in Fortran. I suggested something along those lines in: Fortran runtime math library, but it wasn’t received well. I still think it’s a good idea and the way to go for LFortran, and it is still my goal.

5 Likes

A problem is that a different implementation may be optimal for serial vs SIMD implementation.
For example, serial implementations of many special functions are often implemented piecewise. Break up the domain, and write a fast and accurate approximation for each piece.
For SIMD versions, the ideal would be to not split up the domain at all, covering the entire range with a single algorithm – but a slower one, as you’d probably require higher degree polynomials, for example.

You can still make some exceptions, e.g. if the entire domain can’t be covered with one implementation. Or if you don’t want to underflow on subnormals, where you can then check the entire vector and branch if any were subnormal without taking much of a performance hit on average as subnormals are (hopefully) only rarely encountered.

In Julia Base.log, which is optimized for scalars, is about twice as fast as SLEEFPirates.log, which is optimized for vectors. However, the Base.log algorithm is incompatible with SIMD, meaning 8 evaluations would take about 8 times longer.
But SLEEFPirates.log is about the same fast with 8 inputs as with 1 on my computer.

Another example of how implementations can change as a function of vector sizes is that the exp implementation used by LoopVectorization.jl for systems with AVX512 uses a small, 16-element, lookup table. It uses this because the table can be implemented via a fairly fast vector shuffle instruction rather than a gather instruction, as would be needed with larger tables (a smaller table also takes up less cache space).
But using vector shuffles as lookup tables requires your vectors are a certain size to actually hold the table.

I mean to say “it would be difficult”, not to be discouraging. There are also many examples where it’d be feasible to get it to work well, and I’d love in general to see more of a push towards ISPC-like vectorization abilities and branch handling with masks.

4 Likes

Indeed, I think the compiler should use different implementation based on the vector length. It seems only two functions are needed: log_scalar (to compute just one value as fast as possible) and log_vector which internally can still dispatch based on the vector length. Or is there more to this?

It’d be nice if the compiler knew that log_scalar and log_vector belong together, and automatically used log_vector for vectorized loops. For intrinsic math functions like sin that can be done. For user defined functions maybe an extension of the language would be needed.

Yes, none of this is easy. But all of this is necessary (in my opinion) to realize the full potential of a language like Fortran.