Simple summation 8x slower than in Julia

@certik and everyone,

Since Julia touts its ability at calling C with ccall and Fortran has interoperability with C, I’m inclined to think the better comparison will be to use a common basis in both, say the C code for computing the SINE using the CORDIC algorithm as given by John Burkardt here: https://people.sc.fsu.edu/~jburkardt/c_src/cordic/cordic.c

and retry with the same exact compiled C code in the tests in Julia and Fortran.

3 Likes

That’s one of the maladies affecting Python codes, that I hope Julia developers would care to avoid. Five years from now, will avx have any correct meaning? or will it remain there forever as a historical misnomer?
The Fortran style of optimization hints would be something like do concurrent. The hardware-specific vectorization and optimizations are left to the compiler and compiler flags, which are free to do nothing, or vectorize it, or parallelize on threads, or offload it to GPUs like how NVIDIA Fortran compiler does, or distribute the work across CPUs. The developer does not (and ideally should never need to) insert hardware-specific directives in the code (I realize that do concurrent is irrelevant to this particular thread, just an example).

do concurrent(i = 1:n)
end do

As pointed out above, @avx is a (historically named but not limited to avx) feature of a package (LoopVectorization), not a feature of the language. Julia’s language feature is @simd. It’s fair to assume simd will still be a thing several years from now. From what I understand @simd is right now inferior to (but much simpler than) @avx, the hope is that LoopVectorization serves as a laboratory and once the dust settles parts of it get integrated into the llanguage. This kind of process (have some feature live in a package under a macro while it gets ironed out) appears to be very effective.

5 Likes

One standard way to assist the compiler do vectorization and parallelization is using OpenMP 4 SIMD directive which is supported by all major compilers (Intel, GCC, LLVM [Borrows Intel’s OpenMP library]).

If one would use Intel Compiler with SVML it will have even faster sin() function than Sleef.

1 Like

Adding !$omp simd reduction(+:r) to the loop as suggested by @Royi (welcome to the Discourse @Royi!) and compiling with -fopenmp is enough to get a vectorised version with a call to the libmvec _ZGVdN4v_sin function. To get this working on my Ubuntu 18 machine I also had to add the following to my Fortran source file (I think because of this bug):

!GCC$ builtin (sin) attributes simd (notinbranch) if('x86_64')

Disappointingly, the result with the OpenMP SIMD loop is about 30% slower than without (compiled with -O3 -march=native -fopenmp with an i7-9750H and gcc 10.1.0).

I was however able to get about 2.9x speedup once I linked against the sleef gnu abi (instead of libmvec) which, as others have already pointed out, suggests that much of the difference is to do with the sine implementation.

Since the summation values vary I’m no longer sure that I’m comparing like-for-like in terms of function accuracy, however this is not something I know a lot about.

$ # Original
$ gfortran -O3 -march=native sum.f90 -fopt-info-vec-missed && ./a.out 
sum.f90:19:10: missed: couldn't vectorize loop
...
 Time   1.2919830000000001     
   1.7136493465705458     

$ # Add OpenMP SIMD directive
$ gfortran -O3 -march=native -fopenmp sum_omp.f90 -fopt-info-vec && ./a.out 
sum_omp.f90:22:0: optimized: loop vectorized using 32 byte vectors
sum_omp.f90:24:0: optimized: loop vectorized using 32 byte vectors
 Time   1.6626889999999999     
   1.7136493465696878     

$ # Use the sleef vectorised implementation
$ gfortran -O3 -march=native -fopenmp sum_omp.f90 -fopt-info-vec -lsleefgnuabi && ./a.out 
sum_omp.f90:22:0: optimized: loop vectorized using 32 byte vectors
sum_omp.f90:24:0: optimized: loop vectorized using 32 byte vectors
 Time  0.44306599999999996     
   1.7136493465700504     

$ # Julia
$ julia --version && julia avx.jl
julia version 1.6.0
  0.212345 seconds (1 allocation: 16 bytes)
1.713649346570267

sum.f90
program avx
implicit none
integer, parameter :: dp = kind(0.d0)
real(dp) :: t1, t2, r

call cpu_time(t1)
r = f(100000000)
call cpu_time(t2)

print *, "Time", t2-t1
print *, r

contains

    real(dp) function f(N) result(r)
    integer, intent(in) :: N
    integer :: i
    r = 0
    do i = 1, N
        r = r + sin(real(i,dp))
    end do
    end function

end program
sum_omp.f90
!GCC$ builtin (sin) attributes simd (notinbranch) if('x86_64')
program avx

implicit none
integer, parameter :: dp = kind(0.d0)
real(dp) :: t1, t2, r

call cpu_time(t1)
r = f(100000000)
call cpu_time(t2)

print *, "Time", t2-t1
print *, r

contains

    real(dp) function f(N) result(r)
    integer, intent(in) :: N
    integer :: i
    r = 0

    !$omp simd reduction(+:r)
    do i = 1, N
        r = r + sin(real(i,dp))
    end do
    end function

end program


avx.jl
using LoopVectorization

function f_avx(N)
    s = 0.0
    @avx for i in 1:N
        s += sin(i)
    end
    s
end

@time r = f_avx(100000000)
println(r)

Edit: This post in the Julia thread shows that vectorization is possible without an OpenMP clause when using -Ofast with various architecture flags, but the performance there is similarly disappointing.

3 Likes

Thank you for posting this algorithm, it’s very cool :+1:.

I’ve tried the following

double Sine(double a)
{
    double c, s;
    cossin_cordic(a, 40, &c, &s);
    return s;
}

+ include https://people.sc.fsu.edu/~jburkardt/c_src/cordic/cordic.c
// Build: gcc -shared -Ofast -march=native -mtune=native cordic.c -o cordic.dll
using LoopVectorization

function f_avx(N)
    s = 0.0
    for i in 1:N
        s += ccall((:Sine, "cordic"), Float64, (Float64,), convert(Float64, i))
    end
    s
end

@time r = f_avx(100000000)
println(r)
julia cordic.jl
 23.677484 seconds (1 allocation: 16 bytes)
1.7136493499358774

GFortran version with Sine-Cordic:

GFortran-Cordic.exe
 Time   23.531250000000000
   1.7136493499358774

Unfortunately the Julia version with @avxt doesn’t compile.

1 Like

I think I am getting weary of these unrealistic benchmarks. An applied mathematician would never use such a thing if performance mattered. They would note that the requested sum is well known and would code it accordingly, making the code several hundred thousand times faster.

1 Like

When you use SIMD based implementation of sin() it makes no sense to use SIMD directive.
It is useful to give the compiler some hints about the data in case it made a decision for some reason not to vectorize the operation.

In your case, if you use Sleef as well (Like the Julia code) I’d guess there 3 other factors:

  1. You use -O3 while I believer Julia code also use Fast Math. So use Ofast.
  2. You use different compiler to generate Sleef code (Though as Sleef is mostly hand coded it won’t make much difference).
  3. Your unrolling preferences of the compiler are not as aggressive as Julia (Which uses LLVM which is aggressive in that department to begin with).

@une, thank you very much for posting the results using the CORDIC algorithm. If you or anyone wanted to follow-up on it further, an option worth considering will be to take your C Sine function and port it as a Julia and Fortran functions respectively and encode the C interoperability with cossin_cordic in it. Please note John Burkardt’s code, especially cossin_cordic is effectively pure, so I would suggest marking the Fortran interface to it as PURE.

1 Like

Would the existence of a compiler that recognised the applicability of the identity

db6f6c985a9f5be2208b7618a99495233321fa02

prove that the language in which it did it was in any sense superior?

2 Likes

Yes exactly, in this case the SIMD directive is only used to coerce vectorisation and generate the SIMD library call for sin. I notice that specifying -ffast-math also generates a vectorised version, so the vectorised version does not have strict arithmetic guarantees ?
Using -ffast-math appears to have similar performance to the OpenMP SIMD, which suggests that the resulting vectorised code is similar or equivalent. I get slightly worse performance with -funroll-loops which reflects my personal experience that there are no guarantees with -funroll-loops.

I speculate that the remaining the performance difference is related to your second point, which is that the vectorised functions I’ve used with Fortran are in a separate compiled object whereas Julia is perhaps able to perform whole-program analysis with better inlining and IPO? (I know very little about how Julia works in this regard). Time for a Fortran port of Sleef?

Brilliant, thank you for the bringing up the sine series summation solution, a very useful math reminder and lesson.

It will be useful if Fortranners can take the lead in putting forth in open forums meaningful performance benchmarks.

What I come across in the public domain are either trivial so much so that the point is missed altogether or so complex and specific to some domains anyone without that domain expertise can hardly begin to comprehend what is being tested.

2 Likes

We’re mixing what the compiler can do between which implantation is used.
The SIMD directive I wrote was for the non sin() form.

For sin() function all depends on the implementation.
Sleef has 2-3 accuracy levels, I am not sure which one is used by avxt and which by the Fortran code.

There is no reason once the implementations are the same and both use the same hints for the compiler not to have the same performance.

sin is not a language intrinsic in julia, it’s implemented as a generic function.

Is there a way to tell Fortran compilers to maximally evaluate these sorts of things at compile time? Since all the values are known at compile time, the whole program could be reduced to a print of a known constant.

1 Like

It is up to each Fortran compiler to document such things.

@certik and everyone,

Notwithstanding the excellent point by @themos the specific case at hand is not a true performance benchmark on account of the mathematical identity, should we continue to use the same case but modify it as suggested upthread to view the sin calculation as some representative function computation for which a common library implementation is to be consumed in any programming paradigm of interest, whether it be based in Julia or Fortran, and we “up” the game a little bit to achieve some “scale” to the calculations i.e., raise the number N to be 800,000,000 rather than 100,000,000, a speedup close to 6x relative to Julia is viable with Fortran.

So with John Burkardt’s C code for the CORDIC algorithm (link upthread) as the basis, the Julia code and the result is as follows (as noted by @uwe, avx* macro fails to compile, so it’s removed):

function cordic_sine(a)

   cos = Ref{Float64}(0.0)
   sin = Ref{Float64}(0.0)

   # Ref: Excellent work by John Burkardt
   # https://people.sc.fsu.edu/~jburkardt/c_src/cordic/cordic.c
   # void cossin_cordic ( double beta, int n, double *c, double *s )
   ccall((:cossin_cordic, "cordic"), Cvoid, (Float64, Cint, Ref{Float64}, Ref{Float64}), a, 40, cos, sin)

   return sin[]

end

function f_avx(N)
   s = 0.0
   for i in 1:N
       s += cordic_sine(convert(Float64, i))
   end
   s
end

@time r = f_avx(800000000)
println(r)

Execution result with Julia:

C:\temp>julia avx.jl
172.551030 seconds (1 allocation: 16 bytes)
1.9006631241568408

With Fortran, say one pursues rather simple-minded coarray approach as shown in the below fold - but while noting coarray is part of the base language:

Click to see Fortran code

Simple coarray example to divvy up the series summation across images

module cordic_m
   use, intrinsic :: iso_c_binding, only : c_int, c_double
   interface
      pure subroutine cossin_cordic( beta, n, c, s) bind(C, name="cossin_cordic")
      ! Ref: Excellent work by John Burkardt
      ! https://people.sc.fsu.edu/~jburkardt/c_src/cordic/cordic.c
      ! void cossin_cordic ( double beta, int n, double *c, double *s )
         import :: c_int, c_double
         ! Argument list
         real(c_double), intent(in), value :: beta
         integer(c_int), intent(in), value :: n
         real(c_double), intent(inout)     :: c 
         real(c_double), intent(inout)     :: s 
      end subroutine
   end interface
contains
   pure function cordic_sine( a ) result(sin_a)
      ! Argument list
      real(c_double), intent(in) :: a
      ! Function result
      real(c_double) :: sin_a
      real(c_double) :: cos_a
      call cossin_cordic( a, 40, cos_a, sin_a )
   end function
end module
   use cordic_m
   integer, parameter :: M = 100000000
   real(c_double) :: r[*]
   real(c_double) :: t1, t2
   real(c_double) :: sn
   integer :: i

   if ( this_image() == 1 ) then
      call cpu_time(t1)
   end if
   sync all
       
   r = f(M)
   sync all

   if ( this_image() == 1 ) then
      sn = 0.0_c_double
      do i = 1, num_images()
         sn = sn + r[i]
      end do
      call cpu_time(t2)
      print *, "Time", t2-t1
      print *, sn
   end if

contains

   pure function f(N) result(r)
      integer, intent(in) :: N
      real(c_double) :: r
      integer :: i
      r = 0.0_c_double
      do i = (this_image()-1)*M+1, this_image()*M
         r = r + cordic_sine(real(i,c_double))
      end do
   end function

end program 

the timing result with IFORT on the same CPU as the above Julia case is:

C:\temp>ifort /c /O3 /Qcoarray:shared /Qcoarray-num-images=8 /QxHost /traceback p.f90
Intel(R) Fortran Intel(R) 64 Compiler Classic for applications running on Intel(R) 64, Version 2021.2.0 Build 20210228_000000
Copyright (C) 1985-2021 Intel Corporation. All rights reserved.

C:\temp>link p.obj cordic.obj /subsystem:console /out:p.exe
Microsoft (R) Incremental Linker Version 14.27.29112.0
Copyright (C) Microsoft Corporation. All rights reserved.

C:\temp>p.exe
Time 30.2500000000000
1.90066312415625

Note the series summation result matches with that of Julia to the precision of c_double / Float64.

Perhaps one or more of the Julia enthusiasts can show here the parallel execution option(s) with Julia for the same exact case i.e., their native option equivalent to Fortran coarray - now that will be a good learning exercise for readers here!

3 Likes

As far as I understand, Julia is quite strict about floating point semantics, so it will not allow transformations that violate the semantics (eg loop reordering). It does autovectorize (through LLVM), but only as far as it can prove that they produce the exact same results. I think Fortran does the same, so differences there are probably due to compilers more than languages. Julia annotations such as @simd or @fastmath relax the semantics, allowing the compiler to optimize more, so compared to fortran it’s more granular than compiler switches. I sometimes wish julia would allow more aggressive transforms automagically, but it’s not easy to find an API that allows both automatic transformations in the cases where it doesn’t matter and forbids them in the cases where it does. In any case adding @simd in front of tight loops is not that big of a deal.

None of the languages will vectorize this without permission because vectorization requires reassociating the order of the summation in violation of the IEEE 754 floating point standard. That permission is given in C and Fortran by passing the -ffast-math flag for an entire compilation unit. This flag also gives permission to do other transformations that may speed up your code but may also change the answer for better or worse, but definitely less predictable since the scope of the transformations is unspecified and unbounded. (In bad cases arbitrarily wrong answers can result from the careless application of the -ffast-math option.)

In Julia, a much more limited permission to change associativity is given by using the built-in @simd macro. This is a much more surgical approach since it only gives permission to reassociate operations as necessary to vectorize the annotated loop, and doesn’t allow any of the other potentially more problematic transformations that the -ffast-math option can cause. The @avx macro provided by the LoopVectorization package does a bit more than this: it explicitly unrolls the loop, and replaces calls to certain math functions (including sin) with vectorized versions that may give slightly different results. As @antoine-levitt mentioned, there is also a built-in @fastmath macro that gives permissions similar to what -ffast-math does (but locally).

7 Likes