Simple summation 8x slower than in Julia

We seem to have great front-ends, great platform-specific back-ends, great runtime libraries, great instrumentation libraries, great optimization reporting and open-source compilers. Unfortunately, that’s about 6 different compilers!


Slightly tangential, but Steven G. Johnson, one of the two authors of FFTW, is a long time Julia contributor and user and gave this excellent keynote about metaprogramming in Julia at JuliaCon 2019:

This talk is particularly interesting because rather than being some esoteric topic, metaprogramming is of great relevance to high performance numerical programming since libraries like FFTW and BLAS, and pretty much any other kind of really high performance library, end up having to do pretty sophisticated code generation in order to achieve the highest levels of performance. FFTW, for example, uses OCaml to generate highly optimized C code. Julia is very much designed to make this kind of code generation easy. (It’s even possible at run time based on the characteristics of specific problems, if necessary, without the user needing to do anything other than calling a function.)


I get 4X speedup for free by just allowing @fastmath in front of g(i,N). This is partly because x^4 is calculated in Julia using Base.power_by_squaring instead of the much faster but less accurate x*x*x*x. You can test this if you replace x^4 with x*x^3, in this case, x is multiplied by the optimized x^3 calculated using Base.literal_pow algorithm. Being a loyal lover for Fortran, Julia still amazes me every day with its impressive performance.

Of course I tested my Intel compiler with the /fast flag to compare and tried the x*x**3 trick but nothing changes the timings of the Fortran version. Here are my benchmark results.

Intel Fortran:

 time =    3.8906250000000000     
 time =    3.9062500000000000     
 time =    3.9062500000000000     
 val =   0.42737032509713474

Julia 1.7.0-beta2:

loop      1.140    s (0 allocations: 0 bytes)  0.4273703250971348
fast      1.140    s (0 allocations: 0 bytes)  0.4273703250971348
avx       1.512    s (0 allocations: 0 bytes)  0.42737032509704814
avxt      388.076 ms (0 allocations: 0 bytes)  0.4273703250970827
simd      1.140    s (0 allocations: 0 bytes)  0.4273703250971348
sumiter   3.135    s (0 allocations: 0 bytes)  0.4273703250971348
mapreduce 3.135    s (0 allocations: 0 bytes)  0.4273703250970799
threadsx.mapreduce 412.653 ms (620 allocations: 44.83 KiB) 0.42737032509707623

You might want to give NVFortran a try. It has been established as being the fastest Fortran compiler that has been tried on this benchmark. Also, in another topic, the Intel MKL was used to provide a speedup for Fortran.


@seif_shebl thank you for submitting your version and welcome to the forum!

I’ve recently opened an issue at Add summation of sin example · Issue #14 · fortran-lang/benchmarks · GitHub to include this example in our “benchmarks” repository, both Fortran and Julia. So this work will not be lost, we’ll eventually get to it to include it.


I am amazed that people think that there exist one-liners that matter in practice that decades of competitive Fortran compiler and program development have left on the tree to be picked.


A lot of the problem is that the optimal elementary function implimentation changes over time as hardware changes. One thing Julia does to get fast and accurate special functions is that it heavily relies on fused multiply add instructions. On most processors from the last decade or so, these are 1 cycle instructions that do 2 operations with only 1 rounding. However, fma is a relatively new instruction, so if you wrote your elementary function code before you could rely on them, you would use a different, less efficient algorithm.

To some extent, I think there is a broader lesson here. Even if you are writing statically typed, low level code, writing it in a fast, high level language with good introspection capability has a lot of value. If you are using a language with 10 different compilers which doesn’t let you see how intrinsics are implimented easily, it is much less likely that users will see places to optimize intrinsics, and the language will fall behind.


@oscardssmith, no doubt as a user I would like to see exactly how the compiler implements intrinsics and contribute to its implementation. I think Fortran intrinsic math functions should be written in Fortran and the Fortran compiler(s) should be able to optimize Fortran code so well, that you can write them in Fortran. I suggested something along those lines in: Fortran runtime math library, but it wasn’t received well. I still think it’s a good idea and the way to go for LFortran, and it is still my goal.


A problem is that a different implementation may be optimal for serial vs SIMD implementation.
For example, serial implementations of many special functions are often implemented piecewise. Break up the domain, and write a fast and accurate approximation for each piece.
For SIMD versions, the ideal would be to not split up the domain at all, covering the entire range with a single algorithm – but a slower one, as you’d probably require higher degree polynomials, for example.

You can still make some exceptions, e.g. if the entire domain can’t be covered with one implementation. Or if you don’t want to underflow on subnormals, where you can then check the entire vector and branch if any were subnormal without taking much of a performance hit on average as subnormals are (hopefully) only rarely encountered.

In Julia Base.log, which is optimized for scalars, is about twice as fast as SLEEFPirates.log, which is optimized for vectors. However, the Base.log algorithm is incompatible with SIMD, meaning 8 evaluations would take about 8 times longer.
But SLEEFPirates.log is about the same fast with 8 inputs as with 1 on my computer.

Another example of how implementations can change as a function of vector sizes is that the exp implementation used by LoopVectorization.jl for systems with AVX512 uses a small, 16-element, lookup table. It uses this because the table can be implemented via a fairly fast vector shuffle instruction rather than a gather instruction, as would be needed with larger tables (a smaller table also takes up less cache space).
But using vector shuffles as lookup tables requires your vectors are a certain size to actually hold the table.

I mean to say “it would be difficult”, not to be discouraging. There are also many examples where it’d be feasible to get it to work well, and I’d love in general to see more of a push towards ISPC-like vectorization abilities and branch handling with masks.


Indeed, I think the compiler should use different implementation based on the vector length. It seems only two functions are needed: log_scalar (to compute just one value as fast as possible) and log_vector which internally can still dispatch based on the vector length. Or is there more to this?

It’d be nice if the compiler knew that log_scalar and log_vector belong together, and automatically used log_vector for vectorized loops. For intrinsic math functions like sin that can be done. For user defined functions maybe an extension of the language would be needed.

Yes, none of this is easy. But all of this is necessary (in my opinion) to realize the full potential of a language like Fortran.

Having a compiler know about the relationship between log_scalar and log_vector sounds to me like you really wish you had multiple dispatch :slight_smile:.

In Fortran you can write a generic function like this:

interface log
    module procedure log_scalar, log_vector
end interface


subroutine log_scalar(x)
real(dp), intent(out) :: x
end subroutine

subroutine log_vector(x)
real(dp), intent(out) :: x(:)
end subroutine

But I don’t know if compilers would switch from log_scalar(x) to log_vector(x) when optimizing code like this one:

do i = 1, N
    A(i) = log(x(i))
end do

Why does your Fortran code convert an integer to a real inside a hot loop?

Shouldn’t your iterator be of type real to get decent performance?


Welcome to the forum. Non-integer do loop variables were removed from Fortran 95 and later standards, as discussed here.

Thanks to @ricebunny for the suggestion. It appears that a loop with real loop variable is slightly faster with gfortran -Ofast on Windows, with output

Time, result, method: 3.765625 1.71364934657028 original            
Time, result, method: 3.640625 1.71364934657028 real_loop_simulated 
Time, result, method: 3.625000 1.71364934657028 real_loop

for the code

program avx
implicit none
integer, parameter :: dp = kind(0.d0)
real(dp) :: t1, t2, r
integer :: iter
integer, parameter :: n = 100000000, nmethods = 3
character (len=*), parameter :: methods(nmethods) = [character (len=20) :: "original ","real_loop_simulated","real_loop"]
do iter=1,3
   call cpu_time(t1)
   if (iter == 1) then
      r = f(n)
   else if (iter == 2) then
      r = g(n)
      r = h(n)
   end if
   call cpu_time(t2)
   write (*,"(a,1x,f0.6,1x,f0.14,1x,a)") "Time, result, method:", t2-t1, r, methods(iter)
end do


    real(dp) function f(N) result(r)
    integer, intent(in) :: N
    integer :: i
    r = 0
    do i = 1, N
        r = r + sin(real(i,dp))
    end do
    end function f

    real(dp) function g(N) result(r)
    integer, intent(in) :: N
    integer :: i
    real(dp) :: x
    r = 0
    x = 0.0_dp
    do i = 1, N ! simulate loop with real loop variable
        x = x + 1.0_dp
        r = r + sin(x)
    end do
    end function g

    real(dp) function h(N) result(r)
    integer, intent(in) :: N
    real(dp) :: i
    r = 0
    do i = 1, N ! loop with real loop variable -- nonstandard
        r = r + sin(i)
    end do
    end function h

end program

With that overhead removed, are you not essentially comparing the implementation of the library sine function in Fortran and Julia?

The summation is likely only a small fraction of the total computational cost. Even when executed as a CPU instruction, sine has a latency between 150 and 200 cycles.

Secondly: are you sure that both libraries implement a vectorized version of the sine function? I am not a Fortran programer, but I know that sometimes you need to use the correct compiler flags to get vectorized code. Is your Fortran code using SVML? Maybe this helps: fimf-use-svml, Qimf-use-svml


It’s not using the same sin implementation. The Julia version is using very optimized version and then LoopVectorization.jl inlines it properly in the loop and vectorizes everything. Fortran compilers should do the same, but currently they don’t. We are in the process of creating very fast pure Fortran sin implementations to be used for examples like these.


Why not utilise Fortran advantage and adopt !$OMP

program avx
implicit none
integer, parameter :: dp = kind(0.d0)
real(dp) :: t1, t2, r

call elapse_time(t1)
r = f(100000000)
call elapse_time(t2)

print *, "Time", t2-t1
print *, r


    real(dp) function f(N) result(r)
    integer, intent(in) :: N
    integer  :: i
    real(dp) :: x, s
    s = 0
  !$omp parallel do shared (n), private (i,x) reduction (+ : s) schedule (static)
    do i = 1, N
        x = i
        s = s + sin(x)
    end do
  !$omp end parallel do
    r = s
    end function

    subroutine elapse_time (sec)
      real(dp) :: sec
      integer*8 ticks, rate
      call system_clock ( ticks, rate )
      sec = dble(ticks) / dble(rate)
    end subroutine elapse_time

end program

Although, this test is basically comparing the implementation of sin(x) for large x

1 Like

Sorry to bump up this old thread. I recently ran into very slow loops (slower than the equivalent ones in MATLAB) with 500k or so iterations which used cosh, cos/sin in each iteration. Replacing these by the Intel mkl vml routines lead to a 5X speed up. My processor only has avx2 instructions and I think if I try it on a processor with avx2-512 instruction set, this can be slashed even further. I use intel oneAPI under Linux.


Thank you @mohoree for the great news!
May I ask, could you perhaps may show a small minimal working example about how to use the intel MKL vml routines? Thank you very much indeed!

I tried a one line solution for function f but the speed seems slightly slower, Intel Fortran, -O3 -xHost, took 0.65 seconds.

r = sum( sin( dble( (/(i, i = 1,N)/) ) ) )


The do loop takes 0.6 seconds.