Simple summation 8x slower than in Julia

oscardssmith · June 30, 2021, 7:35pm

Having a compiler know about the relationship between log_scalar and log_vector sounds to me like you really wish you had multiple dispatch .

certik · June 30, 2021, 7:52pm

In Fortran you can write a generic function like this:

interface log
    module procedure log_scalar, log_vector
end interface

contains

subroutine log_scalar(x)
real(dp), intent(out) :: x
...
end subroutine

subroutine log_vector(x)
real(dp), intent(out) :: x(:)
...
end subroutine

But I don’t know if compilers would switch from log_scalar(x) to log_vector(x) when optimizing code like this one:

do i = 1, N
    A(i) = log(x(i))
end do

ricebunny · November 20, 2021, 1:18pm

Why does your Fortran code convert an integer to a real inside a hot loop?

Shouldn’t your iterator be of type real to get decent performance?

Beliavsky · November 20, 2021, 1:29pm

Welcome to the forum. Non-integer do loop variables were removed from Fortran 95 and later standards, as discussed here.

Edit:
Thanks to @ricebunny for the suggestion. It appears that a loop with real loop variable is slightly faster with gfortran -Ofast on Windows, with output

Time, result, method: 3.765625 1.71364934657028 original            
Time, result, method: 3.640625 1.71364934657028 real_loop_simulated 
Time, result, method: 3.625000 1.71364934657028 real_loop

for the code

program avx
implicit none
integer, parameter :: dp = kind(0.d0)
real(dp) :: t1, t2, r
integer :: iter
integer, parameter :: n = 100000000, nmethods = 3
character (len=*), parameter :: methods(nmethods) = [character (len=20) :: "original ","real_loop_simulated","real_loop"]
do iter=1,3
   call cpu_time(t1)
   if (iter == 1) then
      r = f(n)
   else if (iter == 2) then
      r = g(n)
   else
      r = h(n)
   end if
   call cpu_time(t2)
   write (*,"(a,1x,f0.6,1x,f0.14,1x,a)") "Time, result, method:", t2-t1, r, methods(iter)
end do

contains

    real(dp) function f(N) result(r)
    integer, intent(in) :: N
    integer :: i
    r = 0
    do i = 1, N
        r = r + sin(real(i,dp))
    end do
    end function f

    real(dp) function g(N) result(r)
    integer, intent(in) :: N
    integer :: i
    real(dp) :: x
    r = 0
    x = 0.0_dp
    do i = 1, N ! simulate loop with real loop variable
        x = x + 1.0_dp
        r = r + sin(x)
    end do
    end function g

    real(dp) function h(N) result(r)
    integer, intent(in) :: N
    real(dp) :: i
    r = 0
    do i = 1, N ! loop with real loop variable -- nonstandard
        r = r + sin(i)
    end do
    end function h

end program

ricebunny · November 20, 2021, 3:44pm

@Beliavsky
With that overhead removed, are you not essentially comparing the implementation of the library sine function in Fortran and Julia?

The summation is likely only a small fraction of the total computational cost. Even when executed as a CPU instruction, sine has a latency between 150 and 200 cycles.

Secondly: are you sure that both libraries implement a vectorized version of the sine function? I am not a Fortran programer, but I know that sometimes you need to use the correct compiler flags to get vectorized code. Is your Fortran code using SVML? Maybe this helps: fimf-use-svml, Qimf-use-svml

certik · November 20, 2021, 5:23pm

It’s not using the same sin implementation. The Julia version is using very optimized version and then LoopVectorization.jl inlines it properly in the loop and vectorizes everything. Fortran compilers should do the same, but currently they don’t. We are in the process of creating very fast pure Fortran sin implementations to be used for examples like these.

JohnCampbell · November 21, 2021, 3:44am

Why not utilise Fortran advantage and adopt !$OMP

program avx
implicit none
integer, parameter :: dp = kind(0.d0)
real(dp) :: t1, t2, r

call elapse_time(t1)
r = f(100000000)
call elapse_time(t2)

print *, "Time", t2-t1
print *, r

contains

    real(dp) function f(N) result(r)
    integer, intent(in) :: N
    integer  :: i
    real(dp) :: x, s
    s = 0
  !$omp parallel do shared (n), private (i,x) reduction (+ : s) schedule (static)
    do i = 1, N
        x = i
        s = s + sin(x)
    end do
  !$omp end parallel do
    r = s
    end function

    subroutine elapse_time (sec)
      real(dp) :: sec
      integer*8 ticks, rate
      call system_clock ( ticks, rate )
      sec = dble(ticks) / dble(rate)
    end subroutine elapse_time

end program

Although, this test is basically comparing the implementation of sin(x) for large x

mohoree · November 24, 2021, 9:34am

Sorry to bump up this old thread. I recently ran into very slow loops (slower than the equivalent ones in MATLAB) with 500k or so iterations which used cosh, cos/sin in each iteration. Replacing these by the Intel mkl vml routines lead to a 5X speed up. My processor only has avx2 instructions and I think if I try it on a processor with avx2-512 instruction set, this can be slashed even further. I use intel oneAPI under Linux.

CRquantum · April 2, 2022, 5:18am

Thank you @mohoree for the great news!
May I ask, could you perhaps may show a small minimal working example about how to use the intel MKL vml routines? Thank you very much indeed!

CRquantum · April 2, 2022, 8:58am

I tried a one line solution for function f but the speed seems slightly slower, Intel Fortran, -O3 -xHost, took 0.65 seconds.

r = sum( sin( dble( (/(i, i = 1,N)/) ) ) )

The do loop takes 0.6 seconds.

Topic		Replies	Views
How to use IFX and offload openMP to GPU?	0	1413	April 2, 2022
Julia: Fast as Fortran, Beautiful as Python	184	10657	November 13, 2022
An interesting video: Python vs Fortran vs GNU Octave/MATLAB --- side by side performance comparison	5	980	January 6, 2022
Comparing Fortran and Julia's Bessel function performance	69	4160	October 23, 2022
Improving Fortran Results in the Julia Micro-benchmarks Help	44	3785	June 23, 2022

Simple summation 8x slower than in Julia

Related Topics