Simple summation 8x slower than in Julia

Having a compiler know about the relationship between log_scalar and log_vector sounds to me like you really wish you had multiple dispatch :slight_smile:.

In Fortran you can write a generic function like this:

interface log
    module procedure log_scalar, log_vector
end interface

contains

subroutine log_scalar(x)
real(dp), intent(out) :: x
...
end subroutine

subroutine log_vector(x)
real(dp), intent(out) :: x(:)
...
end subroutine

But I don’t know if compilers would switch from log_scalar(x) to log_vector(x) when optimizing code like this one:

do i = 1, N
    A(i) = log(x(i))
end do
5 Likes

Why does your Fortran code convert an integer to a real inside a hot loop?

Shouldn’t your iterator be of type real to get decent performance?

3 Likes

Welcome to the forum. Non-integer do loop variables were removed from Fortran 95 and later standards, as discussed here.

Edit:
Thanks to @ricebunny for the suggestion. It appears that a loop with real loop variable is slightly faster with gfortran -Ofast on Windows, with output

Time, result, method: 3.765625 1.71364934657028 original            
Time, result, method: 3.640625 1.71364934657028 real_loop_simulated 
Time, result, method: 3.625000 1.71364934657028 real_loop

for the code

program avx
implicit none
integer, parameter :: dp = kind(0.d0)
real(dp) :: t1, t2, r
integer :: iter
integer, parameter :: n = 100000000, nmethods = 3
character (len=*), parameter :: methods(nmethods) = [character (len=20) :: "original ","real_loop_simulated","real_loop"]
do iter=1,3
   call cpu_time(t1)
   if (iter == 1) then
      r = f(n)
   else if (iter == 2) then
      r = g(n)
   else
      r = h(n)
   end if
   call cpu_time(t2)
   write (*,"(a,1x,f0.6,1x,f0.14,1x,a)") "Time, result, method:", t2-t1, r, methods(iter)
end do

contains

    real(dp) function f(N) result(r)
    integer, intent(in) :: N
    integer :: i
    r = 0
    do i = 1, N
        r = r + sin(real(i,dp))
    end do
    end function f

    real(dp) function g(N) result(r)
    integer, intent(in) :: N
    integer :: i
    real(dp) :: x
    r = 0
    x = 0.0_dp
    do i = 1, N ! simulate loop with real loop variable
        x = x + 1.0_dp
        r = r + sin(x)
    end do
    end function g

    real(dp) function h(N) result(r)
    integer, intent(in) :: N
    real(dp) :: i
    r = 0
    do i = 1, N ! loop with real loop variable -- nonstandard
        r = r + sin(i)
    end do
    end function h

end program
3 Likes

@Beliavsky
With that overhead removed, are you not essentially comparing the implementation of the library sine function in Fortran and Julia?

The summation is likely only a small fraction of the total computational cost. Even when executed as a CPU instruction, sine has a latency between 150 and 200 cycles.

Secondly: are you sure that both libraries implement a vectorized version of the sine function? I am not a Fortran programer, but I know that sometimes you need to use the correct compiler flags to get vectorized code. Is your Fortran code using SVML? Maybe this helps: fimf-use-svml, Qimf-use-svml

2 Likes

It’s not using the same sin implementation. The Julia version is using very optimized version and then LoopVectorization.jl inlines it properly in the loop and vectorizes everything. Fortran compilers should do the same, but currently they don’t. We are in the process of creating very fast pure Fortran sin implementations to be used for examples like these.

5 Likes

Why not utilise Fortran advantage and adopt !$OMP

program avx
implicit none
integer, parameter :: dp = kind(0.d0)
real(dp) :: t1, t2, r

call elapse_time(t1)
r = f(100000000)
call elapse_time(t2)

print *, "Time", t2-t1
print *, r

contains

    real(dp) function f(N) result(r)
    integer, intent(in) :: N
    integer  :: i
    real(dp) :: x, s
    s = 0
  !$omp parallel do shared (n), private (i,x) reduction (+ : s) schedule (static)
    do i = 1, N
        x = i
        s = s + sin(x)
    end do
  !$omp end parallel do
    r = s
    end function

    subroutine elapse_time (sec)
      real(dp) :: sec
      integer*8 ticks, rate
      call system_clock ( ticks, rate )
      sec = dble(ticks) / dble(rate)
    end subroutine elapse_time

end program

Although, this test is basically comparing the implementation of sin(x) for large x

1 Like

Sorry to bump up this old thread. I recently ran into very slow loops (slower than the equivalent ones in MATLAB) with 500k or so iterations which used cosh, cos/sin in each iteration. Replacing these by the Intel mkl vml routines lead to a 5X speed up. My processor only has avx2 instructions and I think if I try it on a processor with avx2-512 instruction set, this can be slashed even further. I use intel oneAPI under Linux.

2 Likes

Thank you @mohoree for the great news!
May I ask, could you perhaps may show a small minimal working example about how to use the intel MKL vml routines? Thank you very much indeed!

I tried a one line solution for function f but the speed seems slightly slower, Intel Fortran, -O3 -xHost, took 0.65 seconds.

r = sum( sin( dble( (/(i, i = 1,N)/) ) ) )

:sweat_smile:

The do loop takes 0.6 seconds.