Having a compiler know about the relationship between log_scalar
and log_vector
sounds to me like you really wish you had multiple dispatch .
In Fortran you can write a generic function like this:
interface log
module procedure log_scalar, log_vector
end interface
contains
subroutine log_scalar(x)
real(dp), intent(out) :: x
...
end subroutine
subroutine log_vector(x)
real(dp), intent(out) :: x(:)
...
end subroutine
But I donât know if compilers would switch from log_scalar(x)
to log_vector(x)
when optimizing code like this one:
do i = 1, N
A(i) = log(x(i))
end do
Why does your Fortran code convert an integer to a real inside a hot loop?
Shouldnât your iterator be of type real to get decent performance?
Welcome to the forum. Non-integer do loop variables were removed from Fortran 95 and later standards, as discussed here.
Edit:
Thanks to @ricebunny for the suggestion. It appears that a loop with real loop variable is slightly faster with gfortran -Ofast on Windows, with output
Time, result, method: 3.765625 1.71364934657028 original
Time, result, method: 3.640625 1.71364934657028 real_loop_simulated
Time, result, method: 3.625000 1.71364934657028 real_loop
for the code
program avx
implicit none
integer, parameter :: dp = kind(0.d0)
real(dp) :: t1, t2, r
integer :: iter
integer, parameter :: n = 100000000, nmethods = 3
character (len=*), parameter :: methods(nmethods) = [character (len=20) :: "original ","real_loop_simulated","real_loop"]
do iter=1,3
call cpu_time(t1)
if (iter == 1) then
r = f(n)
else if (iter == 2) then
r = g(n)
else
r = h(n)
end if
call cpu_time(t2)
write (*,"(a,1x,f0.6,1x,f0.14,1x,a)") "Time, result, method:", t2-t1, r, methods(iter)
end do
contains
real(dp) function f(N) result(r)
integer, intent(in) :: N
integer :: i
r = 0
do i = 1, N
r = r + sin(real(i,dp))
end do
end function f
real(dp) function g(N) result(r)
integer, intent(in) :: N
integer :: i
real(dp) :: x
r = 0
x = 0.0_dp
do i = 1, N ! simulate loop with real loop variable
x = x + 1.0_dp
r = r + sin(x)
end do
end function g
real(dp) function h(N) result(r)
integer, intent(in) :: N
real(dp) :: i
r = 0
do i = 1, N ! loop with real loop variable -- nonstandard
r = r + sin(i)
end do
end function h
end program
@Beliavsky
With that overhead removed, are you not essentially comparing the implementation of the library sine function in Fortran and Julia?
The summation is likely only a small fraction of the total computational cost. Even when executed as a CPU instruction, sine has a latency between 150 and 200 cycles.
Secondly: are you sure that both libraries implement a vectorized version of the sine function? I am not a Fortran programer, but I know that sometimes you need to use the correct compiler flags to get vectorized code. Is your Fortran code using SVML? Maybe this helps: fimf-use-svml, Qimf-use-svml
Itâs not using the same sin implementation. The Julia version is using very optimized version and then LoopVectorization.jl inlines it properly in the loop and vectorizes everything. Fortran compilers should do the same, but currently they donât. We are in the process of creating very fast pure Fortran sin implementations to be used for examples like these.
Why not utilise Fortran advantage and adopt !$OMP
program avx
implicit none
integer, parameter :: dp = kind(0.d0)
real(dp) :: t1, t2, r
call elapse_time(t1)
r = f(100000000)
call elapse_time(t2)
print *, "Time", t2-t1
print *, r
contains
real(dp) function f(N) result(r)
integer, intent(in) :: N
integer :: i
real(dp) :: x, s
s = 0
!$omp parallel do shared (n), private (i,x) reduction (+ : s) schedule (static)
do i = 1, N
x = i
s = s + sin(x)
end do
!$omp end parallel do
r = s
end function
subroutine elapse_time (sec)
real(dp) :: sec
integer*8 ticks, rate
call system_clock ( ticks, rate )
sec = dble(ticks) / dble(rate)
end subroutine elapse_time
end program
Although, this test is basically comparing the implementation of sin(x) for large x
Sorry to bump up this old thread. I recently ran into very slow loops (slower than the equivalent ones in MATLAB) with 500k or so iterations which used cosh, cos/sin in each iteration. Replacing these by the Intel mkl vml routines lead to a 5X speed up. My processor only has avx2 instructions and I think if I try it on a processor with avx2-512 instruction set, this can be slashed even further. I use intel oneAPI under Linux.
Thank you @mohoree for the great news!
May I ask, could you perhaps may show a small minimal working example about how to use the intel MKL vml routines? Thank you very much indeed!
I tried a one line solution for function f but the speed seems slightly slower, Intel Fortran, -O3 -xHost, took 0.65 seconds.
r = sum( sin( dble( (/(i, i = 1,N)/) ) ) )
The do loop takes 0.6 seconds.