We did some comparison on a simple (single core) sum in Julia and Fortran:
Here is the code in Julia:
using LoopVectorization
function f_avx(N)
s = 0.0
@avxt for i in 1:N
s += sin(i)
end
s
end
@time r = f_avx(100000000)
println(r)
Compile and run:
$ julia avx.jl
0.185973 seconds (1 allocation: 16 bytes)
1.713649346570267
And in Fortran:
program avx
implicit none
integer, parameter :: dp = kind(0.d0)
real(dp) :: t1, t2, r
call cpu_time(t1)
r = f(100000000)
call cpu_time(t2)
print *, "Time", t2-t1
print *, r
contains
real(dp) function f(N) result(r)
integer, intent(in) :: N
integer :: i
r = 0
do i = 1, N
r = r + sin(real(i,dp))
end do
end function
end program
Compile and run:
$ gfortran -Ofast avx.f90 && ./a.out
Time 1.4526789999999998
1.7136493465705178
So Julia is about 8x faster.
If you don’t use the @avxt macro, then Julia is about 2x slower than Fortran.
I checked the assembly, and without @avxt then neither Julia nor GFortran use the AVX instructions. With the @avxt macro, it seems Julia is using AVX.
I think it is -xCORE-AVX2, but -xHost should be equivalent to it on right hardware. I just tried and get the same timings for -xHost and -xCORE-AVX2, and slower timings without setting them up.
I just learned that @avxt is for parallel execution which you can control using julia -t4 test.jl for 4 threads. Can you change @avxt to @avx in Julia and rerun? That is the serial version, just in case.
On Windows, measuring wall clock time using the timethis.exe utility, gfortran and especially Intel Fortran are much faster than Julia:
c:\julia>timethis julia avx.jl
0.237056 seconds (1 allocation: 16 bytes)
1.713649346570267
TimeThis : Command Line : julia avx.jl
TimeThis : Start Time : Fri May 07 19:49:43 2021
TimeThis : End Time : Fri May 07 19:49:54 2021
TimeThis : Elapsed Time : 00:00:11.656
c:\julia>gfortran -O2 sum_sin.f90
c:\julia>timethis a.exe
1.7136493465702802
TimeThis : Command Line : a.exe
TimeThis : Start Time : Fri May 07 19:49:55 2021
TimeThis : End Time : Fri May 07 19:49:59 2021
TimeThis : Elapsed Time : 00:00:04.187
c:\julia>ifort -nologo -O2 sum_sin.f90
c:\julia>timethis sum_sin.exe
1.71364934656989
TimeThis : Command Line : sum_sin.exe
TimeThis : Start Time : Fri May 07 19:49:59 2021
TimeThis : End Time : Fri May 07 19:50:01 2021
TimeThis : Elapsed Time : 00:00:01.281
The sum_sin.f90 code is
module m
implicit none
integer, parameter :: dp = kind(1.0d0)
contains
pure function f(n) result(fsum)
integer, intent(in) :: n
real(kind=dp) :: fsum
integer :: i
fsum = 0.0_dp
do i=1,n
fsum = fsum + sin(real(i,kind=dp))
end do
end function f
end module m
!
program main
use m, only: f
implicit none
print*,f(100000000)
end program main
Julia appears to have a large fixed cost for doing calculations and a low variable cost. So running the calculation 10 times with
using LoopVectorization
function f_avx(N)
s = 0.0
@avxt for i in 1:N
s += sin(i)
end
return(s)
end
for i in 1:10
@time r = f_avx(100000000)
println(r)
end
takes only 13.8 s of wall time. For the corresponding Fortran program gfortran and Intel take 38 s and 11 s. Ideally the Fortran compilers would recognize that a PURE function of a constant argument needs to be computed only once. The Fortran code is
module m
implicit none
integer, parameter :: dp = kind(1.0d0)
contains
pure function f(n) result(fsum)
integer, intent(in) :: n
real(kind=dp) :: fsum
integer :: i
fsum = 0.0_dp
do i=1,n
fsum = fsum + sin(real(i,kind=dp))
end do
end function f
end module m
!
program main
use m, only: f
implicit none
integer :: i
do i=1,10
print*,f(100000000)
end do
end program main
This doesn’t have to do with caching or anything fancy like that. This is just that you are timing compile time and run-time together. To prove this, you can change it to
for i in 1:10
@time r = f_avx(i*100000000)
println(r)
end
so purity analysis can’t help you. Julia will still be faster in this case because it’s a faster program.
Julia has very fast trig functions. Removing the sin calculation and just computing the sum of integers
using LoopVectorization
function g_avx(N)
s = 0.0
@avxt for i in 1:N
s += Float64(i)
end
return(s)
end
for i in 1:100
@time r = g_avx(100000000)
println(r)
end
the time per loop is 0.025 s with output such as
0.025127 seconds
5.00000005e15
For Intel Fortran -O2 the wall time for the same calculation is 2.5 s, which is the same time per loop. The program is
module m
implicit none
integer, parameter :: dp = kind(1.0d0)
contains
pure function g(n) result(fsum)
integer, intent(in) :: n
real(kind=dp) :: fsum
integer :: i
fsum = 0.0_dp
do i=1,n
fsum = fsum + real(i,kind=dp)
end do
end function g
end module m
!
program main
use m, only: g
implicit none
integer :: i
do i=1,100
print*,g(100000000)
end do
end program main
@certik over the years, I have realized that a design philosophy of Fortran is to keep hardware-specific optimizations and directives away from the core language syntax and only provide hints to the compiler for optimizations. This has been likely one of the main reasons behind code longevity and backward compatibility in Fortran. From this perspective, does it even make sense to introduce a hardware-specific macro/keyword in the code as is done in the Julia script here? What happens to all @avx macros if a new technology dominates the market next year? I am not a professional in Julia, so I am asking those who are to shed light on this matter.
The @avx macro is from the LoopVectorization.jl package, not from the base language. The beauty of julia is that this sort of thing doesn’t necessarily need to be baked in.
That said, thanks to LLVM, I’m pretty sure that @avx already works on non-x86 hardware such as Apple’s M1 (though I guess the name @avx is a bit of a misnomer at that point because the instructions aren’t AVX instructions).