Simple summation 8x slower than in Julia

We did some comparison on a simple (single core) sum in Julia and Fortran:

Here is the code in Julia:

using LoopVectorization

function f_avx(N)
    s = 0.0
    @avxt for i in 1:N
        s += sin(i)
    end
    s
end

@time r = f_avx(100000000)
println(r)

Compile and run:

$ julia avx.jl
  0.185973 seconds (1 allocation: 16 bytes)
1.713649346570267

And in Fortran:

program avx
implicit none
integer, parameter :: dp = kind(0.d0)
real(dp) :: t1, t2, r

call cpu_time(t1)
r = f(100000000)
call cpu_time(t2)

print *, "Time", t2-t1
print *, r

contains

    real(dp) function f(N) result(r)
    integer, intent(in) :: N
    integer :: i
    r = 0
    do i = 1, N
        r = r + sin(real(i,dp))
    end do
    end function

end program

Compile and run:

$ gfortran -Ofast avx.f90 && ./a.out
 Time   1.4526789999999998     
   1.7136493465705178     

So Julia is about 8x faster.

If you don’t use the @avxt macro, then Julia is about 2x slower than Fortran.

I checked the assembly, and without @avxt then neither Julia nor GFortran use the AVX instructions. With the @avxt macro, it seems Julia is using AVX.

Any ideas how to speedup the Fortran version?

3 Likes

Let’s have at it

2 Likes

Here are my timings with different compilers:

$ gfortran -Ofast -march=native test.f90 && ./a.out
 Time   1.4684500000000000
   1.7136493465701697
$ ifort -fast -xHost test.f90 && ./a.out
ipo: warning #11021: unresolved __ehdr_start
        Referenced in libc.a(libc-start.o)
 Time  0.699488000000000
   1.71364934657041
$ nvfortran -fast test.f90 && ./a.out                                                                                                                                                       
/usr/bin/ld: warning: /home/pedro/software/nvidia/hpc_sdk/Linux_x86_64/21.3/compilers/lib/nvhpc.ld contains output sections; did you forget -T?
 Time   0.2167658805847168
    1.713649346569483
3 Likes

@pcosta, can you try Julia on your computer also? It looks like nvfortran should match Julia’s performance.

1 Like

It seems like they do, yes.

I don’t recall the exact syntax, but ifort has a set of AVX compile-time flags. I wonder if that would help.

@themos link suggests -O4 -mavx2 for ifort.

@pcosta, can you try Julia on your computer also? It looks like nvfortran should match Julia’s performance.

Julia seems about 2x faster than nvfortran:

$ julia test.jl
  0.096500 seconds (1 allocation: 16 bytes)
1.7136493465698832
1 Like

I think it is -xCORE-AVX2, but -xHost should be equivalent to it on right hardware. I just tried and get the same timings for -xHost and -xCORE-AVX2, and slower timings without setting them up.

1 Like

I just learned that @avxt is for parallel execution which you can control using julia -t4 test.jl for 4 threads. Can you change @avxt to @avx in Julia and rerun? That is the serial version, just in case.

1 Like

Ok! I see… I tried and I got the same timing with @avx.

1 Like

Julia links openlibm, but uses it’s own implimentations for almost everything at this point. sin (and the rest of the trig functions) are in Base.sin

1 Like

LoopVectorization.@avx uses SLEEFPirates.sin_fast.

For reference, this is the asm I get on a cascadelake system:

        .text
        push    rbp
        push    r15
        push    r14
        push    r13
        push    r12
        push    rbx
        movabs  r15, offset .rodata.cst8
        movabs  r11, 139717300456024
        movabs  r14, 139717300456016
        movabs  r9, 139717300456032
        movabs  r10, 139717300456056
        movabs  r8, 139717300456064
        movabs  r12, 139717300456080
        movabs  rax, 139717300456088
        movabs  rsi, 139717300456144
        movabs  rbp, 139717300456120
        movabs  rdx, 139717300456128
        movabs  r13, 139717300456096
        cmp     rdi, 16
        jae     L187
        mov     r14, r12
        mov     r11, rdx
        vxorpd  xmm0, xmm0, xmm0
        mov     r12d, 1
        movabs  r8, 139717300456072
        movabs  r10, 139717300456048
        movabs  r15, 139717300456040
        jmp     L941
L187:
        mov     rbx, r13
        mov     qword ptr [rsp - 8], rdi
        lea     r13, [rdi - 15]
        mov     rdi, r12
        movabs  rcx, offset .rodata
        vmovapd zmm23, zmmword ptr [rcx]
        mov     r12d, 1
        vbroadcastsd    zmm24, qword ptr [r15]
        vbroadcastsd    zmm2, qword ptr [r14]
        vbroadcastsd    zmm3, qword ptr [r11]
        vbroadcastsd    zmm4, qword ptr [r9]
        movabs  rcx, 139717300456040
        vbroadcastsd    zmm5, qword ptr [rcx]
        movabs  rcx, 139717300456048
        vbroadcastsd    zmm6, qword ptr [rcx]
        vbroadcastsd    zmm7, qword ptr [r10]
        vbroadcastsd    zmm8, qword ptr [r8]
        movabs  rcx, 139717300456072
        vbroadcastsd    zmm9, qword ptr [rcx]
        vbroadcastsd    zmm10, qword ptr [rdi]
        vbroadcastsd    zmm11, qword ptr [rax]
        vbroadcastsd    zmm12, qword ptr [rbx]
        movabs  rcx, 139717300456104
        vbroadcastsd    zmm13, qword ptr [rcx]
        movabs  rcx, 139717300456112
        vbroadcastsd    zmm14, qword ptr [rcx]
        vbroadcastsd    zmm15, qword ptr [rbp]
        mov     r11, rdx
        vbroadcastsd    zmm16, qword ptr [rdx]
        movabs  rcx, 139717300456136
        vbroadcastsd    zmm18, qword ptr [rcx]
        vbroadcastsd    zmm19, qword ptr [rsi]
        vxorpd  xmm17, xmm17, xmm17
        movabs  rcx, 139717300456152
        vbroadcastsd    zmm20, qword ptr [rcx]
        movabs  rcx, 139717300456160
        vbroadcastsd    zmm22, qword ptr [rcx]
        vxorpd  xmm21, xmm21, xmm21
        nop     dword ptr [rax + rax]
L448:
        vcvtsi2sd       xmm0, xmm31, r12
        vbroadcastsd    zmm0, xmm0
        lea     rcx, [r12 + 8]
        vcvtsi2sd       xmm1, xmm31, rcx
        vaddpd  zmm0, zmm0, zmm23
        vbroadcastsd    zmm1, xmm1
        vaddpd  zmm1, zmm1, zmm23
        vmulpd  zmm25, zmm0, zmm24
        vmulpd  zmm26, zmm1, zmm24
        vrndscalepd     zmm25, zmm25, 11
        vrndscalepd     zmm26, zmm26, 11
        vmulpd  zmm27, zmm25, zmm3
        vmulpd  zmm28, zmm26, zmm3
        vmovapd zmm29, zmm2
        vfmsub213pd     zmm29, zmm0, zmm27      # zmm29 = (zmm0 * zmm29) - zmm27
        vmovapd zmm30, zmm2
        vfmsub213pd     zmm30, zmm1, zmm28      # zmm30 = (zmm1 * zmm30) - zmm28
        vrndscalepd     zmm29, zmm29, 12
        vrndscalepd     zmm30, zmm30, 12
        vfmadd231pd     zmm0, zmm25, zmm4       # zmm0 = (zmm25 * zmm4) + zmm0
        vfmadd231pd     zmm1, zmm26, zmm4       # zmm1 = (zmm26 * zmm4) + zmm1
        vfmadd231pd     zmm0, zmm29, zmm5       # zmm0 = (zmm29 * zmm5) + zmm0
        vfmadd231pd     zmm1, zmm30, zmm5       # zmm1 = (zmm30 * zmm5) + zmm1
        vfmadd231pd     zmm0, zmm25, zmm6       # zmm0 = (zmm25 * zmm6) + zmm0
        vfmadd231pd     zmm1, zmm26, zmm6       # zmm1 = (zmm26 * zmm6) + zmm1
        vfmadd231pd     zmm0, zmm29, zmm7       # zmm0 = (zmm29 * zmm7) + zmm0
        vfmadd231pd     zmm1, zmm30, zmm7       # zmm1 = (zmm30 * zmm7) + zmm1
        vfmadd231pd     zmm0, zmm8, zmm25       # zmm0 = (zmm8 * zmm25) + zmm0
        vfmadd231pd     zmm1, zmm8, zmm26       # zmm1 = (zmm8 * zmm26) + zmm1
        vfmadd231pd     zmm0, zmm29, zmm9       # zmm0 = (zmm29 * zmm9) + zmm0
        vaddpd  zmm25, zmm27, zmm29
        vaddpd  zmm26, zmm28, zmm30
        vfmadd231pd     zmm1, zmm30, zmm9       # zmm1 = (zmm30 * zmm9) + zmm1
        vfmadd213pd     zmm25, zmm10, zmm0      # zmm25 = (zmm10 * zmm25) + zmm0
        vfmadd213pd     zmm26, zmm10, zmm1      # zmm26 = (zmm10 * zmm26) + zmm1
        vmulpd  zmm0, zmm25, zmm25
        vmulpd  zmm1, zmm26, zmm26
        vcvttpd2qq      zmm27, zmm30
        vcvttpd2qq      zmm28, zmm29
        vpsllq  zmm28, zmm28, 63
        vpmovq2m        k1, zmm28
        vpsllq  zmm27, zmm27, 63
        vpmovq2m        k2, zmm27
        vxorpd  zmm25 {k1}, zmm25, zmm11
        vmovapd zmm27, zmm13
        vxorpd  zmm26 {k2}, zmm26, zmm11
        vfmadd213pd     zmm27, zmm0, zmm12      # zmm27 = (zmm0 * zmm27) + zmm12
        vmovapd zmm28, zmm13
        vfmadd213pd     zmm28, zmm1, zmm12      # zmm28 = (zmm1 * zmm28) + zmm12
        vfmadd213pd     zmm27, zmm0, zmm14      # zmm27 = (zmm0 * zmm27) + zmm14
        vfmadd213pd     zmm28, zmm1, zmm14      # zmm28 = (zmm1 * zmm28) + zmm14
        vfmadd213pd     zmm27, zmm0, zmm15      # zmm27 = (zmm0 * zmm27) + zmm15
        vfmadd213pd     zmm28, zmm1, zmm15      # zmm28 = (zmm1 * zmm28) + zmm15
        vfmadd213pd     zmm27, zmm0, zmm16      # zmm27 = (zmm0 * zmm27) + zmm16
        vfmadd213pd     zmm28, zmm1, zmm16      # zmm28 = (zmm1 * zmm28) + zmm16
        vfmadd213pd     zmm27, zmm0, zmm18      # zmm27 = (zmm0 * zmm27) + zmm18
        vfmadd213pd     zmm28, zmm1, zmm18      # zmm28 = (zmm1 * zmm28) + zmm18
        vfmadd213pd     zmm27, zmm0, zmm19      # zmm27 = (zmm0 * zmm27) + zmm19
        vfmadd213pd     zmm28, zmm1, zmm19      # zmm28 = (zmm1 * zmm28) + zmm19
        vfmadd213pd     zmm27, zmm0, zmm20      # zmm27 = (zmm0 * zmm27) + zmm20
        vfmadd213pd     zmm28, zmm1, zmm20      # zmm28 = (zmm1 * zmm28) + zmm20
        vfmadd213pd     zmm27, zmm0, zmm22      # zmm27 = (zmm0 * zmm27) + zmm22
        vfmadd213pd     zmm28, zmm1, zmm22      # zmm28 = (zmm1 * zmm28) + zmm22
        vmulpd  zmm27, zmm25, zmm27
        vmulpd  zmm28, zmm26, zmm28
        vfmadd213pd     zmm27, zmm0, zmm25      # zmm27 = (zmm0 * zmm27) + zmm25
        vaddpd  zmm21, zmm21, zmm27
        vfmadd213pd     zmm28, zmm1, zmm26      # zmm28 = (zmm1 * zmm28) + zmm26
        vaddpd  zmm17, zmm17, zmm28
        add     r12, 16
        cmp     r12, r13
        jle     L448
        vaddpd  zmm0, zmm21, zmm17
        movabs  r13, 139717300456096
        movabs  r14, 139717300456080
        movabs  r8, 139717300456072
        movabs  r10, 139717300456048
        movabs  r15, 139717300456040
        mov     rdi, qword ptr [rsp - 8]
L941:
        lea     rcx, [rdi - 7]
        movabs  rsi, 139717300456256
        cmp     r12, rcx
        movabs  rbp, 139717300456056
        movabs  rdx, 139717300456032
        movabs  rbx, 139717300456016
        jle     L998
        mov     r9, rsi
        jmp     L1488
L998:
        mov     r9, rsi
        vmovdqa64       zmm1, zmmword ptr [rsi]
        movabs  rsi, offset .rodata.cst8
        vbroadcastsd    zmm2, qword ptr [rsi]
        movabs  rsi, 139717300456024
        vbroadcastsd    zmm3, qword ptr [rsi]
        vbroadcastsd    zmm4, qword ptr [rbx]
        vbroadcastsd    zmm5, qword ptr [rdx]
        vbroadcastsd    zmm6, qword ptr [r15]
        vbroadcastsd    zmm7, qword ptr [r10]
        vbroadcastsd    zmm8, qword ptr [rbp]
        movabs  rsi, 139717300456064
        vbroadcastsd    zmm9, qword ptr [rsi]
        vbroadcastsd    zmm10, qword ptr [r8]
        vbroadcastsd    zmm11, qword ptr [r14]
        movabs  rsi, 139717300456152
        vbroadcastsd    zmm12, qword ptr [rsi]
        movabs  rsi, 139717300456160
        vbroadcastsd    zmm13, qword ptr [rsi]
        movabs  rsi, 139717300456136
        vbroadcastsd    zmm14, qword ptr [rsi]
        movabs  rsi, 139717300456144
        vbroadcastsd    zmm15, qword ptr [rsi]
        movabs  rsi, 139717300456120
        vbroadcastsd    zmm16, qword ptr [rsi]
        vbroadcastsd    zmm17, qword ptr [r11]
        vbroadcastsd    zmm18, qword ptr [r13]
        movabs  rsi, 139717300456112
        vbroadcastsd    zmm19, qword ptr [rsi]
        movabs  rsi, 139717300456104
        vbroadcastsd    zmm20, qword ptr [rsi]
        nop     word ptr [rax + rax]
L1232:
        vpbroadcastq    zmm21, r12
        vpaddq  zmm21, zmm21, zmm1
        vcvtqq2pd       zmm21, zmm21
        vmulpd  zmm22, zmm21, zmm2
        vrndscalepd     zmm22, zmm22, 11
        vmulpd  zmm23, zmm22, zmm3
        vmovapd zmm24, zmm23
        vfmsub231pd     zmm24, zmm21, zmm4      # zmm24 = (zmm21 * zmm4) - zmm24
        vrndscalepd     zmm24, zmm24, 12
        vfmadd231pd     zmm21, zmm22, zmm5      # zmm21 = (zmm22 * zmm5) + zmm21
        vfmadd231pd     zmm21, zmm24, zmm6      # zmm21 = (zmm24 * zmm6) + zmm21
        vfmadd231pd     zmm21, zmm22, zmm7      # zmm21 = (zmm22 * zmm7) + zmm21
        vfmadd231pd     zmm21, zmm24, zmm8      # zmm21 = (zmm24 * zmm8) + zmm21
        vfmadd231pd     zmm21, zmm22, zmm9      # zmm21 = (zmm22 * zmm9) + zmm21
        vfmadd231pd     zmm21, zmm24, zmm10     # zmm21 = (zmm24 * zmm10) + zmm21
        vaddpd  zmm22, zmm23, zmm24
        vfmadd132pd     zmm22, zmm21, zmm11     # zmm22 = (zmm22 * zmm11) + zmm21
        vmulpd  zmm21, zmm22, zmm22
        vcvttpd2qq      zmm23, zmm24
        vpsllq  zmm23, zmm23, 63
        vpmovq2m        k1, zmm23
        vxorpd  zmm22 {k1}, zmm22, qword ptr [rax]{1to8}
        vmulpd  zmm23, zmm21, zmm21
        vmovapd zmm24, zmm12
        vfmadd213pd     zmm24, zmm21, zmm13     # zmm24 = (zmm21 * zmm24) + zmm13
        vmovapd zmm25, zmm14
        vfmadd213pd     zmm25, zmm21, zmm15     # zmm25 = (zmm21 * zmm25) + zmm15
        vmovapd zmm26, zmm16
        vmovapd zmm27, zmm18
        vfmadd213pd     zmm26, zmm21, zmm17     # zmm26 = (zmm21 * zmm26) + zmm17
        vfmadd213pd     zmm27, zmm21, zmm19     # zmm27 = (zmm21 * zmm27) + zmm19
        vmulpd  zmm28, zmm23, zmm23
        vfmadd213pd     zmm25, zmm23, zmm24     # zmm25 = (zmm23 * zmm25) + zmm24
        vfmadd213pd     zmm27, zmm23, zmm26     # zmm27 = (zmm23 * zmm27) + zmm26
        vmulpd  zmm23, zmm28, zmm28
        vfmadd213pd     zmm27, zmm28, zmm25     # zmm27 = (zmm28 * zmm27) + zmm25
        vfmadd231pd     zmm27, zmm23, zmm20     # zmm27 = (zmm23 * zmm20) + zmm27
        vmulpd  zmm23, zmm22, zmm27
        vfmadd213pd     zmm23, zmm21, zmm22     # zmm23 = (zmm21 * zmm23) + zmm22
        vaddpd  zmm0, zmm0, zmm23
        add     r12, 8
        cmp     r12, rcx
        jle     L1232
L1488:
        cmp     r12, rdi
        ja      L1859
        vpbroadcastq    zmm1, r12
        vpaddq  zmm1, zmm1, zmmword ptr [r9]
        vcvtqq2pd       zmm1, zmm1
        movabs  rcx, offset .rodata.cst8
        vmulpd  zmm2, zmm1, qword ptr [rcx]{1to8}
        vrndscalepd     zmm2, zmm2, 11
        movabs  rcx, 139717300456024
        vmulpd  zmm3, zmm2, qword ptr [rcx]{1to8}
        vbroadcastsd    zmm4, qword ptr [rbx]
        vfmsub213pd     zmm4, zmm1, zmm3        # zmm4 = (zmm1 * zmm4) - zmm3
        vfmadd231pd     zmm1, zmm2, qword ptr [rdx]{1to8} # zmm1 = (zmm2 * mem) + zmm1
        vrndscalepd     zmm4, zmm4, 12
        vfmadd231pd     zmm1, zmm4, qword ptr [r15]{1to8} # zmm1 = (zmm4 * mem) + zmm1
        vfmadd231pd     zmm1, zmm2, qword ptr [r10]{1to8} # zmm1 = (zmm2 * mem) + zmm1
        vfmadd231pd     zmm1, zmm4, qword ptr [rbp]{1to8} # zmm1 = (zmm4 * mem) + zmm1
        movabs  rcx, 139717300456064
        vfmadd231pd     zmm1, zmm2, qword ptr [rcx]{1to8} # zmm1 = (zmm2 * mem) + zmm1
        vfmadd231pd     zmm1, zmm4, qword ptr [r8]{1to8} # zmm1 = (zmm4 * mem) + zmm1
        vaddpd  zmm2, zmm3, zmm4
        vfmadd132pd     zmm2, zmm1, qword ptr [r14]{1to8} # zmm2 = (zmm2 * mem) + zmm1
        vmulpd  zmm1, zmm2, zmm2
        vcvttpd2qq      zmm3, zmm4
        vpsllq  zmm3, zmm3, 63
        vpmovq2m        k1, zmm3
        vxorpd  zmm2 {k1}, zmm2, qword ptr [rax]{1to8}
        movabs  rax, 139717300456152
        vbroadcastsd    zmm3, qword ptr [rax]
        movabs  rax, 139717300456160
        vfmadd213pd     zmm3, zmm1, qword ptr [rax]{1to8} # zmm3 = (zmm1 * zmm3) + mem
        movabs  rax, 139717300456136
        vbroadcastsd    zmm4, qword ptr [rax]
        movabs  rax, 139717300456144
        vfmadd213pd     zmm4, zmm1, qword ptr [rax]{1to8} # zmm4 = (zmm1 * zmm4) + mem
        movabs  rax, 139717300456120
        vbroadcastsd    zmm5, qword ptr [rax]
        vfmadd213pd     zmm5, zmm1, qword ptr [r11]{1to8} # zmm5 = (zmm1 * zmm5) + mem
        vbroadcastsd    zmm6, qword ptr [r13]
        movabs  rax, 139717300456112
        vfmadd213pd     zmm6, zmm1, qword ptr [rax]{1to8} # zmm6 = (zmm1 * zmm6) + mem
        neg     dil
        and     dil, 7
        vmulpd  zmm7, zmm1, zmm1
        vmulpd  zmm8, zmm7, zmm7
        vfmadd213pd     zmm4, zmm7, zmm3        # zmm4 = (zmm7 * zmm4) + zmm3
        vfmadd213pd     zmm6, zmm7, zmm5        # zmm6 = (zmm7 * zmm6) + zmm5
        vmulpd  zmm3, zmm8, zmm8
        vfmadd213pd     zmm6, zmm8, zmm4        # zmm6 = (zmm8 * zmm6) + zmm4
        movabs  rax, 139717300456104
        vfmadd231pd     zmm6, zmm3, qword ptr [rax]{1to8} # zmm6 = (zmm3 * mem) + zmm6
        mov     al, -1
        mov     ecx, edi
        shr     al, cl
        kmovd   k1, eax
        vmulpd  zmm3, zmm2, zmm6
        vfmadd213pd     zmm3, zmm1, zmm2        # zmm3 = (zmm1 * zmm3) + zmm2
        vaddpd  zmm0 {k1}, zmm0, zmm3
L1859:
        vextractf64x4   ymm1, zmm0, 1
        vaddpd  zmm0, zmm0, zmm1
        vextractf128    xmm1, ymm0, 1
        vaddpd  xmm0, xmm0, xmm1
        vpermilpd       xmm1, xmm0, 1           # xmm1 = xmm0[1,0]
        vaddsd  xmm0, xmm0, xmm1
        pop     rbx
        pop     r12
        pop     r13
        pop     r14
        pop     r15
        pop     rbp
        vzeroupper
        ret
        nop     word ptr cs:[rax + rax]
4 Likes

On Windows, measuring wall clock time using the timethis.exe utility, gfortran and especially Intel Fortran are much faster than Julia:

c:\julia>timethis julia avx.jl 
  0.237056 seconds (1 allocation: 16 bytes)
1.713649346570267

TimeThis :  Command Line :  julia avx.jl
TimeThis :    Start Time :  Fri May 07 19:49:43 2021
TimeThis :      End Time :  Fri May 07 19:49:54 2021
TimeThis :  Elapsed Time :  00:00:11.656

c:\julia>gfortran -O2 sum_sin.f90 

c:\julia>timethis a.exe 
   1.7136493465702802     

TimeThis :  Command Line :  a.exe
TimeThis :    Start Time :  Fri May 07 19:49:55 2021
TimeThis :      End Time :  Fri May 07 19:49:59 2021
TimeThis :  Elapsed Time :  00:00:04.187

c:\julia>ifort -nologo -O2 sum_sin.f90 

c:\julia>timethis sum_sin.exe 
   1.71364934656989     

TimeThis :  Command Line :  sum_sin.exe
TimeThis :    Start Time :  Fri May 07 19:49:59 2021
TimeThis :      End Time :  Fri May 07 19:50:01 2021
TimeThis :  Elapsed Time :  00:00:01.281

The sum_sin.f90 code is

module m
implicit none
integer, parameter :: dp = kind(1.0d0)
contains
pure function f(n) result(fsum)
integer, intent(in) :: n
real(kind=dp)       :: fsum
integer             :: i
fsum = 0.0_dp
do i=1,n
   fsum = fsum + sin(real(i,kind=dp))
end do
end function f
end module m
!
program main
use m, only: f
implicit none
print*,f(100000000)
end program main
1 Like

Julia appears to have a large fixed cost for doing calculations and a low variable cost. So running the calculation 10 times with

using LoopVectorization

function f_avx(N)
    s = 0.0
    @avxt for i in 1:N
        s += sin(i)
    end
    return(s)
end

for i in 1:10
   @time r = f_avx(100000000)
   println(r)
end

takes only 13.8 s of wall time. For the corresponding Fortran program gfortran and Intel take 38 s and 11 s. Ideally the Fortran compilers would recognize that a PURE function of a constant argument needs to be computed only once. The Fortran code is

module m
implicit none
integer, parameter :: dp = kind(1.0d0)
contains
pure function f(n) result(fsum)
integer, intent(in) :: n
real(kind=dp)       :: fsum
integer             :: i
fsum = 0.0_dp
do i=1,n
   fsum = fsum + sin(real(i,kind=dp))
end do
end function f
end module m
!
program main
use m, only: f
implicit none
integer :: i
do i=1,10
   print*,f(100000000)
end do
end program main
1 Like

This doesn’t have to do with caching or anything fancy like that. This is just that you are timing compile time and run-time together. To prove this, you can change it to

for i in 1:10
   @time r = f_avx(i*100000000)
   println(r)
end

so purity analysis can’t help you. Julia will still be faster in this case because it’s a faster program.

2 Likes

Julia has very fast trig functions. Removing the sin calculation and just computing the sum of integers

using LoopVectorization

function g_avx(N)
    s = 0.0
    @avxt for i in 1:N
        s += Float64(i)
    end
    return(s)
end

for i in 1:100
   @time r = g_avx(100000000)
   println(r)
end

the time per loop is 0.025 s with output such as

  0.025127 seconds
5.00000005e15

For Intel Fortran -O2 the wall time for the same calculation is 2.5 s, which is the same time per loop. The program is

module m
implicit none
integer, parameter :: dp = kind(1.0d0)
contains
pure function g(n) result(fsum)
integer, intent(in) :: n
real(kind=dp)       :: fsum
integer             :: i
fsum = 0.0_dp
do i=1,n
   fsum = fsum + real(i,kind=dp)
end do
end function g
end module m
!
program main
use m, only: g
implicit none
integer :: i
do i=1,100
   print*,g(100000000)
end do
end program main

Welcome @Mason and @oscardssmith!

4 Likes

@certik over the years, I have realized that a design philosophy of Fortran is to keep hardware-specific optimizations and directives away from the core language syntax and only provide hints to the compiler for optimizations. This has been likely one of the main reasons behind code longevity and backward compatibility in Fortran. From this perspective, does it even make sense to introduce a hardware-specific macro/keyword in the code as is done in the Julia script here? What happens to all @avx macros if a new technology dominates the market next year? I am not a professional in Julia, so I am asking those who are to shed light on this matter.

2 Likes

The @avx macro is from the LoopVectorization.jl package, not from the base language. The beauty of julia is that this sort of thing doesn’t necessarily need to be baked in.

That said, thanks to LLVM, I’m pretty sure that @avx already works on non-x86 hardware such as Apple’s M1 (though I guess the name @avx is a bit of a misnomer at that point because the instructions aren’t AVX instructions).

4 Likes