Simple summation 8x slower than in Julia

I can’t get this cordic program to compile on my machine so I can’t run my own comparisons. If you want to do multithreading, the simplest way would be to just write

function f_certainly_not_avx(N)
   s = 0.0
   Threads.@threads for i in 1:N
       s += cordic_sine(convert(Float64, i))
   end
   s
end

Not really sure why or how this is relevant to the original question that @certik asked though.

If one is willing to roll up their sleeves, they could make this cordic_sine function work with LoopVectorization.@avx, they just have to add dispatches that act correctly and vectorize across a VectorizationBase.AbstractSIMD type, but it wouldn’t be super simple.

Chris explained a bit on how to do so here: https://julialang.zulipchat.com/#narrow/stream/137791-general/topic/LoopVectorization.20vectorize.20an.20arbitrary.20function

@Mason,

No, “multithreading” is not of interest; the curiosity given the original post is how to do parallel programming in Julia. Regardless, your suggestion of multithreading decreases the performance with the Julia case:

C:\temp>type avx.jl
function cordic_sine(a::Float64)

cos = Ref{Float64}(0.0)
sin = Ref{Float64}(0.0)

ccall((:cossin_cordic, “cordic”), Cvoid, (Float64, Cint, Ref{Float64}, Ref{Float64}), a, 40, cos, sin)

return sin[]

end

function f_avx(N)
s = 0.0
Threads.@threads for i in (Threads.threadid()-1)(N/Threads.nthreads()):Threads.threadid()(N/Threads.nthreads())
s += cordic_sine(convert(Float64, i))
end
s
end

@time r = f_avx(800000000)
println(r)

C:\temp>julia avx.jl
201.834736 seconds (1.60 G allocations: 23.847 GiB, 0.80% gc time, 0.02% compilation time)
1.9006631241572693

C:\temp>

You probably started julia with only 1 thread. On my 6 core machine this is what I see:

julia> function cordic_sine(a)

          cos = Ref{Float64}(0.0)
          sin = Ref{Float64}(0.0)

          # Ref: Excellent work by John Burkardt
          # https://people.sc.fsu.edu/~jburkardt/c_src/cordic/cordic.c
          # void cossin_cordic ( double beta, int n, double *c, double *s )
          ccall((:cossin_cordic, "/home/mason/cordic/libcordic.so"), Cvoid, (Float64, Cint, Ref{Float64}, Ref{Float64}), a, 40, cos, sin)

          return sin[]

       end
cordic_sine (generic function with 1 method)

julia> function f_definitely_not_avx(N)
           s = 0.0
           Threads.@threads for i in 1:N
               s += cordic_sine(convert(Float64, i))
           end
           s
       end
f_definitely_not_avx (generic function with 1 method)
julia> @time r = f_definitely_not_avx(800000000)
 57.295734 seconds (1.60 G allocations: 23.846 GiB, 9.56% gc time, 0.04% compilation time)
-0.8053191287737147

julia> versioninfo()
Julia Version 1.6.0
Commit f9720dc2eb* (2021-03-24 12:55 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: AMD Ryzen 5 2600 Six-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, znver1)
Environment:
  JULIA_NUM_THREADS = 6
1 Like

@Mason, here’re the steps I followed on Windows OS with the cordic.c and cordic.h from the FSU site I linked upthread: note on Windows, Julia seems to want a shared library (DLL) from what I could see. Here a Windows definition (.def) file of exported symbols from the library make things clearer in my view:

C:\temp>dir cordic.*
Volume in drive C is OS
Volume Serial Number is C4CA-BA12

Directory of C:\temp

05/08/2021 11:15 AM 63,494 cordic.c
05/08/2021 01:09 PM 50 cordic.def
05/08/2021 11:15 AM 1,196 cordic.h
3 File(s) 64,740 bytes
0 Dir(s) 23,240,077,312 bytes free

C:\temp>type cordic.def
LIBRARY cordic
EXPORTS
cossin_cordic @1

C:\temp>cl /c /O2 cordic.c
Microsoft (R) C/C++ Optimizing Compiler Version 19.27.29112 for x64
Copyright (C) Microsoft Corporation. All rights reserved.

cordic.c

C:\temp>link cordic.obj /dll /def:cordic.def /out:cordic.dll
Microsoft (R) Incremental Linker Version 14.27.29112.0
Copyright (C) Microsoft Corporation. All rights reserved.

Creating library cordic.lib and object cordic.exp

C:\temp>

What exactly do you mean by parallel here? Maybe you mean what I’d recognize as distributed / multiprocess? In that case, you’d want to use the Distributed standard library or something like MPI.jl.

Here’s an example on my 6 core machine:

julia> using Distributed; addprocs(5); nprocs()
6

julia> @everywhere function cordic_sine(a)
           cos = Ref{Float64}() # don't waste time zeroing data that'll get thrown out
           sin = Ref{Float64}() 
           # Ref: Excellent work by John Burkardt
           # https://people.sc.fsu.edu/~jburkardt/c_src/cordic/cordic.c
           # void cossin_cordic ( double beta, int n, double *c, double *s )
           ccall((:cossin_cordic, "/home/mason/cordic/libcordic.so"), Cvoid, (Float64, Cint, Ref{Float64}, Ref{Float64}), a, 40, cos, sin)
           return sin[]
       end

julia> function f_distributed(N)
           @distributed (+) for i in 1:N
               cordic_sine(convert(Float64, i))
           end
       end
f_distributed (generic function with 1 method)

julia> @time r = f_distributed(800000000)
 45.842816 seconds (933.50 k allocations: 54.914 MiB, 0.03% gc time, 0.67% compilation time)
1.9006631241562875
julia> versioninfo()
Julia Version 1.6.0
Commit f9720dc2eb* (2021-03-24 12:55 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: AMD Ryzen 5 2600 Six-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, znver1)
Environment:
  JULIA_NUM_THREADS = 6

Here is the time for the serial version:

julia> function f_serial(N)
          s = 0.0
          for i in 1:N
              s += cordic_sine(convert(Float64, i))
          end
          s
       end
f_avx (generic function with 1 method)

julia> @time r = f_serial(800000000)
216.155306 seconds (1 allocation: 16 bytes)
1.9006631241568408
1 Like

Again though, I really don’t understand that point of this benchmark you’re proposing @FortranFan. Why are we interested in a benchmark that just times Fortran / Julia’s foreign function interfaces in parallel?

As far as I understood, this thread was about @certik wanting to understand how LoopVectorization.jl was generating such fast code in julia and how it could be replicated in Fortran.

2 Likes

On a more general note, people might like to time their Fortran compiler’s SUM intrinsic on the largest array their system will handle and report the nanoseconds per item. When I do this, I find large variations, with the Cray compiler on Intel hardware clearly ahead. That merely demonstrates that vendors prioritise different aspects of performance. You will always find some examples of sub-optimal performance if that is what you set out to look for.
The Fortran community is well aware that GPUs are a challenge for modern Fortran but I don’t think that it has been demonstrated that Fortran cannot adapt to the challenge because of fundamental, baked-in language design issues. I don’t think that fast trig approximations is a battle worth fighting, any more than trying to beat, say, FFTW or MKL DGEMM in Fortran.

1 Like

I’m not really proposing this benchmark, as shown upthread it’s not realistic. The point is only there are 2 parts to this “micro” benchmark: one is some form of non-serial execution of “+” reduction (whether AVX or otherwise) and the next is function evaluation.

If placed on the same basis, there are hardly much of any differences between modern languages with calculations that have any heft to them.

What’s left is simple, convenient way of introducing aspects natively into the language whilst abstracting away the paradigms and separating the hardware-related considerations such as AVX.

For practitioners needing to use Fortran, the above is just a simple example using coarrays.

About "really don’t understand the point " is a question that can be raised of the ones by Jula-lang here.

1 Like

By default, Julia can use more memory than a program compiled by gfortran, so that on my Windows 10 PC, Julia runs

n = 10^9
x = rand(n)
println(sum(x)/n)

in 2.8 s (wall time) while gfortran -O2 cannot do the calculation for n = 3e8 and for 2e8

implicit none
integer, parameter :: n = 2*10**8
real(kind=kind(1.0d0)) :: x(n)
call random_number(x)
print*,n,sum(x)/n
end

takes 1.7 s. Is there a gfortran compiler option that permits larger arrays? I don’t think Julia “sees” that it need not store the entire array x, since for n = 5e9 the Julia program hangs my machine. Gfortran does compile a program with n = 1e9 using the options gfortran -fstack-arrays -fmax-stack-var-size=3000000000 -O2 , but running the program produces no output.

On Windows OS, better to use ALLOCATABLE arrays for the default stack-size is rather low and there are OS restrictions if you’re using 32-bit.

64-bit.

It wouldn’t be able to transform it anyway, so removing the using LoopVectorization and just using for i in would be fine.

LoopVectorization.jl uses a fork of a port of an old version of SLEEF for its sin function, so this is the fairest comparison. But as has been pointed out before, in Julia sin will be inlined, and some things like constant loads will get hoisted out of the loop.

I think it’s more likely that better language integration will go the other way: Julia’s compiler team wants to “rip open” the compiler, better exposing it to library authors to build their own.

LoopVectorization.jl is experimental, and will go through at least one major overhaul before I am comfortable tagging a “1.0” version.

It does. Others have confirmed it works on Power8 and Nvidea Jetson (ARM).
I bought an M1 Mac specifically so I could test and make sure performance isn’t accidentally bad, and can confirm it works there as well (you just need Julia master, and – until we get a Hwloc_jll version compatible with Apple ARM – you’d need to ] dev Hwloc to replace Hwloc_jll with a path to the library installed some other way; I used Homebrew).

Given that LoopVectorization is still at version “0”, now (or, “sooner rather than later”) would be the time to deprecate @avx, and choose a new name for the macro.
@avx seems to consistently cause confusion over whether platforms other than x86 are supported. The reaction to the name seems generally pretty negative.

I’m open to name suggestions here, but I’d probably ask on Julia’s Slack/Zulip.

Why did I pick the name @avx?

  1. Because it is short.
  2. Because @simd was taken.
  3. I didn’t choose @vectorize because there was already a vastly inferior macro with that name in LoopVectorization at the time that many of my packages were using, and I didn’t want to break them. It’s long since been removed. I don’t want the name anymore either, because aside from length, it seems like it’d cause similar confusion if/when matrix-extension support is added. The library name itself makes that mistake, but c’est la vie.

Perhaps @fast, to be fully open ended about how it gets there, and being short?

Depends on the loop.
If autovectorization requires violating IEEE, then it will require annotation of some kind, e.g. @fastmath or @simd from Base Julia.
LoopVectorization.@avx is a macro implemented by a Julia package that implements its own (experimental) autovectorizer.
It does more than just use special functions.

LoopVectorization.jl’s approach does not have that limitation, although I should cleanup erf support, which currently requires specifying verf.

LoopVectorization is also pretty good at nested loops. This is most relevant whenever anyone’s doing things not already easily/well expressed by BLAS, e.g. convolutions and tensor contractions (where expressing them as BLAS has some overhead).

gfortran code for the benchmark
module matrixmul
  use ISO_C_BINDING

contains
! gfortran -Ofast -march=native -mprefer-vector-width=512 -shared -fPIC matmul.f90 -o libmatmul.so

  subroutine AmulB(C, A, B, M, K, N) BIND(C, name="AmulB")
    integer(C_long), intent(in) :: M, K, N
    real(C_double), dimension(M, N), intent(out) :: C
    real(C_double), dimension(M, K), intent(in) :: A
    real(C_double), dimension(K, N), intent(in) :: B
    C = matmul(A, B)
  end subroutine AmulB
  subroutine AmulB7x7x7(C, A, B) BIND(C, name="AmulB7x7x7")
    real(C_double), dimension(7, 7), intent(out) :: C
    real(C_double), dimension(7, 7), intent(in) :: A
    real(C_double), dimension(7, 7), intent(in) :: B
    C = matmul(A, B)
  end subroutine AmulB7x7x7
  subroutine AmulB8x8x8(C, A, B) BIND(C, name="AmulB8x8x8")
    real(C_double), dimension(8, 8), intent(out) :: C
    real(C_double), dimension(8, 8), intent(in) :: A
    real(C_double), dimension(8, 8), intent(in) :: B
    C = matmul(A, B)
  end subroutine AmulB8x8x8

  subroutine AmulBt(C, A, Bt, M, K, N) BIND(C, name="AmulBt")
    integer(C_long), intent(in) :: M, K, N
    real(C_double), dimension(M, N), intent(out) :: C
    real(C_double), dimension(M, K), intent(in) :: A
    real(C_double), dimension(N, K), intent(in) :: Bt
    C = matmul(A, transpose(Bt))
  end subroutine AmulBt
  subroutine AmulBt7x7x7(C, A, Bt) BIND(C, name="AmulBt7x7x7")
    real(C_double), dimension(7, 7), intent(out) :: C
    real(C_double), dimension(7, 7), intent(in) :: A
    real(C_double), dimension(7, 7), intent(in) :: Bt
    C = matmul(A, transpose(Bt))
  end subroutine AmulBt7x7x7
  subroutine AmulBt8x8x8(C, A, Bt) BIND(C, name="AmulBt8x8x8")
    real(C_double), dimension(8, 8), intent(out) :: C
    real(C_double), dimension(8, 8), intent(in) :: A
    real(C_double), dimension(8, 8), intent(in) :: Bt
    C = matmul(A, transpose(Bt))
  end subroutine AmulBt8x8x8

  subroutine AtmulB(C, At, B, M, K, N) BIND(C, name="AtmulB")
    integer(C_long), intent(in) :: M, K, N
    real(C_double), dimension(M, N), intent(out) :: C
    real(C_double), dimension(K, M), intent(in) :: At
    real(C_double), dimension(K, N), intent(in) :: B
    C = matmul(transpose(At), B)
  end subroutine AtmulB
  subroutine AtmulB7x7x7(C, At, B) BIND(C, name="AtmulB7x7x7")
    real(C_double), dimension(7, 7), intent(out) :: C
    real(C_double), dimension(7, 7), intent(in) :: At
    real(C_double), dimension(7, 7), intent(in) :: B
    C = matmul(transpose(At), B)
  end subroutine AtmulB7x7x7
  subroutine AtmulB8x8x8(C, At, B) BIND(C, name="AtmulB8x8x8")
    real(C_double), dimension(8, 8), intent(out) :: C
    real(C_double), dimension(8, 8), intent(in) :: At
    real(C_double), dimension(8, 8), intent(in) :: B
    C = matmul(transpose(At), B)
  end subroutine AtmulB8x8x8

  subroutine AtmulBt(C, At, Bt, M, K, N) BIND(C, name="AtmulBt")
    integer(C_long), intent(in) :: M, K, N
    real(C_double), dimension(M, N), intent(out) :: C
    real(C_double), dimension(K, M), intent(in) :: At
    real(C_double), dimension(N, K), intent(in) :: Bt
    C = transpose(matmul(Bt, At))
  end subroutine AtmulBt
  subroutine AtmulBt7x7x7(C, At, Bt) BIND(C, name="AtmulBt7x7x7")
    real(C_double), dimension(7, 7), intent(out) :: C
    real(C_double), dimension(7, 7), intent(in) :: At
    real(C_double), dimension(7, 7), intent(in) :: Bt
    C = transpose(matmul(Bt, At))
  end subroutine AtmulBt7x7x7
  subroutine AtmulBt8x8x8(C, At, Bt) BIND(C, name="AtmulBt8x8x8")
    real(C_double), dimension(8, 8), intent(out) :: C
    real(C_double), dimension(8, 8), intent(in) :: At
    real(C_double), dimension(8, 8), intent(in) :: Bt
    C = transpose(matmul(Bt, At))
  end subroutine AtmulBt8x8x8

  subroutine AmulB_loop(C, A, B, M, K, N) BIND(C, name="AmulB_loop")
    integer(C_long), intent(in) :: M, K, N
    real(C_double), dimension(M, N), intent(out) :: C
    real(C_double), dimension(M, K), intent(in) :: A
    real(C_double), dimension(K, N), intent(in) :: B
    real(C_double) :: Cmn
    do concurrent(nn = 1:N, mm = 1:M)
       Cmn = 0.0
       do kk = 1,K
          Cmn = Cmn + A(mm,kk) * B(kk,nn)
       end do
       C(mm,nn) = Cmn
    end do
  end subroutine AmulB_loop
  subroutine AmulB7x7x7_loop(C, A, B) BIND(C, name="AmulB7x7x7_loop")
    real(C_double), dimension(7, 7), intent(out) :: C
    real(C_double), dimension(7, 7), intent(in) :: A
    real(C_double), dimension(7, 7), intent(in) :: B
    real(C_double) :: Cmn
    do concurrent(nn = 1:7, mm = 1:7)
       Cmn = 0.0
       do kk = 1,7
          Cmn = Cmn + A(mm,kk) * B(kk,nn)
       end do
       C(mm,nn) = Cmn
    end do
  end subroutine AmulB7x7x7_loop
  subroutine AmulB8x8x8_loop(C, A, B) BIND(C, name="AmulB8x8x8_loop")
    real(C_double), dimension(8, 8), intent(out) :: C
    real(C_double), dimension(8, 8), intent(in) :: A
    real(C_double), dimension(8, 8), intent(in) :: B
    real(C_double) :: Cmn
    do concurrent(nn = 1:8, mm = 1:8)
       Cmn = 0.0
       do kk = 1,8
          Cmn = Cmn + A(mm,kk) * B(kk,nn)
       end do
       C(mm,nn) = Cmn
    end do
  end subroutine AmulB8x8x8_loop

  subroutine AmulBt_loop(C, A, Bt, M, K, N) BIND(C, name="AmulBt_loop")
    integer(C_long), intent(in) :: M, K, N
    real(C_double), dimension(M, N), intent(out) :: C
    real(C_double), dimension(M, K), intent(in) :: A
    real(C_double), dimension(N, K), intent(in) :: Bt
    real(C_double) :: Cmn
    do concurrent(nn = 1:N, mm = 1:M)
       Cmn = 0.0
       do kk = 1,K
          Cmn = Cmn + A(mm,kk) * Bt(nn,kk)
       end do
       C(mm,nn) = Cmn
    end do
  end subroutine AmulBt_loop
  subroutine AmulBt7x7x7_loop(C, A, Bt) BIND(C, name="AmulBt7x7x7_loop")
    real(C_double), dimension(7, 7), intent(out) :: C
    real(C_double), dimension(7, 7), intent(in) :: A
    real(C_double), dimension(7, 7), intent(in) :: Bt
    real(C_double) :: Cmn
    do concurrent(nn = 1:7, mm = 1:7)
       Cmn = 0.0
       do kk = 1,7
          Cmn = Cmn + A(mm,kk) * Bt(nn,kk)
       end do
       C(mm,nn) = Cmn
    end do
  end subroutine AmulBt7x7x7_loop
  subroutine AmulBt8x8x8_loop(C, A, Bt) BIND(C, name="AmulBt8x8x8_loop")
    real(C_double), dimension(8, 8), intent(out) :: C
    real(C_double), dimension(8, 8), intent(in) :: A
    real(C_double), dimension(8, 8), intent(in) :: Bt
    real(C_double) :: Cmn
    do concurrent(nn = 1:8, mm = 1:8)
       Cmn = 0.0
       do kk = 1,8
          Cmn = Cmn + A(mm,kk) * Bt(nn,kk)
       end do
       C(mm,nn) = Cmn
    end do
  end subroutine AmulBt8x8x8_loop

  subroutine AtmulB_loop(C, At, B, M, K, N) BIND(C, name="AtmulB_loop")
    integer(C_long), intent(in) :: M, K, N
    real(C_double), dimension(M, N), intent(out) :: C
    real(C_double), dimension(K, M), intent(in) :: At
    real(C_double), dimension(K, N), intent(in) :: B
    real(C_double) :: Cmn
    do concurrent(nn = 1:N, mm = 1:M)
       Cmn = 0.0
       do kk = 1,K
          Cmn = Cmn + At(kk,mm) * B(kk,nn)
       end do
       C(mm,nn) = Cmn
    end do
  end subroutine AtmulB_loop
  subroutine AtmulB7x7x7_loop(C, At, B) BIND(C, name="AtmulB7x7x7_loop")
    real(C_double), dimension(7, 7), intent(out) :: C
    real(C_double), dimension(7, 7), intent(in) :: At
    real(C_double), dimension(7, 7), intent(in) :: B
    real(C_double) :: Cmn
    do concurrent(nn = 1:7, mm = 1:7)
       Cmn = 0.0
       do kk = 1,7
          Cmn = Cmn + At(kk,mm) * B(kk,nn)
       end do
       C(mm,nn) = Cmn
    end do
  end subroutine AtmulB7x7x7_loop
  subroutine AtmulB8x8x8_loop(C, At, B) BIND(C, name="AtmulB8x8x8_loop")
    real(C_double), dimension(8, 8), intent(out) :: C
    real(C_double), dimension(8, 8), intent(in) :: At
    real(C_double), dimension(8, 8), intent(in) :: B
    real(C_double) :: Cmn
    do concurrent(nn = 1:8, mm = 1:8)
       Cmn = 0.0
       do kk = 1,8
          Cmn = Cmn + At(kk,mm) * B(kk,nn)
       end do
       C(mm,nn) = Cmn
    end do
  end subroutine AtmulB8x8x8_loop

  subroutine AtmulBt_loop(C, At, Bt, M, K, N) BIND(C, name="AtmulBt_loop")
    integer(C_long), intent(in) :: M, K, N
    real(C_double), dimension(M, N), intent(out) :: C
    real(C_double), dimension(K, M), intent(in) :: At
    real(C_double), dimension(N, K), intent(in) :: Bt
    real(C_double) :: Cmn
    do concurrent(nn = 1:N, mm = 1:M)
       Cmn = 0.0
       do kk = 1,K
          Cmn = Cmn + At(kk,mm) * Bt(nn,kk)
       end do
       C(mm,nn) = Cmn
    end do
  end subroutine AtmulBt_loop
  subroutine AtmulBt7x7x7_loop(C, At, Bt) BIND(C, name="AtmulBt7x7x7_loop")
    real(C_double), dimension(7, 7), intent(out) :: C
    real(C_double), dimension(7, 7), intent(in) :: At
    real(C_double), dimension(7, 7), intent(in) :: Bt
    real(C_double) :: Cmn
    do concurrent(nn = 1:7, mm = 1:7)
       Cmn = 0.0
       do kk = 1,7
          Cmn = Cmn + At(kk,mm) * Bt(nn,kk)
       end do
       C(mm,nn) = Cmn
    end do
  end subroutine AtmulBt7x7x7_loop
  subroutine AtmulBt8x8x8_loop(C, At, Bt) BIND(C, name="AtmulBt8x8x8_loop")
    real(C_double), dimension(8, 8), intent(out) :: C
    real(C_double), dimension(8, 8), intent(in) :: At
    real(C_double), dimension(8, 8), intent(in) :: Bt
    real(C_double) :: Cmn
    do concurrent(nn = 1:8, mm = 1:8)
       Cmn = 0.0
       do kk = 1,8
          Cmn = Cmn + At(kk,mm) * Bt(nn,kk)
       end do
       C(mm,nn) = Cmn
    end do
  end subroutine AtmulBt8x8x8_loop

  
  real(C_double) function dot3v2(x, A, y, M, N) BIND(C, name="dot3v2")
    integer(C_long), intent(in) :: M, N
    real(C_double), intent(in) :: x(M), A(M,N), y(N)
    real(C_double) :: t
    integer(C_long) :: mm, nn
    dot3v2 = 0.0d0
    do concurrent(nn = 1:N, mm = 1:M)
       dot3v2 = dot3v2 + x(mm) * A(mm, nn) * y(nn)
    end do
  end function dot3v2
  real(C_double) function dot3(x, A, y, M, N) BIND(C, name="dot3")
    integer(C_long), intent(in) :: M, N
    real(C_double), intent(in) :: x(M), A(M,N), y(N)
    real(C_double) :: t
    integer(C_long) :: mm, nn
    dot3 = 0.0d0
    do concurrent(nn = 1:N)
       t = 0.0d0
       do concurrent(mm = 1:M)
          t = t + x(mm) * A(mm, nn)
       end do
       dot3 = dot3 + t * y(nn)
    end do
  end function dot3
  real(C_double) function dot3builtin(x, A, y, M, N) BIND(C, name="dot3builtin")
    integer(C_long), intent(in) :: M, N
    real(C_double), intent(in) :: x(M), A(M,N), y(N)
    dot3builtin = dot_product(x, matmul(A, y))
  end function dot3builtin


end module matrixmul
Julia code and script for running benchmarks
using LoopVectorization, ArrayInterface, Static, LinearAlgebra

function mul_serial!(C, A, B)
  @avx for n ∈ indices((C,B), 2), m ∈ indices((C,A), 1)
    Cmn = zero(eltype(C))
    for k ∈ indices((A,B), (2,1))
      Cmn += A[m,k] * B[k,n]
    end
    C[m,n] = Cmn
  end
end
function mul_threads!(C, A, B)
  @avxt for n ∈ indices((C,B), 2), m ∈ indices((C,A), 1)
    Cmn = zero(eltype(C))
    for k ∈ indices((A,B), (2,1))
      Cmn += A[m,k] * B[k,n]
    end
    C[m,n] = Cmn
  end
end

const LIBFORTMUL = joinpath(pwd(), "libmatmul.so")
if !isfile(LIBFORTMUL)
  # run(`gfortran -Ofast -march=native -fdisable-tree-cunrolli -floop-nest-optimize -funroll-loops -mprefer-vector-width=$(LoopVectorization.pick_vector_width(Float64)*64) -shared -fPIC matmul.f90 -o $LIBFORTMUL`)
  run(`gfortran -Ofast -march=native -mprefer-vector-width=$(LoopVectorization.pick_vector_width(Float64)*64) -shared -fPIC matmul.f90 -o $LIBFORTMUL`)
end

struct SizedMatrix{M,N,T} <: DenseMatrix{T}
  data::Matrix{T}
end
@inline SizedMatrix{M,N}(data::Matrix{T}) where {M,N,T} = SizedMatrix{M,N,T}(data)
@inline Base.unsafe_convert(::Type{Ptr{T}}, A::SizedMatrix{M,N,T}) where {M,N,T} = Base.unsafe_convert(Ptr{T}, A.data)
@inline Base.size(A::SizedMatrix{M,N}) where {M,N} = (M,N)
@inline Base.elsize(A::SizedMatrix{M,N,T}) where {M,N,T} = sizeof(T)
@inline Base.length(A::SizedMatrix{M,N}) where {M,N} = M*N
@inline ArrayInterface.size(A::SizedMatrix{M,N}) where {M,N} = (static(M),static(N))
@inline ArrayInterface.axes(A::SizedMatrix{M,N}, ::StaticInt{1}) where {M,N} = static(1):static(M)
@inline ArrayInterface.axes(A::SizedMatrix{M,N}, ::StaticInt{2}) where {M,N} = static(1):static(N)
@inline ArrayInterface.axes(A::SizedMatrix{M,N}) where {M,N} = (static(1):static(M), static(1):static(N))
@inline Base.parent(A::SizedMatrix) = A.data
@inline ArrayInterface.parent_type(::Type{<:SizedMatrix{M,N,T}}) where {M,N,T} = Matrix{T}
@inline function Base.getindex(A::SizedMatrix, i...)
  @boundscheck checkbounds(A, i...)
  @inbounds A.data[i...]
end
@inline function Base.setindex!(A::SizedMatrix, v, i...)
  @boundscheck checkbounds(A, i...)
  @inbounds A.data[i...] = v
end

for At ∈ (false,true), Bt ∈ (false,true)
  fname = Symbol(At ? "At" : "A", "mul", Bt ? "Bt" : "B")
  for static_size ∈ (0,7,8)
    if static_size == 7
      Ctype = :(SizedMatrix{7,7,Float64})
      fname_sized = Symbol(fname, "7x7x7")
    elseif static_size == 8
      Ctype = :(SizedMatrix{8,8,Float64})
      fname_sized = Symbol(fname, "8x8x8")
    else
      Ctype = :(Matrix{Float64})
      fname_sized = fname
    end
    Atype = At ? :(Adjoint{Float64,$Ctype}) : Ctype
    Btype = Bt ? :(Adjoint{Float64,$Ctype}) : Ctype
    A = At ? :(parent(A)) : :A
    B = Bt ? :(parent(B)) : :B
    for loop ∈ (false, true)
      funcname! = loop ? :fortmul! : :fortmul_loop!
      cfname = QuoteNode(loop ? Symbol(fname_sized, "_loop") : fname_sized)
      matmul_quote = if static_size == 0
        quote
          function $funcname!(C::$Ctype, A::$Atype, B::$Btype)
            M, N = size(C); K = size(B,1)
            ccall(($cfname, LIBFORTMUL), Cvoid, (Ref{Float64},Ref{Float64},Ref{Float64},Ref{Clong},Ref{Clong},Ref{Clong}), C, $A, $B, M, K, N)
          end
        end
      else
        quote
          function $funcname!(C::$Ctype, A::$Atype, B::$Btype)
            ccall(($cfname, LIBFORTMUL), Cvoid, (Ref{Float64},Ref{Float64},Ref{Float64}), C, $A, $B)
          end
        end
      end
      @eval $matmul_quote
    end
  end
end

const Matrix7x7 = SizedMatrix{7,7}
const Matrix8x8 = SizedMatrix{8,8}
A7x7 = rand(7,7); B7x7 = rand(7,7); C7x7 = rand(7,7);
A8x8 = rand(8,8); B8x8 = rand(8,8); C8x8 = rand(8,8);

function print_matmul_summary(C, A, B)
  M, K = size(A); N = size(B,2);
  A_str = A isa Adjoint ? "A'" : "A"
  B_str = B isa Adjoint ? "B'" : "B"
  static_str = C isa SizedMatrix ? "compile" : "runtime"
  println("Running $(static_str)-sized $(M)×$(N) = $(M)×$(K) * $(K)×$(N) benchmarks for C = $A_str * $B_str")
end
function run_benchmark(C,A,B)
  print_matmul_summary(C,A,B)
  M, K = size(A); N = size(B,2)
  println("gfortran Builtin:")
  t = @belapsed fortmul!($C, $A, $B)
  println("GFLOPS: $(2e-9M*K*N/t)")
  Ccopy = copy(C)
  println("gfortran Loops:")
  t = @belapsed fortmul_loop!($C, $A, $B)
  println("GFLOPS: $(2e-9M*K*N/t)")
  @assert C ≈ Ccopy
  println("LoopVectorization.jl Single-Thread:")
  t = @belapsed mul_serial!($C, $A, $B)
  println("GFLOPS: $(2e-9M*K*N/t)")
  @assert C ≈ Ccopy
  if length(C) > 32^2
    println("LoopVectorization.jl Multiple-Threads:")
    t = @belapsed mul_threads!($C, $A, $B)
    println("GFLOPS: $(2e-9M*K*N/t)")
    @assert C ≈ Ccopy
  end
  println()
end
for (C,A,B) ∈ ((C7x7,A7x7,B7x7),(C8x8,A8x8,B8x8)), At ∈ (false,true), Bt ∈ (false,true), static_size ∈ (false,true)
  M,K = size(A); N = size(B,2)
  T = static_size ? (M == 7 ? Matrix7x7 : Matrix8x8) : identity
  Atemp = T(A); Btemp = T(B)
  Atemp = At ? Atemp' : Atemp
  Btemp = Bt ? Btemp' : Btemp
  run_benchmark(T(C), Atemp, Btemp)
end

for M ∈ (71, 72, 144, 216)
  K = N = M
  A = rand(M,K); B = rand(K,N);
  C = Matrix{Float64}(undef, M, N);

  run_benchmark(C, A, B)
  run_benchmark(C, A, B')
  run_benchmark(C, A', B)
  run_benchmark(C, A', B')
end
Results
Running runtime-sized 7×7 = 7×7 * 7×7 benchmarks for C = A * B
gfortran Builtin:
GFLOPS: 4.290648913738776
gfortran Loops:
GFLOPS: 3.8254906148406542
LoopVectorization.jl Single-Thread:
GFLOPS: 25.796296296296298

Running compile-sized 7×7 = 7×7 * 7×7 benchmarks for C = A * B
gfortran Builtin:
GFLOPS: 26.027454718779794
gfortran Loops:
GFLOPS: 15.402794157670328
LoopVectorization.jl Single-Thread:
GFLOPS: 65.19349315068494

Running runtime-sized 7×7 = 7×7 * 7×7 benchmarks for C = A * B'
gfortran Builtin:
GFLOPS: 3.376153217330375
gfortran Loops:
GFLOPS: 3.8443792201472116
LoopVectorization.jl Single-Thread:
GFLOPS: 34.47618088427838

Running compile-sized 7×7 = 7×7 * 7×7 benchmarks for C = A * B'
gfortran Builtin:
GFLOPS: 26.010593704748118
gfortran Loops:
GFLOPS: 13.077782718818277
LoopVectorization.jl Single-Thread:
GFLOPS: 65.18729192428422

Running runtime-sized 7×7 = 7×7 * 7×7 benchmarks for C = A' * B
gfortran Builtin:
GFLOPS: 4.0271049910194305
gfortran Loops:
GFLOPS: 3.9683312655086853
LoopVectorization.jl Single-Thread:
GFLOPS: 10.530866711571298

Running compile-sized 7×7 = 7×7 * 7×7 benchmarks for C = A' * B
gfortran Builtin:
GFLOPS: 18.178010471204193
gfortran Loops:
GFLOPS: 10.238181624159642
LoopVectorization.jl Single-Thread:
GFLOPS: 28.06786345150557

Running runtime-sized 7×7 = 7×7 * 7×7 benchmarks for C = A' * B'
gfortran Builtin:
GFLOPS: 4.315322113055289
gfortran Loops:
GFLOPS: 2.8338118977575006
LoopVectorization.jl Single-Thread:
GFLOPS: 24.073995697104365

Running compile-sized 7×7 = 7×7 * 7×7 benchmarks for C = A' * B'
gfortran Builtin:
GFLOPS: 16.877507447864946
gfortran Loops:
GFLOPS: 13.098295894655307
LoopVectorization.jl Single-Thread:
GFLOPS: 42.206275815301154

Running runtime-sized 8×8 = 8×8 * 8×8 benchmarks for C = A * B
gfortran Builtin:
GFLOPS: 4.654805292161444
gfortran Loops:
GFLOPS: 7.049481636524185
LoopVectorization.jl Single-Thread:
GFLOPS: 34.24933544197315

Running compile-sized 8×8 = 8×8 * 8×8 benchmarks for C = A * B
gfortran Builtin:
GFLOPS: 75.37630919014605
gfortran Loops:
GFLOPS: 13.412733778948079
LoopVectorization.jl Single-Thread:
GFLOPS: 81.84462757020562

Running runtime-sized 8×8 = 8×8 * 8×8 benchmarks for C = A * B'
gfortran Builtin:
GFLOPS: 3.4824207638528932
gfortran Loops:
GFLOPS: 6.503686130838877
LoopVectorization.jl Single-Thread:
GFLOPS: 49.12558945241074

Running compile-sized 8×8 = 8×8 * 8×8 benchmarks for C = A * B'
gfortran Builtin:
GFLOPS: 75.28191528545119
gfortran Loops:
GFLOPS: 47.59792997342533
LoopVectorization.jl Single-Thread:
GFLOPS: 81.79227632525786

Running runtime-sized 8×8 = 8×8 * 8×8 benchmarks for C = A' * B
gfortran Builtin:
GFLOPS: 8.2165948853584
gfortran Loops:
GFLOPS: 5.622450042883365
LoopVectorization.jl Single-Thread:
GFLOPS: 13.54091119775336

Running compile-sized 8×8 = 8×8 * 8×8 benchmarks for C = A' * B
gfortran Builtin:
GFLOPS: 56.06721950683728
gfortran Loops:
GFLOPS: 47.56793060025185
LoopVectorization.jl Single-Thread:
GFLOPS: 49.458773374672994

Running runtime-sized 8×8 = 8×8 * 8×8 benchmarks for C = A' * B'
gfortran Builtin:
GFLOPS: 4.89827710282352
gfortran Loops:
GFLOPS: 5.558826599853847
LoopVectorization.jl Single-Thread:
GFLOPS: 31.88760662318113

Running compile-sized 8×8 = 8×8 * 8×8 benchmarks for C = A' * B'
gfortran Builtin:
GFLOPS: 56.06721950683728
gfortran Loops:
GFLOPS: 11.955531821933105
LoopVectorization.jl Single-Thread:
GFLOPS: 48.73397298200392

Running runtime-sized 71×71 = 71×71 * 71×71 benchmarks for C = A * B
gfortran Builtin:
GFLOPS: 5.865806789966648
gfortran Loops:
GFLOPS: 22.828868478122207
LoopVectorization.jl Single-Thread:
GFLOPS: 109.64066903565738
LoopVectorization.jl Multiple-Threads:
GFLOPS: 326.3791478798318

Running runtime-sized 71×71 = 71×71 * 71×71 benchmarks for C = A * B'
gfortran Builtin:
GFLOPS: 3.2668930328504797
gfortran Loops:
GFLOPS: 6.8133941234140165
LoopVectorization.jl Single-Thread:
GFLOPS: 107.9801484342002
LoopVectorization.jl Multiple-Threads:
GFLOPS: 317.0627491510409

Running runtime-sized 71×71 = 71×71 * 71×71 benchmarks for C = A' * B
gfortran Builtin:
GFLOPS: 14.328762735952917
gfortran Loops:
GFLOPS: 14.112931527375249
LoopVectorization.jl Single-Thread:
GFLOPS: 52.568260262906655
LoopVectorization.jl Multiple-Threads:
GFLOPS: 191.04598139393016

Running runtime-sized 71×71 = 71×71 * 71×71 benchmarks for C = A' * B'
gfortran Builtin:
GFLOPS: 5.875484273425701
gfortran Loops:
GFLOPS: 20.93720202404282
LoopVectorization.jl Single-Thread:
GFLOPS: 96.3000033632664
LoopVectorization.jl Multiple-Threads:
GFLOPS: 153.8464734749436

Running runtime-sized 72×72 = 72×72 * 72×72 benchmarks for C = A * B
gfortran Builtin:
GFLOPS: 5.997782455689288
gfortran Loops:
GFLOPS: 26.56380328802221
LoopVectorization.jl Single-Thread:
GFLOPS: 123.29809725158565
LoopVectorization.jl Multiple-Threads:
GFLOPS: 385.96556537924624

Running runtime-sized 72×72 = 72×72 * 72×72 benchmarks for C = A * B'
gfortran Builtin:
GFLOPS: 3.3204458717718337
gfortran Loops:
GFLOPS: 7.33131021478448
LoopVectorization.jl Single-Thread:
GFLOPS: 123.28995177379932
LoopVectorization.jl Multiple-Threads:
GFLOPS: 388.96206752813674

Running runtime-sized 72×72 = 72×72 * 72×72 benchmarks for C = A' * B
gfortran Builtin:
GFLOPS: 25.192224622030242
gfortran Loops:
GFLOPS: 24.992333188255387
LoopVectorization.jl Single-Thread:
GFLOPS: 68.62437948152235
LoopVectorization.jl Multiple-Threads:
GFLOPS: 200.95457298606908

Running runtime-sized 72×72 = 72×72 * 72×72 benchmarks for C = A' * B'
gfortran Builtin:
GFLOPS: 5.9635236505108775
gfortran Loops:
GFLOPS: 24.258148376823844
LoopVectorization.jl Single-Thread:
GFLOPS: 110.31417171567905
LoopVectorization.jl Multiple-Threads:
GFLOPS: 146.8943299693588

Running runtime-sized 144×144 = 144×144 * 144×144 benchmarks for C = A * B
gfortran Builtin:
GFLOPS: 5.601758954232673
gfortran Loops:
GFLOPS: 39.86547665934595
LoopVectorization.jl Single-Thread:
GFLOPS: 126.24657534246577
LoopVectorization.jl Multiple-Threads:
GFLOPS: 1125.972032806461

Running runtime-sized 144×144 = 144×144 * 144×144 benchmarks for C = A * B'
gfortran Builtin:
GFLOPS: 2.578293861451797
gfortran Loops:
GFLOPS: 8.234176150543389
LoopVectorization.jl Single-Thread:
GFLOPS: 108.0644916128332
LoopVectorization.jl Multiple-Threads:
GFLOPS: 975.3018029788348

Running runtime-sized 144×144 = 144×144 * 144×144 benchmarks for C = A' * B
gfortran Builtin:
GFLOPS: 25.676496764623693
gfortran Loops:
GFLOPS: 25.937223937771183
LoopVectorization.jl Single-Thread:
GFLOPS: 89.84456145629608
LoopVectorization.jl Multiple-Threads:
GFLOPS: 867.8420088935393

Running runtime-sized 144×144 = 144×144 * 144×144 benchmarks for C = A' * B'
gfortran Builtin:
GFLOPS: 5.639534367631242
gfortran Loops:
GFLOPS: 37.20829153712439
LoopVectorization.jl Single-Thread:
GFLOPS: 114.82566478878657
LoopVectorization.jl Multiple-Threads:
GFLOPS: 661.0546823112686

Running runtime-sized 216×216 = 216×216 * 216×216 benchmarks for C = A * B
gfortran Builtin:
GFLOPS: 4.885176232546832
gfortran Loops:
GFLOPS: 46.77228108770416
LoopVectorization.jl Single-Thread:
GFLOPS: 123.50268998394588
LoopVectorization.jl Multiple-Threads:
GFLOPS: 1846.7465640461794

Running runtime-sized 216×216 = 216×216 * 216×216 benchmarks for C = A * B'
gfortran Builtin:
GFLOPS: 2.501163629630972
gfortran Loops:
GFLOPS: 8.238294838975227
LoopVectorization.jl Single-Thread:
GFLOPS: 98.05113835376532
LoopVectorization.jl Multiple-Threads:
GFLOPS: 1477.1265665078784

Running runtime-sized 216×216 = 216×216 * 216×216 benchmarks for C = A' * B
gfortran Builtin:
GFLOPS: 25.663006899783166
gfortran Loops:
GFLOPS: 25.26621713606637
LoopVectorization.jl Single-Thread:
GFLOPS: 96.12499105775973
LoopVectorization.jl Multiple-Threads:
GFLOPS: 1474.9646542261253

Running runtime-sized 216×216 = 216×216 * 216×216 benchmarks for C = A' * B'
gfortran Builtin:
GFLOPS: 4.846530403782313
gfortran Loops:
GFLOPS: 52.11586018586035
LoopVectorization.jl Single-Thread:
GFLOPS: 105.0252305768329
LoopVectorization.jl Multiple-Threads:
GFLOPS: 1462.9739420773753

versioninfo:

julia> versioninfo()
Julia Version 1.7.0-DEV.1052
Commit cc59947c03* (2021-05-02 02:36 UTC)
Platform Info:
  OS: Linux (x86_64-generic-linux)
  CPU: Intel(R) Core(TM) i9-10980XE CPU @ 3.00GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, cascadelake)
Environment:
  JULIA_NUM_THREADS = 36

The benchmarks are a bunch of matmuls via the Fotran intrinsic as well as loops.
The Julia code contrains the compilation command. I’m open to suggestions for better loops and compiler flags.
The benchmark contains benchmarks for runtime and compile-time sized 7x7 and 8x8 matrices for all permutations of tranposing A and B in C = A*B, and then runtime sized versions for 71x71, 72x72, 144x144, and 216x216.
The larger sizes also include multithreaded Julia matrix multiply (implemented as @avxt on 3 loops). Note that Julia must be started with multiple threads to support multithreading.
Also, @avx(t) is currently limited to simple rectangular loop nests; i.e. triangular loops or multiple loops at the same level in a nest don’t work at the moment, and neither does complex control flow (you can remove @avx if you want to benchmark that).

Also note that there’s only two versions of the matrix multiply function in Julia: 1 single threaded, another multithreaded. The same function is used for all transposed permutations, without us having to write and optimize a different version for each.

EDIT: Fixed names following @themos’ suggestion.

8 Likes

I rather object to you labelling the results from one compiler as the “Fortran” results. I also suspect that your work in producing fast loops would have produced similar results had you worked on a Fortran compiler that you had access to the internals of. Is there any reason that it wouldn’t?

It may well turn out to be the case that Fortran compilers have dropped the ball when it comes to delivering blistering performance in the cases that matter (thousands of communicating processes solving engineering and science problems) but it will take significantly more work to demonstrate that.

2 Likes

Unlike Julia, the Fortran compiler development efforts are not centralized. From what I have learned over the past years, some are very good at debugging (like NAG, GFortran) and some others are historically known for better optimizations (like Intel, Cray, IBM, PGI, …). But these could change at any time. Having and maintaining multiple high-quality compilers takes a lot of effort. For benchmarking purposes, it makes sense to compile Fortran code with all available compilers and report the best result. I think this issue highlights the importance of such activities as Google Summer of Code 2021

3 Likes

Julia correctly sums 10^8 32-bit uniform deviates, with the program

n = 10^8
x = Float32.(rand(n))
println(sum(x)/n)

giving output such as 0.49995995 but gfortran for

implicit none
integer, parameter :: n = 10**8
real :: x(n)
call random_number(x)
print*,sum(x)/n
end

gives 0.167772159 and g95 gives 0.16777216 but Intel Fortran gives 0.4999704
Compensated summation, which gets this correct, may eventually be in stdlib. Ideally a compiler would have an option to compute SUM carefully and the user could verify that toggling the option does not influence the program results.

By default, Julia uses pairwise summation, which explains this. On modern architectures, sum will usually be memory-bound, so the extra (very minor) cost of summing out of order ends up not mattering.

1 Like

Yes, indeed. I would like to keep it that way in Fortran. I also personally do not like to use macros like @avx and rather just use built-in general parallel features such as do concurrent.

3 Likes

Thanks @chriselrod, @StefanKarpinski, @antoine-levitt, @Royi and others from the Julia community to provide feedback on this benchmark and welcome to the forum!

I like this benchmark as it benchmarks a loop with a custom function and a sum. The custom function is a sin, but we can as well benchmark any other custom function (and as pointed above, Julia treats sin as any other function that the user could write). To achieve the best performance, the loop has to be unrolled and perhaps one has to inline the function implementation and rearrange operations.

It seems that Julia indeed is getting very good performance and that some Fortran compilers might be lagging behind.

I actually think being able to optimize code like this is very important, as many production Fortran codes have inner loops like this. I have written many like that. Of course, as @themos said, this is not the only consideration, and a large production code runs in parallel and is a lot more complex than this. But Julia is a young language, so they do not have as many large production codes yet. But they will. And they will beat Fortran also, if they can beat Fortran on simpler benchmarks.

Fortran has large codes and good track record on that, but unfortunately it has been my personal experience that Fortran compilers have been not delivering on the full potential that Fortran allows. This benchmark might show one such example, and Julia shows how to get the performance. As a user I would like the Fortran compilers to do better. Once LFortran can compile most codes, I would like to start looking into this and start implementing such optimizations.

9 Likes

I really don’t think that scientists/engineers SHOULD try to use their working language for optimising SUM(SIN(RANDOM_NUMBER())) on the supercomputer they got last month. They should be talking to their support people about installed libraries for random distributions and feed back any performance concerns. Then all codes could benefit from the work of a small team. If it turns out that Julia or C have the best code generator on that platform, then that is what the implementor of the library does and the user of the library can use it from whatever language they are currently working in. This is the model that took us from the megaflops to the near exaflops. The alternative, where everyone rolls their own using a single-source code generator with ad-hoc extensions, seems a very large risk.

3 Likes

Consider the fact that (as far as I know) LoopVectorization was developed entirely by @chriselrod (a PhD student in an unrelated area) as a hobby project, starting one year ago, with very little interaction with the julia language itself. Of course this has more to do with him being a genius than anything else, but still, I’m not sure that would have happened had he decided to work in Fortran. The model you argue for (of a strict separation between “scientists” and “support people”) makes sense in the context where you need a team of highly-trained engineers working in an environment with high barriers of entry (either because of closed-source, or just because the barrier to entry in any compiler is huge) to make any progress. When you break these barriers, you get pretty interesting results.

5 Likes

There haven’t been such barriers and my evidence is FFTW and GotoBLAS. Both were created by the tiniest of brilliant teams, addressed a specific, well-understood need and became overnight successes and universally found in most supercomputer libraries.

The point of the support team is that they act as librarians: they will tell the researcher (IF they are asked) which bits of performant software is around ready to use, which is the best fit to their workloads, any tradeoffs, etc. Ideally, they will feed back user experiences to the library developers.

There was never any suggestion. for very good reasons, that people should change their programming language to whatever FFTW or GotoBLAS used to develop the codes.

Another example I am aware of is Halide. It is specifically designed to produce the fastest kernels for image processing and uses its own specification language. Once the kernel is produced, it can be consumed from any programming language.