# Simple summation 8x slower than in Julia

I can’t get this `cordic` program to compile on my machine so I can’t run my own comparisons. If you want to do multithreading, the simplest way would be to just write

``````function f_certainly_not_avx(N)
s = 0.0
s += cordic_sine(convert(Float64, i))
end
s
end
``````

Not really sure why or how this is relevant to the original question that @certik asked though.

If one is willing to roll up their sleeves, they could make this `cordic_sine` function work with `LoopVectorization.@avx`, they just have to add dispatches that act correctly and vectorize across a `VectorizationBase.AbstractSIMD` type, but it wouldn’t be super simple.

Chris explained a bit on how to do so here: https://julialang.zulipchat.com/#narrow/stream/137791-general/topic/LoopVectorization.20vectorize.20an.20arbitrary.20function

No, “multithreading” is not of interest; the curiosity given the original post is how to do parallel programming in Julia. Regardless, your suggestion of multithreading decreases the performance with the Julia case:

C:\temp>type avx.jl
function cordic_sine(a::Float64)

cos = Ref{Float64}(0.0)
sin = Ref{Float64}(0.0)

ccall((:cossin_cordic, “cordic”), Cvoid, (Float64, Cint, Ref{Float64}, Ref{Float64}), a, 40, cos, sin)

return sin[]

end

function f_avx(N)
s = 0.0
s += cordic_sine(convert(Float64, i))
end
s
end

@time r = f_avx(800000000)
println(r)

C:\temp>julia avx.jl
201.834736 seconds (1.60 G allocations: 23.847 GiB, 0.80% gc time, 0.02% compilation time)
1.9006631241572693

C:\temp>

You probably started julia with only 1 thread. On my 6 core machine this is what I see:

``````julia> function cordic_sine(a)

cos = Ref{Float64}(0.0)
sin = Ref{Float64}(0.0)

# Ref: Excellent work by John Burkardt
# https://people.sc.fsu.edu/~jburkardt/c_src/cordic/cordic.c
# void cossin_cordic ( double beta, int n, double *c, double *s )
ccall((:cossin_cordic, "/home/mason/cordic/libcordic.so"), Cvoid, (Float64, Cint, Ref{Float64}, Ref{Float64}), a, 40, cos, sin)

return sin[]

end
cordic_sine (generic function with 1 method)

julia> function f_definitely_not_avx(N)
s = 0.0
s += cordic_sine(convert(Float64, i))
end
s
end
f_definitely_not_avx (generic function with 1 method)
``````
``````julia> @time r = f_definitely_not_avx(800000000)
57.295734 seconds (1.60 G allocations: 23.846 GiB, 9.56% gc time, 0.04% compilation time)
-0.8053191287737147

julia> versioninfo()
Julia Version 1.6.0
Commit f9720dc2eb* (2021-03-24 12:55 UTC)
Platform Info:
OS: Linux (x86_64-pc-linux-gnu)
CPU: AMD Ryzen 5 2600 Six-Core Processor
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-11.0.1 (ORCJIT, znver1)
Environment:
``````
1 Like

@Mason, here’re the steps I followed on Windows OS with the `cordic.c` and `cordic.h` from the FSU site I linked upthread: note on Windows, Julia seems to want a shared library (DLL) from what I could see. Here a Windows definition (.def) file of exported symbols from the library make things clearer in my view:

C:\temp>dir cordic.*
Volume in drive C is OS
Volume Serial Number is C4CA-BA12

Directory of C:\temp

05/08/2021 11:15 AM 63,494 cordic.c
05/08/2021 01:09 PM 50 cordic.def
05/08/2021 11:15 AM 1,196 cordic.h
3 File(s) 64,740 bytes
0 Dir(s) 23,240,077,312 bytes free

C:\temp>type cordic.def
LIBRARY cordic
EXPORTS
cossin_cordic @1

C:\temp>cl /c /O2 cordic.c
Microsoft (R) C/C++ Optimizing Compiler Version 19.27.29112 for x64

cordic.c

Microsoft (R) Incremental Linker Version 14.27.29112.0

Creating library cordic.lib and object cordic.exp

C:\temp>

What exactly do you mean by parallel here? Maybe you mean what I’d recognize as distributed / multiprocess? In that case, you’d want to use the `Distributed` standard library or something like MPI.jl.

Here’s an example on my 6 core machine:

``````julia> using Distributed; addprocs(5); nprocs()
6

julia> @everywhere function cordic_sine(a)
cos = Ref{Float64}() # don't waste time zeroing data that'll get thrown out
sin = Ref{Float64}()
# Ref: Excellent work by John Burkardt
# https://people.sc.fsu.edu/~jburkardt/c_src/cordic/cordic.c
# void cossin_cordic ( double beta, int n, double *c, double *s )
ccall((:cossin_cordic, "/home/mason/cordic/libcordic.so"), Cvoid, (Float64, Cint, Ref{Float64}, Ref{Float64}), a, 40, cos, sin)
return sin[]
end

julia> function f_distributed(N)
@distributed (+) for i in 1:N
cordic_sine(convert(Float64, i))
end
end
f_distributed (generic function with 1 method)

julia> @time r = f_distributed(800000000)
45.842816 seconds (933.50 k allocations: 54.914 MiB, 0.03% gc time, 0.67% compilation time)
1.9006631241562875
``````
``````julia> versioninfo()
Julia Version 1.6.0
Commit f9720dc2eb* (2021-03-24 12:55 UTC)
Platform Info:
OS: Linux (x86_64-pc-linux-gnu)
CPU: AMD Ryzen 5 2600 Six-Core Processor
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-11.0.1 (ORCJIT, znver1)
Environment:
``````

Here is the time for the serial version:

``````julia> function f_serial(N)
s = 0.0
for i in 1:N
s += cordic_sine(convert(Float64, i))
end
s
end
f_avx (generic function with 1 method)

julia> @time r = f_serial(800000000)
216.155306 seconds (1 allocation: 16 bytes)
1.9006631241568408
``````
1 Like

Again though, I really don’t understand that point of this benchmark you’re proposing @FortranFan. Why are we interested in a benchmark that just times Fortran / Julia’s foreign function interfaces in parallel?

As far as I understood, this thread was about @certik wanting to understand how LoopVectorization.jl was generating such fast code in julia and how it could be replicated in Fortran.

2 Likes

On a more general note, people might like to time their Fortran compiler’s SUM intrinsic on the largest array their system will handle and report the nanoseconds per item. When I do this, I find large variations, with the Cray compiler on Intel hardware clearly ahead. That merely demonstrates that vendors prioritise different aspects of performance. You will always find some examples of sub-optimal performance if that is what you set out to look for.
The Fortran community is well aware that GPUs are a challenge for modern Fortran but I don’t think that it has been demonstrated that Fortran cannot adapt to the challenge because of fundamental, baked-in language design issues. I don’t think that fast trig approximations is a battle worth fighting, any more than trying to beat, say, FFTW or MKL DGEMM in Fortran.

1 Like

I’m not really proposing this benchmark, as shown upthread it’s not realistic. The point is only there are 2 parts to this “micro” benchmark: one is some form of non-serial execution of “+” reduction (whether AVX or otherwise) and the next is function evaluation.

If placed on the same basis, there are hardly much of any differences between modern languages with calculations that have any heft to them.

What’s left is simple, convenient way of introducing aspects natively into the language whilst abstracting away the paradigms and separating the hardware-related considerations such as AVX.

For practitioners needing to use Fortran, the above is just a simple example using `coarray`s.

About "really don’t understand the point " is a question that can be raised of the ones by Jula-lang here.

1 Like

By default, Julia can use more memory than a program compiled by gfortran, so that on my Windows 10 PC, Julia runs

``````n = 10^9
x = rand(n)
println(sum(x)/n)
``````

in 2.8 s (wall time) while gfortran -O2 cannot do the calculation for n = 3e8 and for 2e8

``````implicit none
integer, parameter :: n = 2*10**8
real(kind=kind(1.0d0)) :: x(n)
call random_number(x)
print*,n,sum(x)/n
end
``````

takes 1.7 s. Is there a gfortran compiler option that permits larger arrays? I don’t think Julia “sees” that it need not store the entire array x, since for n = 5e9 the Julia program hangs my machine. Gfortran does compile a program with n = 1e9 using the options gfortran -fstack-arrays -fmax-stack-var-size=3000000000 -O2 , but running the program produces no output.

On Windows OS, better to use `ALLOCATABLE` arrays for the default stack-size is rather low and there are OS restrictions if you’re using 32-bit.

64-bit.

It wouldn’t be able to transform it anyway, so removing the `using LoopVectorization` and just using `for i in` would be fine.

`LoopVectorization.jl` uses a fork of a port of an old version of SLEEF for its `sin` function, so this is the fairest comparison. But as has been pointed out before, in Julia `sin` will be inlined, and some things like constant loads will get hoisted out of the loop.

I think it’s more likely that better language integration will go the other way: Julia’s compiler team wants to “rip open” the compiler, better exposing it to library authors to build their own.

LoopVectorization.jl is experimental, and will go through at least one major overhaul before I am comfortable tagging a “1.0” version.

It does. Others have confirmed it works on Power8 and Nvidea Jetson (ARM).
I bought an M1 Mac specifically so I could test and make sure performance isn’t accidentally bad, and can confirm it works there as well (you just need Julia master, and – until we get a Hwloc_jll version compatible with Apple ARM – you’d need to `] dev Hwloc` to replace `Hwloc_jll` with a path to the library installed some other way; I used Homebrew).

Given that LoopVectorization is still at version “0”, now (or, “sooner rather than later”) would be the time to deprecate `@avx`, and choose a new name for the macro.
`@avx` seems to consistently cause confusion over whether platforms other than x86 are supported. The reaction to the name seems generally pretty negative.

I’m open to name suggestions here, but I’d probably ask on Julia’s Slack/Zulip.

Why did I pick the name `@avx`?

1. Because it is short.
2. Because `@simd` was taken.
3. I didn’t choose `@vectorize` because there was already a vastly inferior macro with that name in LoopVectorization at the time that many of my packages were using, and I didn’t want to break them. It’s long since been removed. I don’t want the name anymore either, because aside from length, it seems like it’d cause similar confusion if/when matrix-extension support is added. The library name itself makes that mistake, but c’est la vie.

Perhaps `@fast`, to be fully open ended about how it gets there, and being short?

Depends on the loop.
If autovectorization requires violating IEEE, then it will require annotation of some kind, e.g. `@fastmath` or `@simd` from Base Julia.
`LoopVectorization.@avx` is a macro implemented by a Julia package that implements its own (experimental) autovectorizer.
It does more than just use special functions.

LoopVectorization.jl’s approach does not have that limitation, although I should cleanup `erf` support, which currently requires specifying `verf`.

LoopVectorization is also pretty good at nested loops. This is most relevant whenever anyone’s doing things not already easily/well expressed by BLAS, e.g. convolutions and tensor contractions (where expressing them as BLAS has some overhead).

gfortran code for the benchmark
``````module matrixmul
use ISO_C_BINDING

contains
! gfortran -Ofast -march=native -mprefer-vector-width=512 -shared -fPIC matmul.f90 -o libmatmul.so

subroutine AmulB(C, A, B, M, K, N) BIND(C, name="AmulB")
integer(C_long), intent(in) :: M, K, N
real(C_double), dimension(M, N), intent(out) :: C
real(C_double), dimension(M, K), intent(in) :: A
real(C_double), dimension(K, N), intent(in) :: B
C = matmul(A, B)
end subroutine AmulB
subroutine AmulB7x7x7(C, A, B) BIND(C, name="AmulB7x7x7")
real(C_double), dimension(7, 7), intent(out) :: C
real(C_double), dimension(7, 7), intent(in) :: A
real(C_double), dimension(7, 7), intent(in) :: B
C = matmul(A, B)
end subroutine AmulB7x7x7
subroutine AmulB8x8x8(C, A, B) BIND(C, name="AmulB8x8x8")
real(C_double), dimension(8, 8), intent(out) :: C
real(C_double), dimension(8, 8), intent(in) :: A
real(C_double), dimension(8, 8), intent(in) :: B
C = matmul(A, B)
end subroutine AmulB8x8x8

subroutine AmulBt(C, A, Bt, M, K, N) BIND(C, name="AmulBt")
integer(C_long), intent(in) :: M, K, N
real(C_double), dimension(M, N), intent(out) :: C
real(C_double), dimension(M, K), intent(in) :: A
real(C_double), dimension(N, K), intent(in) :: Bt
C = matmul(A, transpose(Bt))
end subroutine AmulBt
subroutine AmulBt7x7x7(C, A, Bt) BIND(C, name="AmulBt7x7x7")
real(C_double), dimension(7, 7), intent(out) :: C
real(C_double), dimension(7, 7), intent(in) :: A
real(C_double), dimension(7, 7), intent(in) :: Bt
C = matmul(A, transpose(Bt))
end subroutine AmulBt7x7x7
subroutine AmulBt8x8x8(C, A, Bt) BIND(C, name="AmulBt8x8x8")
real(C_double), dimension(8, 8), intent(out) :: C
real(C_double), dimension(8, 8), intent(in) :: A
real(C_double), dimension(8, 8), intent(in) :: Bt
C = matmul(A, transpose(Bt))
end subroutine AmulBt8x8x8

subroutine AtmulB(C, At, B, M, K, N) BIND(C, name="AtmulB")
integer(C_long), intent(in) :: M, K, N
real(C_double), dimension(M, N), intent(out) :: C
real(C_double), dimension(K, M), intent(in) :: At
real(C_double), dimension(K, N), intent(in) :: B
C = matmul(transpose(At), B)
end subroutine AtmulB
subroutine AtmulB7x7x7(C, At, B) BIND(C, name="AtmulB7x7x7")
real(C_double), dimension(7, 7), intent(out) :: C
real(C_double), dimension(7, 7), intent(in) :: At
real(C_double), dimension(7, 7), intent(in) :: B
C = matmul(transpose(At), B)
end subroutine AtmulB7x7x7
subroutine AtmulB8x8x8(C, At, B) BIND(C, name="AtmulB8x8x8")
real(C_double), dimension(8, 8), intent(out) :: C
real(C_double), dimension(8, 8), intent(in) :: At
real(C_double), dimension(8, 8), intent(in) :: B
C = matmul(transpose(At), B)
end subroutine AtmulB8x8x8

subroutine AtmulBt(C, At, Bt, M, K, N) BIND(C, name="AtmulBt")
integer(C_long), intent(in) :: M, K, N
real(C_double), dimension(M, N), intent(out) :: C
real(C_double), dimension(K, M), intent(in) :: At
real(C_double), dimension(N, K), intent(in) :: Bt
C = transpose(matmul(Bt, At))
end subroutine AtmulBt
subroutine AtmulBt7x7x7(C, At, Bt) BIND(C, name="AtmulBt7x7x7")
real(C_double), dimension(7, 7), intent(out) :: C
real(C_double), dimension(7, 7), intent(in) :: At
real(C_double), dimension(7, 7), intent(in) :: Bt
C = transpose(matmul(Bt, At))
end subroutine AtmulBt7x7x7
subroutine AtmulBt8x8x8(C, At, Bt) BIND(C, name="AtmulBt8x8x8")
real(C_double), dimension(8, 8), intent(out) :: C
real(C_double), dimension(8, 8), intent(in) :: At
real(C_double), dimension(8, 8), intent(in) :: Bt
C = transpose(matmul(Bt, At))
end subroutine AtmulBt8x8x8

subroutine AmulB_loop(C, A, B, M, K, N) BIND(C, name="AmulB_loop")
integer(C_long), intent(in) :: M, K, N
real(C_double), dimension(M, N), intent(out) :: C
real(C_double), dimension(M, K), intent(in) :: A
real(C_double), dimension(K, N), intent(in) :: B
real(C_double) :: Cmn
do concurrent(nn = 1:N, mm = 1:M)
Cmn = 0.0
do kk = 1,K
Cmn = Cmn + A(mm,kk) * B(kk,nn)
end do
C(mm,nn) = Cmn
end do
end subroutine AmulB_loop
subroutine AmulB7x7x7_loop(C, A, B) BIND(C, name="AmulB7x7x7_loop")
real(C_double), dimension(7, 7), intent(out) :: C
real(C_double), dimension(7, 7), intent(in) :: A
real(C_double), dimension(7, 7), intent(in) :: B
real(C_double) :: Cmn
do concurrent(nn = 1:7, mm = 1:7)
Cmn = 0.0
do kk = 1,7
Cmn = Cmn + A(mm,kk) * B(kk,nn)
end do
C(mm,nn) = Cmn
end do
end subroutine AmulB7x7x7_loop
subroutine AmulB8x8x8_loop(C, A, B) BIND(C, name="AmulB8x8x8_loop")
real(C_double), dimension(8, 8), intent(out) :: C
real(C_double), dimension(8, 8), intent(in) :: A
real(C_double), dimension(8, 8), intent(in) :: B
real(C_double) :: Cmn
do concurrent(nn = 1:8, mm = 1:8)
Cmn = 0.0
do kk = 1,8
Cmn = Cmn + A(mm,kk) * B(kk,nn)
end do
C(mm,nn) = Cmn
end do
end subroutine AmulB8x8x8_loop

subroutine AmulBt_loop(C, A, Bt, M, K, N) BIND(C, name="AmulBt_loop")
integer(C_long), intent(in) :: M, K, N
real(C_double), dimension(M, N), intent(out) :: C
real(C_double), dimension(M, K), intent(in) :: A
real(C_double), dimension(N, K), intent(in) :: Bt
real(C_double) :: Cmn
do concurrent(nn = 1:N, mm = 1:M)
Cmn = 0.0
do kk = 1,K
Cmn = Cmn + A(mm,kk) * Bt(nn,kk)
end do
C(mm,nn) = Cmn
end do
end subroutine AmulBt_loop
subroutine AmulBt7x7x7_loop(C, A, Bt) BIND(C, name="AmulBt7x7x7_loop")
real(C_double), dimension(7, 7), intent(out) :: C
real(C_double), dimension(7, 7), intent(in) :: A
real(C_double), dimension(7, 7), intent(in) :: Bt
real(C_double) :: Cmn
do concurrent(nn = 1:7, mm = 1:7)
Cmn = 0.0
do kk = 1,7
Cmn = Cmn + A(mm,kk) * Bt(nn,kk)
end do
C(mm,nn) = Cmn
end do
end subroutine AmulBt7x7x7_loop
subroutine AmulBt8x8x8_loop(C, A, Bt) BIND(C, name="AmulBt8x8x8_loop")
real(C_double), dimension(8, 8), intent(out) :: C
real(C_double), dimension(8, 8), intent(in) :: A
real(C_double), dimension(8, 8), intent(in) :: Bt
real(C_double) :: Cmn
do concurrent(nn = 1:8, mm = 1:8)
Cmn = 0.0
do kk = 1,8
Cmn = Cmn + A(mm,kk) * Bt(nn,kk)
end do
C(mm,nn) = Cmn
end do
end subroutine AmulBt8x8x8_loop

subroutine AtmulB_loop(C, At, B, M, K, N) BIND(C, name="AtmulB_loop")
integer(C_long), intent(in) :: M, K, N
real(C_double), dimension(M, N), intent(out) :: C
real(C_double), dimension(K, M), intent(in) :: At
real(C_double), dimension(K, N), intent(in) :: B
real(C_double) :: Cmn
do concurrent(nn = 1:N, mm = 1:M)
Cmn = 0.0
do kk = 1,K
Cmn = Cmn + At(kk,mm) * B(kk,nn)
end do
C(mm,nn) = Cmn
end do
end subroutine AtmulB_loop
subroutine AtmulB7x7x7_loop(C, At, B) BIND(C, name="AtmulB7x7x7_loop")
real(C_double), dimension(7, 7), intent(out) :: C
real(C_double), dimension(7, 7), intent(in) :: At
real(C_double), dimension(7, 7), intent(in) :: B
real(C_double) :: Cmn
do concurrent(nn = 1:7, mm = 1:7)
Cmn = 0.0
do kk = 1,7
Cmn = Cmn + At(kk,mm) * B(kk,nn)
end do
C(mm,nn) = Cmn
end do
end subroutine AtmulB7x7x7_loop
subroutine AtmulB8x8x8_loop(C, At, B) BIND(C, name="AtmulB8x8x8_loop")
real(C_double), dimension(8, 8), intent(out) :: C
real(C_double), dimension(8, 8), intent(in) :: At
real(C_double), dimension(8, 8), intent(in) :: B
real(C_double) :: Cmn
do concurrent(nn = 1:8, mm = 1:8)
Cmn = 0.0
do kk = 1,8
Cmn = Cmn + At(kk,mm) * B(kk,nn)
end do
C(mm,nn) = Cmn
end do
end subroutine AtmulB8x8x8_loop

subroutine AtmulBt_loop(C, At, Bt, M, K, N) BIND(C, name="AtmulBt_loop")
integer(C_long), intent(in) :: M, K, N
real(C_double), dimension(M, N), intent(out) :: C
real(C_double), dimension(K, M), intent(in) :: At
real(C_double), dimension(N, K), intent(in) :: Bt
real(C_double) :: Cmn
do concurrent(nn = 1:N, mm = 1:M)
Cmn = 0.0
do kk = 1,K
Cmn = Cmn + At(kk,mm) * Bt(nn,kk)
end do
C(mm,nn) = Cmn
end do
end subroutine AtmulBt_loop
subroutine AtmulBt7x7x7_loop(C, At, Bt) BIND(C, name="AtmulBt7x7x7_loop")
real(C_double), dimension(7, 7), intent(out) :: C
real(C_double), dimension(7, 7), intent(in) :: At
real(C_double), dimension(7, 7), intent(in) :: Bt
real(C_double) :: Cmn
do concurrent(nn = 1:7, mm = 1:7)
Cmn = 0.0
do kk = 1,7
Cmn = Cmn + At(kk,mm) * Bt(nn,kk)
end do
C(mm,nn) = Cmn
end do
end subroutine AtmulBt7x7x7_loop
subroutine AtmulBt8x8x8_loop(C, At, Bt) BIND(C, name="AtmulBt8x8x8_loop")
real(C_double), dimension(8, 8), intent(out) :: C
real(C_double), dimension(8, 8), intent(in) :: At
real(C_double), dimension(8, 8), intent(in) :: Bt
real(C_double) :: Cmn
do concurrent(nn = 1:8, mm = 1:8)
Cmn = 0.0
do kk = 1,8
Cmn = Cmn + At(kk,mm) * Bt(nn,kk)
end do
C(mm,nn) = Cmn
end do
end subroutine AtmulBt8x8x8_loop

real(C_double) function dot3v2(x, A, y, M, N) BIND(C, name="dot3v2")
integer(C_long), intent(in) :: M, N
real(C_double), intent(in) :: x(M), A(M,N), y(N)
real(C_double) :: t
integer(C_long) :: mm, nn
dot3v2 = 0.0d0
do concurrent(nn = 1:N, mm = 1:M)
dot3v2 = dot3v2 + x(mm) * A(mm, nn) * y(nn)
end do
end function dot3v2
real(C_double) function dot3(x, A, y, M, N) BIND(C, name="dot3")
integer(C_long), intent(in) :: M, N
real(C_double), intent(in) :: x(M), A(M,N), y(N)
real(C_double) :: t
integer(C_long) :: mm, nn
dot3 = 0.0d0
do concurrent(nn = 1:N)
t = 0.0d0
do concurrent(mm = 1:M)
t = t + x(mm) * A(mm, nn)
end do
dot3 = dot3 + t * y(nn)
end do
end function dot3
real(C_double) function dot3builtin(x, A, y, M, N) BIND(C, name="dot3builtin")
integer(C_long), intent(in) :: M, N
real(C_double), intent(in) :: x(M), A(M,N), y(N)
dot3builtin = dot_product(x, matmul(A, y))
end function dot3builtin

end module matrixmul
``````
Julia code and script for running benchmarks
``````using LoopVectorization, ArrayInterface, Static, LinearAlgebra

function mul_serial!(C, A, B)
@avx for n ∈ indices((C,B), 2), m ∈ indices((C,A), 1)
Cmn = zero(eltype(C))
for k ∈ indices((A,B), (2,1))
Cmn += A[m,k] * B[k,n]
end
C[m,n] = Cmn
end
end
@avxt for n ∈ indices((C,B), 2), m ∈ indices((C,A), 1)
Cmn = zero(eltype(C))
for k ∈ indices((A,B), (2,1))
Cmn += A[m,k] * B[k,n]
end
C[m,n] = Cmn
end
end

const LIBFORTMUL = joinpath(pwd(), "libmatmul.so")
if !isfile(LIBFORTMUL)
# run(`gfortran -Ofast -march=native -fdisable-tree-cunrolli -floop-nest-optimize -funroll-loops -mprefer-vector-width=\$(LoopVectorization.pick_vector_width(Float64)*64) -shared -fPIC matmul.f90 -o \$LIBFORTMUL`)
run(`gfortran -Ofast -march=native -mprefer-vector-width=\$(LoopVectorization.pick_vector_width(Float64)*64) -shared -fPIC matmul.f90 -o \$LIBFORTMUL`)
end

struct SizedMatrix{M,N,T} <: DenseMatrix{T}
data::Matrix{T}
end
@inline SizedMatrix{M,N}(data::Matrix{T}) where {M,N,T} = SizedMatrix{M,N,T}(data)
@inline Base.unsafe_convert(::Type{Ptr{T}}, A::SizedMatrix{M,N,T}) where {M,N,T} = Base.unsafe_convert(Ptr{T}, A.data)
@inline Base.size(A::SizedMatrix{M,N}) where {M,N} = (M,N)
@inline Base.elsize(A::SizedMatrix{M,N,T}) where {M,N,T} = sizeof(T)
@inline Base.length(A::SizedMatrix{M,N}) where {M,N} = M*N
@inline ArrayInterface.size(A::SizedMatrix{M,N}) where {M,N} = (static(M),static(N))
@inline ArrayInterface.axes(A::SizedMatrix{M,N}, ::StaticInt{1}) where {M,N} = static(1):static(M)
@inline ArrayInterface.axes(A::SizedMatrix{M,N}, ::StaticInt{2}) where {M,N} = static(1):static(N)
@inline ArrayInterface.axes(A::SizedMatrix{M,N}) where {M,N} = (static(1):static(M), static(1):static(N))
@inline Base.parent(A::SizedMatrix) = A.data
@inline ArrayInterface.parent_type(::Type{<:SizedMatrix{M,N,T}}) where {M,N,T} = Matrix{T}
@inline function Base.getindex(A::SizedMatrix, i...)
@boundscheck checkbounds(A, i...)
@inbounds A.data[i...]
end
@inline function Base.setindex!(A::SizedMatrix, v, i...)
@boundscheck checkbounds(A, i...)
@inbounds A.data[i...] = v
end

for At ∈ (false,true), Bt ∈ (false,true)
fname = Symbol(At ? "At" : "A", "mul", Bt ? "Bt" : "B")
for static_size ∈ (0,7,8)
if static_size == 7
Ctype = :(SizedMatrix{7,7,Float64})
fname_sized = Symbol(fname, "7x7x7")
elseif static_size == 8
Ctype = :(SizedMatrix{8,8,Float64})
fname_sized = Symbol(fname, "8x8x8")
else
Ctype = :(Matrix{Float64})
fname_sized = fname
end
Atype = At ? :(Adjoint{Float64,\$Ctype}) : Ctype
Btype = Bt ? :(Adjoint{Float64,\$Ctype}) : Ctype
A = At ? :(parent(A)) : :A
B = Bt ? :(parent(B)) : :B
for loop ∈ (false, true)
funcname! = loop ? :fortmul! : :fortmul_loop!
cfname = QuoteNode(loop ? Symbol(fname_sized, "_loop") : fname_sized)
matmul_quote = if static_size == 0
quote
function \$funcname!(C::\$Ctype, A::\$Atype, B::\$Btype)
M, N = size(C); K = size(B,1)
ccall((\$cfname, LIBFORTMUL), Cvoid, (Ref{Float64},Ref{Float64},Ref{Float64},Ref{Clong},Ref{Clong},Ref{Clong}), C, \$A, \$B, M, K, N)
end
end
else
quote
function \$funcname!(C::\$Ctype, A::\$Atype, B::\$Btype)
ccall((\$cfname, LIBFORTMUL), Cvoid, (Ref{Float64},Ref{Float64},Ref{Float64}), C, \$A, \$B)
end
end
end
@eval \$matmul_quote
end
end
end

const Matrix7x7 = SizedMatrix{7,7}
const Matrix8x8 = SizedMatrix{8,8}
A7x7 = rand(7,7); B7x7 = rand(7,7); C7x7 = rand(7,7);
A8x8 = rand(8,8); B8x8 = rand(8,8); C8x8 = rand(8,8);

function print_matmul_summary(C, A, B)
M, K = size(A); N = size(B,2);
A_str = A isa Adjoint ? "A'" : "A"
B_str = B isa Adjoint ? "B'" : "B"
static_str = C isa SizedMatrix ? "compile" : "runtime"
println("Running \$(static_str)-sized \$(M)×\$(N) = \$(M)×\$(K) * \$(K)×\$(N) benchmarks for C = \$A_str * \$B_str")
end
function run_benchmark(C,A,B)
print_matmul_summary(C,A,B)
M, K = size(A); N = size(B,2)
println("gfortran Builtin:")
t = @belapsed fortmul!(\$C, \$A, \$B)
println("GFLOPS: \$(2e-9M*K*N/t)")
Ccopy = copy(C)
println("gfortran Loops:")
t = @belapsed fortmul_loop!(\$C, \$A, \$B)
println("GFLOPS: \$(2e-9M*K*N/t)")
@assert C ≈ Ccopy
t = @belapsed mul_serial!(\$C, \$A, \$B)
println("GFLOPS: \$(2e-9M*K*N/t)")
@assert C ≈ Ccopy
if length(C) > 32^2
t = @belapsed mul_threads!(\$C, \$A, \$B)
println("GFLOPS: \$(2e-9M*K*N/t)")
@assert C ≈ Ccopy
end
println()
end
for (C,A,B) ∈ ((C7x7,A7x7,B7x7),(C8x8,A8x8,B8x8)), At ∈ (false,true), Bt ∈ (false,true), static_size ∈ (false,true)
M,K = size(A); N = size(B,2)
T = static_size ? (M == 7 ? Matrix7x7 : Matrix8x8) : identity
Atemp = T(A); Btemp = T(B)
Atemp = At ? Atemp' : Atemp
Btemp = Bt ? Btemp' : Btemp
run_benchmark(T(C), Atemp, Btemp)
end

for M ∈ (71, 72, 144, 216)
K = N = M
A = rand(M,K); B = rand(K,N);
C = Matrix{Float64}(undef, M, N);

run_benchmark(C, A, B)
run_benchmark(C, A, B')
run_benchmark(C, A', B)
run_benchmark(C, A', B')
end
``````
Results
``````Running runtime-sized 7×7 = 7×7 * 7×7 benchmarks for C = A * B
gfortran Builtin:
GFLOPS: 4.290648913738776
gfortran Loops:
GFLOPS: 3.8254906148406542
GFLOPS: 25.796296296296298

Running compile-sized 7×7 = 7×7 * 7×7 benchmarks for C = A * B
gfortran Builtin:
GFLOPS: 26.027454718779794
gfortran Loops:
GFLOPS: 15.402794157670328
GFLOPS: 65.19349315068494

Running runtime-sized 7×7 = 7×7 * 7×7 benchmarks for C = A * B'
gfortran Builtin:
GFLOPS: 3.376153217330375
gfortran Loops:
GFLOPS: 3.8443792201472116
GFLOPS: 34.47618088427838

Running compile-sized 7×7 = 7×7 * 7×7 benchmarks for C = A * B'
gfortran Builtin:
GFLOPS: 26.010593704748118
gfortran Loops:
GFLOPS: 13.077782718818277
GFLOPS: 65.18729192428422

Running runtime-sized 7×7 = 7×7 * 7×7 benchmarks for C = A' * B
gfortran Builtin:
GFLOPS: 4.0271049910194305
gfortran Loops:
GFLOPS: 3.9683312655086853
GFLOPS: 10.530866711571298

Running compile-sized 7×7 = 7×7 * 7×7 benchmarks for C = A' * B
gfortran Builtin:
GFLOPS: 18.178010471204193
gfortran Loops:
GFLOPS: 10.238181624159642
GFLOPS: 28.06786345150557

Running runtime-sized 7×7 = 7×7 * 7×7 benchmarks for C = A' * B'
gfortran Builtin:
GFLOPS: 4.315322113055289
gfortran Loops:
GFLOPS: 2.8338118977575006
GFLOPS: 24.073995697104365

Running compile-sized 7×7 = 7×7 * 7×7 benchmarks for C = A' * B'
gfortran Builtin:
GFLOPS: 16.877507447864946
gfortran Loops:
GFLOPS: 13.098295894655307
GFLOPS: 42.206275815301154

Running runtime-sized 8×8 = 8×8 * 8×8 benchmarks for C = A * B
gfortran Builtin:
GFLOPS: 4.654805292161444
gfortran Loops:
GFLOPS: 7.049481636524185
GFLOPS: 34.24933544197315

Running compile-sized 8×8 = 8×8 * 8×8 benchmarks for C = A * B
gfortran Builtin:
GFLOPS: 75.37630919014605
gfortran Loops:
GFLOPS: 13.412733778948079
GFLOPS: 81.84462757020562

Running runtime-sized 8×8 = 8×8 * 8×8 benchmarks for C = A * B'
gfortran Builtin:
GFLOPS: 3.4824207638528932
gfortran Loops:
GFLOPS: 6.503686130838877
GFLOPS: 49.12558945241074

Running compile-sized 8×8 = 8×8 * 8×8 benchmarks for C = A * B'
gfortran Builtin:
GFLOPS: 75.28191528545119
gfortran Loops:
GFLOPS: 47.59792997342533
GFLOPS: 81.79227632525786

Running runtime-sized 8×8 = 8×8 * 8×8 benchmarks for C = A' * B
gfortran Builtin:
GFLOPS: 8.2165948853584
gfortran Loops:
GFLOPS: 5.622450042883365
GFLOPS: 13.54091119775336

Running compile-sized 8×8 = 8×8 * 8×8 benchmarks for C = A' * B
gfortran Builtin:
GFLOPS: 56.06721950683728
gfortran Loops:
GFLOPS: 47.56793060025185
GFLOPS: 49.458773374672994

Running runtime-sized 8×8 = 8×8 * 8×8 benchmarks for C = A' * B'
gfortran Builtin:
GFLOPS: 4.89827710282352
gfortran Loops:
GFLOPS: 5.558826599853847
GFLOPS: 31.88760662318113

Running compile-sized 8×8 = 8×8 * 8×8 benchmarks for C = A' * B'
gfortran Builtin:
GFLOPS: 56.06721950683728
gfortran Loops:
GFLOPS: 11.955531821933105
GFLOPS: 48.73397298200392

Running runtime-sized 71×71 = 71×71 * 71×71 benchmarks for C = A * B
gfortran Builtin:
GFLOPS: 5.865806789966648
gfortran Loops:
GFLOPS: 22.828868478122207
GFLOPS: 109.64066903565738
GFLOPS: 326.3791478798318

Running runtime-sized 71×71 = 71×71 * 71×71 benchmarks for C = A * B'
gfortran Builtin:
GFLOPS: 3.2668930328504797
gfortran Loops:
GFLOPS: 6.8133941234140165
GFLOPS: 107.9801484342002
GFLOPS: 317.0627491510409

Running runtime-sized 71×71 = 71×71 * 71×71 benchmarks for C = A' * B
gfortran Builtin:
GFLOPS: 14.328762735952917
gfortran Loops:
GFLOPS: 14.112931527375249
GFLOPS: 52.568260262906655
GFLOPS: 191.04598139393016

Running runtime-sized 71×71 = 71×71 * 71×71 benchmarks for C = A' * B'
gfortran Builtin:
GFLOPS: 5.875484273425701
gfortran Loops:
GFLOPS: 20.93720202404282
GFLOPS: 96.3000033632664
GFLOPS: 153.8464734749436

Running runtime-sized 72×72 = 72×72 * 72×72 benchmarks for C = A * B
gfortran Builtin:
GFLOPS: 5.997782455689288
gfortran Loops:
GFLOPS: 26.56380328802221
GFLOPS: 123.29809725158565
GFLOPS: 385.96556537924624

Running runtime-sized 72×72 = 72×72 * 72×72 benchmarks for C = A * B'
gfortran Builtin:
GFLOPS: 3.3204458717718337
gfortran Loops:
GFLOPS: 7.33131021478448
GFLOPS: 123.28995177379932
GFLOPS: 388.96206752813674

Running runtime-sized 72×72 = 72×72 * 72×72 benchmarks for C = A' * B
gfortran Builtin:
GFLOPS: 25.192224622030242
gfortran Loops:
GFLOPS: 24.992333188255387
GFLOPS: 68.62437948152235
GFLOPS: 200.95457298606908

Running runtime-sized 72×72 = 72×72 * 72×72 benchmarks for C = A' * B'
gfortran Builtin:
GFLOPS: 5.9635236505108775
gfortran Loops:
GFLOPS: 24.258148376823844
GFLOPS: 110.31417171567905
GFLOPS: 146.8943299693588

Running runtime-sized 144×144 = 144×144 * 144×144 benchmarks for C = A * B
gfortran Builtin:
GFLOPS: 5.601758954232673
gfortran Loops:
GFLOPS: 39.86547665934595
GFLOPS: 126.24657534246577
GFLOPS: 1125.972032806461

Running runtime-sized 144×144 = 144×144 * 144×144 benchmarks for C = A * B'
gfortran Builtin:
GFLOPS: 2.578293861451797
gfortran Loops:
GFLOPS: 8.234176150543389
GFLOPS: 108.0644916128332
GFLOPS: 975.3018029788348

Running runtime-sized 144×144 = 144×144 * 144×144 benchmarks for C = A' * B
gfortran Builtin:
GFLOPS: 25.676496764623693
gfortran Loops:
GFLOPS: 25.937223937771183
GFLOPS: 89.84456145629608
GFLOPS: 867.8420088935393

Running runtime-sized 144×144 = 144×144 * 144×144 benchmarks for C = A' * B'
gfortran Builtin:
GFLOPS: 5.639534367631242
gfortran Loops:
GFLOPS: 37.20829153712439
GFLOPS: 114.82566478878657
GFLOPS: 661.0546823112686

Running runtime-sized 216×216 = 216×216 * 216×216 benchmarks for C = A * B
gfortran Builtin:
GFLOPS: 4.885176232546832
gfortran Loops:
GFLOPS: 46.77228108770416
GFLOPS: 123.50268998394588
GFLOPS: 1846.7465640461794

Running runtime-sized 216×216 = 216×216 * 216×216 benchmarks for C = A * B'
gfortran Builtin:
GFLOPS: 2.501163629630972
gfortran Loops:
GFLOPS: 8.238294838975227
GFLOPS: 98.05113835376532
GFLOPS: 1477.1265665078784

Running runtime-sized 216×216 = 216×216 * 216×216 benchmarks for C = A' * B
gfortran Builtin:
GFLOPS: 25.663006899783166
gfortran Loops:
GFLOPS: 25.26621713606637
GFLOPS: 96.12499105775973
GFLOPS: 1474.9646542261253

Running runtime-sized 216×216 = 216×216 * 216×216 benchmarks for C = A' * B'
gfortran Builtin:
GFLOPS: 4.846530403782313
gfortran Loops:
GFLOPS: 52.11586018586035
GFLOPS: 105.0252305768329
GFLOPS: 1462.9739420773753
``````

versioninfo:

``````julia> versioninfo()
Julia Version 1.7.0-DEV.1052
Commit cc59947c03* (2021-05-02 02:36 UTC)
Platform Info:
OS: Linux (x86_64-generic-linux)
CPU: Intel(R) Core(TM) i9-10980XE CPU @ 3.00GHz
WORD_SIZE: 64
LIBM: libopenlibm
Environment:
``````

The benchmarks are a bunch of `matmul`s via the Fotran intrinsic as well as loops.
The Julia code contrains the compilation command. I’m open to suggestions for better loops and compiler flags.
The benchmark contains benchmarks for runtime and compile-time sized 7x7 and 8x8 matrices for all permutations of tranposing `A` and `B` in `C = A*B`, and then runtime sized versions for `71x71`, `72x72`, `144x144`, and `216x216`.
The larger sizes also include multithreaded Julia matrix multiply (implemented as `@avxt` on 3 loops). Note that Julia must be started with multiple threads to support multithreading.
Also, `@avx(t)` is currently limited to simple rectangular loop nests; i.e. triangular loops or multiple loops at the same level in a nest don’t work at the moment, and neither does complex control flow (you can remove `@avx` if you want to benchmark that).

Also note that there’s only two versions of the matrix multiply function in Julia: 1 single threaded, another multithreaded. The same function is used for all transposed permutations, without us having to write and optimize a different version for each.

EDIT: Fixed names following @themos’ suggestion.

8 Likes

I rather object to you labelling the results from one compiler as the “Fortran” results. I also suspect that your work in producing fast loops would have produced similar results had you worked on a Fortran compiler that you had access to the internals of. Is there any reason that it wouldn’t?

It may well turn out to be the case that Fortran compilers have dropped the ball when it comes to delivering blistering performance in the cases that matter (thousands of communicating processes solving engineering and science problems) but it will take significantly more work to demonstrate that.

2 Likes

Unlike Julia, the Fortran compiler development efforts are not centralized. From what I have learned over the past years, some are very good at debugging (like NAG, GFortran) and some others are historically known for better optimizations (like Intel, Cray, IBM, PGI, …). But these could change at any time. Having and maintaining multiple high-quality compilers takes a lot of effort. For benchmarking purposes, it makes sense to compile Fortran code with all available compilers and report the best result. I think this issue highlights the importance of such activities as Google Summer of Code 2021

3 Likes

Julia correctly sums 10^8 32-bit uniform deviates, with the program

``````n = 10^8
x = Float32.(rand(n))
println(sum(x)/n)
``````

giving output such as `0.49995995` but gfortran for

``````implicit none
integer, parameter :: n = 10**8
real :: x(n)
call random_number(x)
print*,sum(x)/n
end
``````

gives `0.167772159` and g95 gives `0.16777216` but Intel Fortran gives `0.4999704`
Compensated summation, which gets this correct, may eventually be in stdlib. Ideally a compiler would have an option to compute SUM carefully and the user could verify that toggling the option does not influence the program results.

By default, Julia uses pairwise summation, which explains this. On modern architectures, `sum` will usually be memory-bound, so the extra (very minor) cost of summing out of order ends up not mattering.

1 Like

Yes, indeed. I would like to keep it that way in Fortran. I also personally do not like to use macros like `@avx` and rather just use built-in general parallel features such as `do concurrent`.

3 Likes

Thanks @chriselrod, @StefanKarpinski, @antoine-levitt, @Royi and others from the Julia community to provide feedback on this benchmark and welcome to the forum!

I like this benchmark as it benchmarks a loop with a custom function and a sum. The custom function is a `sin`, but we can as well benchmark any other custom function (and as pointed above, Julia treats `sin` as any other function that the user could write). To achieve the best performance, the loop has to be unrolled and perhaps one has to inline the function implementation and rearrange operations.

It seems that Julia indeed is getting very good performance and that some Fortran compilers might be lagging behind.

I actually think being able to optimize code like this is very important, as many production Fortran codes have inner loops like this. I have written many like that. Of course, as @themos said, this is not the only consideration, and a large production code runs in parallel and is a lot more complex than this. But Julia is a young language, so they do not have as many large production codes yet. But they will. And they will beat Fortran also, if they can beat Fortran on simpler benchmarks.

Fortran has large codes and good track record on that, but unfortunately it has been my personal experience that Fortran compilers have been not delivering on the full potential that Fortran allows. This benchmark might show one such example, and Julia shows how to get the performance. As a user I would like the Fortran compilers to do better. Once LFortran can compile most codes, I would like to start looking into this and start implementing such optimizations.

9 Likes

I really don’t think that scientists/engineers SHOULD try to use their working language for optimising SUM(SIN(RANDOM_NUMBER())) on the supercomputer they got last month. They should be talking to their support people about installed libraries for random distributions and feed back any performance concerns. Then all codes could benefit from the work of a small team. If it turns out that Julia or C have the best code generator on that platform, then that is what the implementor of the library does and the user of the library can use it from whatever language they are currently working in. This is the model that took us from the megaflops to the near exaflops. The alternative, where everyone rolls their own using a single-source code generator with ad-hoc extensions, seems a very large risk.

3 Likes

Consider the fact that (as far as I know) LoopVectorization was developed entirely by @chriselrod (a PhD student in an unrelated area) as a hobby project, starting one year ago, with very little interaction with the julia language itself. Of course this has more to do with him being a genius than anything else, but still, I’m not sure that would have happened had he decided to work in Fortran. The model you argue for (of a strict separation between “scientists” and “support people”) makes sense in the context where you need a team of highly-trained engineers working in an environment with high barriers of entry (either because of closed-source, or just because the barrier to entry in any compiler is huge) to make any progress. When you break these barriers, you get pretty interesting results.

5 Likes

There haven’t been such barriers and my evidence is FFTW and GotoBLAS. Both were created by the tiniest of brilliant teams, addressed a specific, well-understood need and became overnight successes and universally found in most supercomputer libraries.

The point of the support team is that they act as librarians: they will tell the researcher (IF they are asked) which bits of performant software is around ready to use, which is the best fit to their workloads, any tradeoffs, etc. Ideally, they will feed back user experiences to the library developers.

There was never any suggestion. for very good reasons, that people should change their programming language to whatever FFTW or GotoBLAS used to develop the codes.

Another example I am aware of is Halide. It is specifically designed to produce the fastest kernels for image processing and uses its own specification language. Once the kernel is produced, it can be consumed from any programming language.