# Simple summation 8x slower than in Julia

Is there a way to tell Fortran compilers to maximally evaluate these sorts of things at compile time? Since all the values are known at compile time, the whole program could be reduced to a print of a known constant.

1 Like

It is up to each Fortran compiler to document such things.

@certik and everyone,

Notwithstanding the excellent point by @themos the specific case at hand is not a true performance benchmark on account of the mathematical identity, should we continue to use the same case but modify it as suggested upthread to view the `sin` calculation as some representative function computation for which a common library implementation is to be consumed in any programming paradigm of interest, whether it be based in Julia or Fortran, and we “up” the game a little bit to achieve some “scale” to the calculations i.e., raise the number N to be 800,000,000 rather than 100,000,000, a speedup close to 6x relative to Julia is viable with Fortran.

So with John Burkardt’s C code for the CORDIC algorithm (link upthread) as the basis, the Julia code and the result is as follows (as noted by @uwe, `avx*` macro fails to compile, so it’s removed):

``````function cordic_sine(a)

cos = Ref{Float64}(0.0)
sin = Ref{Float64}(0.0)

# Ref: Excellent work by John Burkardt
# https://people.sc.fsu.edu/~jburkardt/c_src/cordic/cordic.c
# void cossin_cordic ( double beta, int n, double *c, double *s )
ccall((:cossin_cordic, "cordic"), Cvoid, (Float64, Cint, Ref{Float64}, Ref{Float64}), a, 40, cos, sin)

return sin[]

end

function f_avx(N)
s = 0.0
for i in 1:N
s += cordic_sine(convert(Float64, i))
end
s
end

@time r = f_avx(800000000)
println(r)
``````

Execution result with Julia:

C:\temp>julia avx.jl
172.551030 seconds (1 allocation: 16 bytes)
1.9006631241568408

With Fortran, say one pursues rather simple-minded `coarray` approach as shown in the below fold - but while noting `coarray` is part of the base language:

Click to see Fortran code

#### Simple coarray example to divvy up the series summation across images

``````module cordic_m
use, intrinsic :: iso_c_binding, only : c_int, c_double
interface
pure subroutine cossin_cordic( beta, n, c, s) bind(C, name="cossin_cordic")
! Ref: Excellent work by John Burkardt
! https://people.sc.fsu.edu/~jburkardt/c_src/cordic/cordic.c
! void cossin_cordic ( double beta, int n, double *c, double *s )
import :: c_int, c_double
! Argument list
real(c_double), intent(in), value :: beta
integer(c_int), intent(in), value :: n
real(c_double), intent(inout)     :: c
real(c_double), intent(inout)     :: s
end subroutine
end interface
contains
pure function cordic_sine( a ) result(sin_a)
! Argument list
real(c_double), intent(in) :: a
! Function result
real(c_double) :: sin_a
real(c_double) :: cos_a
call cossin_cordic( a, 40, cos_a, sin_a )
end function
end module
use cordic_m
integer, parameter :: M = 100000000
real(c_double) :: r[*]
real(c_double) :: t1, t2
real(c_double) :: sn
integer :: i

if ( this_image() == 1 ) then
call cpu_time(t1)
end if
sync all

r = f(M)
sync all

if ( this_image() == 1 ) then
sn = 0.0_c_double
do i = 1, num_images()
sn = sn + r[i]
end do
call cpu_time(t2)
print *, "Time", t2-t1
print *, sn
end if

contains

pure function f(N) result(r)
integer, intent(in) :: N
real(c_double) :: r
integer :: i
r = 0.0_c_double
do i = (this_image()-1)*M+1, this_image()*M
r = r + cordic_sine(real(i,c_double))
end do
end function

end program
``````

the timing result with IFORT on the same CPU as the above Julia case is:

C:\temp>ifort /c /O3 /Qcoarray:shared /Qcoarray-num-images=8 /QxHost /traceback p.f90
Intel(R) Fortran Intel(R) 64 Compiler Classic for applications running on Intel(R) 64, Version 2021.2.0 Build 20210228_000000

Microsoft (R) Incremental Linker Version 14.27.29112.0

C:\temp>p.exe
Time 30.2500000000000
1.90066312415625

Note the series summation result matches with that of Julia to the precision of `c_double` / `Float64`.

Perhaps one or more of the Julia enthusiasts can show here the parallel execution option(s) with Julia for the same exact case i.e., their native option equivalent to Fortran `coarray` - now that will be a good learning exercise for readers here!

3 Likes

As far as I understand, Julia is quite strict about floating point semantics, so it will not allow transformations that violate the semantics (eg loop reordering). It does autovectorize (through LLVM), but only as far as it can prove that they produce the exact same results. I think Fortran does the same, so differences there are probably due to compilers more than languages. Julia annotations such as `@simd` or `@fastmath` relax the semantics, allowing the compiler to optimize more, so compared to fortran it’s more granular than compiler switches. I sometimes wish julia would allow more aggressive transforms automagically, but it’s not easy to find an API that allows both automatic transformations in the cases where it doesn’t matter and forbids them in the cases where it does. In any case adding `@simd` in front of tight loops is not that big of a deal.

None of the languages will vectorize this without permission because vectorization requires reassociating the order of the summation in violation of the IEEE 754 floating point standard. That permission is given in C and Fortran by passing the `-ffast-math` flag for an entire compilation unit. This flag also gives permission to do other transformations that may speed up your code but may also change the answer for better or worse, but definitely less predictable since the scope of the transformations is unspecified and unbounded. (In bad cases arbitrarily wrong answers can result from the careless application of the `-ffast-math` option.)

In Julia, a much more limited permission to change associativity is given by using the built-in `@simd` macro. This is a much more surgical approach since it only gives permission to reassociate operations as necessary to vectorize the annotated loop, and doesn’t allow any of the other potentially more problematic transformations that the `-ffast-math` option can cause. The `@avx` macro provided by the LoopVectorization package does a bit more than this: it explicitly unrolls the loop, and replaces calls to certain math functions (including `sin`) with vectorized versions that may give slightly different results. As @antoine-levitt mentioned, there is also a built-in `@fastmath` macro that gives permissions similar to what `-ffast-math` does (but locally).

7 Likes

I can’t get this `cordic` program to compile on my machine so I can’t run my own comparisons. If you want to do multithreading, the simplest way would be to just write

``````function f_certainly_not_avx(N)
s = 0.0
s += cordic_sine(convert(Float64, i))
end
s
end
``````

Not really sure why or how this is relevant to the original question that @certik asked though.

If one is willing to roll up their sleeves, they could make this `cordic_sine` function work with `LoopVectorization.@avx`, they just have to add dispatches that act correctly and vectorize across a `VectorizationBase.AbstractSIMD` type, but it wouldn’t be super simple.

Chris explained a bit on how to do so here: https://julialang.zulipchat.com/#narrow/stream/137791-general/topic/LoopVectorization.20vectorize.20an.20arbitrary.20function

No, “multithreading” is not of interest; the curiosity given the original post is how to do parallel programming in Julia. Regardless, your suggestion of multithreading decreases the performance with the Julia case:

C:\temp>type avx.jl
function cordic_sine(a::Float64)

cos = Ref{Float64}(0.0)
sin = Ref{Float64}(0.0)

ccall((:cossin_cordic, “cordic”), Cvoid, (Float64, Cint, Ref{Float64}, Ref{Float64}), a, 40, cos, sin)

return sin[]

end

function f_avx(N)
s = 0.0
s += cordic_sine(convert(Float64, i))
end
s
end

@time r = f_avx(800000000)
println(r)

C:\temp>julia avx.jl
201.834736 seconds (1.60 G allocations: 23.847 GiB, 0.80% gc time, 0.02% compilation time)
1.9006631241572693

C:\temp>

You probably started julia with only 1 thread. On my 6 core machine this is what I see:

``````julia> function cordic_sine(a)

cos = Ref{Float64}(0.0)
sin = Ref{Float64}(0.0)

# Ref: Excellent work by John Burkardt
# https://people.sc.fsu.edu/~jburkardt/c_src/cordic/cordic.c
# void cossin_cordic ( double beta, int n, double *c, double *s )
ccall((:cossin_cordic, "/home/mason/cordic/libcordic.so"), Cvoid, (Float64, Cint, Ref{Float64}, Ref{Float64}), a, 40, cos, sin)

return sin[]

end
cordic_sine (generic function with 1 method)

julia> function f_definitely_not_avx(N)
s = 0.0
s += cordic_sine(convert(Float64, i))
end
s
end
f_definitely_not_avx (generic function with 1 method)
``````
``````julia> @time r = f_definitely_not_avx(800000000)
57.295734 seconds (1.60 G allocations: 23.846 GiB, 9.56% gc time, 0.04% compilation time)
-0.8053191287737147

julia> versioninfo()
Julia Version 1.6.0
Commit f9720dc2eb* (2021-03-24 12:55 UTC)
Platform Info:
OS: Linux (x86_64-pc-linux-gnu)
CPU: AMD Ryzen 5 2600 Six-Core Processor
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-11.0.1 (ORCJIT, znver1)
Environment:
``````
1 Like

@Mason, here’re the steps I followed on Windows OS with the `cordic.c` and `cordic.h` from the FSU site I linked upthread: note on Windows, Julia seems to want a shared library (DLL) from what I could see. Here a Windows definition (.def) file of exported symbols from the library make things clearer in my view:

C:\temp>dir cordic.*
Volume in drive C is OS
Volume Serial Number is C4CA-BA12

Directory of C:\temp

05/08/2021 11:15 AM 63,494 cordic.c
05/08/2021 01:09 PM 50 cordic.def
05/08/2021 11:15 AM 1,196 cordic.h
3 File(s) 64,740 bytes
0 Dir(s) 23,240,077,312 bytes free

C:\temp>type cordic.def
LIBRARY cordic
EXPORTS
cossin_cordic @1

C:\temp>cl /c /O2 cordic.c
Microsoft (R) C/C++ Optimizing Compiler Version 19.27.29112 for x64

cordic.c

Microsoft (R) Incremental Linker Version 14.27.29112.0

Creating library cordic.lib and object cordic.exp

C:\temp>

What exactly do you mean by parallel here? Maybe you mean what I’d recognize as distributed / multiprocess? In that case, you’d want to use the `Distributed` standard library or something like MPI.jl.

Here’s an example on my 6 core machine:

``````julia> using Distributed; addprocs(5); nprocs()
6

julia> @everywhere function cordic_sine(a)
cos = Ref{Float64}() # don't waste time zeroing data that'll get thrown out
sin = Ref{Float64}()
# Ref: Excellent work by John Burkardt
# https://people.sc.fsu.edu/~jburkardt/c_src/cordic/cordic.c
# void cossin_cordic ( double beta, int n, double *c, double *s )
ccall((:cossin_cordic, "/home/mason/cordic/libcordic.so"), Cvoid, (Float64, Cint, Ref{Float64}, Ref{Float64}), a, 40, cos, sin)
return sin[]
end

julia> function f_distributed(N)
@distributed (+) for i in 1:N
cordic_sine(convert(Float64, i))
end
end
f_distributed (generic function with 1 method)

julia> @time r = f_distributed(800000000)
45.842816 seconds (933.50 k allocations: 54.914 MiB, 0.03% gc time, 0.67% compilation time)
1.9006631241562875
``````
``````julia> versioninfo()
Julia Version 1.6.0
Commit f9720dc2eb* (2021-03-24 12:55 UTC)
Platform Info:
OS: Linux (x86_64-pc-linux-gnu)
CPU: AMD Ryzen 5 2600 Six-Core Processor
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-11.0.1 (ORCJIT, znver1)
Environment:
``````

Here is the time for the serial version:

``````julia> function f_serial(N)
s = 0.0
for i in 1:N
s += cordic_sine(convert(Float64, i))
end
s
end
f_avx (generic function with 1 method)

julia> @time r = f_serial(800000000)
216.155306 seconds (1 allocation: 16 bytes)
1.9006631241568408
``````
1 Like

Again though, I really don’t understand that point of this benchmark you’re proposing @FortranFan. Why are we interested in a benchmark that just times Fortran / Julia’s foreign function interfaces in parallel?

As far as I understood, this thread was about @certik wanting to understand how LoopVectorization.jl was generating such fast code in julia and how it could be replicated in Fortran.

2 Likes

On a more general note, people might like to time their Fortran compiler’s SUM intrinsic on the largest array their system will handle and report the nanoseconds per item. When I do this, I find large variations, with the Cray compiler on Intel hardware clearly ahead. That merely demonstrates that vendors prioritise different aspects of performance. You will always find some examples of sub-optimal performance if that is what you set out to look for.
The Fortran community is well aware that GPUs are a challenge for modern Fortran but I don’t think that it has been demonstrated that Fortran cannot adapt to the challenge because of fundamental, baked-in language design issues. I don’t think that fast trig approximations is a battle worth fighting, any more than trying to beat, say, FFTW or MKL DGEMM in Fortran.

I’m not really proposing this benchmark, as shown upthread it’s not realistic. The point is only there are 2 parts to this “micro” benchmark: one is some form of non-serial execution of “+” reduction (whether AVX or otherwise) and the next is function evaluation.

If placed on the same basis, there are hardly much of any differences between modern languages with calculations that have any heft to them.

What’s left is simple, convenient way of introducing aspects natively into the language whilst abstracting away the paradigms and separating the hardware-related considerations such as AVX.

For practitioners needing to use Fortran, the above is just a simple example using `coarray`s.

About "really don’t understand the point " is a question that can be raised of the ones by Jula-lang here.

1 Like

By default, Julia can use more memory than a program compiled by gfortran, so that on my Windows 10 PC, Julia runs

``````n = 10^9
x = rand(n)
println(sum(x)/n)
``````

in 2.8 s (wall time) while gfortran -O2 cannot do the calculation for n = 3e8 and for 2e8

``````implicit none
integer, parameter :: n = 2*10**8
real(kind=kind(1.0d0)) :: x(n)
call random_number(x)
print*,n,sum(x)/n
end
``````

takes 1.7 s. Is there a gfortran compiler option that permits larger arrays? I don’t think Julia “sees” that it need not store the entire array x, since for n = 5e9 the Julia program hangs my machine. Gfortran does compile a program with n = 1e9 using the options gfortran -fstack-arrays -fmax-stack-var-size=3000000000 -O2 , but running the program produces no output.

On Windows OS, better to use `ALLOCATABLE` arrays for the default stack-size is rather low and there are OS restrictions if you’re using 32-bit.

64-bit.

It wouldn’t be able to transform it anyway, so removing the `using LoopVectorization` and just using `for i in` would be fine.

`LoopVectorization.jl` uses a fork of a port of an old version of SLEEF for its `sin` function, so this is the fairest comparison. But as has been pointed out before, in Julia `sin` will be inlined, and some things like constant loads will get hoisted out of the loop.

I think it’s more likely that better language integration will go the other way: Julia’s compiler team wants to “rip open” the compiler, better exposing it to library authors to build their own.

LoopVectorization.jl is experimental, and will go through at least one major overhaul before I am comfortable tagging a “1.0” version.

It does. Others have confirmed it works on Power8 and Nvidea Jetson (ARM).
I bought an M1 Mac specifically so I could test and make sure performance isn’t accidentally bad, and can confirm it works there as well (you just need Julia master, and – until we get a Hwloc_jll version compatible with Apple ARM – you’d need to `] dev Hwloc` to replace `Hwloc_jll` with a path to the library installed some other way; I used Homebrew).

Given that LoopVectorization is still at version “0”, now (or, “sooner rather than later”) would be the time to deprecate `@avx`, and choose a new name for the macro.
`@avx` seems to consistently cause confusion over whether platforms other than x86 are supported. The reaction to the name seems generally pretty negative.

I’m open to name suggestions here, but I’d probably ask on Julia’s Slack/Zulip.

Why did I pick the name `@avx`?

1. Because it is short.
2. Because `@simd` was taken.
3. I didn’t choose `@vectorize` because there was already a vastly inferior macro with that name in LoopVectorization at the time that many of my packages were using, and I didn’t want to break them. It’s long since been removed. I don’t want the name anymore either, because aside from length, it seems like it’d cause similar confusion if/when matrix-extension support is added. The library name itself makes that mistake, but c’est la vie.

Perhaps `@fast`, to be fully open ended about how it gets there, and being short?

Depends on the loop.
If autovectorization requires violating IEEE, then it will require annotation of some kind, e.g. `@fastmath` or `@simd` from Base Julia.
`LoopVectorization.@avx` is a macro implemented by a Julia package that implements its own (experimental) autovectorizer.
It does more than just use special functions.

LoopVectorization.jl’s approach does not have that limitation, although I should cleanup `erf` support, which currently requires specifying `verf`.

LoopVectorization is also pretty good at nested loops. This is most relevant whenever anyone’s doing things not already easily/well expressed by BLAS, e.g. convolutions and tensor contractions (where expressing them as BLAS has some overhead).

gfortran code for the benchmark
``````module matrixmul
use ISO_C_BINDING

contains
! gfortran -Ofast -march=native -mprefer-vector-width=512 -shared -fPIC matmul.f90 -o libmatmul.so

subroutine AmulB(C, A, B, M, K, N) BIND(C, name="AmulB")
integer(C_long), intent(in) :: M, K, N
real(C_double), dimension(M, N), intent(out) :: C
real(C_double), dimension(M, K), intent(in) :: A
real(C_double), dimension(K, N), intent(in) :: B
C = matmul(A, B)
end subroutine AmulB
subroutine AmulB7x7x7(C, A, B) BIND(C, name="AmulB7x7x7")
real(C_double), dimension(7, 7), intent(out) :: C
real(C_double), dimension(7, 7), intent(in) :: A
real(C_double), dimension(7, 7), intent(in) :: B
C = matmul(A, B)
end subroutine AmulB7x7x7
subroutine AmulB8x8x8(C, A, B) BIND(C, name="AmulB8x8x8")
real(C_double), dimension(8, 8), intent(out) :: C
real(C_double), dimension(8, 8), intent(in) :: A
real(C_double), dimension(8, 8), intent(in) :: B
C = matmul(A, B)
end subroutine AmulB8x8x8

subroutine AmulBt(C, A, Bt, M, K, N) BIND(C, name="AmulBt")
integer(C_long), intent(in) :: M, K, N
real(C_double), dimension(M, N), intent(out) :: C
real(C_double), dimension(M, K), intent(in) :: A
real(C_double), dimension(N, K), intent(in) :: Bt
C = matmul(A, transpose(Bt))
end subroutine AmulBt
subroutine AmulBt7x7x7(C, A, Bt) BIND(C, name="AmulBt7x7x7")
real(C_double), dimension(7, 7), intent(out) :: C
real(C_double), dimension(7, 7), intent(in) :: A
real(C_double), dimension(7, 7), intent(in) :: Bt
C = matmul(A, transpose(Bt))
end subroutine AmulBt7x7x7
subroutine AmulBt8x8x8(C, A, Bt) BIND(C, name="AmulBt8x8x8")
real(C_double), dimension(8, 8), intent(out) :: C
real(C_double), dimension(8, 8), intent(in) :: A
real(C_double), dimension(8, 8), intent(in) :: Bt
C = matmul(A, transpose(Bt))
end subroutine AmulBt8x8x8

subroutine AtmulB(C, At, B, M, K, N) BIND(C, name="AtmulB")
integer(C_long), intent(in) :: M, K, N
real(C_double), dimension(M, N), intent(out) :: C
real(C_double), dimension(K, M), intent(in) :: At
real(C_double), dimension(K, N), intent(in) :: B
C = matmul(transpose(At), B)
end subroutine AtmulB
subroutine AtmulB7x7x7(C, At, B) BIND(C, name="AtmulB7x7x7")
real(C_double), dimension(7, 7), intent(out) :: C
real(C_double), dimension(7, 7), intent(in) :: At
real(C_double), dimension(7, 7), intent(in) :: B
C = matmul(transpose(At), B)
end subroutine AtmulB7x7x7
subroutine AtmulB8x8x8(C, At, B) BIND(C, name="AtmulB8x8x8")
real(C_double), dimension(8, 8), intent(out) :: C
real(C_double), dimension(8, 8), intent(in) :: At
real(C_double), dimension(8, 8), intent(in) :: B
C = matmul(transpose(At), B)
end subroutine AtmulB8x8x8

subroutine AtmulBt(C, At, Bt, M, K, N) BIND(C, name="AtmulBt")
integer(C_long), intent(in) :: M, K, N
real(C_double), dimension(M, N), intent(out) :: C
real(C_double), dimension(K, M), intent(in) :: At
real(C_double), dimension(N, K), intent(in) :: Bt
C = transpose(matmul(Bt, At))
end subroutine AtmulBt
subroutine AtmulBt7x7x7(C, At, Bt) BIND(C, name="AtmulBt7x7x7")
real(C_double), dimension(7, 7), intent(out) :: C
real(C_double), dimension(7, 7), intent(in) :: At
real(C_double), dimension(7, 7), intent(in) :: Bt
C = transpose(matmul(Bt, At))
end subroutine AtmulBt7x7x7
subroutine AtmulBt8x8x8(C, At, Bt) BIND(C, name="AtmulBt8x8x8")
real(C_double), dimension(8, 8), intent(out) :: C
real(C_double), dimension(8, 8), intent(in) :: At
real(C_double), dimension(8, 8), intent(in) :: Bt
C = transpose(matmul(Bt, At))
end subroutine AtmulBt8x8x8

subroutine AmulB_loop(C, A, B, M, K, N) BIND(C, name="AmulB_loop")
integer(C_long), intent(in) :: M, K, N
real(C_double), dimension(M, N), intent(out) :: C
real(C_double), dimension(M, K), intent(in) :: A
real(C_double), dimension(K, N), intent(in) :: B
real(C_double) :: Cmn
do concurrent(nn = 1:N, mm = 1:M)
Cmn = 0.0
do kk = 1,K
Cmn = Cmn + A(mm,kk) * B(kk,nn)
end do
C(mm,nn) = Cmn
end do
end subroutine AmulB_loop
subroutine AmulB7x7x7_loop(C, A, B) BIND(C, name="AmulB7x7x7_loop")
real(C_double), dimension(7, 7), intent(out) :: C
real(C_double), dimension(7, 7), intent(in) :: A
real(C_double), dimension(7, 7), intent(in) :: B
real(C_double) :: Cmn
do concurrent(nn = 1:7, mm = 1:7)
Cmn = 0.0
do kk = 1,7
Cmn = Cmn + A(mm,kk) * B(kk,nn)
end do
C(mm,nn) = Cmn
end do
end subroutine AmulB7x7x7_loop
subroutine AmulB8x8x8_loop(C, A, B) BIND(C, name="AmulB8x8x8_loop")
real(C_double), dimension(8, 8), intent(out) :: C
real(C_double), dimension(8, 8), intent(in) :: A
real(C_double), dimension(8, 8), intent(in) :: B
real(C_double) :: Cmn
do concurrent(nn = 1:8, mm = 1:8)
Cmn = 0.0
do kk = 1,8
Cmn = Cmn + A(mm,kk) * B(kk,nn)
end do
C(mm,nn) = Cmn
end do
end subroutine AmulB8x8x8_loop

subroutine AmulBt_loop(C, A, Bt, M, K, N) BIND(C, name="AmulBt_loop")
integer(C_long), intent(in) :: M, K, N
real(C_double), dimension(M, N), intent(out) :: C
real(C_double), dimension(M, K), intent(in) :: A
real(C_double), dimension(N, K), intent(in) :: Bt
real(C_double) :: Cmn
do concurrent(nn = 1:N, mm = 1:M)
Cmn = 0.0
do kk = 1,K
Cmn = Cmn + A(mm,kk) * Bt(nn,kk)
end do
C(mm,nn) = Cmn
end do
end subroutine AmulBt_loop
subroutine AmulBt7x7x7_loop(C, A, Bt) BIND(C, name="AmulBt7x7x7_loop")
real(C_double), dimension(7, 7), intent(out) :: C
real(C_double), dimension(7, 7), intent(in) :: A
real(C_double), dimension(7, 7), intent(in) :: Bt
real(C_double) :: Cmn
do concurrent(nn = 1:7, mm = 1:7)
Cmn = 0.0
do kk = 1,7
Cmn = Cmn + A(mm,kk) * Bt(nn,kk)
end do
C(mm,nn) = Cmn
end do
end subroutine AmulBt7x7x7_loop
subroutine AmulBt8x8x8_loop(C, A, Bt) BIND(C, name="AmulBt8x8x8_loop")
real(C_double), dimension(8, 8), intent(out) :: C
real(C_double), dimension(8, 8), intent(in) :: A
real(C_double), dimension(8, 8), intent(in) :: Bt
real(C_double) :: Cmn
do concurrent(nn = 1:8, mm = 1:8)
Cmn = 0.0
do kk = 1,8
Cmn = Cmn + A(mm,kk) * Bt(nn,kk)
end do
C(mm,nn) = Cmn
end do
end subroutine AmulBt8x8x8_loop

subroutine AtmulB_loop(C, At, B, M, K, N) BIND(C, name="AtmulB_loop")
integer(C_long), intent(in) :: M, K, N
real(C_double), dimension(M, N), intent(out) :: C
real(C_double), dimension(K, M), intent(in) :: At
real(C_double), dimension(K, N), intent(in) :: B
real(C_double) :: Cmn
do concurrent(nn = 1:N, mm = 1:M)
Cmn = 0.0
do kk = 1,K
Cmn = Cmn + At(kk,mm) * B(kk,nn)
end do
C(mm,nn) = Cmn
end do
end subroutine AtmulB_loop
subroutine AtmulB7x7x7_loop(C, At, B) BIND(C, name="AtmulB7x7x7_loop")
real(C_double), dimension(7, 7), intent(out) :: C
real(C_double), dimension(7, 7), intent(in) :: At
real(C_double), dimension(7, 7), intent(in) :: B
real(C_double) :: Cmn
do concurrent(nn = 1:7, mm = 1:7)
Cmn = 0.0
do kk = 1,7
Cmn = Cmn + At(kk,mm) * B(kk,nn)
end do
C(mm,nn) = Cmn
end do
end subroutine AtmulB7x7x7_loop
subroutine AtmulB8x8x8_loop(C, At, B) BIND(C, name="AtmulB8x8x8_loop")
real(C_double), dimension(8, 8), intent(out) :: C
real(C_double), dimension(8, 8), intent(in) :: At
real(C_double), dimension(8, 8), intent(in) :: B
real(C_double) :: Cmn
do concurrent(nn = 1:8, mm = 1:8)
Cmn = 0.0
do kk = 1,8
Cmn = Cmn + At(kk,mm) * B(kk,nn)
end do
C(mm,nn) = Cmn
end do
end subroutine AtmulB8x8x8_loop

subroutine AtmulBt_loop(C, At, Bt, M, K, N) BIND(C, name="AtmulBt_loop")
integer(C_long), intent(in) :: M, K, N
real(C_double), dimension(M, N), intent(out) :: C
real(C_double), dimension(K, M), intent(in) :: At
real(C_double), dimension(N, K), intent(in) :: Bt
real(C_double) :: Cmn
do concurrent(nn = 1:N, mm = 1:M)
Cmn = 0.0
do kk = 1,K
Cmn = Cmn + At(kk,mm) * Bt(nn,kk)
end do
C(mm,nn) = Cmn
end do
end subroutine AtmulBt_loop
subroutine AtmulBt7x7x7_loop(C, At, Bt) BIND(C, name="AtmulBt7x7x7_loop")
real(C_double), dimension(7, 7), intent(out) :: C
real(C_double), dimension(7, 7), intent(in) :: At
real(C_double), dimension(7, 7), intent(in) :: Bt
real(C_double) :: Cmn
do concurrent(nn = 1:7, mm = 1:7)
Cmn = 0.0
do kk = 1,7
Cmn = Cmn + At(kk,mm) * Bt(nn,kk)
end do
C(mm,nn) = Cmn
end do
end subroutine AtmulBt7x7x7_loop
subroutine AtmulBt8x8x8_loop(C, At, Bt) BIND(C, name="AtmulBt8x8x8_loop")
real(C_double), dimension(8, 8), intent(out) :: C
real(C_double), dimension(8, 8), intent(in) :: At
real(C_double), dimension(8, 8), intent(in) :: Bt
real(C_double) :: Cmn
do concurrent(nn = 1:8, mm = 1:8)
Cmn = 0.0
do kk = 1,8
Cmn = Cmn + At(kk,mm) * Bt(nn,kk)
end do
C(mm,nn) = Cmn
end do
end subroutine AtmulBt8x8x8_loop

real(C_double) function dot3v2(x, A, y, M, N) BIND(C, name="dot3v2")
integer(C_long), intent(in) :: M, N
real(C_double), intent(in) :: x(M), A(M,N), y(N)
real(C_double) :: t
integer(C_long) :: mm, nn
dot3v2 = 0.0d0
do concurrent(nn = 1:N, mm = 1:M)
dot3v2 = dot3v2 + x(mm) * A(mm, nn) * y(nn)
end do
end function dot3v2
real(C_double) function dot3(x, A, y, M, N) BIND(C, name="dot3")
integer(C_long), intent(in) :: M, N
real(C_double), intent(in) :: x(M), A(M,N), y(N)
real(C_double) :: t
integer(C_long) :: mm, nn
dot3 = 0.0d0
do concurrent(nn = 1:N)
t = 0.0d0
do concurrent(mm = 1:M)
t = t + x(mm) * A(mm, nn)
end do
dot3 = dot3 + t * y(nn)
end do
end function dot3
real(C_double) function dot3builtin(x, A, y, M, N) BIND(C, name="dot3builtin")
integer(C_long), intent(in) :: M, N
real(C_double), intent(in) :: x(M), A(M,N), y(N)
dot3builtin = dot_product(x, matmul(A, y))
end function dot3builtin

end module matrixmul
``````
Julia code and script for running benchmarks
``````using LoopVectorization, ArrayInterface, Static, LinearAlgebra

function mul_serial!(C, A, B)
@avx for n ∈ indices((C,B), 2), m ∈ indices((C,A), 1)
Cmn = zero(eltype(C))
for k ∈ indices((A,B), (2,1))
Cmn += A[m,k] * B[k,n]
end
C[m,n] = Cmn
end
end
@avxt for n ∈ indices((C,B), 2), m ∈ indices((C,A), 1)
Cmn = zero(eltype(C))
for k ∈ indices((A,B), (2,1))
Cmn += A[m,k] * B[k,n]
end
C[m,n] = Cmn
end
end

const LIBFORTMUL = joinpath(pwd(), "libmatmul.so")
if !isfile(LIBFORTMUL)
# run(`gfortran -Ofast -march=native -fdisable-tree-cunrolli -floop-nest-optimize -funroll-loops -mprefer-vector-width=\$(LoopVectorization.pick_vector_width(Float64)*64) -shared -fPIC matmul.f90 -o \$LIBFORTMUL`)
run(`gfortran -Ofast -march=native -mprefer-vector-width=\$(LoopVectorization.pick_vector_width(Float64)*64) -shared -fPIC matmul.f90 -o \$LIBFORTMUL`)
end

struct SizedMatrix{M,N,T} <: DenseMatrix{T}
data::Matrix{T}
end
@inline SizedMatrix{M,N}(data::Matrix{T}) where {M,N,T} = SizedMatrix{M,N,T}(data)
@inline Base.unsafe_convert(::Type{Ptr{T}}, A::SizedMatrix{M,N,T}) where {M,N,T} = Base.unsafe_convert(Ptr{T}, A.data)
@inline Base.size(A::SizedMatrix{M,N}) where {M,N} = (M,N)
@inline Base.elsize(A::SizedMatrix{M,N,T}) where {M,N,T} = sizeof(T)
@inline Base.length(A::SizedMatrix{M,N}) where {M,N} = M*N
@inline ArrayInterface.size(A::SizedMatrix{M,N}) where {M,N} = (static(M),static(N))
@inline ArrayInterface.axes(A::SizedMatrix{M,N}, ::StaticInt{1}) where {M,N} = static(1):static(M)
@inline ArrayInterface.axes(A::SizedMatrix{M,N}, ::StaticInt{2}) where {M,N} = static(1):static(N)
@inline ArrayInterface.axes(A::SizedMatrix{M,N}) where {M,N} = (static(1):static(M), static(1):static(N))
@inline Base.parent(A::SizedMatrix) = A.data
@inline ArrayInterface.parent_type(::Type{<:SizedMatrix{M,N,T}}) where {M,N,T} = Matrix{T}
@inline function Base.getindex(A::SizedMatrix, i...)
@boundscheck checkbounds(A, i...)
@inbounds A.data[i...]
end
@inline function Base.setindex!(A::SizedMatrix, v, i...)
@boundscheck checkbounds(A, i...)
@inbounds A.data[i...] = v
end

for At ∈ (false,true), Bt ∈ (false,true)
fname = Symbol(At ? "At" : "A", "mul", Bt ? "Bt" : "B")
for static_size ∈ (0,7,8)
if static_size == 7
Ctype = :(SizedMatrix{7,7,Float64})
fname_sized = Symbol(fname, "7x7x7")
elseif static_size == 8
Ctype = :(SizedMatrix{8,8,Float64})
fname_sized = Symbol(fname, "8x8x8")
else
Ctype = :(Matrix{Float64})
fname_sized = fname
end
Atype = At ? :(Adjoint{Float64,\$Ctype}) : Ctype
Btype = Bt ? :(Adjoint{Float64,\$Ctype}) : Ctype
A = At ? :(parent(A)) : :A
B = Bt ? :(parent(B)) : :B
for loop ∈ (false, true)
funcname! = loop ? :fortmul! : :fortmul_loop!
cfname = QuoteNode(loop ? Symbol(fname_sized, "_loop") : fname_sized)
matmul_quote = if static_size == 0
quote
function \$funcname!(C::\$Ctype, A::\$Atype, B::\$Btype)
M, N = size(C); K = size(B,1)
ccall((\$cfname, LIBFORTMUL), Cvoid, (Ref{Float64},Ref{Float64},Ref{Float64},Ref{Clong},Ref{Clong},Ref{Clong}), C, \$A, \$B, M, K, N)
end
end
else
quote
function \$funcname!(C::\$Ctype, A::\$Atype, B::\$Btype)
ccall((\$cfname, LIBFORTMUL), Cvoid, (Ref{Float64},Ref{Float64},Ref{Float64}), C, \$A, \$B)
end
end
end
@eval \$matmul_quote
end
end
end

const Matrix7x7 = SizedMatrix{7,7}
const Matrix8x8 = SizedMatrix{8,8}
A7x7 = rand(7,7); B7x7 = rand(7,7); C7x7 = rand(7,7);
A8x8 = rand(8,8); B8x8 = rand(8,8); C8x8 = rand(8,8);

function print_matmul_summary(C, A, B)
M, K = size(A); N = size(B,2);
A_str = A isa Adjoint ? "A'" : "A"
B_str = B isa Adjoint ? "B'" : "B"
static_str = C isa SizedMatrix ? "compile" : "runtime"
println("Running \$(static_str)-sized \$(M)×\$(N) = \$(M)×\$(K) * \$(K)×\$(N) benchmarks for C = \$A_str * \$B_str")
end
function run_benchmark(C,A,B)
print_matmul_summary(C,A,B)
M, K = size(A); N = size(B,2)
println("gfortran Builtin:")
t = @belapsed fortmul!(\$C, \$A, \$B)
println("GFLOPS: \$(2e-9M*K*N/t)")
Ccopy = copy(C)
println("gfortran Loops:")
t = @belapsed fortmul_loop!(\$C, \$A, \$B)
println("GFLOPS: \$(2e-9M*K*N/t)")
@assert C ≈ Ccopy
t = @belapsed mul_serial!(\$C, \$A, \$B)
println("GFLOPS: \$(2e-9M*K*N/t)")
@assert C ≈ Ccopy
if length(C) > 32^2
t = @belapsed mul_threads!(\$C, \$A, \$B)
println("GFLOPS: \$(2e-9M*K*N/t)")
@assert C ≈ Ccopy
end
println()
end
for (C,A,B) ∈ ((C7x7,A7x7,B7x7),(C8x8,A8x8,B8x8)), At ∈ (false,true), Bt ∈ (false,true), static_size ∈ (false,true)
M,K = size(A); N = size(B,2)
T = static_size ? (M == 7 ? Matrix7x7 : Matrix8x8) : identity
Atemp = T(A); Btemp = T(B)
Atemp = At ? Atemp' : Atemp
Btemp = Bt ? Btemp' : Btemp
run_benchmark(T(C), Atemp, Btemp)
end

for M ∈ (71, 72, 144, 216)
K = N = M
A = rand(M,K); B = rand(K,N);
C = Matrix{Float64}(undef, M, N);

run_benchmark(C, A, B)
run_benchmark(C, A, B')
run_benchmark(C, A', B)
run_benchmark(C, A', B')
end
``````
Results
``````Running runtime-sized 7×7 = 7×7 * 7×7 benchmarks for C = A * B
gfortran Builtin:
GFLOPS: 4.290648913738776
gfortran Loops:
GFLOPS: 3.8254906148406542
GFLOPS: 25.796296296296298

Running compile-sized 7×7 = 7×7 * 7×7 benchmarks for C = A * B
gfortran Builtin:
GFLOPS: 26.027454718779794
gfortran Loops:
GFLOPS: 15.402794157670328
GFLOPS: 65.19349315068494

Running runtime-sized 7×7 = 7×7 * 7×7 benchmarks for C = A * B'
gfortran Builtin:
GFLOPS: 3.376153217330375
gfortran Loops:
GFLOPS: 3.8443792201472116
GFLOPS: 34.47618088427838

Running compile-sized 7×7 = 7×7 * 7×7 benchmarks for C = A * B'
gfortran Builtin:
GFLOPS: 26.010593704748118
gfortran Loops:
GFLOPS: 13.077782718818277
GFLOPS: 65.18729192428422

Running runtime-sized 7×7 = 7×7 * 7×7 benchmarks for C = A' * B
gfortran Builtin:
GFLOPS: 4.0271049910194305
gfortran Loops:
GFLOPS: 3.9683312655086853
GFLOPS: 10.530866711571298

Running compile-sized 7×7 = 7×7 * 7×7 benchmarks for C = A' * B
gfortran Builtin:
GFLOPS: 18.178010471204193
gfortran Loops:
GFLOPS: 10.238181624159642
GFLOPS: 28.06786345150557

Running runtime-sized 7×7 = 7×7 * 7×7 benchmarks for C = A' * B'
gfortran Builtin:
GFLOPS: 4.315322113055289
gfortran Loops:
GFLOPS: 2.8338118977575006
GFLOPS: 24.073995697104365

Running compile-sized 7×7 = 7×7 * 7×7 benchmarks for C = A' * B'
gfortran Builtin:
GFLOPS: 16.877507447864946
gfortran Loops:
GFLOPS: 13.098295894655307
GFLOPS: 42.206275815301154

Running runtime-sized 8×8 = 8×8 * 8×8 benchmarks for C = A * B
gfortran Builtin:
GFLOPS: 4.654805292161444
gfortran Loops:
GFLOPS: 7.049481636524185
GFLOPS: 34.24933544197315

Running compile-sized 8×8 = 8×8 * 8×8 benchmarks for C = A * B
gfortran Builtin:
GFLOPS: 75.37630919014605
gfortran Loops:
GFLOPS: 13.412733778948079
GFLOPS: 81.84462757020562

Running runtime-sized 8×8 = 8×8 * 8×8 benchmarks for C = A * B'
gfortran Builtin:
GFLOPS: 3.4824207638528932
gfortran Loops:
GFLOPS: 6.503686130838877
GFLOPS: 49.12558945241074

Running compile-sized 8×8 = 8×8 * 8×8 benchmarks for C = A * B'
gfortran Builtin:
GFLOPS: 75.28191528545119
gfortran Loops:
GFLOPS: 47.59792997342533
GFLOPS: 81.79227632525786

Running runtime-sized 8×8 = 8×8 * 8×8 benchmarks for C = A' * B
gfortran Builtin:
GFLOPS: 8.2165948853584
gfortran Loops:
GFLOPS: 5.622450042883365
GFLOPS: 13.54091119775336

Running compile-sized 8×8 = 8×8 * 8×8 benchmarks for C = A' * B
gfortran Builtin:
GFLOPS: 56.06721950683728
gfortran Loops:
GFLOPS: 47.56793060025185
GFLOPS: 49.458773374672994

Running runtime-sized 8×8 = 8×8 * 8×8 benchmarks for C = A' * B'
gfortran Builtin:
GFLOPS: 4.89827710282352
gfortran Loops:
GFLOPS: 5.558826599853847
GFLOPS: 31.88760662318113

Running compile-sized 8×8 = 8×8 * 8×8 benchmarks for C = A' * B'
gfortran Builtin:
GFLOPS: 56.06721950683728
gfortran Loops:
GFLOPS: 11.955531821933105
GFLOPS: 48.73397298200392

Running runtime-sized 71×71 = 71×71 * 71×71 benchmarks for C = A * B
gfortran Builtin:
GFLOPS: 5.865806789966648
gfortran Loops:
GFLOPS: 22.828868478122207
GFLOPS: 109.64066903565738
GFLOPS: 326.3791478798318

Running runtime-sized 71×71 = 71×71 * 71×71 benchmarks for C = A * B'
gfortran Builtin:
GFLOPS: 3.2668930328504797
gfortran Loops:
GFLOPS: 6.8133941234140165
GFLOPS: 107.9801484342002
GFLOPS: 317.0627491510409

Running runtime-sized 71×71 = 71×71 * 71×71 benchmarks for C = A' * B
gfortran Builtin:
GFLOPS: 14.328762735952917
gfortran Loops:
GFLOPS: 14.112931527375249
GFLOPS: 52.568260262906655
GFLOPS: 191.04598139393016

Running runtime-sized 71×71 = 71×71 * 71×71 benchmarks for C = A' * B'
gfortran Builtin:
GFLOPS: 5.875484273425701
gfortran Loops:
GFLOPS: 20.93720202404282
GFLOPS: 96.3000033632664
GFLOPS: 153.8464734749436

Running runtime-sized 72×72 = 72×72 * 72×72 benchmarks for C = A * B
gfortran Builtin:
GFLOPS: 5.997782455689288
gfortran Loops:
GFLOPS: 26.56380328802221
GFLOPS: 123.29809725158565
GFLOPS: 385.96556537924624

Running runtime-sized 72×72 = 72×72 * 72×72 benchmarks for C = A * B'
gfortran Builtin:
GFLOPS: 3.3204458717718337
gfortran Loops:
GFLOPS: 7.33131021478448
GFLOPS: 123.28995177379932
GFLOPS: 388.96206752813674

Running runtime-sized 72×72 = 72×72 * 72×72 benchmarks for C = A' * B
gfortran Builtin:
GFLOPS: 25.192224622030242
gfortran Loops:
GFLOPS: 24.992333188255387
GFLOPS: 68.62437948152235
GFLOPS: 200.95457298606908

Running runtime-sized 72×72 = 72×72 * 72×72 benchmarks for C = A' * B'
gfortran Builtin:
GFLOPS: 5.9635236505108775
gfortran Loops:
GFLOPS: 24.258148376823844
GFLOPS: 110.31417171567905
GFLOPS: 146.8943299693588

Running runtime-sized 144×144 = 144×144 * 144×144 benchmarks for C = A * B
gfortran Builtin:
GFLOPS: 5.601758954232673
gfortran Loops:
GFLOPS: 39.86547665934595
GFLOPS: 126.24657534246577
GFLOPS: 1125.972032806461

Running runtime-sized 144×144 = 144×144 * 144×144 benchmarks for C = A * B'
gfortran Builtin:
GFLOPS: 2.578293861451797
gfortran Loops:
GFLOPS: 8.234176150543389
GFLOPS: 108.0644916128332
GFLOPS: 975.3018029788348

Running runtime-sized 144×144 = 144×144 * 144×144 benchmarks for C = A' * B
gfortran Builtin:
GFLOPS: 25.676496764623693
gfortran Loops:
GFLOPS: 25.937223937771183
GFLOPS: 89.84456145629608
GFLOPS: 867.8420088935393

Running runtime-sized 144×144 = 144×144 * 144×144 benchmarks for C = A' * B'
gfortran Builtin:
GFLOPS: 5.639534367631242
gfortran Loops:
GFLOPS: 37.20829153712439
GFLOPS: 114.82566478878657
GFLOPS: 661.0546823112686

Running runtime-sized 216×216 = 216×216 * 216×216 benchmarks for C = A * B
gfortran Builtin:
GFLOPS: 4.885176232546832
gfortran Loops:
GFLOPS: 46.77228108770416
GFLOPS: 123.50268998394588
GFLOPS: 1846.7465640461794

Running runtime-sized 216×216 = 216×216 * 216×216 benchmarks for C = A * B'
gfortran Builtin:
GFLOPS: 2.501163629630972
gfortran Loops:
GFLOPS: 8.238294838975227
GFLOPS: 98.05113835376532
GFLOPS: 1477.1265665078784

Running runtime-sized 216×216 = 216×216 * 216×216 benchmarks for C = A' * B
gfortran Builtin:
GFLOPS: 25.663006899783166
gfortran Loops:
GFLOPS: 25.26621713606637
GFLOPS: 96.12499105775973
GFLOPS: 1474.9646542261253

Running runtime-sized 216×216 = 216×216 * 216×216 benchmarks for C = A' * B'
gfortran Builtin:
GFLOPS: 4.846530403782313
gfortran Loops:
GFLOPS: 52.11586018586035
GFLOPS: 105.0252305768329
GFLOPS: 1462.9739420773753
``````

versioninfo:

``````julia> versioninfo()
Julia Version 1.7.0-DEV.1052
Commit cc59947c03* (2021-05-02 02:36 UTC)
Platform Info:
OS: Linux (x86_64-generic-linux)
CPU: Intel(R) Core(TM) i9-10980XE CPU @ 3.00GHz
WORD_SIZE: 64
LIBM: libopenlibm
Environment:
``````

The benchmarks are a bunch of `matmul`s via the Fotran intrinsic as well as loops.
The Julia code contrains the compilation command. I’m open to suggestions for better loops and compiler flags.
The benchmark contains benchmarks for runtime and compile-time sized 7x7 and 8x8 matrices for all permutations of tranposing `A` and `B` in `C = A*B`, and then runtime sized versions for `71x71`, `72x72`, `144x144`, and `216x216`.
The larger sizes also include multithreaded Julia matrix multiply (implemented as `@avxt` on 3 loops). Note that Julia must be started with multiple threads to support multithreading.
Also, `@avx(t)` is currently limited to simple rectangular loop nests; i.e. triangular loops or multiple loops at the same level in a nest don’t work at the moment, and neither does complex control flow (you can remove `@avx` if you want to benchmark that).

Also note that there’s only two versions of the matrix multiply function in Julia: 1 single threaded, another multithreaded. The same function is used for all transposed permutations, without us having to write and optimize a different version for each.

EDIT: Fixed names following @themos’ suggestion.

8 Likes

I rather object to you labelling the results from one compiler as the “Fortran” results. I also suspect that your work in producing fast loops would have produced similar results had you worked on a Fortran compiler that you had access to the internals of. Is there any reason that it wouldn’t?

It may well turn out to be the case that Fortran compilers have dropped the ball when it comes to delivering blistering performance in the cases that matter (thousands of communicating processes solving engineering and science problems) but it will take significantly more work to demonstrate that.

2 Likes

Unlike Julia, the Fortran compiler development efforts are not centralized. From what I have learned over the past years, some are very good at debugging (like NAG, GFortran) and some others are historically known for better optimizations (like Intel, Cray, IBM, PGI, …). But these could change at any time. Having and maintaining multiple high-quality compilers takes a lot of effort. For benchmarking purposes, it makes sense to compile Fortran code with all available compilers and report the best result. I think this issue highlights the importance of such activities as Google Summer of Code 2021

3 Likes

Julia correctly sums 10^8 32-bit uniform deviates, with the program

``````n = 10^8
x = Float32.(rand(n))
println(sum(x)/n)
``````

giving output such as `0.49995995` but gfortran for

``````implicit none
integer, parameter :: n = 10**8
real :: x(n)
call random_number(x)
print*,sum(x)/n
end
``````

gives `0.167772159` and g95 gives `0.16777216` but Intel Fortran gives `0.4999704`
Compensated summation, which gets this correct, may eventually be in stdlib. Ideally a compiler would have an option to compute SUM carefully and the user could verify that toggling the option does not influence the program results.

By default, Julia uses pairwise summation, which explains this. On modern architectures, `sum` will usually be memory-bound, so the extra (very minor) cost of summing out of order ends up not mattering.

1 Like