Simple summation 8x slower than in Julia

Is there a way to tell Fortran compilers to maximally evaluate these sorts of things at compile time? Since all the values are known at compile time, the whole program could be reduced to a print of a known constant.

1 Like

It is up to each Fortran compiler to document such things.

@certik and everyone,

Notwithstanding the excellent point by @themos the specific case at hand is not a true performance benchmark on account of the mathematical identity, should we continue to use the same case but modify it as suggested upthread to view the sin calculation as some representative function computation for which a common library implementation is to be consumed in any programming paradigm of interest, whether it be based in Julia or Fortran, and we “up” the game a little bit to achieve some “scale” to the calculations i.e., raise the number N to be 800,000,000 rather than 100,000,000, a speedup close to 6x relative to Julia is viable with Fortran.

So with John Burkardt’s C code for the CORDIC algorithm (link upthread) as the basis, the Julia code and the result is as follows (as noted by @uwe, avx* macro fails to compile, so it’s removed):

function cordic_sine(a)

   cos = Ref{Float64}(0.0)
   sin = Ref{Float64}(0.0)

   # Ref: Excellent work by John Burkardt
   # https://people.sc.fsu.edu/~jburkardt/c_src/cordic/cordic.c
   # void cossin_cordic ( double beta, int n, double *c, double *s )
   ccall((:cossin_cordic, "cordic"), Cvoid, (Float64, Cint, Ref{Float64}, Ref{Float64}), a, 40, cos, sin)

   return sin[]

end

function f_avx(N)
   s = 0.0
   for i in 1:N
       s += cordic_sine(convert(Float64, i))
   end
   s
end

@time r = f_avx(800000000)
println(r)

Execution result with Julia:

C:\temp>julia avx.jl
172.551030 seconds (1 allocation: 16 bytes)
1.9006631241568408

With Fortran, say one pursues rather simple-minded coarray approach as shown in the below fold - but while noting coarray is part of the base language:

Click to see Fortran code

Simple coarray example to divvy up the series summation across images

module cordic_m
   use, intrinsic :: iso_c_binding, only : c_int, c_double
   interface
      pure subroutine cossin_cordic( beta, n, c, s) bind(C, name="cossin_cordic")
      ! Ref: Excellent work by John Burkardt
      ! https://people.sc.fsu.edu/~jburkardt/c_src/cordic/cordic.c
      ! void cossin_cordic ( double beta, int n, double *c, double *s )
         import :: c_int, c_double
         ! Argument list
         real(c_double), intent(in), value :: beta
         integer(c_int), intent(in), value :: n
         real(c_double), intent(inout)     :: c 
         real(c_double), intent(inout)     :: s 
      end subroutine
   end interface
contains
   pure function cordic_sine( a ) result(sin_a)
      ! Argument list
      real(c_double), intent(in) :: a
      ! Function result
      real(c_double) :: sin_a
      real(c_double) :: cos_a
      call cossin_cordic( a, 40, cos_a, sin_a )
   end function
end module
   use cordic_m
   integer, parameter :: M = 100000000
   real(c_double) :: r[*]
   real(c_double) :: t1, t2
   real(c_double) :: sn
   integer :: i

   if ( this_image() == 1 ) then
      call cpu_time(t1)
   end if
   sync all
       
   r = f(M)
   sync all

   if ( this_image() == 1 ) then
      sn = 0.0_c_double
      do i = 1, num_images()
         sn = sn + r[i]
      end do
      call cpu_time(t2)
      print *, "Time", t2-t1
      print *, sn
   end if

contains

   pure function f(N) result(r)
      integer, intent(in) :: N
      real(c_double) :: r
      integer :: i
      r = 0.0_c_double
      do i = (this_image()-1)*M+1, this_image()*M
         r = r + cordic_sine(real(i,c_double))
      end do
   end function

end program 

the timing result with IFORT on the same CPU as the above Julia case is:

C:\temp>ifort /c /O3 /Qcoarray:shared /Qcoarray-num-images=8 /QxHost /traceback p.f90
Intel(R) Fortran Intel(R) 64 Compiler Classic for applications running on Intel(R) 64, Version 2021.2.0 Build 20210228_000000
Copyright (C) 1985-2021 Intel Corporation. All rights reserved.

C:\temp>link p.obj cordic.obj /subsystem:console /out:p.exe
Microsoft (R) Incremental Linker Version 14.27.29112.0
Copyright (C) Microsoft Corporation. All rights reserved.

C:\temp>p.exe
Time 30.2500000000000
1.90066312415625

Note the series summation result matches with that of Julia to the precision of c_double / Float64.

Perhaps one or more of the Julia enthusiasts can show here the parallel execution option(s) with Julia for the same exact case i.e., their native option equivalent to Fortran coarray - now that will be a good learning exercise for readers here!

3 Likes

As far as I understand, Julia is quite strict about floating point semantics, so it will not allow transformations that violate the semantics (eg loop reordering). It does autovectorize (through LLVM), but only as far as it can prove that they produce the exact same results. I think Fortran does the same, so differences there are probably due to compilers more than languages. Julia annotations such as @simd or @fastmath relax the semantics, allowing the compiler to optimize more, so compared to fortran it’s more granular than compiler switches. I sometimes wish julia would allow more aggressive transforms automagically, but it’s not easy to find an API that allows both automatic transformations in the cases where it doesn’t matter and forbids them in the cases where it does. In any case adding @simd in front of tight loops is not that big of a deal.

None of the languages will vectorize this without permission because vectorization requires reassociating the order of the summation in violation of the IEEE 754 floating point standard. That permission is given in C and Fortran by passing the -ffast-math flag for an entire compilation unit. This flag also gives permission to do other transformations that may speed up your code but may also change the answer for better or worse, but definitely less predictable since the scope of the transformations is unspecified and unbounded. (In bad cases arbitrarily wrong answers can result from the careless application of the -ffast-math option.)

In Julia, a much more limited permission to change associativity is given by using the built-in @simd macro. This is a much more surgical approach since it only gives permission to reassociate operations as necessary to vectorize the annotated loop, and doesn’t allow any of the other potentially more problematic transformations that the -ffast-math option can cause. The @avx macro provided by the LoopVectorization package does a bit more than this: it explicitly unrolls the loop, and replaces calls to certain math functions (including sin) with vectorized versions that may give slightly different results. As @antoine-levitt mentioned, there is also a built-in @fastmath macro that gives permissions similar to what -ffast-math does (but locally).

7 Likes

I can’t get this cordic program to compile on my machine so I can’t run my own comparisons. If you want to do multithreading, the simplest way would be to just write

function f_certainly_not_avx(N)
   s = 0.0
   Threads.@threads for i in 1:N
       s += cordic_sine(convert(Float64, i))
   end
   s
end

Not really sure why or how this is relevant to the original question that @certik asked though.

If one is willing to roll up their sleeves, they could make this cordic_sine function work with LoopVectorization.@avx, they just have to add dispatches that act correctly and vectorize across a VectorizationBase.AbstractSIMD type, but it wouldn’t be super simple.

Chris explained a bit on how to do so here: https://julialang.zulipchat.com/#narrow/stream/137791-general/topic/LoopVectorization.20vectorize.20an.20arbitrary.20function

@Mason,

No, “multithreading” is not of interest; the curiosity given the original post is how to do parallel programming in Julia. Regardless, your suggestion of multithreading decreases the performance with the Julia case:

C:\temp>type avx.jl
function cordic_sine(a::Float64)

cos = Ref{Float64}(0.0)
sin = Ref{Float64}(0.0)

ccall((:cossin_cordic, “cordic”), Cvoid, (Float64, Cint, Ref{Float64}, Ref{Float64}), a, 40, cos, sin)

return sin[]

end

function f_avx(N)
s = 0.0
Threads.@threads for i in (Threads.threadid()-1)(N/Threads.nthreads()):Threads.threadid()(N/Threads.nthreads())
s += cordic_sine(convert(Float64, i))
end
s
end

@time r = f_avx(800000000)
println(r)

C:\temp>julia avx.jl
201.834736 seconds (1.60 G allocations: 23.847 GiB, 0.80% gc time, 0.02% compilation time)
1.9006631241572693

C:\temp>

You probably started julia with only 1 thread. On my 6 core machine this is what I see:

julia> function cordic_sine(a)

          cos = Ref{Float64}(0.0)
          sin = Ref{Float64}(0.0)

          # Ref: Excellent work by John Burkardt
          # https://people.sc.fsu.edu/~jburkardt/c_src/cordic/cordic.c
          # void cossin_cordic ( double beta, int n, double *c, double *s )
          ccall((:cossin_cordic, "/home/mason/cordic/libcordic.so"), Cvoid, (Float64, Cint, Ref{Float64}, Ref{Float64}), a, 40, cos, sin)

          return sin[]

       end
cordic_sine (generic function with 1 method)

julia> function f_definitely_not_avx(N)
           s = 0.0
           Threads.@threads for i in 1:N
               s += cordic_sine(convert(Float64, i))
           end
           s
       end
f_definitely_not_avx (generic function with 1 method)
julia> @time r = f_definitely_not_avx(800000000)
 57.295734 seconds (1.60 G allocations: 23.846 GiB, 9.56% gc time, 0.04% compilation time)
-0.8053191287737147

julia> versioninfo()
Julia Version 1.6.0
Commit f9720dc2eb* (2021-03-24 12:55 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: AMD Ryzen 5 2600 Six-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, znver1)
Environment:
  JULIA_NUM_THREADS = 6
1 Like

@Mason, here’re the steps I followed on Windows OS with the cordic.c and cordic.h from the FSU site I linked upthread: note on Windows, Julia seems to want a shared library (DLL) from what I could see. Here a Windows definition (.def) file of exported symbols from the library make things clearer in my view:

C:\temp>dir cordic.*
Volume in drive C is OS
Volume Serial Number is C4CA-BA12

Directory of C:\temp

05/08/2021 11:15 AM 63,494 cordic.c
05/08/2021 01:09 PM 50 cordic.def
05/08/2021 11:15 AM 1,196 cordic.h
3 File(s) 64,740 bytes
0 Dir(s) 23,240,077,312 bytes free

C:\temp>type cordic.def
LIBRARY cordic
EXPORTS
cossin_cordic @1

C:\temp>cl /c /O2 cordic.c
Microsoft (R) C/C++ Optimizing Compiler Version 19.27.29112 for x64
Copyright (C) Microsoft Corporation. All rights reserved.

cordic.c

C:\temp>link cordic.obj /dll /def:cordic.def /out:cordic.dll
Microsoft (R) Incremental Linker Version 14.27.29112.0
Copyright (C) Microsoft Corporation. All rights reserved.

Creating library cordic.lib and object cordic.exp

C:\temp>

What exactly do you mean by parallel here? Maybe you mean what I’d recognize as distributed / multiprocess? In that case, you’d want to use the Distributed standard library or something like MPI.jl.

Here’s an example on my 6 core machine:

julia> using Distributed; addprocs(5); nprocs()
6

julia> @everywhere function cordic_sine(a)
           cos = Ref{Float64}() # don't waste time zeroing data that'll get thrown out
           sin = Ref{Float64}() 
           # Ref: Excellent work by John Burkardt
           # https://people.sc.fsu.edu/~jburkardt/c_src/cordic/cordic.c
           # void cossin_cordic ( double beta, int n, double *c, double *s )
           ccall((:cossin_cordic, "/home/mason/cordic/libcordic.so"), Cvoid, (Float64, Cint, Ref{Float64}, Ref{Float64}), a, 40, cos, sin)
           return sin[]
       end

julia> function f_distributed(N)
           @distributed (+) for i in 1:N
               cordic_sine(convert(Float64, i))
           end
       end
f_distributed (generic function with 1 method)

julia> @time r = f_distributed(800000000)
 45.842816 seconds (933.50 k allocations: 54.914 MiB, 0.03% gc time, 0.67% compilation time)
1.9006631241562875
julia> versioninfo()
Julia Version 1.6.0
Commit f9720dc2eb* (2021-03-24 12:55 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: AMD Ryzen 5 2600 Six-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, znver1)
Environment:
  JULIA_NUM_THREADS = 6

Here is the time for the serial version:

julia> function f_serial(N)
          s = 0.0
          for i in 1:N
              s += cordic_sine(convert(Float64, i))
          end
          s
       end
f_avx (generic function with 1 method)

julia> @time r = f_serial(800000000)
216.155306 seconds (1 allocation: 16 bytes)
1.9006631241568408
1 Like

Again though, I really don’t understand that point of this benchmark you’re proposing @FortranFan. Why are we interested in a benchmark that just times Fortran / Julia’s foreign function interfaces in parallel?

As far as I understood, this thread was about @certik wanting to understand how LoopVectorization.jl was generating such fast code in julia and how it could be replicated in Fortran.

2 Likes

On a more general note, people might like to time their Fortran compiler’s SUM intrinsic on the largest array their system will handle and report the nanoseconds per item. When I do this, I find large variations, with the Cray compiler on Intel hardware clearly ahead. That merely demonstrates that vendors prioritise different aspects of performance. You will always find some examples of sub-optimal performance if that is what you set out to look for.
The Fortran community is well aware that GPUs are a challenge for modern Fortran but I don’t think that it has been demonstrated that Fortran cannot adapt to the challenge because of fundamental, baked-in language design issues. I don’t think that fast trig approximations is a battle worth fighting, any more than trying to beat, say, FFTW or MKL DGEMM in Fortran.

I’m not really proposing this benchmark, as shown upthread it’s not realistic. The point is only there are 2 parts to this “micro” benchmark: one is some form of non-serial execution of “+” reduction (whether AVX or otherwise) and the next is function evaluation.

If placed on the same basis, there are hardly much of any differences between modern languages with calculations that have any heft to them.

What’s left is simple, convenient way of introducing aspects natively into the language whilst abstracting away the paradigms and separating the hardware-related considerations such as AVX.

For practitioners needing to use Fortran, the above is just a simple example using coarrays.

About "really don’t understand the point " is a question that can be raised of the ones by Jula-lang here.

1 Like

By default, Julia can use more memory than a program compiled by gfortran, so that on my Windows 10 PC, Julia runs

n = 10^9
x = rand(n)
println(sum(x)/n)

in 2.8 s (wall time) while gfortran -O2 cannot do the calculation for n = 3e8 and for 2e8

implicit none
integer, parameter :: n = 2*10**8
real(kind=kind(1.0d0)) :: x(n)
call random_number(x)
print*,n,sum(x)/n
end

takes 1.7 s. Is there a gfortran compiler option that permits larger arrays? I don’t think Julia “sees” that it need not store the entire array x, since for n = 5e9 the Julia program hangs my machine. Gfortran does compile a program with n = 1e9 using the options gfortran -fstack-arrays -fmax-stack-var-size=3000000000 -O2 , but running the program produces no output.

On Windows OS, better to use ALLOCATABLE arrays for the default stack-size is rather low and there are OS restrictions if you’re using 32-bit.

64-bit.

It wouldn’t be able to transform it anyway, so removing the using LoopVectorization and just using for i in would be fine.

LoopVectorization.jl uses a fork of a port of an old version of SLEEF for its sin function, so this is the fairest comparison. But as has been pointed out before, in Julia sin will be inlined, and some things like constant loads will get hoisted out of the loop.

I think it’s more likely that better language integration will go the other way: Julia’s compiler team wants to “rip open” the compiler, better exposing it to library authors to build their own.

LoopVectorization.jl is experimental, and will go through at least one major overhaul before I am comfortable tagging a “1.0” version.

It does. Others have confirmed it works on Power8 and Nvidea Jetson (ARM).
I bought an M1 Mac specifically so I could test and make sure performance isn’t accidentally bad, and can confirm it works there as well (you just need Julia master, and – until we get a Hwloc_jll version compatible with Apple ARM – you’d need to ] dev Hwloc to replace Hwloc_jll with a path to the library installed some other way; I used Homebrew).

Given that LoopVectorization is still at version “0”, now (or, “sooner rather than later”) would be the time to deprecate @avx, and choose a new name for the macro.
@avx seems to consistently cause confusion over whether platforms other than x86 are supported. The reaction to the name seems generally pretty negative.

I’m open to name suggestions here, but I’d probably ask on Julia’s Slack/Zulip.

Why did I pick the name @avx?

  1. Because it is short.
  2. Because @simd was taken.
  3. I didn’t choose @vectorize because there was already a vastly inferior macro with that name in LoopVectorization at the time that many of my packages were using, and I didn’t want to break them. It’s long since been removed. I don’t want the name anymore either, because aside from length, it seems like it’d cause similar confusion if/when matrix-extension support is added. The library name itself makes that mistake, but c’est la vie.

Perhaps @fast, to be fully open ended about how it gets there, and being short?

Depends on the loop.
If autovectorization requires violating IEEE, then it will require annotation of some kind, e.g. @fastmath or @simd from Base Julia.
LoopVectorization.@avx is a macro implemented by a Julia package that implements its own (experimental) autovectorizer.
It does more than just use special functions.

LoopVectorization.jl’s approach does not have that limitation, although I should cleanup erf support, which currently requires specifying verf.

LoopVectorization is also pretty good at nested loops. This is most relevant whenever anyone’s doing things not already easily/well expressed by BLAS, e.g. convolutions and tensor contractions (where expressing them as BLAS has some overhead).

gfortran code for the benchmark
module matrixmul
  use ISO_C_BINDING

contains
! gfortran -Ofast -march=native -mprefer-vector-width=512 -shared -fPIC matmul.f90 -o libmatmul.so

  subroutine AmulB(C, A, B, M, K, N) BIND(C, name="AmulB")
    integer(C_long), intent(in) :: M, K, N
    real(C_double), dimension(M, N), intent(out) :: C
    real(C_double), dimension(M, K), intent(in) :: A
    real(C_double), dimension(K, N), intent(in) :: B
    C = matmul(A, B)
  end subroutine AmulB
  subroutine AmulB7x7x7(C, A, B) BIND(C, name="AmulB7x7x7")
    real(C_double), dimension(7, 7), intent(out) :: C
    real(C_double), dimension(7, 7), intent(in) :: A
    real(C_double), dimension(7, 7), intent(in) :: B
    C = matmul(A, B)
  end subroutine AmulB7x7x7
  subroutine AmulB8x8x8(C, A, B) BIND(C, name="AmulB8x8x8")
    real(C_double), dimension(8, 8), intent(out) :: C
    real(C_double), dimension(8, 8), intent(in) :: A
    real(C_double), dimension(8, 8), intent(in) :: B
    C = matmul(A, B)
  end subroutine AmulB8x8x8

  subroutine AmulBt(C, A, Bt, M, K, N) BIND(C, name="AmulBt")
    integer(C_long), intent(in) :: M, K, N
    real(C_double), dimension(M, N), intent(out) :: C
    real(C_double), dimension(M, K), intent(in) :: A
    real(C_double), dimension(N, K), intent(in) :: Bt
    C = matmul(A, transpose(Bt))
  end subroutine AmulBt
  subroutine AmulBt7x7x7(C, A, Bt) BIND(C, name="AmulBt7x7x7")
    real(C_double), dimension(7, 7), intent(out) :: C
    real(C_double), dimension(7, 7), intent(in) :: A
    real(C_double), dimension(7, 7), intent(in) :: Bt
    C = matmul(A, transpose(Bt))
  end subroutine AmulBt7x7x7
  subroutine AmulBt8x8x8(C, A, Bt) BIND(C, name="AmulBt8x8x8")
    real(C_double), dimension(8, 8), intent(out) :: C
    real(C_double), dimension(8, 8), intent(in) :: A
    real(C_double), dimension(8, 8), intent(in) :: Bt
    C = matmul(A, transpose(Bt))
  end subroutine AmulBt8x8x8

  subroutine AtmulB(C, At, B, M, K, N) BIND(C, name="AtmulB")
    integer(C_long), intent(in) :: M, K, N
    real(C_double), dimension(M, N), intent(out) :: C
    real(C_double), dimension(K, M), intent(in) :: At
    real(C_double), dimension(K, N), intent(in) :: B
    C = matmul(transpose(At), B)
  end subroutine AtmulB
  subroutine AtmulB7x7x7(C, At, B) BIND(C, name="AtmulB7x7x7")
    real(C_double), dimension(7, 7), intent(out) :: C
    real(C_double), dimension(7, 7), intent(in) :: At
    real(C_double), dimension(7, 7), intent(in) :: B
    C = matmul(transpose(At), B)
  end subroutine AtmulB7x7x7
  subroutine AtmulB8x8x8(C, At, B) BIND(C, name="AtmulB8x8x8")
    real(C_double), dimension(8, 8), intent(out) :: C
    real(C_double), dimension(8, 8), intent(in) :: At
    real(C_double), dimension(8, 8), intent(in) :: B
    C = matmul(transpose(At), B)
  end subroutine AtmulB8x8x8

  subroutine AtmulBt(C, At, Bt, M, K, N) BIND(C, name="AtmulBt")
    integer(C_long), intent(in) :: M, K, N
    real(C_double), dimension(M, N), intent(out) :: C
    real(C_double), dimension(K, M), intent(in) :: At
    real(C_double), dimension(N, K), intent(in) :: Bt
    C = transpose(matmul(Bt, At))
  end subroutine AtmulBt
  subroutine AtmulBt7x7x7(C, At, Bt) BIND(C, name="AtmulBt7x7x7")
    real(C_double), dimension(7, 7), intent(out) :: C
    real(C_double), dimension(7, 7), intent(in) :: At
    real(C_double), dimension(7, 7), intent(in) :: Bt
    C = transpose(matmul(Bt, At))
  end subroutine AtmulBt7x7x7
  subroutine AtmulBt8x8x8(C, At, Bt) BIND(C, name="AtmulBt8x8x8")
    real(C_double), dimension(8, 8), intent(out) :: C
    real(C_double), dimension(8, 8), intent(in) :: At
    real(C_double), dimension(8, 8), intent(in) :: Bt
    C = transpose(matmul(Bt, At))
  end subroutine AtmulBt8x8x8

  subroutine AmulB_loop(C, A, B, M, K, N) BIND(C, name="AmulB_loop")
    integer(C_long), intent(in) :: M, K, N
    real(C_double), dimension(M, N), intent(out) :: C
    real(C_double), dimension(M, K), intent(in) :: A
    real(C_double), dimension(K, N), intent(in) :: B
    real(C_double) :: Cmn
    do concurrent(nn = 1:N, mm = 1:M)
       Cmn = 0.0
       do kk = 1,K
          Cmn = Cmn + A(mm,kk) * B(kk,nn)
       end do
       C(mm,nn) = Cmn
    end do
  end subroutine AmulB_loop
  subroutine AmulB7x7x7_loop(C, A, B) BIND(C, name="AmulB7x7x7_loop")
    real(C_double), dimension(7, 7), intent(out) :: C
    real(C_double), dimension(7, 7), intent(in) :: A
    real(C_double), dimension(7, 7), intent(in) :: B
    real(C_double) :: Cmn
    do concurrent(nn = 1:7, mm = 1:7)
       Cmn = 0.0
       do kk = 1,7
          Cmn = Cmn + A(mm,kk) * B(kk,nn)
       end do
       C(mm,nn) = Cmn
    end do
  end subroutine AmulB7x7x7_loop
  subroutine AmulB8x8x8_loop(C, A, B) BIND(C, name="AmulB8x8x8_loop")
    real(C_double), dimension(8, 8), intent(out) :: C
    real(C_double), dimension(8, 8), intent(in) :: A
    real(C_double), dimension(8, 8), intent(in) :: B
    real(C_double) :: Cmn
    do concurrent(nn = 1:8, mm = 1:8)
       Cmn = 0.0
       do kk = 1,8
          Cmn = Cmn + A(mm,kk) * B(kk,nn)
       end do
       C(mm,nn) = Cmn
    end do
  end subroutine AmulB8x8x8_loop

  subroutine AmulBt_loop(C, A, Bt, M, K, N) BIND(C, name="AmulBt_loop")
    integer(C_long), intent(in) :: M, K, N
    real(C_double), dimension(M, N), intent(out) :: C
    real(C_double), dimension(M, K), intent(in) :: A
    real(C_double), dimension(N, K), intent(in) :: Bt
    real(C_double) :: Cmn
    do concurrent(nn = 1:N, mm = 1:M)
       Cmn = 0.0
       do kk = 1,K
          Cmn = Cmn + A(mm,kk) * Bt(nn,kk)
       end do
       C(mm,nn) = Cmn
    end do
  end subroutine AmulBt_loop
  subroutine AmulBt7x7x7_loop(C, A, Bt) BIND(C, name="AmulBt7x7x7_loop")
    real(C_double), dimension(7, 7), intent(out) :: C
    real(C_double), dimension(7, 7), intent(in) :: A
    real(C_double), dimension(7, 7), intent(in) :: Bt
    real(C_double) :: Cmn
    do concurrent(nn = 1:7, mm = 1:7)
       Cmn = 0.0
       do kk = 1,7
          Cmn = Cmn + A(mm,kk) * Bt(nn,kk)
       end do
       C(mm,nn) = Cmn
    end do
  end subroutine AmulBt7x7x7_loop
  subroutine AmulBt8x8x8_loop(C, A, Bt) BIND(C, name="AmulBt8x8x8_loop")
    real(C_double), dimension(8, 8), intent(out) :: C
    real(C_double), dimension(8, 8), intent(in) :: A
    real(C_double), dimension(8, 8), intent(in) :: Bt
    real(C_double) :: Cmn
    do concurrent(nn = 1:8, mm = 1:8)
       Cmn = 0.0
       do kk = 1,8
          Cmn = Cmn + A(mm,kk) * Bt(nn,kk)
       end do
       C(mm,nn) = Cmn
    end do
  end subroutine AmulBt8x8x8_loop

  subroutine AtmulB_loop(C, At, B, M, K, N) BIND(C, name="AtmulB_loop")
    integer(C_long), intent(in) :: M, K, N
    real(C_double), dimension(M, N), intent(out) :: C
    real(C_double), dimension(K, M), intent(in) :: At
    real(C_double), dimension(K, N), intent(in) :: B
    real(C_double) :: Cmn
    do concurrent(nn = 1:N, mm = 1:M)
       Cmn = 0.0
       do kk = 1,K
          Cmn = Cmn + At(kk,mm) * B(kk,nn)
       end do
       C(mm,nn) = Cmn
    end do
  end subroutine AtmulB_loop
  subroutine AtmulB7x7x7_loop(C, At, B) BIND(C, name="AtmulB7x7x7_loop")
    real(C_double), dimension(7, 7), intent(out) :: C
    real(C_double), dimension(7, 7), intent(in) :: At
    real(C_double), dimension(7, 7), intent(in) :: B
    real(C_double) :: Cmn
    do concurrent(nn = 1:7, mm = 1:7)
       Cmn = 0.0
       do kk = 1,7
          Cmn = Cmn + At(kk,mm) * B(kk,nn)
       end do
       C(mm,nn) = Cmn
    end do
  end subroutine AtmulB7x7x7_loop
  subroutine AtmulB8x8x8_loop(C, At, B) BIND(C, name="AtmulB8x8x8_loop")
    real(C_double), dimension(8, 8), intent(out) :: C
    real(C_double), dimension(8, 8), intent(in) :: At
    real(C_double), dimension(8, 8), intent(in) :: B
    real(C_double) :: Cmn
    do concurrent(nn = 1:8, mm = 1:8)
       Cmn = 0.0
       do kk = 1,8
          Cmn = Cmn + At(kk,mm) * B(kk,nn)
       end do
       C(mm,nn) = Cmn
    end do
  end subroutine AtmulB8x8x8_loop

  subroutine AtmulBt_loop(C, At, Bt, M, K, N) BIND(C, name="AtmulBt_loop")
    integer(C_long), intent(in) :: M, K, N
    real(C_double), dimension(M, N), intent(out) :: C
    real(C_double), dimension(K, M), intent(in) :: At
    real(C_double), dimension(N, K), intent(in) :: Bt
    real(C_double) :: Cmn
    do concurrent(nn = 1:N, mm = 1:M)
       Cmn = 0.0
       do kk = 1,K
          Cmn = Cmn + At(kk,mm) * Bt(nn,kk)
       end do
       C(mm,nn) = Cmn
    end do
  end subroutine AtmulBt_loop
  subroutine AtmulBt7x7x7_loop(C, At, Bt) BIND(C, name="AtmulBt7x7x7_loop")
    real(C_double), dimension(7, 7), intent(out) :: C
    real(C_double), dimension(7, 7), intent(in) :: At
    real(C_double), dimension(7, 7), intent(in) :: Bt
    real(C_double) :: Cmn
    do concurrent(nn = 1:7, mm = 1:7)
       Cmn = 0.0
       do kk = 1,7
          Cmn = Cmn + At(kk,mm) * Bt(nn,kk)
       end do
       C(mm,nn) = Cmn
    end do
  end subroutine AtmulBt7x7x7_loop
  subroutine AtmulBt8x8x8_loop(C, At, Bt) BIND(C, name="AtmulBt8x8x8_loop")
    real(C_double), dimension(8, 8), intent(out) :: C
    real(C_double), dimension(8, 8), intent(in) :: At
    real(C_double), dimension(8, 8), intent(in) :: Bt
    real(C_double) :: Cmn
    do concurrent(nn = 1:8, mm = 1:8)
       Cmn = 0.0
       do kk = 1,8
          Cmn = Cmn + At(kk,mm) * Bt(nn,kk)
       end do
       C(mm,nn) = Cmn
    end do
  end subroutine AtmulBt8x8x8_loop

  
  real(C_double) function dot3v2(x, A, y, M, N) BIND(C, name="dot3v2")
    integer(C_long), intent(in) :: M, N
    real(C_double), intent(in) :: x(M), A(M,N), y(N)
    real(C_double) :: t
    integer(C_long) :: mm, nn
    dot3v2 = 0.0d0
    do concurrent(nn = 1:N, mm = 1:M)
       dot3v2 = dot3v2 + x(mm) * A(mm, nn) * y(nn)
    end do
  end function dot3v2
  real(C_double) function dot3(x, A, y, M, N) BIND(C, name="dot3")
    integer(C_long), intent(in) :: M, N
    real(C_double), intent(in) :: x(M), A(M,N), y(N)
    real(C_double) :: t
    integer(C_long) :: mm, nn
    dot3 = 0.0d0
    do concurrent(nn = 1:N)
       t = 0.0d0
       do concurrent(mm = 1:M)
          t = t + x(mm) * A(mm, nn)
       end do
       dot3 = dot3 + t * y(nn)
    end do
  end function dot3
  real(C_double) function dot3builtin(x, A, y, M, N) BIND(C, name="dot3builtin")
    integer(C_long), intent(in) :: M, N
    real(C_double), intent(in) :: x(M), A(M,N), y(N)
    dot3builtin = dot_product(x, matmul(A, y))
  end function dot3builtin


end module matrixmul
Julia code and script for running benchmarks
using LoopVectorization, ArrayInterface, Static, LinearAlgebra

function mul_serial!(C, A, B)
  @avx for n ∈ indices((C,B), 2), m ∈ indices((C,A), 1)
    Cmn = zero(eltype(C))
    for k ∈ indices((A,B), (2,1))
      Cmn += A[m,k] * B[k,n]
    end
    C[m,n] = Cmn
  end
end
function mul_threads!(C, A, B)
  @avxt for n ∈ indices((C,B), 2), m ∈ indices((C,A), 1)
    Cmn = zero(eltype(C))
    for k ∈ indices((A,B), (2,1))
      Cmn += A[m,k] * B[k,n]
    end
    C[m,n] = Cmn
  end
end

const LIBFORTMUL = joinpath(pwd(), "libmatmul.so")
if !isfile(LIBFORTMUL)
  # run(`gfortran -Ofast -march=native -fdisable-tree-cunrolli -floop-nest-optimize -funroll-loops -mprefer-vector-width=$(LoopVectorization.pick_vector_width(Float64)*64) -shared -fPIC matmul.f90 -o $LIBFORTMUL`)
  run(`gfortran -Ofast -march=native -mprefer-vector-width=$(LoopVectorization.pick_vector_width(Float64)*64) -shared -fPIC matmul.f90 -o $LIBFORTMUL`)
end

struct SizedMatrix{M,N,T} <: DenseMatrix{T}
  data::Matrix{T}
end
@inline SizedMatrix{M,N}(data::Matrix{T}) where {M,N,T} = SizedMatrix{M,N,T}(data)
@inline Base.unsafe_convert(::Type{Ptr{T}}, A::SizedMatrix{M,N,T}) where {M,N,T} = Base.unsafe_convert(Ptr{T}, A.data)
@inline Base.size(A::SizedMatrix{M,N}) where {M,N} = (M,N)
@inline Base.elsize(A::SizedMatrix{M,N,T}) where {M,N,T} = sizeof(T)
@inline Base.length(A::SizedMatrix{M,N}) where {M,N} = M*N
@inline ArrayInterface.size(A::SizedMatrix{M,N}) where {M,N} = (static(M),static(N))
@inline ArrayInterface.axes(A::SizedMatrix{M,N}, ::StaticInt{1}) where {M,N} = static(1):static(M)
@inline ArrayInterface.axes(A::SizedMatrix{M,N}, ::StaticInt{2}) where {M,N} = static(1):static(N)
@inline ArrayInterface.axes(A::SizedMatrix{M,N}) where {M,N} = (static(1):static(M), static(1):static(N))
@inline Base.parent(A::SizedMatrix) = A.data
@inline ArrayInterface.parent_type(::Type{<:SizedMatrix{M,N,T}}) where {M,N,T} = Matrix{T}
@inline function Base.getindex(A::SizedMatrix, i...)
  @boundscheck checkbounds(A, i...)
  @inbounds A.data[i...]
end
@inline function Base.setindex!(A::SizedMatrix, v, i...)
  @boundscheck checkbounds(A, i...)
  @inbounds A.data[i...] = v
end

for At ∈ (false,true), Bt ∈ (false,true)
  fname = Symbol(At ? "At" : "A", "mul", Bt ? "Bt" : "B")
  for static_size ∈ (0,7,8)
    if static_size == 7
      Ctype = :(SizedMatrix{7,7,Float64})
      fname_sized = Symbol(fname, "7x7x7")
    elseif static_size == 8
      Ctype = :(SizedMatrix{8,8,Float64})
      fname_sized = Symbol(fname, "8x8x8")
    else
      Ctype = :(Matrix{Float64})
      fname_sized = fname
    end
    Atype = At ? :(Adjoint{Float64,$Ctype}) : Ctype
    Btype = Bt ? :(Adjoint{Float64,$Ctype}) : Ctype
    A = At ? :(parent(A)) : :A
    B = Bt ? :(parent(B)) : :B
    for loop ∈ (false, true)
      funcname! = loop ? :fortmul! : :fortmul_loop!
      cfname = QuoteNode(loop ? Symbol(fname_sized, "_loop") : fname_sized)
      matmul_quote = if static_size == 0
        quote
          function $funcname!(C::$Ctype, A::$Atype, B::$Btype)
            M, N = size(C); K = size(B,1)
            ccall(($cfname, LIBFORTMUL), Cvoid, (Ref{Float64},Ref{Float64},Ref{Float64},Ref{Clong},Ref{Clong},Ref{Clong}), C, $A, $B, M, K, N)
          end
        end
      else
        quote
          function $funcname!(C::$Ctype, A::$Atype, B::$Btype)
            ccall(($cfname, LIBFORTMUL), Cvoid, (Ref{Float64},Ref{Float64},Ref{Float64}), C, $A, $B)
          end
        end
      end
      @eval $matmul_quote
    end
  end
end

const Matrix7x7 = SizedMatrix{7,7}
const Matrix8x8 = SizedMatrix{8,8}
A7x7 = rand(7,7); B7x7 = rand(7,7); C7x7 = rand(7,7);
A8x8 = rand(8,8); B8x8 = rand(8,8); C8x8 = rand(8,8);

function print_matmul_summary(C, A, B)
  M, K = size(A); N = size(B,2);
  A_str = A isa Adjoint ? "A'" : "A"
  B_str = B isa Adjoint ? "B'" : "B"
  static_str = C isa SizedMatrix ? "compile" : "runtime"
  println("Running $(static_str)-sized $(M)×$(N) = $(M)×$(K) * $(K)×$(N) benchmarks for C = $A_str * $B_str")
end
function run_benchmark(C,A,B)
  print_matmul_summary(C,A,B)
  M, K = size(A); N = size(B,2)
  println("gfortran Builtin:")
  t = @belapsed fortmul!($C, $A, $B)
  println("GFLOPS: $(2e-9M*K*N/t)")
  Ccopy = copy(C)
  println("gfortran Loops:")
  t = @belapsed fortmul_loop!($C, $A, $B)
  println("GFLOPS: $(2e-9M*K*N/t)")
  @assert C ≈ Ccopy
  println("LoopVectorization.jl Single-Thread:")
  t = @belapsed mul_serial!($C, $A, $B)
  println("GFLOPS: $(2e-9M*K*N/t)")
  @assert C ≈ Ccopy
  if length(C) > 32^2
    println("LoopVectorization.jl Multiple-Threads:")
    t = @belapsed mul_threads!($C, $A, $B)
    println("GFLOPS: $(2e-9M*K*N/t)")
    @assert C ≈ Ccopy
  end
  println()
end
for (C,A,B) ∈ ((C7x7,A7x7,B7x7),(C8x8,A8x8,B8x8)), At ∈ (false,true), Bt ∈ (false,true), static_size ∈ (false,true)
  M,K = size(A); N = size(B,2)
  T = static_size ? (M == 7 ? Matrix7x7 : Matrix8x8) : identity
  Atemp = T(A); Btemp = T(B)
  Atemp = At ? Atemp' : Atemp
  Btemp = Bt ? Btemp' : Btemp
  run_benchmark(T(C), Atemp, Btemp)
end

for M ∈ (71, 72, 144, 216)
  K = N = M
  A = rand(M,K); B = rand(K,N);
  C = Matrix{Float64}(undef, M, N);

  run_benchmark(C, A, B)
  run_benchmark(C, A, B')
  run_benchmark(C, A', B)
  run_benchmark(C, A', B')
end
Results
Running runtime-sized 7×7 = 7×7 * 7×7 benchmarks for C = A * B
gfortran Builtin:
GFLOPS: 4.290648913738776
gfortran Loops:
GFLOPS: 3.8254906148406542
LoopVectorization.jl Single-Thread:
GFLOPS: 25.796296296296298

Running compile-sized 7×7 = 7×7 * 7×7 benchmarks for C = A * B
gfortran Builtin:
GFLOPS: 26.027454718779794
gfortran Loops:
GFLOPS: 15.402794157670328
LoopVectorization.jl Single-Thread:
GFLOPS: 65.19349315068494

Running runtime-sized 7×7 = 7×7 * 7×7 benchmarks for C = A * B'
gfortran Builtin:
GFLOPS: 3.376153217330375
gfortran Loops:
GFLOPS: 3.8443792201472116
LoopVectorization.jl Single-Thread:
GFLOPS: 34.47618088427838

Running compile-sized 7×7 = 7×7 * 7×7 benchmarks for C = A * B'
gfortran Builtin:
GFLOPS: 26.010593704748118
gfortran Loops:
GFLOPS: 13.077782718818277
LoopVectorization.jl Single-Thread:
GFLOPS: 65.18729192428422

Running runtime-sized 7×7 = 7×7 * 7×7 benchmarks for C = A' * B
gfortran Builtin:
GFLOPS: 4.0271049910194305
gfortran Loops:
GFLOPS: 3.9683312655086853
LoopVectorization.jl Single-Thread:
GFLOPS: 10.530866711571298

Running compile-sized 7×7 = 7×7 * 7×7 benchmarks for C = A' * B
gfortran Builtin:
GFLOPS: 18.178010471204193
gfortran Loops:
GFLOPS: 10.238181624159642
LoopVectorization.jl Single-Thread:
GFLOPS: 28.06786345150557

Running runtime-sized 7×7 = 7×7 * 7×7 benchmarks for C = A' * B'
gfortran Builtin:
GFLOPS: 4.315322113055289
gfortran Loops:
GFLOPS: 2.8338118977575006
LoopVectorization.jl Single-Thread:
GFLOPS: 24.073995697104365

Running compile-sized 7×7 = 7×7 * 7×7 benchmarks for C = A' * B'
gfortran Builtin:
GFLOPS: 16.877507447864946
gfortran Loops:
GFLOPS: 13.098295894655307
LoopVectorization.jl Single-Thread:
GFLOPS: 42.206275815301154

Running runtime-sized 8×8 = 8×8 * 8×8 benchmarks for C = A * B
gfortran Builtin:
GFLOPS: 4.654805292161444
gfortran Loops:
GFLOPS: 7.049481636524185
LoopVectorization.jl Single-Thread:
GFLOPS: 34.24933544197315

Running compile-sized 8×8 = 8×8 * 8×8 benchmarks for C = A * B
gfortran Builtin:
GFLOPS: 75.37630919014605
gfortran Loops:
GFLOPS: 13.412733778948079
LoopVectorization.jl Single-Thread:
GFLOPS: 81.84462757020562

Running runtime-sized 8×8 = 8×8 * 8×8 benchmarks for C = A * B'
gfortran Builtin:
GFLOPS: 3.4824207638528932
gfortran Loops:
GFLOPS: 6.503686130838877
LoopVectorization.jl Single-Thread:
GFLOPS: 49.12558945241074

Running compile-sized 8×8 = 8×8 * 8×8 benchmarks for C = A * B'
gfortran Builtin:
GFLOPS: 75.28191528545119
gfortran Loops:
GFLOPS: 47.59792997342533
LoopVectorization.jl Single-Thread:
GFLOPS: 81.79227632525786

Running runtime-sized 8×8 = 8×8 * 8×8 benchmarks for C = A' * B
gfortran Builtin:
GFLOPS: 8.2165948853584
gfortran Loops:
GFLOPS: 5.622450042883365
LoopVectorization.jl Single-Thread:
GFLOPS: 13.54091119775336

Running compile-sized 8×8 = 8×8 * 8×8 benchmarks for C = A' * B
gfortran Builtin:
GFLOPS: 56.06721950683728
gfortran Loops:
GFLOPS: 47.56793060025185
LoopVectorization.jl Single-Thread:
GFLOPS: 49.458773374672994

Running runtime-sized 8×8 = 8×8 * 8×8 benchmarks for C = A' * B'
gfortran Builtin:
GFLOPS: 4.89827710282352
gfortran Loops:
GFLOPS: 5.558826599853847
LoopVectorization.jl Single-Thread:
GFLOPS: 31.88760662318113

Running compile-sized 8×8 = 8×8 * 8×8 benchmarks for C = A' * B'
gfortran Builtin:
GFLOPS: 56.06721950683728
gfortran Loops:
GFLOPS: 11.955531821933105
LoopVectorization.jl Single-Thread:
GFLOPS: 48.73397298200392

Running runtime-sized 71×71 = 71×71 * 71×71 benchmarks for C = A * B
gfortran Builtin:
GFLOPS: 5.865806789966648
gfortran Loops:
GFLOPS: 22.828868478122207
LoopVectorization.jl Single-Thread:
GFLOPS: 109.64066903565738
LoopVectorization.jl Multiple-Threads:
GFLOPS: 326.3791478798318

Running runtime-sized 71×71 = 71×71 * 71×71 benchmarks for C = A * B'
gfortran Builtin:
GFLOPS: 3.2668930328504797
gfortran Loops:
GFLOPS: 6.8133941234140165
LoopVectorization.jl Single-Thread:
GFLOPS: 107.9801484342002
LoopVectorization.jl Multiple-Threads:
GFLOPS: 317.0627491510409

Running runtime-sized 71×71 = 71×71 * 71×71 benchmarks for C = A' * B
gfortran Builtin:
GFLOPS: 14.328762735952917
gfortran Loops:
GFLOPS: 14.112931527375249
LoopVectorization.jl Single-Thread:
GFLOPS: 52.568260262906655
LoopVectorization.jl Multiple-Threads:
GFLOPS: 191.04598139393016

Running runtime-sized 71×71 = 71×71 * 71×71 benchmarks for C = A' * B'
gfortran Builtin:
GFLOPS: 5.875484273425701
gfortran Loops:
GFLOPS: 20.93720202404282
LoopVectorization.jl Single-Thread:
GFLOPS: 96.3000033632664
LoopVectorization.jl Multiple-Threads:
GFLOPS: 153.8464734749436

Running runtime-sized 72×72 = 72×72 * 72×72 benchmarks for C = A * B
gfortran Builtin:
GFLOPS: 5.997782455689288
gfortran Loops:
GFLOPS: 26.56380328802221
LoopVectorization.jl Single-Thread:
GFLOPS: 123.29809725158565
LoopVectorization.jl Multiple-Threads:
GFLOPS: 385.96556537924624

Running runtime-sized 72×72 = 72×72 * 72×72 benchmarks for C = A * B'
gfortran Builtin:
GFLOPS: 3.3204458717718337
gfortran Loops:
GFLOPS: 7.33131021478448
LoopVectorization.jl Single-Thread:
GFLOPS: 123.28995177379932
LoopVectorization.jl Multiple-Threads:
GFLOPS: 388.96206752813674

Running runtime-sized 72×72 = 72×72 * 72×72 benchmarks for C = A' * B
gfortran Builtin:
GFLOPS: 25.192224622030242
gfortran Loops:
GFLOPS: 24.992333188255387
LoopVectorization.jl Single-Thread:
GFLOPS: 68.62437948152235
LoopVectorization.jl Multiple-Threads:
GFLOPS: 200.95457298606908

Running runtime-sized 72×72 = 72×72 * 72×72 benchmarks for C = A' * B'
gfortran Builtin:
GFLOPS: 5.9635236505108775
gfortran Loops:
GFLOPS: 24.258148376823844
LoopVectorization.jl Single-Thread:
GFLOPS: 110.31417171567905
LoopVectorization.jl Multiple-Threads:
GFLOPS: 146.8943299693588

Running runtime-sized 144×144 = 144×144 * 144×144 benchmarks for C = A * B
gfortran Builtin:
GFLOPS: 5.601758954232673
gfortran Loops:
GFLOPS: 39.86547665934595
LoopVectorization.jl Single-Thread:
GFLOPS: 126.24657534246577
LoopVectorization.jl Multiple-Threads:
GFLOPS: 1125.972032806461

Running runtime-sized 144×144 = 144×144 * 144×144 benchmarks for C = A * B'
gfortran Builtin:
GFLOPS: 2.578293861451797
gfortran Loops:
GFLOPS: 8.234176150543389
LoopVectorization.jl Single-Thread:
GFLOPS: 108.0644916128332
LoopVectorization.jl Multiple-Threads:
GFLOPS: 975.3018029788348

Running runtime-sized 144×144 = 144×144 * 144×144 benchmarks for C = A' * B
gfortran Builtin:
GFLOPS: 25.676496764623693
gfortran Loops:
GFLOPS: 25.937223937771183
LoopVectorization.jl Single-Thread:
GFLOPS: 89.84456145629608
LoopVectorization.jl Multiple-Threads:
GFLOPS: 867.8420088935393

Running runtime-sized 144×144 = 144×144 * 144×144 benchmarks for C = A' * B'
gfortran Builtin:
GFLOPS: 5.639534367631242
gfortran Loops:
GFLOPS: 37.20829153712439
LoopVectorization.jl Single-Thread:
GFLOPS: 114.82566478878657
LoopVectorization.jl Multiple-Threads:
GFLOPS: 661.0546823112686

Running runtime-sized 216×216 = 216×216 * 216×216 benchmarks for C = A * B
gfortran Builtin:
GFLOPS: 4.885176232546832
gfortran Loops:
GFLOPS: 46.77228108770416
LoopVectorization.jl Single-Thread:
GFLOPS: 123.50268998394588
LoopVectorization.jl Multiple-Threads:
GFLOPS: 1846.7465640461794

Running runtime-sized 216×216 = 216×216 * 216×216 benchmarks for C = A * B'
gfortran Builtin:
GFLOPS: 2.501163629630972
gfortran Loops:
GFLOPS: 8.238294838975227
LoopVectorization.jl Single-Thread:
GFLOPS: 98.05113835376532
LoopVectorization.jl Multiple-Threads:
GFLOPS: 1477.1265665078784

Running runtime-sized 216×216 = 216×216 * 216×216 benchmarks for C = A' * B
gfortran Builtin:
GFLOPS: 25.663006899783166
gfortran Loops:
GFLOPS: 25.26621713606637
LoopVectorization.jl Single-Thread:
GFLOPS: 96.12499105775973
LoopVectorization.jl Multiple-Threads:
GFLOPS: 1474.9646542261253

Running runtime-sized 216×216 = 216×216 * 216×216 benchmarks for C = A' * B'
gfortran Builtin:
GFLOPS: 4.846530403782313
gfortran Loops:
GFLOPS: 52.11586018586035
LoopVectorization.jl Single-Thread:
GFLOPS: 105.0252305768329
LoopVectorization.jl Multiple-Threads:
GFLOPS: 1462.9739420773753

versioninfo:

julia> versioninfo()
Julia Version 1.7.0-DEV.1052
Commit cc59947c03* (2021-05-02 02:36 UTC)
Platform Info:
  OS: Linux (x86_64-generic-linux)
  CPU: Intel(R) Core(TM) i9-10980XE CPU @ 3.00GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, cascadelake)
Environment:
  JULIA_NUM_THREADS = 36

The benchmarks are a bunch of matmuls via the Fotran intrinsic as well as loops.
The Julia code contrains the compilation command. I’m open to suggestions for better loops and compiler flags.
The benchmark contains benchmarks for runtime and compile-time sized 7x7 and 8x8 matrices for all permutations of tranposing A and B in C = A*B, and then runtime sized versions for 71x71, 72x72, 144x144, and 216x216.
The larger sizes also include multithreaded Julia matrix multiply (implemented as @avxt on 3 loops). Note that Julia must be started with multiple threads to support multithreading.
Also, @avx(t) is currently limited to simple rectangular loop nests; i.e. triangular loops or multiple loops at the same level in a nest don’t work at the moment, and neither does complex control flow (you can remove @avx if you want to benchmark that).

Also note that there’s only two versions of the matrix multiply function in Julia: 1 single threaded, another multithreaded. The same function is used for all transposed permutations, without us having to write and optimize a different version for each.

EDIT: Fixed names following @themos’ suggestion.

8 Likes

I rather object to you labelling the results from one compiler as the “Fortran” results. I also suspect that your work in producing fast loops would have produced similar results had you worked on a Fortran compiler that you had access to the internals of. Is there any reason that it wouldn’t?

It may well turn out to be the case that Fortran compilers have dropped the ball when it comes to delivering blistering performance in the cases that matter (thousands of communicating processes solving engineering and science problems) but it will take significantly more work to demonstrate that.

2 Likes

Unlike Julia, the Fortran compiler development efforts are not centralized. From what I have learned over the past years, some are very good at debugging (like NAG, GFortran) and some others are historically known for better optimizations (like Intel, Cray, IBM, PGI, …). But these could change at any time. Having and maintaining multiple high-quality compilers takes a lot of effort. For benchmarking purposes, it makes sense to compile Fortran code with all available compilers and report the best result. I think this issue highlights the importance of such activities as Google Summer of Code 2021

3 Likes

Julia correctly sums 10^8 32-bit uniform deviates, with the program

n = 10^8
x = Float32.(rand(n))
println(sum(x)/n)

giving output such as 0.49995995 but gfortran for

implicit none
integer, parameter :: n = 10**8
real :: x(n)
call random_number(x)
print*,sum(x)/n
end

gives 0.167772159 and g95 gives 0.16777216 but Intel Fortran gives 0.4999704
Compensated summation, which gets this correct, may eventually be in stdlib. Ideally a compiler would have an option to compute SUM carefully and the user could verify that toggling the option does not influence the program results.

By default, Julia uses pairwise summation, which explains this. On modern architectures, sum will usually be memory-bound, so the extra (very minor) cost of summing out of order ends up not mattering.

1 Like