Keep linear algebra in fortran

Looking at some linear algebra project I realized that many of them are written in C and in an parts in assembly (blis, openblas, gotoblas).
To keep linear algebra in fortran one would have to able to write code with a similar performance in fortran!?

Here is the task: the code below implements a home brew gem routine with cache-blocking. According to Intel Advisor the loops reach 84 to 89 % efficiency. But compared to MKL it needs about 1.8x as much time (single core and threaded).

The compiler flags are “-O3 -march=coffeelake”, the compiler is ifort 19.1.3, the mkl is from parallel studio 2020.4.304, using the 64 bit integer interface. The os is linux, kernel version 5. The current benchmark for single threaded is 3.3 seconds for mkl and 5.6 seconds for homebrew. Anybody out there who can close the gap and therefore prove that it is worth staying with fortran for linear algebra?

program test
 use blas95
 use, intrinsic :: iso_fortran_env, only: int64, real64
 real(real64), allocatable :: r21(:,:),r22(:,:),r24(:,:),r23(:,:)
 real(real64) :: t0,t1
 integer(int64) :: nr=7000,nc=4000
 allocate(r21(nr,nc),r22(nc,nc),r23(nr,nc),r24(nr,nc))
 r21(:,:)=1._real64;r22(:,:)=2._real64;r23(:,:)=0._real64;r24(:,:)=0._real64
 call cpu_time(t0)
 call gemm(a=r21,b=r22,c=r23)
 call cpu_time(t1)
 write(*,*) t1-t0
 call cpu_time(t0)
 call homebrew(a=r21,b=r22,c=r24)
 call cpu_time(t1)
 write(*,*) t1-t0
 if(maxval(abs(r23-r24))>10e-10) then
   write(*,*) maxval(abs(r23-r24))
   write(*,*) "error";call exit(1)
 end if
contains
  subroutine homebrew(a,b,c)
    real(real64), intent(in) :: a(:,:),b(:,:)
    real(real64), intent(inout) :: c(:,:)
    integer(int64) :: a1,a2,a3,a4,a5,a6,n=64
    do a1=1,size(b,2),n
      do a2=1,size(b,1),n
        do a3=1,size(a,1),n
          do a4=a1,min(a1+n-1,size(b,2)),1
            do a5=a2,min(a2+n-1,size(b,1)),1
              do a6=a3,min(a3+n-1,size(a,1)),1
                c(a6,a4)=c(a6,a4)+a(a6,a5)*b(a5,a4)
              end do
            end do
          end do
        end do
      end do
    end do
  end subroutine homebrew
end program test

This is matrix matrix multiplication? That is bound by the multiplication cost. For large enough matrices, you can get 100% of the theoretical performance peak in Fortran, I have done that about 5 years ago and measured it. I believe the same speed as OpenBLAS. For small matrices OpenBLAS is faster, because then you have to hide the latency of memory read/write and it gets complicated. However, if you have many small matrices to multiply, you can hide this cost. I have done that for matrix-vector multiply, if you have many vectors to multiply with the same matrix, you can vectorize efficiently and get very close to the theoretical peak performance, in Fortran. But if you have just one matrix and one vector to multiply, it is a very complicated assembly code that you have to write to carefully balance latency of reads and multiplies/additions, you can look into OpenBLAS, that’s not easy. You can’t do it from Fortran or C, unfortunately.

Can you post a C code that is faster? It’s the same issue there, I don’t think there is any advantage there. Typically one has to go into assembly to hide the latency cost if that is the issue. I have done that, I can show how that is done if there is interest.

The GEMM kernel (from Level 3 BLAS) for one vendor library I was involved with ended up a Megabyte of assembly source. It is not anything that anyone has been writing in a high level language for a long time. You either write it in assembly or use a specialised code generator (same with FFTs). Linear algebra is all the stuff you do with BLAS kernels.

2 Likes

I can’t… I have only mild knowledge about C.

Certainly understood … but if everything which is efficient must be written in assembly and is then called from python … where is fortran :astonished:

It is!

Would using DO CONCURRENT with the -qopenmp option for Intel Fortran help? Here is the homebrew subroutine with DO CONCURRENT.

  subroutine homebrew(a,b,c)
    use, intrinsic :: iso_fortran_env, only: int64, real64
    real(real64), intent(in) :: a(:,:),b(:,:)
    real(real64), intent(inout) :: c(:,:)
    integer(int64) :: a1,a2,a3,a4,a5,a6,n=64
    do concurrent (a1=1:size(b,2):n)
      do concurrent (a2=1:size(b,1):n)
        do concurrent (a3=1:size(a,1):n)
          do concurrent(a4=a1:min(a1+n-1,size(b,2)):1)
            do concurrent (a5=a2:min(a2+n-1,size(b,1)):1)
              do concurrent (a6=a3:min(a3+n-1,size(a,1)):1)
                c(a6,a4)=c(a6,a4)+a(a6,a5)*b(a5,a4)
              end do
            end do
          end do
        end do
      end do
    end do
  end subroutine homebrew

Have you tried it?

No, not everything. But Fortran codes can spend 80% or more of their time doing DGEMMs and that’s where you use a specialised code generator to hit the machine peak. It doesn’t affect the choice of language for the rest of your code. You could use python if you don’t mind the other 20% taking 10-20x as long and becoming the actual bottleneck.

1 Like

No. I use Windows and have not used MKL. How should I compile a program with use blas95?

Ok. Maybe it’s worth trying mkl. I think it is free, even for windows.

Yes, as @themos said.

This is a bit orthogonal issue, but something that has been bothering me a lot: the fact that you have to write, say, OpenBLAS in assembly in order to get closer to the theoretical peak performance seems not optimal. It would be great if (say) Fortran compilers could do a better job. We have done some benchmarking with Loopy and on at least some of the benchmarks it was able to get close to the peak performance. So integrating the algorithms from there in a Fortran compiler might get us pretty far.

From what I have seen in the assembly code it is tailored to chip architecture (skylake, haswell etc). I hoped that "-march " would do the trick … currently it doesn’t look like that.

If compiler flags like that would bring fortran to assembly performance … would be a huge selling argument.

Different architectures have different SIMD vector lengths and slightly different latencies and throughputs of vector instructions, but that does not seem to be the main issue.

The way Loopy works is that you tell it what transformations you want to get done on the kernel. Here is a list you can choose from to give you an idea: Installation - loopy 2022.1 documentation and knowing which ones to use and in what order is the key. I don’t know if it can be just a simple compiler option, at first I was thinking more like “pragmas” where you tell the compiler what you want to get done with your loop.

LAPACK is Fortran. But not the actual high-performance BLAS implementations. Those are either manually or automatically tuned assembly/C. Only the reference BLAS is Fortran, as a fallback.

For what it’s worth, Octavian.jl is beating everything I can put it against on my machine and it’s written in just julia, using LoopVectorization.jl and CheapThreads.jl:

It needs more work before it’s fully BLAS compliant and such, but it’s a very cool project coming along nicely.

I think this is a nice existence proof that hand-written assembly is not required for peak linear algebra performance. Rather, having powerful, flexible platforms for metaprogramming and for plugging into the compiler can let high level languages generate all the necessary code.

There’s a good chance that if this work progresses far enough, we’ll be able to drop our dependency on OpenBLAS.

2 Likes

I look forward to the day when gfortran, say, recognizes a simple dgemm loop and calls out to a code generator like Julia’s to actually produce the object code.

1 Like

The freely available 2019 paper PLASMA: Parallel Linear Algebra Software for Multicore Using OpenMP by Dongarra et al. says in the Performance Evaluation section that PLASMA, written in C, is much faster than Lapack. Are there possible improvements on the Fortran side that would reverse this conclusion? PLASMA is Fortran-callable

PLASMA is written in C, with interfaces for Fortran provided via automatic code generation during compilation of the library. In particular, PLASMA is shipped with a Python script, which parses the C header files, and generates a Fortran module with the interface. The bindings are based on features provided by the Fortran 2003 standard, most importantly the iso_c_binding intrinsic module. PLASMA includes several Fortran examples of using the top-level as well as the second-level functions.

so the saying “If you can’t beat them, join them” may apply.

1 Like