Keep linear algebra in fortran

rcs · May 14, 2021, 3:39pm

Looking at some linear algebra project I realized that many of them are written in C and in an parts in assembly (blis, openblas, gotoblas).
To keep linear algebra in fortran one would have to able to write code with a similar performance in fortran!?

Here is the task: the code below implements a home brew gem routine with cache-blocking. According to Intel Advisor the loops reach 84 to 89 % efficiency. But compared to MKL it needs about 1.8x as much time (single core and threaded).

The compiler flags are “-O3 -march=coffeelake”, the compiler is ifort 19.1.3, the mkl is from parallel studio 2020.4.304, using the 64 bit integer interface. The os is linux, kernel version 5. The current benchmark for single threaded is 3.3 seconds for mkl and 5.6 seconds for homebrew. Anybody out there who can close the gap and therefore prove that it is worth staying with fortran for linear algebra?

program test
 use blas95
 use, intrinsic :: iso_fortran_env, only: int64, real64
 real(real64), allocatable :: r21(:,:),r22(:,:),r24(:,:),r23(:,:)
 real(real64) :: t0,t1
 integer(int64) :: nr=7000,nc=4000
 allocate(r21(nr,nc),r22(nc,nc),r23(nr,nc),r24(nr,nc))
 r21(:,:)=1._real64;r22(:,:)=2._real64;r23(:,:)=0._real64;r24(:,:)=0._real64
 call cpu_time(t0)
 call gemm(a=r21,b=r22,c=r23)
 call cpu_time(t1)
 write(*,*) t1-t0
 call cpu_time(t0)
 call homebrew(a=r21,b=r22,c=r24)
 call cpu_time(t1)
 write(*,*) t1-t0
 if(maxval(abs(r23-r24))>10e-10) then
   write(*,*) maxval(abs(r23-r24))
   write(*,*) "error";call exit(1)
 end if
contains
  subroutine homebrew(a,b,c)
    real(real64), intent(in) :: a(:,:),b(:,:)
    real(real64), intent(inout) :: c(:,:)
    integer(int64) :: a1,a2,a3,a4,a5,a6,n=64
    do a1=1,size(b,2),n
      do a2=1,size(b,1),n
        do a3=1,size(a,1),n
          do a4=a1,min(a1+n-1,size(b,2)),1
            do a5=a2,min(a2+n-1,size(b,1)),1
              do a6=a3,min(a3+n-1,size(a,1)),1
                c(a6,a4)=c(a6,a4)+a(a6,a5)*b(a5,a4)
              end do
            end do
          end do
        end do
      end do
    end do
  end subroutine homebrew
end program test

certik · May 14, 2021, 4:16pm

This is matrix matrix multiplication? That is bound by the multiplication cost. For large enough matrices, you can get 100% of the theoretical performance peak in Fortran, I have done that about 5 years ago and measured it. I believe the same speed as OpenBLAS. For small matrices OpenBLAS is faster, because then you have to hide the latency of memory read/write and it gets complicated. However, if you have many small matrices to multiply, you can hide this cost. I have done that for matrix-vector multiply, if you have many vectors to multiply with the same matrix, you can vectorize efficiently and get very close to the theoretical peak performance, in Fortran. But if you have just one matrix and one vector to multiply, it is a very complicated assembly code that you have to write to carefully balance latency of reads and multiplies/additions, you can look into OpenBLAS, that’s not easy. You can’t do it from Fortran or C, unfortunately.

Can you post a C code that is faster? It’s the same issue there, I don’t think there is any advantage there. Typically one has to go into assembly to hide the latency cost if that is the issue. I have done that, I can show how that is done if there is interest.

themos · May 14, 2021, 4:19pm

The GEMM kernel (from Level 3 BLAS) for one vendor library I was involved with ended up a Megabyte of assembly source. It is not anything that anyone has been writing in a high level language for a long time. You either write it in assembly or use a specialised code generator (same with FFTs). Linear algebra is all the stuff you do with BLAS kernels.

rcs · May 14, 2021, 4:23pm

I can’t… I have only mild knowledge about C.

rcs · May 14, 2021, 4:25pm

Certainly understood … but if everything which is efficient must be written in assembly and is then called from python … where is fortran

rcs · May 14, 2021, 4:27pm

It is!

Beliavsky · May 14, 2021, 4:40pm

Would using DO CONCURRENT with the -qopenmp option for Intel Fortran help? Here is the homebrew subroutine with DO CONCURRENT.

  subroutine homebrew(a,b,c)
    use, intrinsic :: iso_fortran_env, only: int64, real64
    real(real64), intent(in) :: a(:,:),b(:,:)
    real(real64), intent(inout) :: c(:,:)
    integer(int64) :: a1,a2,a3,a4,a5,a6,n=64
    do concurrent (a1=1:size(b,2):n)
      do concurrent (a2=1:size(b,1):n)
        do concurrent (a3=1:size(a,1):n)
          do concurrent(a4=a1:min(a1+n-1,size(b,2)):1)
            do concurrent (a5=a2:min(a2+n-1,size(b,1)):1)
              do concurrent (a6=a3:min(a3+n-1,size(a,1)):1)
                c(a6,a4)=c(a6,a4)+a(a6,a5)*b(a5,a4)
              end do
            end do
          end do
        end do
      end do
    end do
  end subroutine homebrew

rcs · May 14, 2021, 4:40pm

Have you tried it?

themos · May 14, 2021, 4:48pm

No, not everything. But Fortran codes can spend 80% or more of their time doing DGEMMs and that’s where you use a specialised code generator to hit the machine peak. It doesn’t affect the choice of language for the rest of your code. You could use python if you don’t mind the other 20% taking 10-20x as long and becoming the actual bottleneck.

Beliavsky · May 14, 2021, 4:52pm

No. I use Windows and have not used MKL. How should I compile a program with use blas95?

rcs · May 14, 2021, 4:55pm

Ok. Maybe it’s worth trying mkl. I think it is free, even for windows.

certik · May 14, 2021, 5:37pm

Yes, as @themos said.

This is a bit orthogonal issue, but something that has been bothering me a lot: the fact that you have to write, say, OpenBLAS in assembly in order to get closer to the theoretical peak performance seems not optimal. It would be great if (say) Fortran compilers could do a better job. We have done some benchmarking with Loopy and on at least some of the benchmarks it was able to get close to the peak performance. So integrating the algorithms from there in a Fortran compiler might get us pretty far.

rcs · May 14, 2021, 5:42pm

From what I have seen in the assembly code it is tailored to chip architecture (skylake, haswell etc). I hoped that "-march " would do the trick … currently it doesn’t look like that.

If compiler flags like that would bring fortran to assembly performance … would be a huge selling argument.

certik · May 14, 2021, 5:50pm

Different architectures have different SIMD vector lengths and slightly different latencies and throughputs of vector instructions, but that does not seem to be the main issue.

The way Loopy works is that you tell it what transformations you want to get done on the kernel. Here is a list you can choose from to give you an idea: Installation - loopy 2022.1 documentation and knowing which ones to use and in what order is the key. I don’t know if it can be just a simple compiler option, at first I was thinking more like “pragmas” where you tell the compiler what you want to get done with your loop.

VladimirF · May 14, 2021, 6:16pm

LAPACK is Fortran. But not the actual high-performance BLAS implementations. Those are either manually or automatically tuned assembly/C. Only the reference BLAS is Fortran, as a fallback.

Mason · May 14, 2021, 8:04pm

For what it’s worth, Octavian.jl is beating everything I can put it against on my machine and it’s written in just julia, using LoopVectorization.jl and CheapThreads.jl:

It needs more work before it’s fully BLAS compliant and such, but it’s a very cool project coming along nicely.

I think this is a nice existence proof that hand-written assembly is not required for peak linear algebra performance. Rather, having powerful, flexible platforms for metaprogramming and for plugging into the compiler can let high level languages generate all the necessary code.

There’s a good chance that if this work progresses far enough, we’ll be able to drop our dependency on OpenBLAS.

themos · May 15, 2021, 8:32am

I look forward to the day when gfortran, say, recognizes a simple dgemm loop and calls out to a code generator like Julia’s to actually produce the object code.

Beliavsky · May 20, 2021, 3:08am

The freely available 2019 paper PLASMA: Parallel Linear Algebra Software for Multicore Using OpenMP by Dongarra et al. says in the Performance Evaluation section that PLASMA, written in C, is much faster than Lapack. Are there possible improvements on the Fortran side that would reverse this conclusion? PLASMA is Fortran-callable

PLASMA is written in C, with interfaces for Fortran provided via automatic code generation during compilation of the library. In particular, PLASMA is shipped with a Python script, which parses the C header files, and generates a Fortran module with the interface. The bindings are based on features provided by the Fortran 2003 standard, most importantly the iso_c_binding intrinsic module. PLASMA includes several Fortran examples of using the top-level as well as the second-level functions.

so the saying “If you can’t beat them, join them” may apply.

Topic		Replies	Views
C++ Standard Library dense linear algebra interface	21	2329	August 11, 2023
Does LAPACK/BLAS automatically use multi cores or threads?	35	6259	August 3, 2022
Unbeatable micro benchmark Help	11	929	August 13, 2020
Mapping matrix & vector arithmetic to BLAS calls	8	1352	July 20, 2022
Fortran has the fastest matmul Advocacy	2	571	January 12, 2024

Keep linear algebra in fortran

Related topics