Gauss Algorithm written in Fortran slower than written in Python for large matrices?

I think the comments by @JohnCampbell and @RonShepard point out the root cause of the issue. The implemented algorithm performs full-row operations in matrix a when swapping rows (a[[i, row_ind]] = a[[row_ind, i]] in Python and the equivalent of a(k, :) = a(i, :) in Fortran). Since Fortran uses column major memory layout (columns are contiguous in memory), this becomes an inefficient operation. However, by default, numpy arrays are row major (order='C', so that rows are contiguous in memory). As n becomes large, the Fortran becomes very cache inefficient, while the Python routine does not suffer from the same issue.

It is, however, very easy to change the memory layout of a numpy array, and you can do this to illustrate the problem, i.e. how fast is the Python routine, if we change the memory layout of a to Fortran order?:

def f_order():
  a = np.array([[1 / (i + j - 1) for j in range(1, n + 1)] for i in range(1, n + 1)], order = 'F')
  b = np.array([1 / (i + 1) for i in range(1, n + 1)])
  xf = gauss(a, b)
  return a, b, xf

def c_order():
  a = np.array([[1 / (i + j - 1) for j in range(1, n + 1)] for i in range(1, n + 1)])
  b = np.array([1 / (i + 1) for i in range(1, n + 1)])
  xc = gauss(a, b)
  return a, b, xc

print("order='F': ", timeit.timeit(f_order, number=3) / 3)
print("order='C': ", timeit.timeit(c_order, number=3) / 3)

With n=1024 I get:

order='F':  9.011245
order='C':  4.227592

and with n=2048:

order='F':  84.46075615
order='C':  25.5663385

The algorithm you implemented is exactly the same in Python and Fortran, and the algorithm performs full-row operations. If memory layout is row-major, the algorithm is efficient (Python with default or order='C'), however, if the memory layout is column-major, the algorithm is inefficient (Python with order='F', or the default in Fortran).

4 Likes

For row-intensive operations, like Gauss elimination etc., one could possibly prepare a hybrid matrix, constructed as an array of pointers (embedded in a derived type, as Fortran, lamentably, does not provide regular array of pointers) to the matrix rows. That way, the row data are contiguous and all operations are done on rank-1 arrays.
Due to the aforementioned deficiency, the syntax is a bit clumsy, but it works, one can swap whole or partial rows as seen in the following super simple sample.

program main
  implicit none
  type rows
    real, pointer :: row(:)
  end type rows
  type(rows), allocatable :: array(:)
  integer :: n=8, i, j
  real, allocatable :: swap(:)
  real :: x,y

  allocate(array(n),swap(n))
  do i=1,n
    allocate(array(i)%row(n))
    do j=1,n
      array(i)%row(j) = 100.0*i+j
    end do
  end do
  print 10,array(1)%row  !  101.0 102.0 103.0 104.0 105.0 106.0 107.0 108.0
  print 10,array(5)%row  !  501.0 502.0 503.0 504.0 505.0 506.0 507.0 508.0
  swap = array(1)%row
  array(1)%row = array(5)%row
  array(5)%row = swap
  print 10,array(1)%row  !  501.0 502.0 503.0 504.0 505.0 506.0 507.0 508.0
  print 10,array(5)%row  !  101.0 102.0 103.0 104.0 105.0 106.0 107.0 108.0
  swap(1:n/2) = array(1)%row(1:n/2)
  array(1)%row(1:n/2) = array(5)%row(1:n/2)
  array(5)%row(1:n/2) = swap(1:n/2)
  print 10,array(1)%row  !  101.0 102.0 103.0 104.0 505.0 506.0 507.0 508.0
  print 10,array(5)%row  !  501.0 502.0 503.0 504.0 105.0 106.0 107.0 108.0
10 format(*(f6.1))
end program main

BTW, anybody knowing why actually array of pointers cannot exist in Modern Fortran?

1 Like

I think this might be because it not feasible with the current syntax:
If you define a pointer, the dimension attribute defines the shape of the array it can point to. Therefore the dimension attribute is already occupied. Maybe it would be possible to solve it with:

real, pointer(:) :: ptr(3, 3) ! 3x3 array with pointers to 1D arrays
print*, ptr(1, 1)(:) ! print first 1D array

But this would break old code.
And what would happen, if you type this:

print*, ptr(2, :)(1)

?

Another reason could be, that this syntax might foster slow code constructs.

1 Like

I think the main reason for the poor performance is what @jbdv said. Changing the memory access to be column major in the Fortran implementation yields a notable difference

 mat size:          500 timigs (col-major, row-major)   6.0472999999999999E-002   8.3461999999999995E-002
 mat size:         1000 timigs (col-major, row-major)  0.17684999999999998        2.1223450000000001     
 mat size:         1500 timigs (col-major, row-major)  0.77685899999999997        11.889548000000000     
 mat size:         2000 timigs (col-major, row-major)   2.0247399999999995        31.665683999999999     
 mat size:         2500 timigs (col-major, row-major)   4.1311760000000035        70.841338999999991     
 mat size:         3000 timigs (col-major, row-major)   7.3490720000000067        121.72819000000001     
 mat size:         3500 timigs (col-major, row-major)   12.796781000000010        210.18912300000000     
 mat size:         4000 timigs (col-major, row-major)   17.554618000000005        337.45606700000002  

python_vs_fortran_gauss

Source code
module gauss_m

   private
   public :: gauss, gauss_rm
   integer, parameter :: dp = kind(1.0d0)


contains

   subroutine gauss(a, b, x)
      real(dp), intent(inout)  :: a(:, :), b(:), x(:)

      integer :: i, k, n
      real(dp), allocatable :: tmp_a(:)
      real(dp)              :: tmp_b

      n = size(a, 1)

      do i=1, n
         ! Check if has to swap
         if (a(i, i) == 0.0) then
            do k=i+1, n
               ! Swap values and exit loop
               if (a(i, k) /= 0.0) then
                  tmp_a = a(:, k);   tmp_b = b(k)
                  a(:, k) = a(:, i); b(k) = b(i)
                  a(:, i) = tmp_a;   b(i) = tmp_b
                  exit
               end if
            end do
         end if
         ! Eliminate items in row below
         do k=i+1, n
            b(k) = b(k) - (b(i) / a(i, i) * a(k, i))
            a(:, k) = a(:, k) - a(:, i) / a(i, i) * a(i, k)
         end do
      end do
      ! Solve
      x = 0.0
      do i=n, 1, -1
         b(i) = b(i) - dot_product(a(:, i), x)
         x(i) = b(i) / a(i, i)
      end do

   end subroutine gauss

   subroutine gauss_rm(a, b, x)
      !! row major implementation, do not use for actual solves
      real(dp), intent(inout)  :: a(:, :), b(:), x(:)

      integer :: i, k, n
      real(dp), allocatable :: tmp_a(:)
      real(dp)              :: tmp_b

      n = size(a, 1)

      do i=1, n
         ! Check if has to swap
         if (a(i, i) == 0.0) then
            do k=i+1, n
               ! Swap values and exit loop
               if (a(k, i) /= 0.0) then
                  tmp_a = a(k, :);   tmp_b = b(k)
                  a(k, :) = a(i, :); b(k) = b(i)
                  a(i, :) = tmp_a;   b(i) = tmp_b
                  exit
               end if
            end do
         end if
         ! Eliminate items in row below
         do k=i+1, n
            b(k) = b(k) - (b(i) / a(i, i) * a(k, i))
            a(k, :) = a(k, :) - a(i, :) / a(i, i) * a(k, i)
         end do
      end do
      ! Solve
      x = 0.0
      do i=n, 1, -1
         b(i) = b(i) - dot_product(a(i, :), x)
         x(i) = b(i) / a(i, i)
      end do

   end subroutine gauss_rm

end module gauss_m


program main

   use gauss_m

   implicit none

   integer, parameter :: dp = kind(1.0d0)
   real(dp), allocatable :: a(:,:), b(:), x(:), a_rm(:,:), b_rm(:), x_rm(:)
   real(dp), allocatable :: timings(:,:)
   integer :: n, n_min, n_max, n_int

   n_min = 500; n_max = 5000; n_int = 500;
   
   allocate(timings(((n_max - n_min)/ n_int) + 1, 2))

   do n = n_min, n_max, n_int

      allocate(a(n, n), b(n), x(n))
      call fill(a, b)
      allocate(a_rm, source=a); allocate(b_rm, source=b); allocate(x_rm, source=x)

      timings(n/n_int, 1) = timeit(a, b, x, gauss)
      timings(n/n_int, 2) = timeit(a_rm, b_rm, x_rm, gauss_rm)
      print*, "mat size:", n, "timigs (col-major, row-major):", timings(n/n_int, :), "rel L2 norm: ", norm2(x_rm- x)/ norm2(x)

      deallocate(a, b, x, a_rm, b_rm, x_rm)
   end do

   print*, "          TIMINGS         "
   print*, repeat("-", 80)
   print*, "column major: ", timings(:, 1)
   print*, "row    major: ", timings(:, 2)
   print*, repeat("-", 80)


contains

   subroutine fill(a, b)
      real(dp), intent(out) :: a(:,:), b(:)
      integer :: i, j

      do j = 1, size(a,2)
         do i = 1, size(a,1)
            a(i,j) = 1.0_dp/real(i + j - 1,dp)
         end do
      end do

      do i = 1, size(b)
         b(i) = 1.0_dp/real(i + 1,dp)
      end do

   end subroutine

   function timeit(a, b, x, sub) result(time)
      real(dp), intent(inout) :: a(:,:), b(:), x(:)
      interface
         subroutine sub(a, b, x)
            integer, parameter :: dp = kind(1.0d0)
            real(dp), intent(inout) :: a(:,:), b(:), x(:)
         end subroutine sub
      end interface

      real(dp) :: time
      real(dp) :: starttime

      call cpu_time(starttime)
      call sub(a, b, x)
      call cpu_time(time)
      time = time - starttime
   end function timeit

end program main

PS thanks @ivanpribec for the fill routine I was too lazy to write one

4 Likes

A simple alternative is to use/supply the At transpose matrix.
As a simple test, the gauss could be programmed for At, and also ensure that only At(i+1:,k) is used for each step.
Based on the code provided:

       do k=i+1, n
          b(k) = b(k) - (b(i) / a(i, i) * a(k, i))
          a(k, :) = a(k, :) - a(i, :) / a(i, i) * a(k, i)
       end do

This could become of the form

       do k=i+1, n
          b(k)        = b(k)        - b(i)        * ( at(i, k) / at(i, i) )
          at(i+1:, k) = at(i+1:, k) - at(i+1:, i) * ( at(i, k) / at(i, i) )
       end do
...
    ! Solve
    x = 0.0
    do i=n, 1, -1
       b(i) = b(i) - dot_product( at(i+1:, i), x(i+1:) )
       x(i) = b(i) / at(i, i)
    end do

This change would be more cache efficient and promote AVX from L1 cache.

2 Likes

The code shown by @gnikit creates symmetric arrays a, a_rm and it gives consistent results. I tried with randomly generated a and a_rm=transpose(a). Unfortunately this produces different results, both with b_rm=b and b_rm=b(n:1:-1). Any thoughts?

  n = 10

  allocate(a(n, n), b(n), x(n))
  !    call fill(a, b)
  call random_number(a)
  call random_number(b)
  allocate(x_rm, source=x)
  a_rm = transpose(a)
  b_rm = b(n:1:-1)

  call gauss(a, b, x)
  call gauss_rm(a_rm, b_rm, x_rm)
  print '(10f10.6)', x
  print '(10f10.6)', x_rm
  print*, "mat size:", n, "rel L2 norm: ", norm2(x_rm- x)/ norm2(x)
  deallocate(a, b, x, a_rm, b_rm, x_rm)
1 Like

@msz59 The original gauss routine posted has a “race” condition in the following line.
a(k, :) = a(k, :) - a(i, :) / a(i, i) * a(k, i)

This could be overcome by correct range or using ( daxpy )

  factor = a(k, i) / a(i, i)
  a(k, i+1:) = a(k, i+1:) - a(i, i+1:) * factor

This may resolve the problem, if not already corrected.

2 Likes

In fortran, the code executes “as if” the rhs is fully evaluated before the lhs is altered, so there is no “race” condition. Sometimes, this requires that temporary arrays be allocated in order to do that computation. This is one of the traps of fortran array notation. In simple cases, you can expect the compiler to recognize when only a simple loop is required, but this case might be a little too complicated for that, so it might be better to rewrite it just to help the compiler nonetheless. Your rewrite above of the code to access the elements my columns instead of rows, and to only access the i+1:n elements during the swap and elimination steps should be near optimal for this particular algorithm.

1 Like

In fortran, you create an array of pointers by putting the pointer inside a derived type, and creating an array of that derived type. In this case however, it would be simpler to just do the gaussian elimination by columns on the transpose of the array, as others have suggested. All of the lapack routines have an argument that specifies whether to operate on the array or on the transpose of the array. All languages show a performance preference for one those cases over the other. This example just happened to pick a matrix storage order that was optimal for python but not for fortran. If the other layout had been chosen, then fortran would have had the natural advantage.

2 Likes

This is exactly what I did in my sample above

Actually it might be worth some testing regarding performance. The array of pointers to the rows is potentially faster way to access the array than the standard rank-2 access. For years the C analog (with true array of pointers, as allowed by C) was the only standard way to pass a matrix (2D array) to another function dynamically, requiring some pre-initialization but giving performance gain at the same time. Whether the array of derived types elements containing the pointers in Fortran can also give any gain, is to be checked.

1 Like

I would say that it is never more efficient to access arrays through row (or column) pointers rather than the underlying storage order. However, if the choice is between accessing elements in the wrong order with one and the correct order with the other, then there will of course be a difference. Storage of ragged arrays is an example where the row (or column) pointers are sometimes appropriate, whether it is done by the language (with actual memory pointers) or with integer array indices. That is a case where some care is necessary in the algorithm choices to avoid the pointer references as much as possible. The packed storage of symmetrical matrices is something that can be done in three different ways, pointers, integer array offsets, and by storage order expression, (I*(I-1))/2 + J. Perhaps not surprisingly, that last option is often the best, especially if the first part of that expression can be reused and moved out of the innermost loops. The integer offset array is simply the precomputed values of the first part of that expression, OFF(I)+J. This approach has the advantage that a single offset array can be used to address many packed-storage arrays. The pointer approach has the advantage that it is closer to the hardware, but its disadvantage is that a new pointer array is required for every packed array, it cannot be reused like the integer offset array. And these days, those hardware addresses are likely to be 64-bits, while the integer addressing may only require 32-bit integer storage.

3 Likes

Wow - that’s a lot of input.
This is my first topic created here and I hadn’t thought that I’d get this quantitiy of (useful) answers.

Thank you for this.

I forgot that Fortran saves arrays column based and oviously this is a leading edge for the Python implementation.

I’ll try to consider all these tips and create a new version of my algorithm that will beat the Python code (in speed and precision ^^).

You’ll find the code here in the next days.
Thank you for these amazing tips.

5 Likes

If you want to get the top performance, I also recommend doing everything including the driver program in Fortran. Use the tips above to optimize. Make sure you try a few compilers, at least GFortran and Intel and make sure you enable all optimizations options correctly for your platform. If you post the full Fortran code, people here can help you get the best performance.

Only then add the Python wrappers and check the speed. It must be the same. If it is not, then there is also some overhead, either in the wrappers, or perhaps not all optimization options were used. So you can then focus on just the wrappers.

2 Likes

By the way, just curious, has anyone tried run the same code on Mac M1?

Even the regular M1 chip’s memory bandwidth is around 70GB/s, which is 2X my DDR4 2666’s 35GB/s.
The high-end M1 Max has 400GB/s bandwidth.

If this algorithm’s bottleneck is memory operations, I guess M1 chip may be 2 times faster than PC like mine with DDR4 2666 even without any optimization.

1 Like

I used an M1 Chip for computing.
Using my posted code and this program to compare:

def benchmark_func_py():
    a = np.array([[1 / (i + j - 1) for j in range(1, n + 1)] for i in range(1, n + 1)])
    b = np.array([1 / (i + 1) for i in range(1, n + 1)])
    return gauss.gauss(a, b)


def benchmark_func_fortran():
    a = numpy.asfortranarray(np.array([[1 / (i + j - 1) for j in range(1, n + 1)] for i in range(1, n + 1)]))
    b = numpy.asfortranarray(np.array([1 / (i + 1) for i in range(1, n + 1)]))
    x = numpy.asfortranarray(np.array([0.0 for _i in range(n)]))
    gauss_m.gauss(a, b, x)
    return x


def benchmark_func_numpy():
    a = np.array([[1 / (i + j - 1) for j in range(1, n + 1)] for i in range(1, n + 1)])
    b = np.array([1 / (i + 1) for i in range(1, n + 1)])
    return np.linalg.solve(a, b)


def benchmark_main():
    py_time = timeit.timeit(benchmark_func_py, globals=globals(), number=1)
    f_time = timeit.timeit(benchmark_func_fortran, globals=globals(), number=1)
    np_time = timeit.timeit(benchmark_func_numpy, globals=globals(), number=1)
    print(f"Duration Python:  {py_time}s")
    print(f"Duration Fortran: {f_time}s")
    print(f"Duration Numpy:   {np_time}s")


if __name__ == "__main__":
    n = 2001
    benchmark_main()

I got this output:

Duration Python:  8.193421584s
Duration Fortran: 6.235588208999999s
Duration Numpy:   0.4742914580000015s

(Only 1 run per try… so there can be significant differences if you try it once again)

I don’t know if this output helps you. Obviously it’s dfficult to compare computers like this (without considering other hardware and software aspects).

Please note that the code has not been optimized since my post.

1 Like

Thank you @rnoffke .
Sorry for a lazy question, would you mind posting the whole Jupyter Notebook here?
I tried the link of the Jupyter notebook book you posted,

and I also

pip install gauss

in my anaconda 3.

But when I run the Juypyter notebook I got an error message below,

I am using anaconda 3 on windows 10.

By the way, I just bought a Macbook Air with 16GB ram.
Here is what I found, for a memory heavy Fortran code I wrote, (it frequently operates big arrays and matrices, consumes 6 - 12 GB memory),
M1 chip with gfortran with -O3 -march=native, took 30 seconds,
while my Thinkpad P72 with Intel xeon 2186M with 64 GB DDR4 2666 ECC, using intel OneAPI, -O3 -xHost, took 70 seconds.

Air’s M1 has memory bandwidth 68.5GB/s, while my Thinkpad has like 35 GB/s. So it is a factor of 2 difference. I believe speedup of my code is due to the memory speed difference between M1 and my Thinkpad’s DDR4 2666.

For smaller matrices and arrays, the speed difference become smaller.

2 Likes

@CRquantum Cant’t you a access all the files of the notebook?

I haven’t shared a Notebook like this before.
And honestly I don’t know if my Jetbrains license allows sharing the “everytime up to date” Notebook including the files here.

But I’ll send you the files.

1 Like

Thanks Rene, received the code you send, and I did

f2py -c gauss.f90 -m gauss_pkg

I actually did

f2py -c gauss.f90 -m gauss_pkg --f90flags='-O3 -march=native'

On Mac Air M1,
I installed the conda in miniforge because the miniforge version seems support native M1 chip.
Here is the result from M1,

Eh, however the time in seconds seems not exactly right. It actually took M1 Air like 1 to 1.5 hour to finish the run.

Ron,

I am trying to understand your statement in regard to the efficiency of Fortran array operations.
If I was to approach this, I would make a copy of the diagonal value only, not make a copy of the whole row.
Are you implying some inefficency in Fortran array operations if temporary arrays are created and then results updated at the end of the operation?
Would you apply the approach “the rhs is fully evaluated before the lhs is altered” to any daxpy : y = y + a*x vector operation ?

It is a quality of implementation issue. The result must be “as if” the rhs is fully evaluated, but if the compiler can figure out how to do the operation more efficiently, it can do so. The only way to know what a compiler does is to look at the code that it generates (or in the case of gfortran, you can look at the intermediate code, which might be easier to understand). If the compiler can’t figure out any good optimizations, then yes, work arrays must be allocated, used, and then deallocated, which makes the operation expensive. If the programmer can help the compiler, then it is more likely to result in efficient code.