Gauss Algorithm written in Fortran slower than written in Python for large matrices?

rnoffke · April 4, 2022, 7:55pm

I had to write the gauss algorithm in Python last week and has to use Numpy arrays to get (a little bit) speed.

After doing so, I asked my self how much faster the same code would be in Fortran.
So I wrote it again, this time in Fortran.

I should explain that n means the number of rows (and obviously columns) of the matrix a that includes the coefficients.

I add the written code here.

def gauss(a: np.array, b: np.array) -> np.array:
    for i in range(len(a)):
        # Check if row order must be changed
        if a[i, i] == 0:
            # Search for first field != 0
            for row_ind in range(i + 1, len(a)):
                if a[row_ind, i] != 0:
                    # Swap
                    a[[i, row_ind]] = a[[row_ind, i]]
                    b[[i, row_ind]] = b[[row_ind, i]]
                    break
        # Eliminate fields below
        for row_ind in range(i + 1, len(a)):
            b[row_ind] -= b[i] / a[i, i] * a[row_ind, i]
            a[row_ind, i:] -= (a[i, i:] / a[i, i] * a[row_ind, i])
            # a[row_ind] -= (a[i] / a[i, i] * a[row_ind, i])
    # Figure out x
    x = np.array([0 for _i in range(len(a))])
    for i in range(len(a) - 1, -1, -1):
        b[i] -= np.dot(a[i], x)
        x[i] = b[i] / a[i, i]
    return x

module gauss_m

contains

  subroutine gauss(a, b, x)
    real(8), intent(inout)  :: a(:, :), b(:), x(:)

    integer :: i, k, n
    real(8), allocatable :: tmp_a(:)
    real(8)              :: tmp_b

    n = size(a, 1)

    do i=1, n
       ! Check if has to swap
       if (a(i, i) == 0.0) then
          do k=i+1, n
             ! Swap values and exit loop
             if (a(k, i) /= 0.0) then
                tmp_a = a(k, :);   tmp_b = b(k)
                a(k, :) = a(i, :); b(k) = b(i)
                a(i, :) = tmp_a;   b(i) = tmp_b
                exit
             end if
          end do
       end if
       ! Eliminate items in row below
       do k=i+1, n
          b(k) = b(k) - (b(i) / a(i, i) * a(k, i))
          a(k, :) = a(k, :) - a(i, :) / a(i, i) * a(k, i)
       end do
    end do
    ! Solve
    x = 0.0
    do i=n, 1, -1
       b(i) = b(i) - dot_product(a(i, :), x)
       x(i) = b(i) / a(i, i)
    end do

  end subroutine gauss
  
end module gauss_m

To test how many time is needed to solve the problem I created a jupyter Notebook and created a cpython module using f2py to being able to import it into Python.

When I started to test how many time is needed, I was suprised how fast Fortran was able to solve the linear equations system.
It was up to 6 times faster than the Python code (although the conversion was written in Python and the code was wrapped using cpython).

But everytime I’m trying “big” n it looks like Python is getting faster than Fortran.

In this graphs you can see time needed to solve the n x n system,
“Python” means my selfwritten Python program, “Numpy” the function numpy.linalg.solve and “Fortran” my selfewritten Fortran code.

The whole Jupyter Notebook can be found here: JetBrains Datalore: A powerful environment for Jupyter notebooks.

It seems like Python is faster solving this system than Fortran if you choose a “big” n .

I think that’s not right but I cannot find a reason for. I tried a fiew things like using compiler flags for optimization, but it’s not getting much better.

If you’ve got any suggestions about how to solve this issue or why it is getting that slow please let me know.

ivanpribec · April 4, 2022, 8:53pm

Hopefully, Numpy is using a Fortran implementation of gesv internally, so Fortran is still the fastest .

One problem with your Fortran version is that you seem to be re-allocating tmp_a each time when swapping rows. Edit: I was wrong, see @Beliavsky’s comment below.

Beliavsky · April 4, 2022, 9:11pm

tmp_a has the same size throughout the subroutine. If x is an allocatable 1D array and
size(x) == size(y), I don’t think assigning y to x causes reallocation.

ivanpribec · April 4, 2022, 9:25pm

Another comment would be in your Jupyter notebook. Instead of using the explicit conversion function asfortranarray, try using the order argument:

def benchmark_func_fortran():
    a = np.array([[1 / (i + j - 1) for j in range(1, n + 1)] for i in range(1, n + 1)],order='F')
    b = np.array([1 / (i + 1) for i in range(1, n + 1)],order='F')
    x = np.array([0.0 for _i in range(n)],order='F')
    gauss_m.gauss(a, b, x)
    return x

I don’t think this will cure the problem entirely, but at least you won’t be counting the transpose operation in your timings.

ivanpribec · April 4, 2022, 11:51pm

Here’s a small driver to experiment with a few solvers, including:

The one introduced in this post, gauss
The elimination routine solve from Metcalf, Reid, & Cohen, Modern Fortran Explained, 2018, pg. 146
The LAPACK routine dgesv
The routines decomp and solve from Forsythe, Malcolm, and Moler, Computer Methods for Mathematical Computations, 1977

The driver can be compiled with:

$  gfortran -O3 -march=native gauss_m.f90 linear.f90 decomp.f solve.f fmm.f90 elim.f90 -o elim -llapack

MRC Linear module

linear.f90:

module linear

  implicit none
  private

  public :: solve
  
  integer, parameter :: dp = kind(1.0d0)

contains

  subroutine solve(a,b,piv_tol,ok)
    real(dp), intent(inout) :: a(:,:)
                    ! The matrix a
    real(dp), intent(inout) :: b(:)
                    ! The right-hand side vector on entry. 
                    ! Overwritten by the solution
    real(dp), intent(in) :: piv_tol
                    ! Smallest acceptable pivot.
    logical, intent(out) :: ok
                    ! True after a succesful entry
                    ! and false otherwise

! Local variables
    integer :: i    ! Row index
    integer :: j    ! Column index
    integer :: n    ! Matrix order
    real(dp), dimension(size(b)) :: row
                    ! Automatic array needed for workspace;
    real(dp) :: element   
                    ! Workspace variable
    
    n = size(b)
    ok = size(a,1) == n .and. size(a,2) == n
    if (.not. ok) then
      return
    end if

    do j = 1, n 

!   Update elements in column j
      do i = 1, j-1
        a(i+1:n, j) = a(i+1:n, j) - a(i,j) * a(i+1:n, i)
      end do

!   Find pivot and ceck its size
      i = maxloc(abs(a(j:n, j)), dim=1) + j - 1
      if (abs(a(i,j)) < piv_tol) then
        ok = .false.
        return
      end if

!   If necessary, apply row interchange
      if (i /= j) then
        row = a(j,:); a(j,:) = a(i,:); a(i,:) = row
        element = b(j); b(j) = b(i); b(i) = element
      end if

!   Compute elements j+1:n of j-th column
      a(j+1:n, j) = a(j+1:n, j)/a(j,j)
    end do

! Forward substitution
    do i = 1, n-1
      b(i+1:n) = b(i+1:n) - b(i)*a(i+1:n, i)
    end do

! Back substitution
    do j = n, 1, -1
      b(j) = b(j)/a(j,j)
      b(1:j-1) = b(1:j-1) - b(j)*a(1:j-1,j)
    end do

  end subroutine solve

end module linear

FMM module

fmm.f90 (the other two routines can be downloaded directly from netlib):

module fmm

  implicit none
  private

  public :: fmm_solve 

  integer, parameter :: dp = kind(1.0d0)

  interface
    subroutine decomp(ndim,n,a,cond,ipvt,work)
      import dp
      integer, intent(in) :: ndim, n
      real(dp), intent(inout) :: a(ndim,n)
      real(dp), intent(out) :: cond
      integer, intent(out) :: ipvt(n)
      real(dp), intent(inout) :: work(n) 
    end subroutine
    subroutine solve(ndim,n,a,b,ipvt)
      import dp
      integer, intent(in) :: ndim, n
      real(dp), intent(in) :: a(ndim,n)
      real(dp), intent(inout) :: b(n)
      integer, intent(in) :: ipvt(n)
    end subroutine
  end interface

contains

  subroutine fmm_solve(a,b,ok)
    real(dp), intent(inout) :: a(:,:)
    real(dp), intent(inout) :: b(:)
    logical, intent(out) :: ok

    integer :: ipvt(size(b))
    real(dp) :: work(size(b)), cond
    integer :: n, ndim

    n = size(b)
    ndim = size(a,1)

    ok = ndim >= n .and. size(a,2) == n
    if (.not. ok) then
      return
    end if

    call decomp(ndim,n,a,cond,ipvt,work)
    if (cond == 1.0d32) then
      ok = .false.
      return
    end if

    call solve(ndim,n,a,b,ipvt)

  end subroutine fmm_solve

end module fmm

The driver elim.f90:

program elim

  use gauss_m
  use linear, only: solve
  use fmm, only: fmm_solve

  implicit none

  integer :: n, j
 
  real(dp), allocatable :: a(:,:), b(:,:), x(:)
 
  ! MCR solve
  real(dp), parameter :: ptol = 1.e-14_dp
  logical :: okay
 
  ! LAPACK related variables
  external :: dgesv
  integer , allocatable :: ipiv(:)
  integer :: info

  integer :: ncmd
  character(len=256) :: cmd
  logical :: print_solution
  real(dp) :: t(5)

  ncmd = command_argument_count()
  if (ncmd /= 2) then
    write(*,*) "Usage: elim <n> <t/f>"
    write(*,*) ""
    write(*,*) "  n   - integer problem size"
    write(*,*) "  t/f - print solution vector, true or false"
    write(*,*) ""
    stop
  end if

  ! Get problem size
  call get_command_argument(1,cmd)
  read(cmd,*) n 

  ! Whether to print the solution vectors for inspection
  call get_command_argument(2,cmd)
  read(cmd,*) print_solution

  allocate(a(n,n),b(n,4),x(n),ipiv(n))

  !
  ! Python to Fortran port from Discourse
  !

  call cpu_time(t(1))

  call fill(a,b(:,1))
  call gauss(a,b(:,1),x)
  b(:,1) = x   ! Copy solution into b for output

  !
  ! Metcalf, Reid, Cohen
  ! 

  call cpu_time(t(2))

  call fill(a,b(:,2))
  call solve(a,b(:,2),ptol,okay)
  if (.not. okay) then
    write(*,*) "MRC failed"
  end if

  !
  ! LAPACK: dgesv
  !

  call cpu_time(t(3))

  call fill(a,b(:,3))
  call dgesv(n,1,a,n,ipiv,b(:,3),n,info)

  !
  ! Forsythe, Malcolm, Moler
  !

  call cpu_time(t(4))

  call fill(a,b(:,4))
  call fmm_solve(a,b(:,4),okay)
  if (.not. okay) then
    write(*,*) "FMM failed"
  end if

  call cpu_time(t(5))


  !
  ! Print results
  !

  write(*,'(4A16)') (repeat('-',14),j=1,4)
  write(*,'(4A16)') "gauss  ", "MRC    ", "dgesv  ", "FMM    "

  if (print_solution) then
    write(*,'(4A16)') (repeat('-',14),j=1,4)
    write(*,'(/,A,/)') "Solutions:"
    write(*,'(4E16.8)') (b(j,:),j=1,n)
  end if

  write(*,'(4A16)') (repeat('-',14),j=1,4)
  write(*,'(/,A,/)') "Time (s): "
  write(*,'(4E16.8)') t(2:5) - t(1:4)
  write(*,*)

contains

  subroutine fill(a,b)
    real(dp), intent(out) :: a(:,:), b(:)
    integer :: i, j

    do j = 1, size(a,2)
      do i = 1, size(a,1)
        a(i,j) = 1.0_dp/real(i + j - 1,dp)
      end do
    end do

    do i = 1, size(b)
      b(i) = 1.0_dp/real(i + 1,dp)
    end do

  end subroutine

end program

Here’s some sample output for N=2000:

$ ./elim 2000 f
 MRC failed
  --------------  --------------  --------------  --------------
         gauss           MRC             dgesv           FMM    
  --------------  --------------  --------------  --------------

Time (s): 

  0.21772618E+02  0.20940000E-02  0.96932900E+00  0.72896100E+00

And for N = 5 with output:

$ ./elim 5 t
  --------------  --------------  --------------  --------------
         gauss           MRC             dgesv           FMM    
  --------------  --------------  --------------  --------------

Solutions:

  0.00000000E+00 -0.11102230E-15  0.00000000E+00 -0.11102230E-15
  0.10000000E+01  0.10000000E+01  0.10000000E+01  0.10000000E+01
  0.00000000E+00 -0.38857806E-14  0.00000000E+00 -0.38857806E-14
  0.00000000E+00  0.25905204E-14  0.00000000E+00  0.25905204E-14
  0.00000000E+00  0.48865080E-26  0.00000000E+00  0.48865080E-26
  --------------  --------------  --------------  --------------

Time (s): 

  0.10000000E-05  0.10000000E-05  0.23000000E-04  0.40000000E-05

The accuracy of the routines starts to diverge noticeably as N is increased. Also the solution of the gauss routine seems to be “fixed” to static value somehow.

JohnCampbell · April 5, 2022, 5:58am

A problem with this inner loop is it is memory hungry for large n, has poor cache efficiency and could also minimise AVX usage.

Strategies/algorithms to improve AVX instruction use and cache utilisation are very important for large N.

A much better solution, where A is symmetric is to use “Crout”, rather than “Gauss”, which is column based and much more efficient for memory addressing and AVX.

That said, if you are programming the same algorithm using Python and Fortran, you may have to check the compile options. I always use -O3 -march=native -ffast-math to improve gFortran performance, but don’t know what are the options you have used or how these compare to Python options. (-ffast-math is ok for my large sets of linear equations and provides significant speed-up, but may not be suitable for other cases)

RonShepard · April 5, 2022, 6:41am

a(k, : ) = a(i, : )

In the swap step in the fortran code, I see these full-row operations. I do not see full-row operations in the python code, it seems to have only partial-row operations. The same thing in the elimination step, partial-row operations in the python code and full-row operations in the fortran code. It looks like the fortran code might be doing twice as much work as the python code.

jbdv · April 5, 2022, 7:19am

I think the comments by @JohnCampbell and @RonShepard point out the root cause of the issue. The implemented algorithm performs full-row operations in matrix a when swapping rows (a[[i, row_ind]] = a[[row_ind, i]] in Python and the equivalent of a(k, :) = a(i, :) in Fortran). Since Fortran uses column major memory layout (columns are contiguous in memory), this becomes an inefficient operation. However, by default, numpy arrays are row major (order='C', so that rows are contiguous in memory). As n becomes large, the Fortran becomes very cache inefficient, while the Python routine does not suffer from the same issue.

It is, however, very easy to change the memory layout of a numpy array, and you can do this to illustrate the problem, i.e. how fast is the Python routine, if we change the memory layout of a to Fortran order?:

def f_order():
  a = np.array([[1 / (i + j - 1) for j in range(1, n + 1)] for i in range(1, n + 1)], order = 'F')
  b = np.array([1 / (i + 1) for i in range(1, n + 1)])
  xf = gauss(a, b)
  return a, b, xf

def c_order():
  a = np.array([[1 / (i + j - 1) for j in range(1, n + 1)] for i in range(1, n + 1)])
  b = np.array([1 / (i + 1) for i in range(1, n + 1)])
  xc = gauss(a, b)
  return a, b, xc

print("order='F': ", timeit.timeit(f_order, number=3) / 3)
print("order='C': ", timeit.timeit(c_order, number=3) / 3)

With n=1024 I get:

order='F':  9.011245
order='C':  4.227592

and with n=2048:

order='F':  84.46075615
order='C':  25.5663385

The algorithm you implemented is exactly the same in Python and Fortran, and the algorithm performs full-row operations. If memory layout is row-major, the algorithm is efficient (Python with default or order='C'), however, if the memory layout is column-major, the algorithm is inefficient (Python with order='F', or the default in Fortran).

msz59 · April 5, 2022, 8:08am

For row-intensive operations, like Gauss elimination etc., one could possibly prepare a hybrid matrix, constructed as an array of pointers (embedded in a derived type, as Fortran, lamentably, does not provide regular array of pointers) to the matrix rows. That way, the row data are contiguous and all operations are done on rank-1 arrays.
Due to the aforementioned deficiency, the syntax is a bit clumsy, but it works, one can swap whole or partial rows as seen in the following super simple sample.

program main
  implicit none
  type rows
    real, pointer :: row(:)
  end type rows
  type(rows), allocatable :: array(:)
  integer :: n=8, i, j
  real, allocatable :: swap(:)
  real :: x,y

  allocate(array(n),swap(n))
  do i=1,n
    allocate(array(i)%row(n))
    do j=1,n
      array(i)%row(j) = 100.0*i+j
    end do
  end do
  print 10,array(1)%row  !  101.0 102.0 103.0 104.0 105.0 106.0 107.0 108.0
  print 10,array(5)%row  !  501.0 502.0 503.0 504.0 505.0 506.0 507.0 508.0
  swap = array(1)%row
  array(1)%row = array(5)%row
  array(5)%row = swap
  print 10,array(1)%row  !  501.0 502.0 503.0 504.0 505.0 506.0 507.0 508.0
  print 10,array(5)%row  !  101.0 102.0 103.0 104.0 105.0 106.0 107.0 108.0
  swap(1:n/2) = array(1)%row(1:n/2)
  array(1)%row(1:n/2) = array(5)%row(1:n/2)
  array(5)%row(1:n/2) = swap(1:n/2)
  print 10,array(1)%row  !  101.0 102.0 103.0 104.0 505.0 506.0 507.0 508.0
  print 10,array(5)%row  !  501.0 502.0 503.0 504.0 105.0 106.0 107.0 108.0
10 format(*(f6.1))
end program main

BTW, anybody knowing why actually array of pointers cannot exist in Modern Fortran?

Carltoffel · April 5, 2022, 9:38am

I think this might be because it not feasible with the current syntax:
If you define a pointer, the dimension attribute defines the shape of the array it can point to. Therefore the dimension attribute is already occupied. Maybe it would be possible to solve it with:

real, pointer(:) :: ptr(3, 3) ! 3x3 array with pointers to 1D arrays
print*, ptr(1, 1)(:) ! print first 1D array

But this would break old code.
And what would happen, if you type this:

print*, ptr(2, :)(1)

?

Another reason could be, that this syntax might foster slow code constructs.

gnikit · April 5, 2022, 10:22am

I think the main reason for the poor performance is what @jbdv said. Changing the memory access to be column major in the Fortran implementation yields a notable difference

 mat size:          500 timigs (col-major, row-major)   6.0472999999999999E-002   8.3461999999999995E-002
 mat size:         1000 timigs (col-major, row-major)  0.17684999999999998        2.1223450000000001     
 mat size:         1500 timigs (col-major, row-major)  0.77685899999999997        11.889548000000000     
 mat size:         2000 timigs (col-major, row-major)   2.0247399999999995        31.665683999999999     
 mat size:         2500 timigs (col-major, row-major)   4.1311760000000035        70.841338999999991     
 mat size:         3000 timigs (col-major, row-major)   7.3490720000000067        121.72819000000001     
 mat size:         3500 timigs (col-major, row-major)   12.796781000000010        210.18912300000000     
 mat size:         4000 timigs (col-major, row-major)   17.554618000000005        337.45606700000002

python_vs_fortran_gauss

Source code

module gauss_m

   private
   public :: gauss, gauss_rm
   integer, parameter :: dp = kind(1.0d0)


contains

   subroutine gauss(a, b, x)
      real(dp), intent(inout)  :: a(:, :), b(:), x(:)

      integer :: i, k, n
      real(dp), allocatable :: tmp_a(:)
      real(dp)              :: tmp_b

      n = size(a, 1)

      do i=1, n
         ! Check if has to swap
         if (a(i, i) == 0.0) then
            do k=i+1, n
               ! Swap values and exit loop
               if (a(i, k) /= 0.0) then
                  tmp_a = a(:, k);   tmp_b = b(k)
                  a(:, k) = a(:, i); b(k) = b(i)
                  a(:, i) = tmp_a;   b(i) = tmp_b
                  exit
               end if
            end do
         end if
         ! Eliminate items in row below
         do k=i+1, n
            b(k) = b(k) - (b(i) / a(i, i) * a(k, i))
            a(:, k) = a(:, k) - a(:, i) / a(i, i) * a(i, k)
         end do
      end do
      ! Solve
      x = 0.0
      do i=n, 1, -1
         b(i) = b(i) - dot_product(a(:, i), x)
         x(i) = b(i) / a(i, i)
      end do

   end subroutine gauss

   subroutine gauss_rm(a, b, x)
      !! row major implementation, do not use for actual solves
      real(dp), intent(inout)  :: a(:, :), b(:), x(:)

      integer :: i, k, n
      real(dp), allocatable :: tmp_a(:)
      real(dp)              :: tmp_b

      n = size(a, 1)

      do i=1, n
         ! Check if has to swap
         if (a(i, i) == 0.0) then
            do k=i+1, n
               ! Swap values and exit loop
               if (a(k, i) /= 0.0) then
                  tmp_a = a(k, :);   tmp_b = b(k)
                  a(k, :) = a(i, :); b(k) = b(i)
                  a(i, :) = tmp_a;   b(i) = tmp_b
                  exit
               end if
            end do
         end if
         ! Eliminate items in row below
         do k=i+1, n
            b(k) = b(k) - (b(i) / a(i, i) * a(k, i))
            a(k, :) = a(k, :) - a(i, :) / a(i, i) * a(k, i)
         end do
      end do
      ! Solve
      x = 0.0
      do i=n, 1, -1
         b(i) = b(i) - dot_product(a(i, :), x)
         x(i) = b(i) / a(i, i)
      end do

   end subroutine gauss_rm

end module gauss_m


program main

   use gauss_m

   implicit none

   integer, parameter :: dp = kind(1.0d0)
   real(dp), allocatable :: a(:,:), b(:), x(:), a_rm(:,:), b_rm(:), x_rm(:)
   real(dp), allocatable :: timings(:,:)
   integer :: n, n_min, n_max, n_int

   n_min = 500; n_max = 5000; n_int = 500;
   
   allocate(timings(((n_max - n_min)/ n_int) + 1, 2))

   do n = n_min, n_max, n_int

      allocate(a(n, n), b(n), x(n))
      call fill(a, b)
      allocate(a_rm, source=a); allocate(b_rm, source=b); allocate(x_rm, source=x)

      timings(n/n_int, 1) = timeit(a, b, x, gauss)
      timings(n/n_int, 2) = timeit(a_rm, b_rm, x_rm, gauss_rm)
      print*, "mat size:", n, "timigs (col-major, row-major):", timings(n/n_int, :), "rel L2 norm: ", norm2(x_rm- x)/ norm2(x)

      deallocate(a, b, x, a_rm, b_rm, x_rm)
   end do

   print*, "          TIMINGS         "
   print*, repeat("-", 80)
   print*, "column major: ", timings(:, 1)
   print*, "row    major: ", timings(:, 2)
   print*, repeat("-", 80)


contains

   subroutine fill(a, b)
      real(dp), intent(out) :: a(:,:), b(:)
      integer :: i, j

      do j = 1, size(a,2)
         do i = 1, size(a,1)
            a(i,j) = 1.0_dp/real(i + j - 1,dp)
         end do
      end do

      do i = 1, size(b)
         b(i) = 1.0_dp/real(i + 1,dp)
      end do

   end subroutine

   function timeit(a, b, x, sub) result(time)
      real(dp), intent(inout) :: a(:,:), b(:), x(:)
      interface
         subroutine sub(a, b, x)
            integer, parameter :: dp = kind(1.0d0)
            real(dp), intent(inout) :: a(:,:), b(:), x(:)
         end subroutine sub
      end interface

      real(dp) :: time
      real(dp) :: starttime

      call cpu_time(starttime)
      call sub(a, b, x)
      call cpu_time(time)
      time = time - starttime
   end function timeit

end program main

PS thanks @ivanpribec for the fill routine I was too lazy to write one

JohnCampbell · April 5, 2022, 12:25pm

A simple alternative is to use/supply the At transpose matrix.
As a simple test, the gauss could be programmed for At, and also ensure that only At(i+1:,k) is used for each step.
Based on the code provided:

       do k=i+1, n
          b(k) = b(k) - (b(i) / a(i, i) * a(k, i))
          a(k, :) = a(k, :) - a(i, :) / a(i, i) * a(k, i)
       end do

This could become of the form

       do k=i+1, n
          b(k)        = b(k)        - b(i)        * ( at(i, k) / at(i, i) )
          at(i+1:, k) = at(i+1:, k) - at(i+1:, i) * ( at(i, k) / at(i, i) )
       end do
...
    ! Solve
    x = 0.0
    do i=n, 1, -1
       b(i) = b(i) - dot_product( at(i+1:, i), x(i+1:) )
       x(i) = b(i) / at(i, i)
    end do

This change would be more cache efficient and promote AVX from L1 cache.

msz59 · April 5, 2022, 12:53pm

The code shown by @gnikit creates symmetric arrays a, a_rm and it gives consistent results. I tried with randomly generated a and a_rm=transpose(a). Unfortunately this produces different results, both with b_rm=b and b_rm=b(n:1:-1). Any thoughts?

  n = 10

  allocate(a(n, n), b(n), x(n))
  !    call fill(a, b)
  call random_number(a)
  call random_number(b)
  allocate(x_rm, source=x)
  a_rm = transpose(a)
  b_rm = b(n:1:-1)

  call gauss(a, b, x)
  call gauss_rm(a_rm, b_rm, x_rm)
  print '(10f10.6)', x
  print '(10f10.6)', x_rm
  print*, "mat size:", n, "rel L2 norm: ", norm2(x_rm- x)/ norm2(x)
  deallocate(a, b, x, a_rm, b_rm, x_rm)

JohnCampbell · April 5, 2022, 1:08pm

@msz59 The original gauss routine posted has a “race” condition in the following line.
a(k, :) = a(k, :) - a(i, :) / a(i, i) * a(k, i)

This could be overcome by correct range or using ( daxpy )

  factor = a(k, i) / a(i, i)
  a(k, i+1:) = a(k, i+1:) - a(i, i+1:) * factor

This may resolve the problem, if not already corrected.

RonShepard · April 5, 2022, 4:32pm

In fortran, the code executes “as if” the rhs is fully evaluated before the lhs is altered, so there is no “race” condition. Sometimes, this requires that temporary arrays be allocated in order to do that computation. This is one of the traps of fortran array notation. In simple cases, you can expect the compiler to recognize when only a simple loop is required, but this case might be a little too complicated for that, so it might be better to rewrite it just to help the compiler nonetheless. Your rewrite above of the code to access the elements my columns instead of rows, and to only access the i+1:n elements during the swap and elimination steps should be near optimal for this particular algorithm.

RonShepard · April 5, 2022, 4:41pm

In fortran, you create an array of pointers by putting the pointer inside a derived type, and creating an array of that derived type. In this case however, it would be simpler to just do the gaussian elimination by columns on the transpose of the array, as others have suggested. All of the lapack routines have an argument that specifies whether to operate on the array or on the transpose of the array. All languages show a performance preference for one those cases over the other. This example just happened to pick a matrix storage order that was optimal for python but not for fortran. If the other layout had been chosen, then fortran would have had the natural advantage.

msz59 · April 5, 2022, 4:51pm

This is exactly what I did in my sample above

Actually it might be worth some testing regarding performance. The array of pointers to the rows is potentially faster way to access the array than the standard rank-2 access. For years the C analog (with true array of pointers, as allowed by C) was the only standard way to pass a matrix (2D array) to another function dynamically, requiring some pre-initialization but giving performance gain at the same time. Whether the array of derived types elements containing the pointers in Fortran can also give any gain, is to be checked.

RonShepard · April 5, 2022, 5:25pm

I would say that it is never more efficient to access arrays through row (or column) pointers rather than the underlying storage order. However, if the choice is between accessing elements in the wrong order with one and the correct order with the other, then there will of course be a difference. Storage of ragged arrays is an example where the row (or column) pointers are sometimes appropriate, whether it is done by the language (with actual memory pointers) or with integer array indices. That is a case where some care is necessary in the algorithm choices to avoid the pointer references as much as possible. The packed storage of symmetrical matrices is something that can be done in three different ways, pointers, integer array offsets, and by storage order expression, (I*(I-1))/2 + J. Perhaps not surprisingly, that last option is often the best, especially if the first part of that expression can be reused and moved out of the innermost loops. The integer offset array is simply the precomputed values of the first part of that expression, OFF(I)+J. This approach has the advantage that a single offset array can be used to address many packed-storage arrays. The pointer approach has the advantage that it is closer to the hardware, but its disadvantage is that a new pointer array is required for every packed array, it cannot be reused like the integer offset array. And these days, those hardware addresses are likely to be 64-bits, while the integer addressing may only require 32-bit integer storage.

rnoffke · April 5, 2022, 6:13pm

Wow - that’s a lot of input.
This is my first topic created here and I hadn’t thought that I’d get this quantitiy of (useful) answers.

Thank you for this.

I forgot that Fortran saves arrays column based and oviously this is a leading edge for the Python implementation.

I’ll try to consider all these tips and create a new version of my algorithm that will beat the Python code (in speed and precision ^^).

You’ll find the code here in the next days.
Thank you for these amazing tips.

certik · April 5, 2022, 11:06pm

If you want to get the top performance, I also recommend doing everything including the driver program in Fortran. Use the tips above to optimize. Make sure you try a few compilers, at least GFortran and Intel and make sure you enable all optimizations options correctly for your platform. If you post the full Fortran code, people here can help you get the best performance.

Only then add the Python wrappers and check the speed. It must be the same. If it is not, then there is also some overhead, either in the wrappers, or perhaps not all optimization options were used. So you can then focus on just the wrappers.

Topic		Replies	Views
Does the speed (bandwidth) of memory matter?	11	1624	April 11, 2022
Gaussian processes in fortran	0	335	November 14, 2023
Algorithm Archive -- add Fortran	0	341	November 7, 2021
Speed of array intrinsics	22	1900	April 14, 2025
Optimizing vectorized array operations Help	78	1310	April 2, 2025

Gauss Algorithm written in Fortran slower than written in Python for large matrices?

Related topics