# Gauss Algorithm written in Fortran slower than written in Python for large matrices?

I had to write the gauss algorithm in Python last week and has to use Numpy arrays to get (a little bit) speed.

After doing so, I asked my self how much faster the same code would be in Fortran.
So I wrote it again, this time in Fortran.

I should explain that `n` means the number of rows (and obviously columns) of the matrix `a` that includes the coefficients.

I add the written code here.

``````def gauss(a: np.array, b: np.array) -> np.array:
for i in range(len(a)):
# Check if row order must be changed
if a[i, i] == 0:
# Search for first field != 0
for row_ind in range(i + 1, len(a)):
if a[row_ind, i] != 0:
# Swap
a[[i, row_ind]] = a[[row_ind, i]]
b[[i, row_ind]] = b[[row_ind, i]]
break
# Eliminate fields below
for row_ind in range(i + 1, len(a)):
b[row_ind] -= b[i] / a[i, i] * a[row_ind, i]
a[row_ind, i:] -= (a[i, i:] / a[i, i] * a[row_ind, i])
# a[row_ind] -= (a[i] / a[i, i] * a[row_ind, i])
# Figure out x
x = np.array([0 for _i in range(len(a))])
for i in range(len(a) - 1, -1, -1):
b[i] -= np.dot(a[i], x)
x[i] = b[i] / a[i, i]
return x
``````
``````module gauss_m

contains

subroutine gauss(a, b, x)
real(8), intent(inout)  :: a(:, :), b(:), x(:)

integer :: i, k, n
real(8), allocatable :: tmp_a(:)
real(8)              :: tmp_b

n = size(a, 1)

do i=1, n
! Check if has to swap
if (a(i, i) == 0.0) then
do k=i+1, n
! Swap values and exit loop
if (a(k, i) /= 0.0) then
tmp_a = a(k, :);   tmp_b = b(k)
a(k, :) = a(i, :); b(k) = b(i)
a(i, :) = tmp_a;   b(i) = tmp_b
exit
end if
end do
end if
! Eliminate items in row below
do k=i+1, n
b(k) = b(k) - (b(i) / a(i, i) * a(k, i))
a(k, :) = a(k, :) - a(i, :) / a(i, i) * a(k, i)
end do
end do
! Solve
x = 0.0
do i=n, 1, -1
b(i) = b(i) - dot_product(a(i, :), x)
x(i) = b(i) / a(i, i)
end do

end subroutine gauss

end module gauss_m
``````

To test how many time is needed to solve the problem I created a jupyter Notebook and created a `cpython` module using `f2py` to being able to import it into Python.

When I started to test how many time is needed, I was suprised how fast Fortran was able to solve the linear equations system.
It was up to 6 times faster than the Python code (although the conversion was written in Python and the code was wrapped using cpython).

But everytime I’m trying “big” `n` it looks like Python is getting faster than Fortran.

In this graphs you can see time needed to solve the `n` x `n` system,
“Python” means my selfwritten Python program, “Numpy” the function `numpy.linalg.solve` and “Fortran” my selfewritten Fortran code.

The whole Jupyter Notebook can be found here: JetBrains Datalore: A powerful environment for Jupyter notebooks.

It seems like Python is faster solving this system than Fortran if you choose a “big” `n` .

I think that’s not right but I cannot find a reason for. I tried a fiew things like using compiler flags for optimization, but it’s not getting much better.

If you’ve got any suggestions about how to solve this issue or why it is getting that slow please let me know.

4 Likes

Hopefully, Numpy is using a Fortran implementation of `gesv` internally, so Fortran is still the fastest .

One problem with your Fortran version is that you seem to be re-allocating `tmp_a` each time when swapping rows. Edit: I was wrong, see @Beliavsky’s comment below.

3 Likes

`tmp_a` has the same size throughout the subroutine. If `x` is an allocatable 1D array and
`size(x) == size(y)`, I don’t think assigning y to x causes reallocation.

4 Likes

Another comment would be in your Jupyter notebook. Instead of using the explicit conversion function `asfortranarray`, try using the `order` argument:

``````def benchmark_func_fortran():
a = np.array([[1 / (i + j - 1) for j in range(1, n + 1)] for i in range(1, n + 1)],order='F')
b = np.array([1 / (i + 1) for i in range(1, n + 1)],order='F')
x = np.array([0.0 for _i in range(n)],order='F')
gauss_m.gauss(a, b, x)
return x
``````

I don’t think this will cure the problem entirely, but at least you won’t be counting the transpose operation in your timings.

1 Like

Here’s a small driver to experiment with a few solvers, including:

1. The one introduced in this post, `gauss`
2. The elimination routine `solve` from Metcalf, Reid, & Cohen, Modern Fortran Explained, 2018, pg. 146
3. The LAPACK routine `dgesv`
4. The routines `decomp` and `solve` from Forsythe, Malcolm, and Moler, Computer Methods for Mathematical Computations, 1977

The driver can be compiled with:

``````\$  gfortran -O3 -march=native gauss_m.f90 linear.f90 decomp.f solve.f fmm.f90 elim.f90 -o elim -llapack
``````
MRC Linear module

`linear.f90`:

``````module linear

implicit none
private

public :: solve

integer, parameter :: dp = kind(1.0d0)

contains

subroutine solve(a,b,piv_tol,ok)
real(dp), intent(inout) :: a(:,:)
! The matrix a
real(dp), intent(inout) :: b(:)
! The right-hand side vector on entry.
! Overwritten by the solution
real(dp), intent(in) :: piv_tol
! Smallest acceptable pivot.
logical, intent(out) :: ok
! True after a succesful entry
! and false otherwise

! Local variables
integer :: i    ! Row index
integer :: j    ! Column index
integer :: n    ! Matrix order
real(dp), dimension(size(b)) :: row
! Automatic array needed for workspace;
real(dp) :: element
! Workspace variable

n = size(b)
ok = size(a,1) == n .and. size(a,2) == n
if (.not. ok) then
return
end if

do j = 1, n

!   Update elements in column j
do i = 1, j-1
a(i+1:n, j) = a(i+1:n, j) - a(i,j) * a(i+1:n, i)
end do

!   Find pivot and ceck its size
i = maxloc(abs(a(j:n, j)), dim=1) + j - 1
if (abs(a(i,j)) < piv_tol) then
ok = .false.
return
end if

!   If necessary, apply row interchange
if (i /= j) then
row = a(j,:); a(j,:) = a(i,:); a(i,:) = row
element = b(j); b(j) = b(i); b(i) = element
end if

!   Compute elements j+1:n of j-th column
a(j+1:n, j) = a(j+1:n, j)/a(j,j)
end do

! Forward substitution
do i = 1, n-1
b(i+1:n) = b(i+1:n) - b(i)*a(i+1:n, i)
end do

! Back substitution
do j = n, 1, -1
b(j) = b(j)/a(j,j)
b(1:j-1) = b(1:j-1) - b(j)*a(1:j-1,j)
end do

end subroutine solve

end module linear
``````
FMM module

`fmm.f90` (the other two routines can be downloaded directly from netlib):

``````module fmm

implicit none
private

public :: fmm_solve

integer, parameter :: dp = kind(1.0d0)

interface
subroutine decomp(ndim,n,a,cond,ipvt,work)
import dp
integer, intent(in) :: ndim, n
real(dp), intent(inout) :: a(ndim,n)
real(dp), intent(out) :: cond
integer, intent(out) :: ipvt(n)
real(dp), intent(inout) :: work(n)
end subroutine
subroutine solve(ndim,n,a,b,ipvt)
import dp
integer, intent(in) :: ndim, n
real(dp), intent(in) :: a(ndim,n)
real(dp), intent(inout) :: b(n)
integer, intent(in) :: ipvt(n)
end subroutine
end interface

contains

subroutine fmm_solve(a,b,ok)
real(dp), intent(inout) :: a(:,:)
real(dp), intent(inout) :: b(:)
logical, intent(out) :: ok

integer :: ipvt(size(b))
real(dp) :: work(size(b)), cond
integer :: n, ndim

n = size(b)
ndim = size(a,1)

ok = ndim >= n .and. size(a,2) == n
if (.not. ok) then
return
end if

call decomp(ndim,n,a,cond,ipvt,work)
if (cond == 1.0d32) then
ok = .false.
return
end if

call solve(ndim,n,a,b,ipvt)

end subroutine fmm_solve

end module fmm
``````

The driver `elim.f90`:

``````program elim

use gauss_m
use linear, only: solve
use fmm, only: fmm_solve

implicit none

integer :: n, j

real(dp), allocatable :: a(:,:), b(:,:), x(:)

! MCR solve
real(dp), parameter :: ptol = 1.e-14_dp
logical :: okay

! LAPACK related variables
external :: dgesv
integer , allocatable :: ipiv(:)
integer :: info

integer :: ncmd
character(len=256) :: cmd
logical :: print_solution
real(dp) :: t(5)

ncmd = command_argument_count()
if (ncmd /= 2) then
write(*,*) "Usage: elim <n> <t/f>"
write(*,*) ""
write(*,*) "  n   - integer problem size"
write(*,*) "  t/f - print solution vector, true or false"
write(*,*) ""
stop
end if

! Get problem size
call get_command_argument(1,cmd)

! Whether to print the solution vectors for inspection
call get_command_argument(2,cmd)

allocate(a(n,n),b(n,4),x(n),ipiv(n))

!
! Python to Fortran port from Discourse
!

call cpu_time(t(1))

call fill(a,b(:,1))
call gauss(a,b(:,1),x)
b(:,1) = x   ! Copy solution into b for output

!
! Metcalf, Reid, Cohen
!

call cpu_time(t(2))

call fill(a,b(:,2))
call solve(a,b(:,2),ptol,okay)
if (.not. okay) then
write(*,*) "MRC failed"
end if

!
! LAPACK: dgesv
!

call cpu_time(t(3))

call fill(a,b(:,3))
call dgesv(n,1,a,n,ipiv,b(:,3),n,info)

!
! Forsythe, Malcolm, Moler
!

call cpu_time(t(4))

call fill(a,b(:,4))
call fmm_solve(a,b(:,4),okay)
if (.not. okay) then
write(*,*) "FMM failed"
end if

call cpu_time(t(5))

!
! Print results
!

write(*,'(4A16)') (repeat('-',14),j=1,4)
write(*,'(4A16)') "gauss  ", "MRC    ", "dgesv  ", "FMM    "

if (print_solution) then
write(*,'(4A16)') (repeat('-',14),j=1,4)
write(*,'(/,A,/)') "Solutions:"
write(*,'(4E16.8)') (b(j,:),j=1,n)
end if

write(*,'(4A16)') (repeat('-',14),j=1,4)
write(*,'(/,A,/)') "Time (s): "
write(*,'(4E16.8)') t(2:5) - t(1:4)
write(*,*)

contains

subroutine fill(a,b)
real(dp), intent(out) :: a(:,:), b(:)
integer :: i, j

do j = 1, size(a,2)
do i = 1, size(a,1)
a(i,j) = 1.0_dp/real(i + j - 1,dp)
end do
end do

do i = 1, size(b)
b(i) = 1.0_dp/real(i + 1,dp)
end do

end subroutine

end program
``````

Here’s some sample output for N=2000:

``````\$ ./elim 2000 f
MRC failed
--------------  --------------  --------------  --------------
gauss           MRC             dgesv           FMM
--------------  --------------  --------------  --------------

Time (s):

0.21772618E+02  0.20940000E-02  0.96932900E+00  0.72896100E+00

``````

And for N = 5 with output:

``````\$ ./elim 5 t
--------------  --------------  --------------  --------------
gauss           MRC             dgesv           FMM
--------------  --------------  --------------  --------------

Solutions:

0.00000000E+00 -0.11102230E-15  0.00000000E+00 -0.11102230E-15
0.10000000E+01  0.10000000E+01  0.10000000E+01  0.10000000E+01
0.00000000E+00 -0.38857806E-14  0.00000000E+00 -0.38857806E-14
0.00000000E+00  0.25905204E-14  0.00000000E+00  0.25905204E-14
0.00000000E+00  0.48865080E-26  0.00000000E+00  0.48865080E-26
--------------  --------------  --------------  --------------

Time (s):

0.10000000E-05  0.10000000E-05  0.23000000E-04  0.40000000E-05

``````

The accuracy of the routines starts to diverge noticeably as N is increased. Also the solution of the `gauss` routine seems to be “fixed” to static value somehow.

3 Likes

A problem with this inner loop is it is memory hungry for large n, has poor cache efficiency and could also minimise AVX usage.

Strategies/algorithms to improve AVX instruction use and cache utilisation are very important for large N.

A much better solution, where A is symmetric is to use “Crout”, rather than “Gauss”, which is column based and much more efficient for memory addressing and AVX.

That said, if you are programming the same algorithm using Python and Fortran, you may have to check the compile options. I always use -O3 -march=native -ffast-math to improve gFortran performance, but don’t know what are the options you have used or how these compare to Python options. (-ffast-math is ok for my large sets of linear equations and provides significant speed-up, but may not be suitable for other cases)

2 Likes

a(k, : ) = a(i, : )

In the swap step in the fortran code, I see these full-row operations. I do not see full-row operations in the python code, it seems to have only partial-row operations. The same thing in the elimination step, partial-row operations in the python code and full-row operations in the fortran code. It looks like the fortran code might be doing twice as much work as the python code.

1 Like

I think the comments by @JohnCampbell and @RonShepard point out the root cause of the issue. The implemented algorithm performs full-row operations in matrix `a` when swapping rows (`a[[i, row_ind]] = a[[row_ind, i]]` in Python and the equivalent of `a(k, :) = a(i, :)` in Fortran). Since Fortran uses column major memory layout (columns are contiguous in memory), this becomes an inefficient operation. However, by default, numpy arrays are row major (`order='C'`, so that rows are contiguous in memory). As `n` becomes large, the Fortran becomes very cache inefficient, while the Python routine does not suffer from the same issue.

It is, however, very easy to change the memory layout of a numpy array, and you can do this to illustrate the problem, i.e. how fast is the Python routine, if we change the memory layout of `a` to Fortran order?:

``````def f_order():
a = np.array([[1 / (i + j - 1) for j in range(1, n + 1)] for i in range(1, n + 1)], order = 'F')
b = np.array([1 / (i + 1) for i in range(1, n + 1)])
xf = gauss(a, b)
return a, b, xf

def c_order():
a = np.array([[1 / (i + j - 1) for j in range(1, n + 1)] for i in range(1, n + 1)])
b = np.array([1 / (i + 1) for i in range(1, n + 1)])
xc = gauss(a, b)
return a, b, xc

print("order='F': ", timeit.timeit(f_order, number=3) / 3)
print("order='C': ", timeit.timeit(c_order, number=3) / 3)
``````

With `n=1024` I get:

``````order='F':  9.011245
order='C':  4.227592
``````

and with `n=2048`:

``````order='F':  84.46075615
order='C':  25.5663385
``````

The algorithm you implemented is exactly the same in Python and Fortran, and the algorithm performs full-row operations. If memory layout is row-major, the algorithm is efficient (Python with default or `order='C'`), however, if the memory layout is column-major, the algorithm is inefficient (Python with `order='F'`, or the default in Fortran).

4 Likes

For row-intensive operations, like Gauss elimination etc., one could possibly prepare a hybrid matrix, constructed as an array of pointers (embedded in a derived type, as Fortran, lamentably, does not provide regular array of pointers) to the matrix rows. That way, the row data are contiguous and all operations are done on rank-1 arrays.
Due to the aforementioned deficiency, the syntax is a bit clumsy, but it works, one can swap whole or partial rows as seen in the following super simple sample.

``````program main
implicit none
type rows
real, pointer :: row(:)
end type rows
type(rows), allocatable :: array(:)
integer :: n=8, i, j
real, allocatable :: swap(:)
real :: x,y

allocate(array(n),swap(n))
do i=1,n
allocate(array(i)%row(n))
do j=1,n
array(i)%row(j) = 100.0*i+j
end do
end do
print 10,array(1)%row  !  101.0 102.0 103.0 104.0 105.0 106.0 107.0 108.0
print 10,array(5)%row  !  501.0 502.0 503.0 504.0 505.0 506.0 507.0 508.0
swap = array(1)%row
array(1)%row = array(5)%row
array(5)%row = swap
print 10,array(1)%row  !  501.0 502.0 503.0 504.0 505.0 506.0 507.0 508.0
print 10,array(5)%row  !  101.0 102.0 103.0 104.0 105.0 106.0 107.0 108.0
swap(1:n/2) = array(1)%row(1:n/2)
array(1)%row(1:n/2) = array(5)%row(1:n/2)
array(5)%row(1:n/2) = swap(1:n/2)
print 10,array(1)%row  !  101.0 102.0 103.0 104.0 505.0 506.0 507.0 508.0
print 10,array(5)%row  !  501.0 502.0 503.0 504.0 105.0 106.0 107.0 108.0
10 format(*(f6.1))
end program main
``````

BTW, anybody knowing why actually array of pointers cannot exist in Modern Fortran?

1 Like

I think this might be because it not feasible with the current syntax:
If you define a pointer, the dimension attribute defines the shape of the array it can point to. Therefore the dimension attribute is already occupied. Maybe it would be possible to solve it with:

``````real, pointer(:) :: ptr(3, 3) ! 3x3 array with pointers to 1D arrays
print*, ptr(1, 1)(:) ! print first 1D array
``````

But this would break old code.
And what would happen, if you type this:

``````print*, ptr(2, :)(1)
``````

?

Another reason could be, that this syntax might foster slow code constructs.

1 Like

I think the main reason for the poor performance is what @jbdv said. Changing the memory access to be column major in the Fortran implementation yields a notable difference

`````` mat size:          500 timigs (col-major, row-major)   6.0472999999999999E-002   8.3461999999999995E-002
mat size:         1000 timigs (col-major, row-major)  0.17684999999999998        2.1223450000000001
mat size:         1500 timigs (col-major, row-major)  0.77685899999999997        11.889548000000000
mat size:         2000 timigs (col-major, row-major)   2.0247399999999995        31.665683999999999
mat size:         2500 timigs (col-major, row-major)   4.1311760000000035        70.841338999999991
mat size:         3000 timigs (col-major, row-major)   7.3490720000000067        121.72819000000001
mat size:         3500 timigs (col-major, row-major)   12.796781000000010        210.18912300000000
mat size:         4000 timigs (col-major, row-major)   17.554618000000005        337.45606700000002
`````` Source code
``````module gauss_m

private
public :: gauss, gauss_rm
integer, parameter :: dp = kind(1.0d0)

contains

subroutine gauss(a, b, x)
real(dp), intent(inout)  :: a(:, :), b(:), x(:)

integer :: i, k, n
real(dp), allocatable :: tmp_a(:)
real(dp)              :: tmp_b

n = size(a, 1)

do i=1, n
! Check if has to swap
if (a(i, i) == 0.0) then
do k=i+1, n
! Swap values and exit loop
if (a(i, k) /= 0.0) then
tmp_a = a(:, k);   tmp_b = b(k)
a(:, k) = a(:, i); b(k) = b(i)
a(:, i) = tmp_a;   b(i) = tmp_b
exit
end if
end do
end if
! Eliminate items in row below
do k=i+1, n
b(k) = b(k) - (b(i) / a(i, i) * a(k, i))
a(:, k) = a(:, k) - a(:, i) / a(i, i) * a(i, k)
end do
end do
! Solve
x = 0.0
do i=n, 1, -1
b(i) = b(i) - dot_product(a(:, i), x)
x(i) = b(i) / a(i, i)
end do

end subroutine gauss

subroutine gauss_rm(a, b, x)
!! row major implementation, do not use for actual solves
real(dp), intent(inout)  :: a(:, :), b(:), x(:)

integer :: i, k, n
real(dp), allocatable :: tmp_a(:)
real(dp)              :: tmp_b

n = size(a, 1)

do i=1, n
! Check if has to swap
if (a(i, i) == 0.0) then
do k=i+1, n
! Swap values and exit loop
if (a(k, i) /= 0.0) then
tmp_a = a(k, :);   tmp_b = b(k)
a(k, :) = a(i, :); b(k) = b(i)
a(i, :) = tmp_a;   b(i) = tmp_b
exit
end if
end do
end if
! Eliminate items in row below
do k=i+1, n
b(k) = b(k) - (b(i) / a(i, i) * a(k, i))
a(k, :) = a(k, :) - a(i, :) / a(i, i) * a(k, i)
end do
end do
! Solve
x = 0.0
do i=n, 1, -1
b(i) = b(i) - dot_product(a(i, :), x)
x(i) = b(i) / a(i, i)
end do

end subroutine gauss_rm

end module gauss_m

program main

use gauss_m

implicit none

integer, parameter :: dp = kind(1.0d0)
real(dp), allocatable :: a(:,:), b(:), x(:), a_rm(:,:), b_rm(:), x_rm(:)
real(dp), allocatable :: timings(:,:)
integer :: n, n_min, n_max, n_int

n_min = 500; n_max = 5000; n_int = 500;

allocate(timings(((n_max - n_min)/ n_int) + 1, 2))

do n = n_min, n_max, n_int

allocate(a(n, n), b(n), x(n))
call fill(a, b)
allocate(a_rm, source=a); allocate(b_rm, source=b); allocate(x_rm, source=x)

timings(n/n_int, 1) = timeit(a, b, x, gauss)
timings(n/n_int, 2) = timeit(a_rm, b_rm, x_rm, gauss_rm)
print*, "mat size:", n, "timigs (col-major, row-major):", timings(n/n_int, :), "rel L2 norm: ", norm2(x_rm- x)/ norm2(x)

deallocate(a, b, x, a_rm, b_rm, x_rm)
end do

print*, "          TIMINGS         "
print*, repeat("-", 80)
print*, "column major: ", timings(:, 1)
print*, "row    major: ", timings(:, 2)
print*, repeat("-", 80)

contains

subroutine fill(a, b)
real(dp), intent(out) :: a(:,:), b(:)
integer :: i, j

do j = 1, size(a,2)
do i = 1, size(a,1)
a(i,j) = 1.0_dp/real(i + j - 1,dp)
end do
end do

do i = 1, size(b)
b(i) = 1.0_dp/real(i + 1,dp)
end do

end subroutine

function timeit(a, b, x, sub) result(time)
real(dp), intent(inout) :: a(:,:), b(:), x(:)
interface
subroutine sub(a, b, x)
integer, parameter :: dp = kind(1.0d0)
real(dp), intent(inout) :: a(:,:), b(:), x(:)
end subroutine sub
end interface

real(dp) :: time
real(dp) :: starttime

call cpu_time(starttime)
call sub(a, b, x)
call cpu_time(time)
time = time - starttime
end function timeit

end program main
``````

PS thanks @ivanpribec for the fill routine I was too lazy to write one

4 Likes

A simple alternative is to use/supply the At transpose matrix.
As a simple test, the gauss could be programmed for At, and also ensure that only At(i+1:,k) is used for each step.
Based on the code provided:

``````       do k=i+1, n
b(k) = b(k) - (b(i) / a(i, i) * a(k, i))
a(k, :) = a(k, :) - a(i, :) / a(i, i) * a(k, i)
end do
``````

This could become of the form

``````       do k=i+1, n
b(k)        = b(k)        - b(i)        * ( at(i, k) / at(i, i) )
at(i+1:, k) = at(i+1:, k) - at(i+1:, i) * ( at(i, k) / at(i, i) )
end do
...
! Solve
x = 0.0
do i=n, 1, -1
b(i) = b(i) - dot_product( at(i+1:, i), x(i+1:) )
x(i) = b(i) / at(i, i)
end do
``````

This change would be more cache efficient and promote AVX from L1 cache.

2 Likes

The code shown by @gnikit creates symmetric arrays `a`, `a_rm` and it gives consistent results. I tried with randomly generated `a` and `a_rm=transpose(a)`. Unfortunately this produces different results, both with `b_rm=b` and `b_rm=b(n:1:-1)`. Any thoughts?

``````  n = 10

allocate(a(n, n), b(n), x(n))
!    call fill(a, b)
call random_number(a)
call random_number(b)
allocate(x_rm, source=x)
a_rm = transpose(a)
b_rm = b(n:1:-1)

call gauss(a, b, x)
call gauss_rm(a_rm, b_rm, x_rm)
print '(10f10.6)', x
print '(10f10.6)', x_rm
print*, "mat size:", n, "rel L2 norm: ", norm2(x_rm- x)/ norm2(x)
deallocate(a, b, x, a_rm, b_rm, x_rm)
``````
1 Like

@msz59 The original gauss routine posted has a “race” condition in the following line.
` a(k, :) = a(k, :) - a(i, :) / a(i, i) * a(k, i)`

This could be overcome by correct range or using ( daxpy )

``````  factor = a(k, i) / a(i, i)
a(k, i+1:) = a(k, i+1:) - a(i, i+1:) * factor
``````

This may resolve the problem, if not already corrected.

2 Likes

In fortran, the code executes “as if” the rhs is fully evaluated before the lhs is altered, so there is no “race” condition. Sometimes, this requires that temporary arrays be allocated in order to do that computation. This is one of the traps of fortran array notation. In simple cases, you can expect the compiler to recognize when only a simple loop is required, but this case might be a little too complicated for that, so it might be better to rewrite it just to help the compiler nonetheless. Your rewrite above of the code to access the elements my columns instead of rows, and to only access the i+1:n elements during the swap and elimination steps should be near optimal for this particular algorithm.

1 Like

In fortran, you create an array of pointers by putting the pointer inside a derived type, and creating an array of that derived type. In this case however, it would be simpler to just do the gaussian elimination by columns on the transpose of the array, as others have suggested. All of the lapack routines have an argument that specifies whether to operate on the array or on the transpose of the array. All languages show a performance preference for one those cases over the other. This example just happened to pick a matrix storage order that was optimal for python but not for fortran. If the other layout had been chosen, then fortran would have had the natural advantage.

2 Likes

This is exactly what I did in my sample above

Actually it might be worth some testing regarding performance. The array of pointers to the rows is potentially faster way to access the array than the standard rank-2 access. For years the C analog (with true array of pointers, as allowed by C) was the only standard way to pass a matrix (2D array) to another function dynamically, requiring some pre-initialization but giving performance gain at the same time. Whether the array of derived types elements containing the pointers in Fortran can also give any gain, is to be checked.

1 Like

I would say that it is never more efficient to access arrays through row (or column) pointers rather than the underlying storage order. However, if the choice is between accessing elements in the wrong order with one and the correct order with the other, then there will of course be a difference. Storage of ragged arrays is an example where the row (or column) pointers are sometimes appropriate, whether it is done by the language (with actual memory pointers) or with integer array indices. That is a case where some care is necessary in the algorithm choices to avoid the pointer references as much as possible. The packed storage of symmetrical matrices is something that can be done in three different ways, pointers, integer array offsets, and by storage order expression, (I*(I-1))/2 + J. Perhaps not surprisingly, that last option is often the best, especially if the first part of that expression can be reused and moved out of the innermost loops. The integer offset array is simply the precomputed values of the first part of that expression, OFF(I)+J. This approach has the advantage that a single offset array can be used to address many packed-storage arrays. The pointer approach has the advantage that it is closer to the hardware, but its disadvantage is that a new pointer array is required for every packed array, it cannot be reused like the integer offset array. And these days, those hardware addresses are likely to be 64-bits, while the integer addressing may only require 32-bit integer storage.

3 Likes

Wow - that’s a lot of input.
This is my first topic created here and I hadn’t thought that I’d get this quantitiy of (useful) answers.

Thank you for this.

I forgot that Fortran saves arrays column based and oviously this is a leading edge for the Python implementation.

I’ll try to consider all these tips and create a new version of my algorithm that will beat the Python code (in speed and precision ^^).

You’ll find the code here in the next days.
Thank you for these amazing tips.

5 Likes

If you want to get the top performance, I also recommend doing everything including the driver program in Fortran. Use the tips above to optimize. Make sure you try a few compilers, at least GFortran and Intel and make sure you enable all optimizations options correctly for your platform. If you post the full Fortran code, people here can help you get the best performance.

Only then add the Python wrappers and check the speed. It must be the same. If it is not, then there is also some overhead, either in the wrappers, or perhaps not all optimization options were used. So you can then focus on just the wrappers.

2 Likes