I think the comments by @JohnCampbell and @RonShepard point out the root cause of the issue. The implemented algorithm performs full-row operations in matrix `a`

when swapping rows (`a[[i, row_ind]] = a[[row_ind, i]]`

in Python and the equivalent of `a(k, :) = a(i, :)`

in Fortran). Since Fortran uses column major memory layout (columns are contiguous in memory), this becomes an inefficient operation. However, by default, numpy arrays are row major (`order='C'`

, so that rows are contiguous in memory). As `n`

becomes large, the Fortran becomes very cache inefficient, while the Python routine does not suffer from the same issue.

It is, however, very easy to change the memory layout of a numpy array, and you can do this to illustrate the problem, i.e. how fast is the Python routine, if we change the memory layout of `a`

to Fortran order?:

```
def f_order():
a = np.array([[1 / (i + j - 1) for j in range(1, n + 1)] for i in range(1, n + 1)], order = 'F')
b = np.array([1 / (i + 1) for i in range(1, n + 1)])
xf = gauss(a, b)
return a, b, xf
def c_order():
a = np.array([[1 / (i + j - 1) for j in range(1, n + 1)] for i in range(1, n + 1)])
b = np.array([1 / (i + 1) for i in range(1, n + 1)])
xc = gauss(a, b)
return a, b, xc
print("order='F': ", timeit.timeit(f_order, number=3) / 3)
print("order='C': ", timeit.timeit(c_order, number=3) / 3)
```

With `n=1024`

I get:

```
order='F': 9.011245
order='C': 4.227592
```

and with `n=2048`

:

```
order='F': 84.46075615
order='C': 25.5663385
```

The algorithm you implemented is exactly the same in Python and Fortran, and the algorithm performs full-row operations. If memory layout is row-major, the algorithm is efficient (Python with default or `order='C'`

), however, if the memory layout is column-major, the algorithm is inefficient (Python with `order='F'`

, or the default in Fortran).