I think the comments by @JohnCampbell and @RonShepard point out the root cause of the issue. The implemented algorithm performs full-row operations in matrix a when swapping rows (a[[i, row_ind]] = a[[row_ind, i]] in Python and the equivalent of a(k, :) = a(i, :) in Fortran). Since Fortran uses column major memory layout (columns are contiguous in memory), this becomes an inefficient operation. However, by default, numpy arrays are row major (order='C', so that rows are contiguous in memory). As n becomes large, the Fortran becomes very cache inefficient, while the Python routine does not suffer from the same issue.
It is, however, very easy to change the memory layout of a numpy array, and you can do this to illustrate the problem, i.e. how fast is the Python routine, if we change the memory layout of a to Fortran order?:
def f_order():
a = np.array([[1 / (i + j - 1) for j in range(1, n + 1)] for i in range(1, n + 1)], order = 'F')
b = np.array([1 / (i + 1) for i in range(1, n + 1)])
xf = gauss(a, b)
return a, b, xf
def c_order():
a = np.array([[1 / (i + j - 1) for j in range(1, n + 1)] for i in range(1, n + 1)])
b = np.array([1 / (i + 1) for i in range(1, n + 1)])
xc = gauss(a, b)
return a, b, xc
print("order='F': ", timeit.timeit(f_order, number=3) / 3)
print("order='C': ", timeit.timeit(c_order, number=3) / 3)
With n=1024 I get:
order='F': 9.011245
order='C': 4.227592
and with n=2048:
order='F': 84.46075615
order='C': 25.5663385
The algorithm you implemented is exactly the same in Python and Fortran, and the algorithm performs full-row operations. If memory layout is row-major, the algorithm is efficient (Python with default or order='C'), however, if the memory layout is column-major, the algorithm is inefficient (Python with order='F', or the default in Fortran).