Gauss Algorithm written in Fortran slower than written in Python for large matrices?

By the way, just curious, has anyone tried run the same code on Mac M1?

Even the regular M1 chip’s memory bandwidth is around 70GB/s, which is 2X my DDR4 2666’s 35GB/s.
The high-end M1 Max has 400GB/s bandwidth.

If this algorithm’s bottleneck is memory operations, I guess M1 chip may be 2 times faster than PC like mine with DDR4 2666 even without any optimization.

2 Likes

I used an M1 Chip for computing.
Using my posted code and this program to compare:

def benchmark_func_py():
    a = np.array([[1 / (i + j - 1) for j in range(1, n + 1)] for i in range(1, n + 1)])
    b = np.array([1 / (i + 1) for i in range(1, n + 1)])
    return gauss.gauss(a, b)


def benchmark_func_fortran():
    a = numpy.asfortranarray(np.array([[1 / (i + j - 1) for j in range(1, n + 1)] for i in range(1, n + 1)]))
    b = numpy.asfortranarray(np.array([1 / (i + 1) for i in range(1, n + 1)]))
    x = numpy.asfortranarray(np.array([0.0 for _i in range(n)]))
    gauss_m.gauss(a, b, x)
    return x


def benchmark_func_numpy():
    a = np.array([[1 / (i + j - 1) for j in range(1, n + 1)] for i in range(1, n + 1)])
    b = np.array([1 / (i + 1) for i in range(1, n + 1)])
    return np.linalg.solve(a, b)


def benchmark_main():
    py_time = timeit.timeit(benchmark_func_py, globals=globals(), number=1)
    f_time = timeit.timeit(benchmark_func_fortran, globals=globals(), number=1)
    np_time = timeit.timeit(benchmark_func_numpy, globals=globals(), number=1)
    print(f"Duration Python:  {py_time}s")
    print(f"Duration Fortran: {f_time}s")
    print(f"Duration Numpy:   {np_time}s")


if __name__ == "__main__":
    n = 2001
    benchmark_main()

I got this output:

Duration Python:  8.193421584s
Duration Fortran: 6.235588208999999s
Duration Numpy:   0.4742914580000015s

(Only 1 run per try… so there can be significant differences if you try it once again)

I don’t know if this output helps you. Obviously it’s dfficult to compare computers like this (without considering other hardware and software aspects).

Please note that the code has not been optimized since my post.

1 Like

Thank you @rnoffke .
Sorry for a lazy question, would you mind posting the whole Jupyter Notebook here?
I tried the link of the Jupyter notebook book you posted,

and I also

pip install gauss

in my anaconda 3.

But when I run the Juypyter notebook I got an error message below,

I am using anaconda 3 on windows 10.

By the way, I just bought a Macbook Air with 16GB ram.
Here is what I found, for a memory heavy Fortran code I wrote, (it frequently operates big arrays and matrices, consumes 6 - 12 GB memory),
M1 chip with gfortran with -O3 -march=native, took 30 seconds,
while my Thinkpad P72 with Intel xeon 2186M with 64 GB DDR4 2666 ECC, using intel OneAPI, -O3 -xHost, took 70 seconds.

Air’s M1 has memory bandwidth 68.5GB/s, while my Thinkpad has like 35 GB/s. So it is a factor of 2 difference. I believe speedup of my code is due to the memory speed difference between M1 and my Thinkpad’s DDR4 2666.

For smaller matrices and arrays, the speed difference become smaller.

2 Likes

@CRquantum Cant’t you a access all the files of the notebook?

I haven’t shared a Notebook like this before.
And honestly I don’t know if my Jetbrains license allows sharing the “everytime up to date” Notebook including the files here.

But I’ll send you the files.

1 Like

Thanks Rene, received the code you send, and I did

f2py -c gauss.f90 -m gauss_pkg

I actually did

f2py -c gauss.f90 -m gauss_pkg --f90flags='-O3 -march=native'

On Mac Air M1,
I installed the conda in miniforge because the miniforge version seems support native M1 chip.
Here is the result from M1,

Eh, however the time in seconds seems not exactly right. It actually took M1 Air like 1 to 1.5 hour to finish the run.

Ron,

I am trying to understand your statement in regard to the efficiency of Fortran array operations.
If I was to approach this, I would make a copy of the diagonal value only, not make a copy of the whole row.
Are you implying some inefficency in Fortran array operations if temporary arrays are created and then results updated at the end of the operation?
Would you apply the approach “the rhs is fully evaluated before the lhs is altered” to any daxpy : y = y + a*x vector operation ?

It is a quality of implementation issue. The result must be “as if” the rhs is fully evaluated, but if the compiler can figure out how to do the operation more efficiently, it can do so. The only way to know what a compiler does is to look at the code that it generates (or in the case of gfortran, you can look at the intermediate code, which might be easier to understand). If the compiler can’t figure out any good optimizations, then yes, work arrays must be allocated, used, and then deallocated, which makes the operation expensive. If the programmer can help the compiler, then it is more likely to result in efficient code.