Writing slower Go programs — Bitfield Consulting

The following article is not about Fortran, but it is worth reading. It is a good reference when making a choice between, e.g., “hand-crafted optimized loops” and “compact expressive array operations”.

With generative AI like ChatGPT, formulating and describing algorithms in the most natural and readable way, either in human languages or in programming languages, is becoming increasingly and overwhelmingly more important than writing manually optimized code, the latter of which will soon become completely worthless (sorry to say it in this gross way). This is my personal view and I hope I am entirely mistaken.

Surely, GO language is an animal that is completely different from Fortran. But again, in the dawn of the rise of AI, it is becoming less and less important which language is used for the code. I expect that the translation between languages will become trivial in a few years (not now, not yet).

Thanks.

The article is here: Writing slower Go programs — Bitfield Consulting . Below is a quote of several paragraphs.

5 Likes

This is a wonderful post, and I think it applies to lots of contexts.

2 Likes

Indeed, good post. Applied to Fortran, here is what I think. For computational code, you absolutely want your code to run at maximum performance that the hardware allows. At the same time, you do not want to be hand optimizing in assembly or complicated low level hand written loop unrolling, vectorization and other rearrangement. To give a specific example using the new language Mojo, announced yesterday, you want to write and maintain (!) your code as something like this (Modular Docs - Matrix multiplication in Mojo):

fn matmul_naive(C: Matrix, A: Matrix, B: Matrix):
    for m in range(C.rows):
        for n in range(C.cols):
            for k in range(A.cols):
                C[m, n] += A[m, k] * B[k, n]

not like this:

@adaptive
fn matmul_autotune_impl(C: Matrix, A: Matrix, B: Matrix):
    @parameter
    fn calc_row(m: Int):
        @parameter
        fn calc_tile[tile_x: Int, tile_y: Int](x: Int, y: Int):
            for n in range(y, y + tile_y):
                @parameter
                fn dot[nelts : Int](k : Int):
                    C[m,n] += (A.load[nelts](m,k+x) * B.load_tr[nelts](k+x,y)).reduce_add()
                vectorize_unroll[nelts, tile_x // nelts, dot](tile_x)

        # Instead of hardcoding to tile_size = 4, search for the fastest 
        # tile size by evaluting this function as tile size varies.
        alias tile_size = autotune(1, 2, 4, 8, 16, 32, 64)
        tile[calc_tile, nelts * tile_size, nelts*tile_size](A.cols, C.cols)
      
    parallelize[calc_row](C.rows)

However, you absolutely want the first version to run as fast as the second. Exactly the same applies to Fortran.

What is the solution?

I think the best way I found so far is to do three things: top-down approach, bottom-up approach and squeeze the middle, as follows:

  • Bottom-up approach: you build primitives that run at maximum speed at your specific hardware, something like the second approach. If the language allows it, you do this directly in the language; if it doesn’t allow it, you do it in whatever language that allows you to get it done. These primitives are hardware specific and could even be a bunch of inconsistent “hacks”, it’s whatever you can figure out.
  • Top-down approach: that’s the first approach. Clean nice high level code, that looks like the math, derived from first principles, independent of hardware.
  • Squeeze the middle: You then try to connect your top-down code with your bottom-up code. Ideally you have a very strong compiler that can do many optimizations and all you need to do is to hint it. It might guide the choice of your high level data structures and types. You can pollute your code with more directives if you need maximum performance today (and simplify later as your compiler gets better). It’s not a “clean” approach, but it’s an approach that works in practice.

If you need performance, in my experience you can’t just do the clean “top-down” and hope compilers will eventually optimize it. That would be like writing Python code and hope that future compilers will somehow magically make it fast. They won’t. You can always do the robust “bottom-up”, but it’s not typically maintainable (across hardware) and simple/clean. But if you do both, you get a clean top-down approach, and as your tooling gets better, you can optimize it more and more. You do the robust bottom-up and try to get it as high level and as clean as you can get. And the fragility in the middle is squeezed: there will always be some, and it’s the “messy” part of this, but it’s well defined and getting smaller over time (squeezed).

Applied to Fortran, as a user I think you want to write and maintain high level “clean” code that looks like the math, and use the compiler to squeeze the middle. However I think compilers should do a much better job to give you access to the fast primitives for a given hardware and show you in some way how your high level code connects to them (or if it connects at all) and then give you ways to guide the compiler to connect. The way you can guide the compiler is by choosing high level data structures, types, and then if you need more, perhaps some pragmas, or compiler options, or other such features. I think as a user you have to work with the compiler, you can’t treat it as a black box if you need performance.

Let me know your thoughts on this.

1 Like

Actually, whatever the language, even a high level one with compact and abstract notations, an uncommented code that you have not written is almost always hard to read.

2 Likes

Absolutely. No doubt about that. One should endeavor to improve the readability and maintainability of his/her code by all means. Carefully documenting the code and writing “slower”/“non-optimized” code (in the sense as elaborated in the aforementioned article or as illustrated above by @certik by his excellent example) are two means that facilitate each other.

1 Like