Writing slower Go programs — Bitfield Consulting

zaikunzhang · May 3, 2023, 8:47am

The following article is not about Fortran, but it is worth reading. It is a good reference when making a choice between, e.g., “hand-crafted optimized loops” and “compact expressive array operations”.

With generative AI like ChatGPT, formulating and describing algorithms in the most natural and readable way, either in human languages or in programming languages, is becoming increasingly and overwhelmingly more important than writing manually optimized code, the latter of which will soon become completely worthless (sorry to say it in this gross way). This is my personal view and I hope I am entirely mistaken.

Surely, GO language is an animal that is completely different from Fortran. But again, in the dawn of the rise of AI, it is becoming less and less important which language is used for the code. I expect that the translation between languages will become trivial in a few years (not now, not yet).

Thanks.

The article is here: Writing slower Go programs — Bitfield Consulting . Below is a quote of several paragraphs.

Readability beats speed

Here’s the thing. Most programs spend a good deal more time being read than they do being executed. And programmer time is a lot more expensive than CPU time (good luck finding a developer for five cents an hour). So doesn’t it make sense to optimise programs for readability instead of raw speed?

Of course, we’d like our programs to be readable as well as fast, and that’s often possible. But when our engineer’s spidey-sense starts twitching, and telling us “Hey, I could make this a few nanoseconds faster per call by using an array instead of a map, and computing the index using the low-order bits of the key”, it’s important to think about what we’re giving up in exchange for the nanoseconds.

The cost of everything and the value of nothing

Every optimisation is a trade-off. What we’re losing when we trade off against speed and memory is usually:

Brevity. There’s more code, so it’s harder to read.

Clarity. It’s no longer obvious how this works.

Simplicity. More functions, more abstractions, more indirection, more data structures to read and understand.

Simplicity is part of the Tao of Go; it’s built into the language design, and our programs should inherit the same values. Clear is better than clever.

A great way to teach yourself the value of readability is to try to read someone else’s code. Often as developers we don’t do a lot of this. We maybe either wrote the code ourselves, or were in the room while it happened, so we have a pretty good idea what it’s all for and how it all fits together. We may even have a sense of how it’s changed over time due to shifting requirements and feature updates.

JeffH · May 3, 2023, 9:49am

This is a wonderful post, and I think it applies to lots of contexts.

certik · May 3, 2023, 2:30pm

Indeed, good post. Applied to Fortran, here is what I think. For computational code, you absolutely want your code to run at maximum performance that the hardware allows. At the same time, you do not want to be hand optimizing in assembly or complicated low level hand written loop unrolling, vectorization and other rearrangement. To give a specific example using the new language Mojo, announced yesterday, you want to write and maintain (!) your code as something like this (Modular Docs - Matrix multiplication in Mojo):

fn matmul_naive(C: Matrix, A: Matrix, B: Matrix):
    for m in range(C.rows):
        for n in range(C.cols):
            for k in range(A.cols):
                C[m, n] += A[m, k] * B[k, n]

not like this:

@adaptive
fn matmul_autotune_impl(C: Matrix, A: Matrix, B: Matrix):
    @parameter
    fn calc_row(m: Int):
        @parameter
        fn calc_tile[tile_x: Int, tile_y: Int](x: Int, y: Int):
            for n in range(y, y + tile_y):
                @parameter
                fn dot[nelts : Int](k : Int):
                    C[m,n] += (A.load[nelts](m,k+x) * B.load_tr[nelts](k+x,y)).reduce_add()
                vectorize_unroll[nelts, tile_x // nelts, dot](tile_x)

        # Instead of hardcoding to tile_size = 4, search for the fastest 
        # tile size by evaluting this function as tile size varies.
        alias tile_size = autotune(1, 2, 4, 8, 16, 32, 64)
        tile[calc_tile, nelts * tile_size, nelts*tile_size](A.cols, C.cols)
      
    parallelize[calc_row](C.rows)

However, you absolutely want the first version to run as fast as the second. Exactly the same applies to Fortran.

What is the solution?

I think the best way I found so far is to do three things: top-down approach, bottom-up approach and squeeze the middle, as follows:

Bottom-up approach: you build primitives that run at maximum speed at your specific hardware, something like the second approach. If the language allows it, you do this directly in the language; if it doesn’t allow it, you do it in whatever language that allows you to get it done. These primitives are hardware specific and could even be a bunch of inconsistent “hacks”, it’s whatever you can figure out.
Top-down approach: that’s the first approach. Clean nice high level code, that looks like the math, derived from first principles, independent of hardware.
Squeeze the middle: You then try to connect your top-down code with your bottom-up code. Ideally you have a very strong compiler that can do many optimizations and all you need to do is to hint it. It might guide the choice of your high level data structures and types. You can pollute your code with more directives if you need maximum performance today (and simplify later as your compiler gets better). It’s not a “clean” approach, but it’s an approach that works in practice.

If you need performance, in my experience you can’t just do the clean “top-down” and hope compilers will eventually optimize it. That would be like writing Python code and hope that future compilers will somehow magically make it fast. They won’t. You can always do the robust “bottom-up”, but it’s not typically maintainable (across hardware) and simple/clean. But if you do both, you get a clean top-down approach, and as your tooling gets better, you can optimize it more and more. You do the robust bottom-up and try to get it as high level and as clean as you can get. And the fragility in the middle is squeezed: there will always be some, and it’s the “messy” part of this, but it’s well defined and getting smaller over time (squeezed).

Applied to Fortran, as a user I think you want to write and maintain high level “clean” code that looks like the math, and use the compiler to squeeze the middle. However I think compilers should do a much better job to give you access to the fast primitives for a given hardware and show you in some way how your high level code connects to them (or if it connects at all) and then give you ways to guide the compiler to connect. The way you can guide the compiler is by choosing high level data structures, types, and then if you need more, perhaps some pragmas, or compiler options, or other such features. I think as a user you have to work with the compiler, you can’t treat it as a black box if you need performance.

Let me know your thoughts on this.

PierU · May 5, 2023, 10:38pm

Actually, whatever the language, even a high level one with compact and abstract notations, an uncommented code that you have not written is almost always hard to read.

zaikunzhang · May 6, 2023, 2:36am

Absolutely. No doubt about that. One should endeavor to improve the readability and maintainability of his/her code by all means. Carefully documenting the code and writing “slower”/“non-optimized” code (in the sense as elaborated in the aforementioned article or as illustrated above by @certik by his excellent example) are two means that facilitate each other.

Topic		Replies	Views
Rewriting Fortran software in Rust	14	3196	July 22, 2020
Zen of Fortran	13	1951	December 8, 2021
79 Languages speed competition: Can we make Fortran win?	43	4287	January 15, 2023
Julia: Fast as Fortran, Beautiful as Python	184	11805	November 13, 2022
Resistance to modernization	108	5444	September 7, 2022

Writing slower Go programs — Bitfield Consulting

Related topics