Indeed, good post. Applied to Fortran, here is what I think. For computational code, you absolutely want your code to run at maximum performance that the hardware allows. At the same time, you do not want to be hand optimizing in assembly or complicated low level hand written loop unrolling, vectorization and other rearrangement. To give a specific example using the new language Mojo, announced yesterday, you want to write and maintain (!) your code as something like this (Modular Docs - Matrix multiplication in Mojo):
fn matmul_naive(C: Matrix, A: Matrix, B: Matrix):
for m in range(C.rows):
for n in range(C.cols):
for k in range(A.cols):
C[m, n] += A[m, k] * B[k, n]
not like this:
@adaptive
fn matmul_autotune_impl(C: Matrix, A: Matrix, B: Matrix):
@parameter
fn calc_row(m: Int):
@parameter
fn calc_tile[tile_x: Int, tile_y: Int](x: Int, y: Int):
for n in range(y, y + tile_y):
@parameter
fn dot[nelts : Int](k : Int):
C[m,n] += (A.load[nelts](m,k+x) * B.load_tr[nelts](k+x,y)).reduce_add()
vectorize_unroll[nelts, tile_x // nelts, dot](tile_x)
# Instead of hardcoding to tile_size = 4, search for the fastest
# tile size by evaluting this function as tile size varies.
alias tile_size = autotune(1, 2, 4, 8, 16, 32, 64)
tile[calc_tile, nelts * tile_size, nelts*tile_size](A.cols, C.cols)
parallelize[calc_row](C.rows)
However, you absolutely want the first version to run as fast as the second. Exactly the same applies to Fortran.
What is the solution?
I think the best way I found so far is to do three things: top-down approach, bottom-up approach and squeeze the middle, as follows:
- Bottom-up approach: you build primitives that run at maximum speed at your specific hardware, something like the second approach. If the language allows it, you do this directly in the language; if it doesn’t allow it, you do it in whatever language that allows you to get it done. These primitives are hardware specific and could even be a bunch of inconsistent “hacks”, it’s whatever you can figure out.
- Top-down approach: that’s the first approach. Clean nice high level code, that looks like the math, derived from first principles, independent of hardware.
- Squeeze the middle: You then try to connect your top-down code with your bottom-up code. Ideally you have a very strong compiler that can do many optimizations and all you need to do is to hint it. It might guide the choice of your high level data structures and types. You can pollute your code with more directives if you need maximum performance today (and simplify later as your compiler gets better). It’s not a “clean” approach, but it’s an approach that works in practice.
If you need performance, in my experience you can’t just do the clean “top-down” and hope compilers will eventually optimize it. That would be like writing Python code and hope that future compilers will somehow magically make it fast. They won’t. You can always do the robust “bottom-up”, but it’s not typically maintainable (across hardware) and simple/clean. But if you do both, you get a clean top-down approach, and as your tooling gets better, you can optimize it more and more. You do the robust bottom-up and try to get it as high level and as clean as you can get. And the fragility in the middle is squeezed: there will always be some, and it’s the “messy” part of this, but it’s well defined and getting smaller over time (squeezed).
Applied to Fortran, as a user I think you want to write and maintain high level “clean” code that looks like the math, and use the compiler to squeeze the middle. However I think compilers should do a much better job to give you access to the fast primitives for a given hardware and show you in some way how your high level code connects to them (or if it connects at all) and then give you ways to guide the compiler to connect. The way you can guide the compiler is by choosing high level data structures, types, and then if you need more, perhaps some pragmas, or compiler options, or other such features. I think as a user you have to work with the compiler, you can’t treat it as a black box if you need performance.
Let me know your thoughts on this.