Machine learning compilers and optimizers

PyTorch, TensorFlow, and other machine learning frameworks feature multi-dimensional arrays, which are built into Fortran. Reading this blog post, I wonder if these frameworks can be optimized better than Fortran for the calculations they do, and if so, why.

A friendly introduction to machine learning compilers and optimizers
Sep 7, 2021 • Chip Huyen

For example, the compute primitive of CPUs used to be a number (scalar), the compute primitive of GPUs used to be a one-dimensional vector, whereas the compute primitive of TPUs is a two-dimensional vector (tensor). However, many CPUs these days have vector instructions and some GPUs have tensor cores, which are 2-dimensional. Performing a convolution operator on a batch of 256 images x 3 channels x 224 W x 224 H will be very different with 1-dimensional vectors compared to 2-dimensional vectors. Similarly, you’d need to take into account different L1, L2, and L3 layouts and buffer sizes to use them efficiently.

There are two ways to optimize your ML models: locally and globally. Locally is when you optimize an operator or a set of operators of your model. Globally is when you optimize the entire computation graph end-to-end.

There are standard local optimization techniques that are known to speed up your model, most of them making things run in parallel or reducing memory access on chips. Here are three of the common techniques.

  • vectorization : given a loop or a nested loop, and instead of executing it one item at a time, use hardware primitives to operate on multiple elements that are contiguous in memory.
  • parallelization : given an input array (or n-dimensional array), divide it into different, independent work chunks, and do the operation on each chunk individually.
  • loop tiling : change the data accessing order in a loop to leverage hardware’s memory layout and cache. This kind of optimization is hardware dependent. A good access pattern on CPUs is not a good access pattern on GPUs. See visualization below by Colfax Research.
  • operator fusion : fuse multiple operators into one to avoid redundant memory access. For example, two operations on the same array require two loops over that array; in a fused case – it is just a single loop. See an example below by Matthias Boehm.

Yes, the optimization passes and decisions about what optimizations to apply and with what parameters, what order, etc. is an optimization problem in itself. So machine learning and other optimization techniques should work.

For LFortran I was hoping to write a pass that isolates a given subroutine, and creates a test program/benchmark for it automatically, filling some input and output arrays (possibly from an actual run, to ensure things work). Or the user can fill them. Once the benchmark runs, the compiler can then try to figure out “optimal” optimizations to apply to get the fastest results.

If anybody is interested in exploring this and writing a prototype, let me know. As a start, we write the benchmark by hand, and then we start applying optimizations. We can start with LLVM optimizations. Once we have more optimizations in LFortran itself, we can optimize those too.

I might try a slightly different approach. Compile in multiple differently optimized versions of various procedures and have the ML pick which versions to use at runtime. Once the user has run their program enough times with a sufficient sampling of real inputs, they can recompile their code with only the versions determined to be fastest by the ML. (Actually you don’t really need ML at that point, just a sampling algorithm).

1 Like