PyTorch, TensorFlow, and other machine learning frameworks feature multi-dimensional arrays, which are built into Fortran. Reading this blog post, I wonder if these frameworks can be optimized better than Fortran for the calculations they do, and if so, why.
A friendly introduction to machine learning compilers and optimizers
Sep 7, 2021 • Chip Huyen
For example, the compute primitive of CPUs used to be a number (scalar), the compute primitive of GPUs used to be a one-dimensional vector, whereas the compute primitive of TPUs is a two-dimensional vector (tensor). However, many CPUs these days have vector instructions and some GPUs have tensor cores, which are 2-dimensional. Performing a convolution operator on a batch of 256 images x 3 channels x 224 W x 224 H will be very different with 1-dimensional vectors compared to 2-dimensional vectors. Similarly, you’d need to take into account different L1, L2, and L3 layouts and buffer sizes to use them efficiently.
There are two ways to optimize your ML models: locally and globally. Locally is when you optimize an operator or a set of operators of your model. Globally is when you optimize the entire computation graph end-to-end.
There are standard local optimization techniques that are known to speed up your model, most of them making things run in parallel or reducing memory access on chips. Here are three of the common techniques.
- vectorization : given a loop or a nested loop, and instead of executing it one item at a time, use hardware primitives to operate on multiple elements that are contiguous in memory.
- parallelization : given an input array (or n-dimensional array), divide it into different, independent work chunks, and do the operation on each chunk individually.
- loop tiling : change the data accessing order in a loop to leverage hardware’s memory layout and cache. This kind of optimization is hardware dependent. A good access pattern on CPUs is not a good access pattern on GPUs. See visualization below by Colfax Research.
- operator fusion : fuse multiple operators into one to avoid redundant memory access. For example, two operations on the same array require two loops over that array; in a fused case – it is just a single loop. See an example below by Matthias Boehm.