Does anyone have memory bandwidth numbers handy for RAM, level-2 cache, and level-1 cache for some of the current intel, AMD, and ARM cpus?
Typically, if there is only one floating point operation for each memory fetch, then the RAM bandwidth determines the performance. However, for operations like matrix-matrix products, each memory fetch is reused N times, where N is the matrix dimension, so the cache bandwidth determines the performance.