I was thinking about this thread Does LAPACK/BLAS automatically use multi cores or threads?, but they both covered some of the same ground.
Yes, synthetic benchmarks can be unrealistic. I’ve seen compilers remove entire do loops in benchmarks because they recognize and remove dead code. But in general I’ve found that what one learns with tuning and optimizing benchmarks can be applied also to optimizing production codes.