How much GFLOPS usually did you get from the CPU peak performance?

Dear all,

A quick question usually how much GFLOPS did you get from your CPU peak performance?

I have a code use one thread, and I use Intel Advisor did some profile about the code, screenshot as below,

In particular, my CPU peak performance is 62.825 GFLOPS, my code only uses 1.908 GFLOPS which is pretty small.

I guess with MPI I use all the 12 threads perhaps I can use about 1.9*12 = 22 GFLOPS. However, even so, 22 GFLOPS still pretty far from the theoretical 62 GFLOPS performance.

I am curious, guys, how much GFLOPS did you get from your CPU peak performance?
How to reach CPU peak performance as much as possible?

If the code frequently operates big arrays (size is like several GB), the performance will be limited by the memory speed (bandwidth) right?

Thanks much in advance!


Just realized Apple M1 Mac their memory seems have quite high bandwidth like 200 - 400 GB/s. While my laptop uses DDR4 2666 which only gives like 35 GB/s, which is way slower than Mac’s. I guess M1 Mac benefit a lot from its high bandwidth memory as well.

It depends on many factors, mainly:

  • Bandwidth utilisation
  • Vectorisation ratio
  • Cache hits/misses

For example, if your code isn’t vectorised, you lose much of the FLOPS in the cores. Or if you have to do less calculations than loading from memory, then data transfer is the bottleneck. Or if you have much branching or use data in an inconvenient order, you’ll get many cash misses which will again end up in an data transfer bottleneck.

If you want to get the most out of the CPU, you have to identify the bottlenecks. I like to use LIKWID for this. It is a tool which utilizes the CPU’s hardware counters and is therefore very “lightweight”.

However, to get the code to reach the best possible FLOPS and bandwidth, you probably need much experience and practice. To start with, I recommend this course from the FAU.

PS: Usually your goal shouldn’t be to maximise the FLOPS, because to reach this, you can simply add nonsense calculations to your code. Instead you want to minimise the execution time of your code.