This is precisely why Hager advocates showing boxplots, because they capture the magnitude of fluctuations. Here is an example of how things can go wrong:

Each boxplot is 10 repeated measurements. The thin orange line is the median. In this particular run, 3 out of 10 measurements with the executable compiled with -O2 were around 100 ms for no obvious reason, while the rest were below < 50 ms.
I repeated my measurements with the minimum time set to 1.0 second instead of 0.2, but the graph does not change.