Jun 1, 2012

Effective Execution Statistics

As designer of a highly scalable system I'm always trying to get good execution statistics data. Even from the frequent short test runs I'm doing during daily development. For instance, how long time have the simulation threads spent executing each simulation tick? How much time is spent persisting/loading data? What is the client IO volume?

I'm not talking about actual code profiling. What I'm after is rather a summary of how the essential processes and tasks have performed after each run. Was something unexpectedly slow or fast? Did some data set grow or shrink?

The behavior and performance can change when something apparently unrelated is changed. Indirect caching and memory consumption changes are two potential culprits. But it's also easy to make mistakes with more direct dependencies, for instance being sloppy when adding some attribute to some game entity type which happens to cause twice the data amount to be sent to clients! An appropriate execution statistics summary after each run enables a simple sanity check to catch such things.

Execution stats are not about fixed time/data limits - in this system there are no "hard" such requirements. If a simulation tick takes 3 ms that's good - but if it takes 50 ms for 1000 simultaneous players that can be ok too. Often I'm not looking at the measurement values by themselves as much as how much they change depending on code I modify. How long does it take on average? Is that average less or greater after I attempted to cache something? (This is very hard to see from series of individual time samples/time stamps.)

And it's important to not only examine an average value, nor the maximum (or minimum) value. For any given metric, I find that all these values are important:
  • count (number of executions/packets/whatnot)
  • average / median
  • standard deviation
  • minimum and maximum.

The count of course tells how many times something has occurred. This info must be there to put the other measurements in context. If a DB load has happened only once but took 900 ms, it's probably slow because nothing had been cached and the JVM hadn't optimized the code yet.

The average and median indicate the general performance to expect over time. For different metrics the average or the median can be a more relevant measure. It's best to produce both! Combined with standard deviation these values provide a good indication of the "typical span" of a metric, for example that it usually takes 10-40 ms to execute.

The extreme cases must be examined as well however. If a simulation tick has a good average performance but occasionally takes 1000 ms (when, say 10 or 100 is expected) this has a serious effect on game play and must be investigated.

Collecting and Printing Statistics

For a long time I was manually adding statistics measurements to the code in the way lots of people do - calling System.currentTimeMillis() or nanoTime() at some execution points and logging the time spent. But this was cumbersome, unclean, ad-hoc and the results (lots of logged values) were difficult to interpret.

I finally ended up writing a relatively simple but effective and efficient statistics collector and printer. My objective was to emulate the logger in simplicity: Put the measurements to the collector in a single line just as you would put a string to the logger:

public class IOPacker {
  public static final Logger LOG = Logger.getLogger(IOPacker.class);
  public static final StatCompiler STAT = StatCompiler.getStatCompiler(IOPacker.class);
  public void pack() {
    long startTime = System.nanoTime();
    STAT.putSample("Pack time [ms]", (System.nanoTime()-startTime)/1e6);

To output the compiled statistics, I execute this line before the program exits:
    StatCompiler.logAllStats(LOG, Level.INFO);

Efficient Representation

Just like for a logger, the implementation needs to be as fast as possible. This is not terribly hard. But then there is the concern for memory: in a long application run, the amount of held data shouldn't grow indefinitely, even though we want the information to encompass the full run!

The solution lies in managing the precision with which all the collected data samples are represented. The implementation is really a bucket map where each bucket represents a certain value interval. Within each decimal order of magnitude (e.g. between 1.0 and 9.9, between 10 and 99 and so on) up to 90 intervals (buckets) are stored. A bucket is simply a counter that records the number of samples that has hit its interval.

This means that the memory size grows with the order of magnitude between the smallest and the largest data point, but not with the number of data points. As long as there is a lower boundary on the smallest data point precision we care about, and since we only create buckets when a sample actually hits them, there rarely needs to be more than a few dozen buckets. We will still safely reflect any unusually small or large values (far outside the "dense bucket range" in the middle), and we still get good results with sufficient precision:

Pack time [ms]: sum: 3 326; avg: 11,9 (sd +- 48,6); count: 279; min<med<max: 5,5 < 8,2 < 823,2


The code is in a public Gist, feel free to use it! (without any warranty of course ;-) )


No comments:

Post a Comment