Stop trying to re-invent statistics. Use a box and whisker plot of latency. You ...

scott_s · on Oct 11, 2018

I've recently grown to like violin plots for latency (https://en.wikipedia.org/wiki/Violin_plot). I've also added 99%ile tick marks, which with the already present median mark, gives a relatively full picture of latency that is easily digestible.

vlovich123 · on Oct 11, 2018

Mathematics is not something handed down by the gods. It's possible to encounter not just completely new problems, but also limitations to existing methods for solving a known problem.

In this particular case the challenge is aggregating statistics from a very large fleet & having automated alarms. Visualization tools don't help with any of that. More specifically, the reporting tools out there apparently have a very common & persistent flaw of reporting an average of percentiles across agents which is a statistically meaningless metric. It makes no difference how you visualize it - the data is bunk.

This article flips it so that agents simply report how many requests they got & how many exceeded the required threshold. This lets them report the percentage of users having a worse experience than the desired SLA. You can also build reliable tools on top of this metric. It's not a universal solution but it's a neat trick to maintain the performance properties of not needing to pull full logs from all agents & still have a meaningful representation of the latency of your users.