Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Stop trying to re-invent statistics. Use a box and whisker plot of latency. You quickly get to see the mean, the quartiles, and all the outliers and you get it in a format which is familiar and easy to understand. You can even plot box and whisker plots next to each other for quick meaningful comparisons between different things.


I've recently grown to like violin plots for latency (https://en.wikipedia.org/wiki/Violin_plot). I've also added 99%ile tick marks, which with the already present median mark, gives a relatively full picture of latency that is easily digestible.


Mathematics is not something handed down by the gods. It's possible to encounter not just completely new problems, but also limitations to existing methods for solving a known problem.

In this particular case the challenge is aggregating statistics from a very large fleet & having automated alarms. Visualization tools don't help with any of that. More specifically, the reporting tools out there apparently have a very common & persistent flaw of reporting an average of percentiles across agents which is a statistically meaningless metric. It makes no difference how you visualize it - the data is bunk.

This article flips it so that agents simply report how many requests they got & how many exceeded the required threshold. This lets them report the percentage of users having a worse experience than the desired SLA. You can also build reliable tools on top of this metric. It's not a universal solution but it's a neat trick to maintain the performance properties of not needing to pull full logs from all agents & still have a meaningful representation of the latency of your users.




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: