In a previous life as a full-stack Engineer at a startup, this was my white whal...

peteradio · on April 11, 2023

You know all that work was worth it when you get a good lauding.

dgunay · on April 12, 2023

Going through something like this as a SWE at a startup. Lots of noise in our alerts and logging, so alert fatigue is a real problem. Do you have any advice on navigating this scenario (esp. negotiating with product to get monitoring and ops in a usable state)

raldi · on April 12, 2023

Sure, just give your manager a copy of the bible: https://docs.google.com/document/d/199PqyG3UsyXlwieHaqbGiWVa...

dgunay · on April 12, 2023

Thanks, this was a very enlightening read. Getting product on board with the labor involved in implementing this is going to be a different story though.

raldi · on April 12, 2023

Another good piece about negotiating with Product is written up here:

https://sre.google/sre-book/introduction/#:~:text=Pursuing%2...

Ultimately, it's Product's job to decide how they want to balance reliability and feature-shipping speed. Work with them to define an SLO (like, in 99.995% of five-minute timeslices of any given month, 99% of all queries will complete within 250msec) and then graph how well you're doing when it comes to hitting it.

If you're failing to keep things above that line, Product either needs to accept lower reliability standards or invest engineering time in improving reliability. Again, it's Product's call to make. If they do want to invest in reliability, though, that's when you get to present your wish list, work out an agreement on its ranking, and find time to get the work done, even if it means slowing down the rate at which new features are shipped.

hallway_monitor · on April 12, 2023

You may have luck if you frame it in terms of an investment. Spend the time now to fix your alerts, add playbooks, improve process - because you immediately start enjoying the benefits. Less time spent on support means higher velocity. The longer you wait the more engineering time you've wasted It just takes a little patience up front as well as product and engineering collaborating.