Sunday, July 28, 2024

New top story on Hacker News: Show HN: I built an open-source tool to make on-call suck less

Show HN: I built an open-source tool to make on-call suck less
18 by aray07 | 2 comments on Hacker News.
Hey HN, I am building an open source platform to make on-call better and less stressful for engineers. We are building a tool that can silence alerts and help with debugging and root cause analysis. We also want to automate tedious parts of being on-call (running runbooks manually, answering questions on Slack, dealing with Pagerduty). Here is a quick video of how it works: https://youtu.be/m_K9Dq1kZDw I hated being on-call for a couple of reasons: * Alert volume: The number of alerts kept increasing over time. It was hard to maintain existing alerts. This would lead to a lot of noisy and unactionable alerts. I have lost count of the number of times I got woken up by alert that auto-resolved 5 minutes later. * Debugging: Debugging an alert or a customer support ticket would need me to gain context on a service that I might not have worked on before. These companies used many observability tools that would make debugging challenging. There are always a time pressure to resolve issues quickly. There were some more tangential issues that used to take up a lot of on-call time * Support: Answering questions from other teams. A lot of times these questions were repetitive and have been answered before. * Dealing with PagerDuty: These tools are hard to use. e.g. It was hard to schedule an override in PD or do holiday schedules. I am building an on-call tool that is Slack-native since that has become the de-facto tool for on-call engineers. We heard from a lot of engineers that maintaining good alert hygiene is a challenge. To start off, Opslane integrates with Datadog and can classify alerts as actionable or noisy. We analyze your alert history across various signals: 1. Alert frequency 2. How quickly the alerts have resolved in the past 3. Alert priority 4. Alert response history Our classification is conservative and it can be tuned as teams get more confidence in the predictions. We want to make sure that you aren't accidentally missing a critical alert. Additionally, we generate a weekly report based on all your alerts to give you a picture of your overall alert hygiene. What’s next? 1. Building more integrations (Prometheus, Splunk, Sentry, PagerDuty) to continue making on-call quality of life better 2. Help make debugging and root cause analysis easier. 3. Runbook automation We’re still pretty early in development and we want to make on-call quality of life better. Any feedback would be much appreciated!

No comments:

Post a Comment