On-Call Rotation for a Startup from Sandglass: practical guidance for assigning responsibility for production alerts without burning out a tiny engineering team.
This guide focuses on assigning responsibility for production alerts without burning out a tiny engineering team. The goal is to make the operating decision clear before a stressful incident forces the team to improvise.
Group alerts by severity, rotate primary ownership, document known fixes, and keep non-urgent alerts out of overnight paging. Sandglass supports the continuous side of this work with checks, incidents, alert routing, and public status visibility.
The fastest way to ruin on-call is routing every minor staging failure to the same person who handles production outages.
Decide which failures in this topic actually reach customers before adding any monitoring.
Match each risk to a single HTTP, content, TCP, SSL certificate, or heartbeat check instead of stacking duplicates.
Give each alert one owner and one destination — email, a Slack webhook, or a generic webhook.
Revisit intervals, thresholds, and ownership once a real incident shows what was missing.