Incident Response Guide from Sandglass: practical guidance for moving from ad-hoc outage reactions to a repeatable owner, channel, timeline, and recovery process.
This guide focuses on moving from ad-hoc outage reactions to a repeatable owner, channel, timeline, and recovery process. The goal is to make the operating decision clear before a stressful incident forces the team to improvise.
Route production incidents to the team channel, assign one incident lead, record decisions as they happen, and review the alert after recovery. Sandglass supports the continuous side of this work with checks, incidents, alert routing, and public status visibility.
Incident process fails when every alert becomes a meeting. Use severity, ownership, and recovery criteria so the response matches the real customer impact.
Decide which failures in this topic actually reach customers before adding any monitoring.
Match each risk to a single HTTP, content, TCP, SSL certificate, or heartbeat check instead of stacking duplicates.
Give each alert one owner and one destination — email, a Slack webhook, or a generic webhook.
Revisit intervals, thresholds, and ownership once a real incident shows what was missing.