Tracking incidents is more than just logging failures—it's about building a knowledge base that helps your team respond faster, learn from patterns, and prevent future outages. This guide covers the essentials of effective incident tracking.
Every incident is a learning opportunity. Without proper tracking, your team loses valuable context about what went wrong, how long it took to resolve, and which systems are most fragile.
Good incident tracking helps you spot patterns, measure your response times, understand your actual uptime, and justify infrastructure investments with concrete data.
Essential incident data includes when it started, when it was resolved, which services or checks were affected, the failure count, and the status at each point in time.
The best systems automatically create incidents when monitoring detects failures—no human intervention needed. This ensures nothing slips through the cracks and provides accurate timestamps.
Manual incident tracking introduces delays and human error. If someone has to remember to log an incident, details get lost and response times appear artificially fast.
Modern monitoring systems track incidents at the group or service level rather than for individual checks. When multiple checks fail simultaneously, one incident captures the entire event instead of fragmenting it across dozens of individual failures.
Group-level incidents give you a clearer picture of actual outages and make it easier to communicate status to stakeholders.
Review your incident history regularly to identify recurring issues, measure your mean time to recovery (MTTR), and understand which services need architectural improvements.
Your incident tracking system should integrate directly with your alerting infrastructure. When an incident opens, alerts go out. When it resolves, recovery notifications are sent automatically.
This tight integration ensures your team sees the full incident lifecycle and can correlate alerts with actual outages instead of drowning in disconnected notifications.
Use your incident tracking system to populate public status pages automatically. When an incident is detected and logged, your status page should reflect it immediately—no manual updates required.
This transparency builds trust with users and reduces support ticket volume during outages.
Sandglass automatically creates incidents when group-level alert policies are triggered. Each incident tracks which checks are failing, how many consecutive failures occurred, and when the incident started.
When checks recover, the incident is automatically marked as resolved with a timestamp. You can view the last 10 incidents for any group directly in the dashboard, giving you immediate visibility into your reliability history.
This automated approach means every outage is logged with accurate data, and your team can focus on fixing problems instead of documenting them.