Incident Tracking Best Practices for Reliability Teams

Tracking incidents is more than just logging failures—it's about building a knowledge base that helps your team respond faster, learn from patterns, and prevent future outages. This guide covers the essentials of effective incident tracking.

Why incident tracking matters

Every incident is a learning opportunity. Without proper tracking, your team loses valuable context about what went wrong, how long it took to resolve, and which systems are most fragile.

Good incident tracking helps you spot patterns, measure your response times, understand your actual uptime, and justify infrastructure investments with concrete data.

What to capture in every incident

Essential incident data includes when it started, when it was resolved, which services or checks were affected, the failure count, and the status at each point in time.

  • Start and end timestamps for accurate duration calculation
  • Affected services or checks to understand scope
  • Failure count to gauge severity
  • Status transitions from detection to resolution
  • Related alert notifications and acknowledgements

Automatic vs manual incident creation

The best systems automatically create incidents when monitoring detects failures—no human intervention needed. This ensures nothing slips through the cracks and provides accurate timestamps.

Manual incident tracking introduces delays and human error. If someone has to remember to log an incident, details get lost and response times appear artificially fast.

Group-level vs check-level incidents

Modern monitoring systems track incidents at the group or service level rather than for individual checks. When multiple checks fail simultaneously, one incident captures the entire event instead of fragmenting it across dozens of individual failures.

Group-level incidents give you a clearer picture of actual outages and make it easier to communicate status to stakeholders.

Using incident history to improve reliability

Review your incident history regularly to identify recurring issues, measure your mean time to recovery (MTTR), and understand which services need architectural improvements.

  • Look for patterns in timing—do incidents cluster around deployments or specific times?
  • Identify your most failure-prone services and prioritize improvements
  • Calculate and track your MTTR to measure incident response improvements
  • Use incident data to set realistic SLA targets

Integrating incident tracking with alerting

Your incident tracking system should integrate directly with your alerting infrastructure. When an incident opens, alerts go out. When it resolves, recovery notifications are sent automatically.

This tight integration ensures your team sees the full incident lifecycle and can correlate alerts with actual outages instead of drowning in disconnected notifications.

Communicating incidents to users

Use your incident tracking system to populate public status pages automatically. When an incident is detected and logged, your status page should reflect it immediately—no manual updates required.

This transparency builds trust with users and reduces support ticket volume during outages.

How Sandglass handles incident tracking

Sandglass automatically creates incidents when group-level alert policies are triggered. Each incident tracks which checks are failing, how many consecutive failures occurred, and when the incident started.

When checks recover, the incident is automatically marked as resolved with a timestamp. You can view the last 10 incidents for any group directly in the dashboard, giving you immediate visibility into your reliability history.

This automated approach means every outage is logged with accurate data, and your team can focus on fixing problems instead of documenting them.