MTTR, MTBF, and MTTA Explained from Sandglass: practical guidance for using reliability metrics to improve detection and recovery without turning them into vanity numbers.
This guide focuses on using reliability metrics to improve detection and recovery without turning them into vanity numbers. The goal is to make the operating decision clear before a stressful incident forces the team to improvise.
Track acknowledge time, recovery time, and recurring failure patterns from real incidents, then review outliers after recovery. Sandglass supports the continuous side of this work with checks, incidents, alert routing, and public status visibility.
Averages hide bad incidents. Look at the worst incidents and repeated causes before celebrating a lower monthly average.
Decide which failures in this topic actually reach customers before adding any monitoring.
Match each risk to a single HTTP, content, TCP, SSL certificate, or heartbeat check instead of stacking duplicates.
Give each alert one owner and one destination — email, a Slack webhook, or a generic webhook.
Revisit intervals, thresholds, and ownership once a real incident shows what was missing.