Uptime Monitoring Guide: What to Monitor and How

Q: How often should I check uptime?

One minute is a good default for customer-facing endpoints. Use 30-second checks for revenue-critical paths and longer intervals (5–15 minutes) for low-stakes internal surfaces. Pair the interval with a retry threshold so a single dropped request does not trigger an alert.

Q: What uptime percentage is good?

99.9% ("three nines") — about 43 minutes of downtime per month — is a realistic target for most small SaaS products. 99.99% is achievable but requires redundancy and fast detection, not just a stricter monitor. Set a target you can actually defend to customers.

Q: Which endpoints should I monitor?

Start with the customer-facing endpoints that prove a critical workflow works: sign-in, the main API, checkout, and any health route your customers depend on. Give each check one clear failure mode and owner.

Q: Does a 200 status code mean the site is healthy?

Not always. Applications can return HTTP 200 while rendering an error page or serving stale content. Add a separate content check for a string you expect in the response body to catch these "soft" failures.

Q: How does Sandglass fit the advice in this guide?

Sandglass handles the continuous side — checks, incidents, alert routing, and a public status page — so the decisions in this guide turn into monitoring you can rely on.

Choose the smallest set of checks that proves your service works from the outside.

A practical uptime monitoring guide: what to check, from where, how often, and how to set alerts that catch real downtime without drowning your team in noise.

By H. Marcell, Freelance Software Developer

Updated July 17, 2026

H. Marcell is a freelance software developer who builds and runs web services and APIs, and writes about uptime monitoring, incident response, and status-page communication.

What this guide covers

Uptime monitoring answers one question for the people who depend on you: is the service working right now? This guide covers what to monitor, where to monitor it from, how often to check, and how to turn a failed check into an alert someone can act on. The goal is not the most checks — it is the right checks, each proving something a real user would notice.

Monitor from the outside, the way a real user reaches the service.
Give every check one job — a distinct failure it proves, not a duplicate.
An alert is only useful if it names an owner and a next action.

What to actually monitor

Work outward from customer impact. Start with the endpoints a user or an integration hits directly — the marketing site, the login flow, the API base URL, the checkout or payment callback. Then add the invisible dependencies whose failure surfaces later: TLS certificates, DNS resolution, scheduled jobs, and webhook receivers. A useful rule: if it can fail in a way a customer would eventually notice, it deserves one check. If two checks would fail for the exact same reason, keep one.

Public web pages and marketing site (HTTP status check + separate content check).
API endpoints and health routes your customers integrate against.
Login, checkout, and other revenue- or trust-critical flows.
TLS certificate expiry on every HTTPS host.
Scheduled/background jobs via heartbeats.

Check interval, timeout, and retries

Interval is how often you check; timeout is how long you wait before calling a check failed; retries decide how many consecutive failures trigger an alert. A one-minute interval with a 2-of-3 retry rule catches genuine outages within a few minutes while absorbing the occasional dropped request that would otherwise page someone at 3am for nothing. Tighter intervals detect faster but cost more noise; loosen them for low-stakes surfaces and tighten them for revenue-critical paths. Pick a timeout that reflects real user patience — a request that takes 30 seconds has effectively failed even if it eventually returns.

Turning a failed check into a useful alert

Detection is only half the job. Route each alert to a destination with a clear owner — email for low-urgency checks, a Slack channel for the team, or a generic webhook into whatever you already use for on-call. Group checks by service and environment so a staging failure never pages the person handling a production outage, and so one alert points unambiguously at one thing to fix.

How Sandglass supports the practice

Start with an HTTP status check on each customer-facing URL and set the expected status code. Where the response body also matters, add a separate content check for a string you expect so a 200-that-serves-an-error-page still fails. Add an SSL certificate check on every HTTPS endpoint so an expiring certificate pages you days ahead, not at the moment browsers start refusing connections. Use TCP/port checks for non-HTTP services like databases or mail, and heartbeat checks for scheduled jobs no user watches directly.

Add a content check alongside the HTTP status check to catch soft failures in the response body.
Watch certificate expiry with a dedicated SSL check so renewals never surprise you.
Set a sensible interval and a retry threshold so one dropped packet does not page anyone.
Back the practices here with HTTP, ping, TCP, content, SSL certificate, and heartbeat checks.
Route incidents to email, Slack webhook channels, and generic webhooks so the right people respond fast.
Use a public status page to keep customers informed while the team works the incident.

Common mistakes to avoid

More checks do not mean better monitoring. Duplicating the same endpoint across five checks multiplies alerts without adding signal, and makes ownership murky during a real incident. Monitoring only the homepage misses the API your customers actually integrate with.

Implementation checklist

Step 1: Start from customer impact

Decide which failures in this topic actually reach customers before adding any monitoring.

Step 2: Choose one signal per risk

Match each risk to a single HTTP, content, TCP, SSL certificate, or heartbeat check instead of stacking duplicates.

Step 3: Assign an owner and a channel

Give each alert one owner and one destination — email, a Slack webhook, or a generic webhook.

Step 4: Review after real incidents

Revisit intervals, thresholds, and ownership once a real incident shows what was missing.

Frequently Asked Questions

How often should I check uptime?

What uptime percentage is good?

Which endpoints should I monitor?

Does a 200 status code mean the site is healthy?

How does Sandglass fit the advice in this guide?

Monitor uptime monitoring guide: what to monitor and how with Sandglass

Start free

Free plan, no credit card required.