On-Call Rotation for a Startup

Q: How small a team can run an on-call rotation?

Even two or three engineers can, though the rotation is demanding. Keep the paging bar high so off-hours alerts are rare, and be generous with compensation or recovery time. Below that size, "best-effort during waking hours" plus a high-severity-only pager is often more honest.

Q: How do I stop on-call from causing burnout?

Reduce the number of pages, not just spread them out. Route only genuine customer-affecting issues to the pager, document known fixes, rotate fairly, and compensate on-call time. Most burnout comes from noisy, unactionable alerts rather than from the rotation itself.

Q: How does Sandglass fit the advice in this guide?

Sandglass handles the continuous side — checks, incidents, alert routing, and a public status page — so the decisions in this guide turn into monitoring you can rely on.

Assign responsibility for production alerts without burning out a tiny team.

How to set up an on-call rotation at a startup without burning out a small team: what should page overnight, how to rotate fairly, and how to keep alert noise down.

By H. Marcell, Freelance Software Developer

Updated July 17, 2026

H. Marcell is a freelance software developer who builds and runs web services and APIs, and writes about uptime monitoring, incident response, and status-page communication.

What this guide covers

On-call is how a small team makes sure someone responds to production problems outside working hours — without it landing on the same person every night. This guide covers what should and should not page overnight, how to structure a fair rotation on a small team, and how to keep the alert volume low enough that on-call stays sustainable.

Only true production issues should page overnight.
Documented known fixes shorten every shift.
Rotating primary ownership prevents one person burning out.

What should page overnight

The bar for waking someone is high: the issue is affecting customers now and cannot wait until morning. A down checkout flow qualifies; a failed nightly report that can rerun does not. Everything else — degraded-but-working services, staging failures, low-priority background jobs — should route to a channel or email for daytime triage. If you would not personally want to be woken for it, do not route it to the pager.

Page: customer-facing outages, data loss risk, security events.
Do not page: staging failures, non-urgent job retries, informational alerts.
Add a retry threshold so one transient blip never pages anyone.

Structuring a fair rotation on a small team

With only a few engineers, keep it simple: one primary per week (or per few days), rotating through everyone, with an optional secondary as backup for when the primary cannot respond. Publish the schedule so everyone knows when they are up. Compensate on-call time in whatever way fits your culture — time off, pay, or reduced daytime load — because unpaid, unacknowledged on-call is how burnout and resentment start.

Reducing the load, not just distributing it

The best on-call improvement is fewer pages. After each incident, ask whether the alert was actionable and whether the fix can be documented or automated. A runbook entry that turns a 2am investigation into a two-minute known fix pays for itself immediately. Over time, tuning thresholds and eliminating noisy checks matters more than any scheduling tweak.

How Sandglass supports the practice

Separate what pages from what waits by routing alerts by severity in Sandglass: production-down checks go to whatever pages your on-call person, while low-urgency and staging checks go to email or a channel to review during the day. Group checks by environment so the split is clean and nobody is woken by a staging deploy.

Back the practices here with HTTP, ping, TCP, content, SSL certificate, and heartbeat checks.
Route incidents to email, Slack webhook channels, and generic webhooks so the right people respond fast.
Use a public status page to keep customers informed while the team works the incident.

Common mistakes to avoid

The fastest way to ruin on-call is routing every minor staging failure to the same person who handles production outages. When 90% of overnight pages are noise, people stop trusting the pager and miss the one that mattered. Ruthlessly separate "wake someone up" alerts from "look at this tomorrow" alerts.

Implementation checklist

Step 1: Start from customer impact

Decide which failures in this topic actually reach customers before adding any monitoring.

Step 2: Choose one signal per risk

Match each risk to a single HTTP, content, TCP, SSL certificate, or heartbeat check instead of stacking duplicates.

Step 3: Assign an owner and a channel

Give each alert one owner and one destination — email, a Slack webhook, or a generic webhook.

Step 4: Review after real incidents

Revisit intervals, thresholds, and ownership once a real incident shows what was missing.

Frequently Asked Questions

How small a team can run an on-call rotation?

How do I stop on-call from causing burnout?

How does Sandglass fit the advice in this guide?

Monitor on-call rotation for a startup with Sandglass

Start free

Free plan, no credit card required.