~ writing/index.md

The 3-pager that fixed on-call

Problem

  • Ambiguous scope: no one agreed what on-call "owned."
  • Severity levels meant different things to different teams.
  • Excessive noise → false pages → responder fatigue.
  • Runbooks existed but weren't authoritative or consistently used.

The solution (3 pages, one owner)

  1. Scope. Systems and services explicitly in scope; escalations for everything else.
  2. Severity. SEV-1 (customer/business critical), SEV-2 (degraded/functional), SEV-3 (nuisance/ops toil) — with concrete examples.
  3. Paging rules. Pages only for SEV-1 and high-confidence SEV-2 signals. Everything else → ticket + business hours.

Guardrails we added

  • Golden signals. Latency, traffic, errors, saturation — per service.
  • Runbook links inline for each common alert.
  • Comms script. Who says what, where, and when (Slack / status page / email).
  • Single owner for updates to prevent drift.

Results

  • ~40% fewer pages (noise removed), better sleep, better focus.
  • Meaningful pages → faster time-to-mitigation and clearer handoffs.
  • Post-incident reviews improved because severity was consistent org-wide.

Short, living, and owned. That's what made it work.

← back to writing