On-Call Practices - Site Reliability Engineering

The Weight of the Pager

Being On-Call is perhaps the most stressful part of an SRE's job. It means you are the designated primary responder for any production incident that occurs during your shift, 24/7.

In many companies, on-call is a source of burnout and turnover. In a mature SRE organization, on-call is a handled with mathematical precision to ensure it is sustainable.

The Rules of Sustainable On-Call

Google SRE practices suggest several non-negotiable rules for healthy on-call rotations:

1. The Minimum Team Size

A single on-call rotation should have a minimum of 8 engineers (if covering 24/7) or 4 engineers (if sharing with an international team in a "Follow the Sun" model).

The Math: If your team has only 2 people, you are on-call every other week. This is a recipe for catastrophic burnout. You need enough people so that on-call only happens once every 1-2 months.

2. Operational Load Limit

No SRE should spend more than 25% of their total time on on-call duties (including the actual shift and the follow-up work). If the "pager is always ringing," the team has no time to actually fix the system, and the reliability will spiral downward.

3. Maximum 2 Incidents per Shift

A healthy on-call shift should have an average of two or fewer serious incidents per 8-hour period. If an engineer is being paged 20 times a night, they are suffering from Alert Fatigue. They will stop being effective and start making mistakes.

Actionable Alerts vs. Noise

The most important way to make on-call sustainable is to ruthlessly delete "Noisy" alerts.

An alert should only fire if:

There is a clear, immediate threat to the SLO.
A human must take immediate action to fix it.

If an alert fires saying "CPU is 90%" but the website is still fast and healthy, that alert is Noise. It should be a warning in a dashboard, not a page that wakes someone up at 3 AM.

The Power of the Runbook

Every alert that triggers a page should include a link to a Runbook.

A Runbook is a step-by-step guide written by the engineers who built the system. It should answer:

What is this? (A brief description of the service).
How do I verify it? (Which dashboard should I look at?).
How do I fix it? (Step 1: Restart X. Step 2: Flush Y).
Who do I call next? (Escalation paths for specialists).

A great runbook can turn a terrifying 3 AM outage into a 5-minute routine procedure.

Following the Sun

To achieve 24/7 coverage without waking people up in the middle of the night, global companies use Follow the Sun.

Team A in London handles the pager from 9 AM to 5 PM GMT.
Team B in San Francisco takes over as London finishes their day.
Team C in Sydney takes over as San Francisco finishes.

No one ever gets paged while they are sleeping. This is the gold standard for SRE sustainability.

Post-Call: The Handoff

When your shift ends, the work isn't done. You must perform a Handoff.

Document every incident that occurred.
Highlight any "near misses."
Ensure the next person knows about any ongoing issues.

By sharing the burden and focusing on sustainability, SREs ensure that the "human" part of the system is just as reliable as the "software" part.