The Reliability Stack
Every SRE team operates on three levels of commitment. To understand them, think of an airline:
- SLI: "How many of our planes took off on time today?" (The raw data).
- SLO: "Our target is that 98% of flights take off within 15 minutes of schedule." (The goal).
- SLA: "If we are more than 4 hours late, we will give you a $200 voucher." (The legal penalty).
1. SLI: Service Level Indicator
An SLI is a quantitative measure of some aspect of the level of service that is provided.
The Formula for a Good SLI
A good SLI is always expressed as a ratio:
SLI = (Good Events / Total Events) * 100
The Four Golden Signals
Google's SRE book identifies four metrics that are the most critical for monitoring any distributed system:
- Latency: The time it takes to service a request. (Crucial: measure the latency of failed requests too, as they are often deceptive).
- Traffic: A measure of how much demand is being placed on your system. (e.g., HTTP requests per second).
- Errors: The rate of requests that fail, either explicitly (500 errors), implicitly (200 OK but with wrong data), or by policy (responses over 1 second).
- Saturation: How "full" your service is. (e.g., CPU, Memory, or database connection pool limits).
2. SLO: Service Level Objective
An SLO is a target value or range of values for a service level that is measured by an SLI.
Choosing a Good SLO
Setting an SLO is a business decision as much as a technical one. You must balance "High Reliability" with "Cost." High reliability (99.99%) requires more servers, more architecture, and slower deployment than 99.0%.
Rules for Setting SLOs:
- Don't target 100%: You will never hit it, and you'll go bankrupt trying.
- Stay ahead of the SLA: If your legal SLA is 99.5%, your internal SRE target (SLO) should be 99.9%.
- Measure from the User's Perspective: A 99.9% "Database Uptime" means nothing if the frontend API is broken. Measure the API endpoint the user actually hits.
3. SLA: Service Level Agreement
An SLA is an explicit or implicit contract with your users that includes consequences of meeting (or missing) the SLOs they contain.
SREs generally do not write SLAs. Lawyers and business development teams write them. SREs simply ensure that the system stays above the SLO so the business never has to pay out the SLA penalties.
Example: The "Profile Service"
Imagine an API that serves user profile data.
| Component | Target / Value |
|---|---|
| SLI (Availability) | Success Ratio = (2xx Responses / Total Requests) |
| SLI (Latency) | 90th Percentile Latency (p90) |
| SLO (Internal Target) | 99.95% Availability; 90% of requests < 300ms. |
| SLA (Legal Contract) | >99.0% Uptime; >50% credit if missed. |
Why p99 matters more than Average
In SRE, never use Averages for latency.
If 99 people have a 10ms experience and 1 person has a 10,000ms experience, the "Average" shows a healthy 110ms. But for that 1 person, the site is completely unusable.
SREs use Percentiles (p95, p99, p99.9). If your p99 is 500ms, it means 99% of your users are experiencing 500ms or faster. This is a much truer representation of user happiness.
Summary
- SLI: Use the Four Golden Signals to measure raw performance.
- SLO: Set an internal target that keeps the team focused.
- SLA: Keep the legal promise as the absolute floor.
- Budgets: Use the window between SLI and SLO to manage risk.