The SRE Observability Stack
The Three Pillars
┌─ Observability ─────────────────────┐
│ │
├─ Metrics (Quantitative) │
│ What we measure numerically │
│ Prometheus, Datadog, etc. │
│ │
├─ Logs (Deterministic Events) │
│ What happened at specific times │
│ ELK, Datadog, Splunk, etc. │
│ │
└─ Traces (Request Paths) │
How requests flow through system │
Jaeger, DataDog, New Relic, etc. │
Metrics: The Foundation
What is a Metric?
A metric is a numerical measurement of system behavior at a point in time.
Examples:
- CPU usage: 65%
- Memory usage: 4.2 GB
- Requests per second: 1,250
- Error rate: 0.5%
- Response latency: 150ms (p99)
- Database connections: 380/500
Time-series: How these change over time
Prometheus: De Facto Standard
Prometheus is the most widely used metrics collection system:
# prometheus.yml - Configure prometheus
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: 'kubernetes'
kubernetes_sd_configs:
- role: pod
metric_path: '/metrics'Example Prometheus Queries
# CPU usage for API service (last 5 minutes)
rate(process_cpu_seconds_total{job="api"}[5m]) * 100
# Current memory usage in GB
node_memory_MemAvailable_bytes / 1e9
# Error rate (errors per second)
rate(http_requests_total{status=~"5.."}[5m])
# Latency - 99th percentile
histogram_quantile(0.99, rate(http_duration_seconds_bucket[5m]))Graphing Metrics with Grafana
Grafana visualizes Prometheus data:
# Grafana Dashboard JSON (simplified)
dashboard:
title: "API Service Health"
panels:
- type: graph
title: "CPU Usage"
targets:
- expr: "rate(process_cpu_seconds_total{job='api'}[5m]) * 100"
threshold: 80 # Red line at 80%
- type: graph
title: "Error Rate"
targets:
- expr: "rate(http_requests_total{status=~'5..', job='api'}[5m])"
alertThreshold: 0.01 # Alert at 1%Alerting: Notification System
Alert Rules
Alerting rules define when to notify teams:
# prometheus/alerts.yml
groups:
- name: api_alerts
rules:
- alert: HighCPU
expr: rate(process_cpu_seconds_total{job="api"}[5m]) * 100 > 80
for: 5m # Alert after sustained 5 minutes
annotations:
summary: "API CPU usage > 80%"
- alert: HighErrorRate
expr: rate(http_requests_total{status=~"5..", job="api"}[5m]) > 0.01
for: 1m
annotations:
summary: "API error rate > 1%"
- alert: ServiceDown
expr: up{job="api"} == 0
for: 30s # Alert immediately
annotations:
summary: "API service is down"Alert Routing with Alertmanager
Route alerts to the right people:
# alertmanager.yml
routing:
group_by: ['alertname', 'cluster']
group_wait: 10s
# Default route
routes:
- match:
severity: critical
receiver: 'pagerduty-critical'
continue: true
- match:
severity: high
receiver: 'slack-oncall'
continue: true
- match:
severity: warning
receiver: 'slack-general'
receivers:
- name: 'pagerduty-critical'
pagerduty_configs:
- service_key: 'YOUR_SERVICE_KEY'
- name: 'slack-oncall'
slack_configs:
- channel: '#oncall'Logs: Event Details
ELK Stack (Elasticsearch, Logstash, Kibana)
The industry standard for log aggregation:
# Filebeat configuration (log shipper)
filebeat.inputs:
- type: log
enabled: true
paths:
- /var/log/app/*.log
fields:
service: api
env: prod
output.elasticsearch:
hosts: ["elasticsearch:9200"]
index: "api-prod-%{+yyyy.MM.dd}"Searching Logs with Kibana
# Find all 500 errors in the last hour
status:500 AND @timestamp:[now-1h TO now]
# Find slow requests
response_time_ms > 1000 AND service:api
# Find errors from specific deployment
deployment:v2.3.1 AND level:error
Tracing: Request Paths
Distributed Tracing with Jaeger
Traces show how a request flows through your system:
User Request
↓
[API Gateway] → 5ms
↓
[Auth Service] → 10ms
↓
[User Service] → 25ms
↓
[Database Query] → 150ms
↓
Total: 190ms
If slow, you see exactly where time is spent
Jaeger Configuration
# jaeger-agent configuration
reporter_logLevel: info
processors:
batch:
queue_size: 2048
batch_size: 512
timeout: 5s
exporters:
jaeger:
endpoint: "http://jaeger-collector:14250"
tls:
insecure: trueIncident Management Platforms
PagerDuty
# PagerDuty integration
- Receives alerts from Prometheus
- Notifies on-call engineer
- Escalates if no response
- Tracks incident details
- Schedules on-call rotationsExample workflow:
1. Alert fires in Prometheus
2. Alertmanager sends to PagerDuty
3. PagerDuty triggers alert to on-call engineer
4. Engineer acknowledges within 5 minutes
5. If no ack, escalates to manager
Other Options
- Opsgenie: Similar to PagerDuty
- VictorOps: Incident workflow automation
- xMatters: Enterprise incident management
- Custom webhook: Direct integration
Health Checking
Built-in Health Checks
Every service should expose /health endpoint:
# Python Flask example
@app.route('/health')
def health_check():
checks = {
'status': 'healthy',
'database': check_database_connection(),
'cache': check_cache_connection(),
'disk': check_disk_space(),
'memory': check_memory_usage(),
}
overall_status = 'healthy' if all(checks.values()) else 'unhealthy'
return {
'status': overall_status,
'checks': checks,
'version': '2.3.1',
'timestamp': datetime.now().isoformat()
}Kubernetes Probes
# Kubernetes health check configuration
apiVersion: apps/v1
kind: Deployment
metadata:
name: api-server
spec:
template:
spec:
containers:
- name: api
image: api:v2.3.1
# Startup probe (initialization phase)
startupProbe:
httpGet:
path: /health
port: 8000
failureThreshold: 30
periodSeconds: 10
# Liveness probe (is it alive?)
livenessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 30
periodSeconds: 10
# Readiness probe (can take traffic?)
readinessProbe:
httpGet:
path: /ready
port: 8000
initialDelaySeconds: 5
periodSeconds: 5Recommended Stack for Different Scales
Small Team (< 10 engineers)
Metrics: Prometheus (open source)
Dashboards: Grafana (open source)
Logs: Basic app logging to file + grep
Tracing: Optional
Alerting: Simple email/Slack
On-call: Basic rotation, no dedicated tool
Total cost: ~$500/month (mostly cloud infrastructure)
Growing Company (10-100 engineers)
Metrics: Prometheus or Datadog
Dashboards: Grafana or built-in (Datadog)
Logs: ELK Stack or Datadog
Tracing: Jaeger or Datadog
Alerting: Alertmanager + PagerDuty
On-call: PagerDuty or Opsgenie
Total cost: $5-20k/month
Enterprise (100+ engineers)
Metrics: Datadog or Prometheus + Thanos
Dashboards: Grafana or Datadog
Logs: Splunk or Datadog
Tracing: Datadog or proprietary
Alerting: Enterprise-grade system
On-call: VictorOps or xMatters
Total cost: $50k-500k+/month
Plus: Custom integrations and support
Essential SRE Tools Checklist
Monitoring & Metrics:
☐ Time-series database (Prometheus, etc.)
☐ Dashboarding tool (Grafana, etc.)
☐ Infrastructure monitoring
☐ Application performance monitoring (APM)
Logging:
☐ Log aggregation (ELK, Datadog, etc.)
☐ Log parsing and search
☐ Retention policies
☐ Access controls
Alerting:
☐ Alert rules engine
☐ Alert routing/deduplication
☐ On-call escalation
☐ Notification channels (email, SMS, Slack, etc.)
Incident Management:
☐ On-call scheduling
☐ Incident tracking
☐ War room collaboration
☐ Post-incident analysis
Infrastructure:
☐ Infrastructure as Code
☐ Deployment automation
☐ Configuration management
☐ Secrets management
Observability:
☐ Distributed tracing
☐ APM tools
☐ Real User Monitoring (RUM)
☐ Synthetic monitoringOpen Source vs Managed Services
Open Source Benefits
✅ Full control
✅ No vendor lock-in
✅ Custom modifications
✅ Lower long-term cost
❌ Must maintain
❌ Requires expertise
❌ Higher initial setup
Managed Services Benefits
✅ No maintenance
✅ Included support
✅ Auto-scaling
✅ SLA guarantees
❌ Vendor lock-in
❌ Higher ongoing cost
❌ Less customization
Hybrid Approach (Recommended)
Use open source for:
- Metrics (Prometheus)
- Dashboards (Grafana)
- Tracing (Jaeger)
Use managed for:
- Log aggregation (Datadog)
- PagerDuty integration
- Custom analysis needs
Benefits:
- Open source flexibility where needed
- Managed convenience for non-differentiating services
- Balance cost and capability
Key Takeaways
✓ Observability = Metrics + Logs + Traces
✓ Prometheus is the metrics standard
✓ Grafana for dashboards
✓ ELK or Datadog for logs
✓ Distributed tracing matters for microservices
✓ Alerting must be smart (not noisy)
✓ Health checks enable automation
✓ Start open source, graduate to managed
✓ Invest gradually as you scale
✓ Right tools enable effective SRE