Top Pitfalls and Prevention
Pitfall 1: Running Chaos Without Proper Monitoring
The Mistake:
Team: \"Let's inject a database failure and see what happens\"
What Actually Happens:
- Database fails
- Application... does something
- No monitoring → Can't see what happened
- Experiment results: \"Unclear\"
- Learning: Nothing
- Investment: Wasted
Prevention:
✅ Before any chaos experiment, verify:
Monitoring Checklist:
✓ Prometheus/Datadog dashboard running
✓ Key metrics being collected
- Request throughput
- Error rate
- Latency (p50, p95, p99)
- Resource usage (CPU, memory)
- Application-specific metrics
✓ Logs being aggregated
✓ Traces being collected (if distributed system)
✓ Team can access and query data
✓ Dashboard setup with baseline and thresholdsExample Safe Experiment Start:
# Step 1: Verify monitoring is working
kubectl logs -n monitoring loki-0 | tail -20
kubectl port-forward -n monitoring svc/grafana 3000:3000
# Manually verify dashboard shows data
# Step 2: Establish baseline (5-10 minutes with no chaos)
# Record metrics in baseline.txt
# Step 3: Now inject failure
# (knowing you can see what happens)
kubectl delete pod -n production database-0
# Step 4: Observe and record metrics
# (compare to baseline)Pitfall 2: Testing in Production Without Safeguards
The Mistake:
Team: \"Let's kill 10% of our pods to see what happens\"
Reality:
- 10% of pods killed
- Traffic reroutes to remaining pods
- Load increases on remaining pods
- Something unexpected happens
- Cascading failure
- Production outage affecting all customer
- Customer support flooded with complaints
- Engineers scrambling to fix
Prevention:
✅ Use Progressive Escalation:
Week 1-2: Test in staging only
(Complete local environment, zero production impact)
Week 3-4: Test in production, off-peak hours, 1% of traffic
(5am-6am on Sunday, least traffic)
Week 5-6: Test in production, off-peak, 5% of traffic
(Still off-peak, wider scope)
Week 7-8: Test in production, low-traffic hours, 10% of traffic
(6am-8am on Sunday, business hours but low traffic)
Week 9+: Test in production, business hours, with guardrails
(Full production environment, but with auto-rollback)
✅ Implement Auto-Rollback:
# Guardrails - Stop experiment if damage detected
Stop Experiment If:
- Error rate > 5%
- Latency p99 > 2x baseline
- Pod restarts > 1 per minute
- Database connection pool > 90% full
- Memory usage > 85%✅ Use Circuit Breakers:
# If experiment is causing problems, automatically stop it
import time
class SafeExperiment:
def __init__(self, duration, error_threshold=0.05):
self.duration = duration
self.error_threshold = error_threshold
self.start_time = time.time()
def should_continue(self, current_error_rate):
elapsed = time.time() - self.start_time
# Stop if duration exceeded
if elapsed > self.duration:
return False
# Stop if error rate too high
if current_error_rate > self.error_threshold:
print(f\"ERROR RATE {current_error_rate} > {self.error_threshold}\")
print(\"Stopping experiment\")
return False
return True
experiment = SafeExperiment(duration=300, error_threshold=0.05)
while experiment.should_continue(get_error_rate()):
inject_chaos()
time.sleep(5)
stop_chaos()Pitfall 3: Not Involving Product/Business Team
The Mistake:
SRE Team: \"We're running chaos experiments this afternoon\"
Business Team: (learns via customer impact)
Customer: \"Your service is down, what's happening?\"
Support: \"We don't know, engineering is investigating\"
CEO: (calls CTO) \"Why was nobody monitoring this risk?\"
Prevention:
✅ Communicate Early:
1 Week Before Experiment:
Email to: Engineering, Product, Support, Executive stakeholder
Subject: \"Scheduled Reliability Test - [Date/Time]\"
Content:
\"We're running a controlled experiment to test our system's
resilience. Target service: [SERVICE]. Time: [TIME].
Expected impact: [Describe graceful degradation or no impact].
If you notice unusual behavior, note the time and email [contact].
This is intentional and controlled.\"
1 Day Before:
Slack reminder: \"Chaos experiment starting tomorrow at [TIME]\"
Include: \"No customer impact expected\" or \"Possible 2s delay\"
During Experiment:
Status channel: \"Chaos experiment in progress. Error rate: 0.1% (normal)\"
Every 5-10 minutes: Update status
After Experiment:
Summary: \"Experiment complete. Results: [brief summary].
All systems normal. Report at [link]\"
✅ Get Stakeholder Buy-In:
Pre-Experiment Alignment:
✓ CTO/VP Engineering: \"This aligns with 2024 reliability goals\"
✓ VP Product: \"Customers will benefit from improved reliability\"
✓ CFO: \"ROI positive: saves $XXX/year in downtime cost\"
✓ CEO: \"Improves brand reputation and customer retention\"
✓ VP Support: \"Reduces our after-hours firefighting\"
Pitfall 4: Testing Only Infrastructure, Not Application Code
The Mistake:
Experiment 1: Kill a pod
Result: Kubernetes restarts it, no issue
Learning: (None - Kubernetes handles this automatically)
Experiment 2: Kill a database replica
Result: Traffic reroutes to primary, no issue
Learning: Good, but expected
Experiment 3: Add network latency to API call
Result: Application times out, circuit breaker opens
Learning: MAJOR - timeout configuration was wrong!
This would have failed with customer impact
But chaos caught it first
Prevention:
✅ Test All Layers:
Infrastructure Layer:
✓ Pod/instance death
✓ Node failure
✓ Network partition
└─ Examples: Chaos Monkey, pod-delete
Dependency Layer:
✓ Database unavailable
✓ Cache timeout
✓ Message queue down
✓ External API latency
└─ Examples: Network latency, dependency kill
Application Layer:
✓ Timeout handling
✓ Retry logic
✓ Circuit breaker activation
✓ Error handling and logging
└─ Examples: Application-level chaos injection
Combination Layer:
✓ Multiple simultaneous failures
✓ Cascading failures
✓ Uncommon combinations
└─ Examples: Pod death + network latency + low memory
✅ Create Experiment Matrix:
Experiment Plan for Payment Service:
By Layer:
Infrastructure:
- Pod delete (1 of 3 pods)
- Node drain
- Network latency to database
Dependency:
- Database offline (primary)
- Cache offline (Redis)
- Message broker offline
Application:
- Timeout simulation (dependency slow)
- Error injection (downstream fails)
- Resource constraints (memory)
Combination:
- Pod delete + network latency
- Database offline + memory pressure
- Cache offline + timeout
By Criticality:
Tier 1 (Must handle): All above
Tier 2 (Should handle): Graceful degradation
Tier 3 (Nice to have): User messagingPitfall 5: Not Documenting and Sharing Learnings
The Mistake:
Team A: Runs experiment, discovers timeout issue
Fixes it, moves on
No documentation
3 months later:
Team B: Runs same experiment, discovers same issue
\"Why are timeouts failing?\"
Wasted time rediscovering problem
Team C: Never runs this experiment
Gets surprised by timeout issue in production
Customer impact
Learning: Repeated across teams multiple times
Prevention:
✅ Document Everything:
# Chaos Experiment Report
## Metadata
- Date: 2024-01-15
- Team: Payment Service
- Experimenter: @alice
- Service: payment-processor
## Experiment
- Name: Database Primary Failure
- Duration: 10 minutes
- Blast Radius: 5% of traffic
- Tool: Litmus
## Hypothesis
\"If database primary becomes unavailable, read replicas
should provide full functionality with no customer impact.\"
## Results
✅ PASS - System recovered automatically
### Metrics
- Error rate: 0.5% (acceptable)
- Latency: 250ms → 450ms (acceptable)
- Database connections: Peaked at 85/100 (ok)
- Customer impact: None detected
### Timeline
- T+0s: Primary killed
- T+3s: Application detected failure (health check)
- T+5s: Traffic rerouted to replica
- T+5m: Primary restarted
- T+7m: Replication caught up
- T+10m: Full recovery
## Learnings
1. ✅ Read replica failover works as expected
2. ✅ Connection pool handles spike
3. ⚠️ Timeout for detecting primary failure is 3 seconds
- Currently configured at 5 seconds
- Could reduce to 2 seconds for faster failover
## Action Items
- [ ] Reduce database health check timeout to 2s (Alice, due 1/22)
- [ ] Test with multiple replicas offline (Bob, due 2/5)
- [ ] Document replica failover in runbook (Carol, due 1/29)
## Related Incidents
- [INC-1234] Database primary failure (2024-01-10) - NOT caught by chaos
- Root cause: Health check timeout too long
- This experiment would have caught it if run weekly
## Links
- Review: [Grafana Dashboard](link)
- Runbook: [Replica Failover Runbook](link)
- Slack Thread: [@chaos-engineering channel](link)✅ Share Widely:
After Each Experiment:
1. Write report (template above)
2. Save to shared repository (Git/Wiki)
3. Post summary to team Slack channel
4. Monthly: Present findings in \"Failure Friday\" meeting
5. Quarterly: Compile into organizational metrics report
Pitfall 6: Over-Automating Too Early
The Mistake:
Team: \"Chaos is too manual, let's automate immediately\"
Result:
- Automated experiments running constantly
- Nobody checking results
- Issues discovered but not fixed
- Confusion about which failures are intentional vs real incidents
- \"Chaos Alarm Fatigue\" - ignoring all chaos-related alerts
Prevention:
✅ Manual Before Automation:
Phase 1: Manual Experiments (Weeks 1-8)
- Team runs chaos consciously
- Reviews results manually
- Learns what breaks
- Fixes issues deliberately
Phase 2: Semi-Automated (Weeks 9-16)
- Experiments trigger on-demand via dashboard
- Automated results collection
- Manual review of findings
- Automated alerting (optional)
Phase 3: Automated in CI/CD (Weeks 17+)
- Lightweight experiments on deployment
- Automated pass/fail verdict
- Auto-rollback on failure
- Comprehensive observability
✅ Automation Checklist:
Before automating, verify:
- Manual experiments have run successfully 10+ times
- Safeguards and auto-rollback proven reliable
- Team trained on how to interpret results
- Alerts configured (not causing fatigue)
- Process for fixing discovered issues established
- Documentation complete
Pitfall 7: Treating Chaos as a One-Time Initiative Instead of Ongoing Practice
The Mistake:
Month 1: \"We're implementing Chaos Engineering!\"
5 experiments run, 3 issues fixed, great energy
Month 2-3: Momentum continues, some experiments
Month 6: \"When was the last chaos experiment?\"
Last experiment: 2 months ago
New issues: Not discovered early
Team trained: Skills atrophying
1 Year Later: Abandoned
Lesson Learned: One-time initiatives never stick
Prevention:
✅ Make Chaos Engineering Permanent:
Institutionalization:
Policy:
- Every service must run chaos experiments monthly
- At least 1 experiment before each major deployment
- Issues found must be fixed within sprint
Structure:
- Dedicated Chaos Engineering team
- Chaos champion in each service team
- Monthly \"Failure Friday\" reviews (mandatory)
- Quarterly chaos summit
Incentives:
- Team OKRs include reliability metrics
- Bonuses tied to availability improvements
- Public recognition for reliability improvements
- Include in hiring/promotion criteria
Measurement:
- Weekly: Report experiments run and issues found
- Monthly: Share findings publicly
- Quarterly: Executive review of metrics
- Annually: Document impact and progress✅ Build Rituals:
Weekly:
- Team Standup: \"Any chaos findings over past week?\"
Monthly:
- Failure Friday (60 min): Quarterly review of discoveries
- Chaos Experiment Planning: What to test next month
Quarterly:
- Chaos Summit (half day): Teams share learnings
- Strategy Session: Alignment on next quarter focus
- Training Bootcamp: New skills and technologies
Annually:
- Reliability Report: Year in review
- Reliability Planning: Year ahead goals
- Team Celebration: Recognize contributions
Pitfall 8: Chaos Experiments Cause Real Incidents
The Mistake:
Experiment: \"Let's test database replica failure\"
Reality:
- Application doesn't recognize replica as down
- Continues sending queries to dead replica
- Queries timeout
- Connection pool exhausted
- All database connections hung
- Service hangs for customers
- REAL INCIDENT because of experiment
Prevention:
✅ Start Conservatively:
First 10 Experiments:
- Staging environment only
- No production impact whatsoever
- Kill obviously redundant components
Experiments 11-20:
- Production, but non-business hours
- Off-peak traffic only
- 1-2 redundant components
- Strong safeguards
Experiments 21+:
- Gradually increase blast radius
- Always with auto-rollback
- Team on standby
✅ Design for Safety:
# Experiment design pattern
class SafeExperiment:
def __init__(self, target_component, duration_seconds=60):
self.target = target_component
self.duration = duration_seconds
self.baseline = self.capture_metrics() # Before
self.thresholds = {
'error_rate': 0.05,
'latency_p99': 5.0,
'pod_restarts': 1,
}
def pre_experiment_checks(self):
'''Verify system is healthy before starting'''
assert self.baseline['error_rate'] < 0.01, \"High baseline errors\"
assert self.baseline['pods_ready'] == self.baseline['pods_total']
print(\"✓ Pre-experiment checks passed\")
def inject_failure(self):
\"\"\"Inject failure with safeguards\"\"\"
self.pre_experiment_checks()
self.inject() # Kill pod, add latency, etc.
self.monitor() # Every 5 seconds check metrics
def monitor(self):
'''Check if experiment should stop early'''
while self.elapsed_seconds() < self.duration:
metrics = self.get_current_metrics()
# Auto-rollback if bad
if metrics['error_rate'] > self.thresholds['error_rate']:
print(f\"ERROR RATE {metrics['error_rate']} TOO HIGH\")
self.stop_experiment()
return False
if metrics['latency_p99'] > self.thresholds['latency_p99']:
print(f\"LATENCY {metrics['latency_p99']}s TOO HIGH\")
self.stop_experiment()
return False
time.sleep(5)
return True
def stop_experiment(self):
'''Clean up and restore to normal'''
self.rollback()
self.verify_recovery()Lessons from Failed Implementations
Case Study 1: "Chaos Experiment Broke Production"
What Happened:
- Team 1 ran: \"Kill all pods in service\" experiment
- Did not test cascade impact on downstream services
- Downstream service got overwhelmed
- Cascading failure
- Real production outage for 2 hours
- Company wide incident review
- Chaos program paused 3 months
Lessons:
✗ Did not understand blast radius
✗ Did not test impact on dependencies
✗ Did not have sufficient safeguards
✓ Should have tested in staging first
✓ Should have started smaller (kill 1 pod, not all)
✓ Should have had auto-rollback
Prevention:
1. Mandate staging environment first
2. Start with 1% blast radius
3. Progressive escalation plan
4. Multiple layers of safeguards
Case Study 2: "Chaos Program Abandoned"
What Happened:
- Initial 2-month sprint: High energy, 50 experiments
- Teams excited, learning a lot
- Issues fixed, reliability improved
- End of sprint: Momentum lost
- 6 months later: No chaos experiments run
- Skills atrophied, interest gone
Lesson:
✗ Treated as sprint initiative, not ongoing practice
✗ No permanent team assigned
✗ No processes in place
✓ Should have had permanent structure
✓ Should have built into standard practices
✓ Should have created ongoing metrics/tracking
Prevention:
1. Institutionalize from day 1
2. Assign permanent team members
3. Build into deployment process
4. Monthly reviews mandatory
5. Quarterly planning for next set
Case Study 3: "Running Chaos on Critical Services Too Early"
What Happened:
- Team eager to test everything
- Started with \"payment processing\" service (critical)
- Experiment revealed: Payment service doesn't handle latency well
- While running experiment: Unexpected message queue backup
- Payment processing failed
- Orders lost
- Customers couldn't complete purchases
Lesson:
✗ Tested critical service before understanding it
✗ No staged approach
✗ Insufficient monitoring of message queue
Prevention:
1. Non-critical services first (analytics, notifications, etc.)
2. Only move to critical services after experience
3. Double safeguards for critical services
4. Comprehensive monitoring in place first
Checklist: Happy Path Implementation
Before Running Any Experiments:
☐ Monitoring visible and working
☐ Alerting configured (not noisy)
☐ Team trained on basics
☐ Staging environment tested
☐ Safeguards in place (auto-rollback)
☐ Stakeholders informed
☐ Process documented
During Experiments:
☐ Start very conservatively
☐ Progressive escalation planned
☐ Team on standby (first few times)
☐ Real-time monitoring dashboard visible
☐ Documented findings saved immediately
☐ Results shared within 24 hours
After Experiments:
☐ Issues documented
☐ Root causes identified
☐ Fixes designed
☐ Fixes verified with re-test
☐ Runbooks updated if needed
☐ Learnings shared organization-wide
☐ Metrics updated
Ongoing:
☐ Monthly reviews of incidents vs chaos findings
☐ Quarterly planning and strategy alignment
☐ Continuous learning culture
☐ Regular communication to stakeholders
Key Takeaways
- Monitor First: Can't improve what you can't measure
- Gradual: Don't jump to aggressive experiments
- Communicate: Keep stakeholders informed
- Test All Layers: Application, not just infrastructure
- Document Learnings: Share knowledge widely
- Automate Late: Master manual first
- Permanent: Make it part of standard operations
- Safe: Start conservatively, scale gradually