Common Pitfalls - Chaos Engineering

Top Pitfalls and Prevention

Pitfall 1: Running Chaos Without Proper Monitoring

The Mistake:

Team: \"Let's inject a database failure and see what happens\"

What Actually Happens:
  - Database fails
  - Application... does something
  - No monitoring → Can't see what happened
  - Experiment results: \"Unclear\"
  - Learning: Nothing
  - Investment: Wasted

Prevention:

✅ Before any chaos experiment, verify:

Monitoring Checklist:
  ✓ Prometheus/Datadog dashboard running
  ✓ Key metrics being collected
    - Request throughput
    - Error rate
    - Latency (p50, p95, p99)
    - Resource usage (CPU, memory)
    - Application-specific metrics
  ✓ Logs being aggregated
  ✓ Traces being collected (if distributed system)
  ✓ Team can access and query data
  ✓ Dashboard setup with baseline and thresholds

Example Safe Experiment Start:

# Step 1: Verify monitoring is working
kubectl logs -n monitoring loki-0 | tail -20
kubectl port-forward -n monitoring svc/grafana 3000:3000
# Manually verify dashboard shows data
 
# Step 2: Establish baseline (5-10 minutes with no chaos)
# Record metrics in baseline.txt
 
# Step 3: Now inject failure
# (knowing you can see what happens)
kubectl delete pod -n production database-0
 
# Step 4: Observe and record metrics
# (compare to baseline)

Pitfall 2: Testing in Production Without Safeguards

The Mistake:

Team: \"Let's kill 10% of our pods to see what happens\"

Reality:
  - 10% of pods killed
  - Traffic reroutes to remaining pods
  - Load increases on remaining pods
  - Something unexpected happens
  - Cascading failure
  - Production outage affecting all customer
  - Customer support flooded with complaints
  - Engineers scrambling to fix

Prevention:

✅ Use Progressive Escalation:

Week 1-2:  Test in staging only
           (Complete local environment, zero production impact)

Week 3-4:  Test in production, off-peak hours, 1% of traffic
           (5am-6am on Sunday, least traffic)

Week 5-6:  Test in production, off-peak, 5% of traffic
           (Still off-peak, wider scope)

Week 7-8:  Test in production, low-traffic hours, 10% of traffic
           (6am-8am on Sunday, business hours but low traffic)

Week 9+:   Test in production, business hours, with guardrails
           (Full production environment, but with auto-rollback)

✅ Implement Auto-Rollback:

# Guardrails - Stop experiment if damage detected
Stop Experiment If:
  - Error rate > 5%
  - Latency p99 > 2x baseline
  - Pod restarts > 1 per minute
  - Database connection pool > 90% full
  - Memory usage > 85%

✅ Use Circuit Breakers:

# If experiment is causing problems, automatically stop it
import time
 
class SafeExperiment:
    def __init__(self, duration, error_threshold=0.05):
        self.duration = duration
        self.error_threshold = error_threshold
        self.start_time = time.time()
    
    def should_continue(self, current_error_rate):
        elapsed = time.time() - self.start_time
        
        # Stop if duration exceeded
        if elapsed > self.duration:
            return False
        
        # Stop if error rate too high
        if current_error_rate > self.error_threshold:
            print(f\"ERROR RATE {current_error_rate} > {self.error_threshold}\")
            print(\"Stopping experiment\")
            return False
        
        return True
 
experiment = SafeExperiment(duration=300, error_threshold=0.05)
 
while experiment.should_continue(get_error_rate()):
    inject_chaos()
    time.sleep(5)
 
stop_chaos()

Pitfall 3: Not Involving Product/Business Team

The Mistake:

SRE Team: \"We're running chaos experiments this afternoon\"
Business Team: (learns via customer impact)
Customer: \"Your service is down, what's happening?\"
Support: \"We don't know, engineering is investigating\"
CEO: (calls CTO) \"Why was nobody monitoring this risk?\"

Prevention:

✅ Communicate Early:

1 Week Before Experiment:
  Email to: Engineering, Product, Support, Executive stakeholder
  Subject: \"Scheduled Reliability Test - [Date/Time]\"
  
  Content:
    \"We're running a controlled experiment to test our system's 
    resilience. Target service: [SERVICE]. Time: [TIME]. 
    Expected impact: [Describe graceful degradation or no impact].
    
    If you notice unusual behavior, note the time and email [contact].
    This is intentional and controlled.\"

1 Day Before:
  Slack reminder: \"Chaos experiment starting tomorrow at [TIME]\"
  Include: \"No customer impact expected\" or \"Possible 2s delay\"

During Experiment:
  Status channel: \"Chaos experiment in progress. Error rate: 0.1% (normal)\"
  Every 5-10 minutes: Update status

After Experiment:
  Summary: \"Experiment complete. Results: [brief summary].
           All systems normal. Report at [link]\"

✅ Get Stakeholder Buy-In:

Pre-Experiment Alignment:
  
  ✓ CTO/VP Engineering: \"This aligns with 2024 reliability goals\"
  ✓ VP Product: \"Customers will benefit from improved reliability\"
  ✓ CFO: \"ROI positive: saves $XXX/year in downtime cost\"
  ✓ CEO: \"Improves brand reputation and customer retention\"
  ✓ VP Support: \"Reduces our after-hours firefighting\"

Pitfall 4: Testing Only Infrastructure, Not Application Code

The Mistake:

Experiment 1: Kill a pod
  Result: Kubernetes restarts it, no issue
  Learning: (None - Kubernetes handles this automatically)

Experiment 2: Kill a database replica
  Result: Traffic reroutes to primary, no issue
  Learning: Good, but expected

Experiment 3: Add network latency to API call
  Result: Application times out, circuit breaker opens
  Learning: MAJOR - timeout configuration was wrong!
           This would have failed with customer impact
           But chaos caught it first

Prevention:

✅ Test All Layers:

Infrastructure Layer:
  ✓ Pod/instance death
  ✓ Node failure
  ✓ Network partition
  └─ Examples: Chaos Monkey, pod-delete

Dependency Layer:
  ✓ Database unavailable
  ✓ Cache timeout
  ✓ Message queue down
  ✓ External API latency
  └─ Examples: Network latency, dependency kill

Application Layer:
  ✓ Timeout handling
  ✓ Retry logic
  ✓ Circuit breaker activation
  ✓ Error handling and logging
  └─ Examples: Application-level chaos injection

Combination Layer:
  ✓ Multiple simultaneous failures
  ✓ Cascading failures
  ✓ Uncommon combinations
  └─ Examples: Pod death + network latency + low memory

✅ Create Experiment Matrix:

Experiment Plan for Payment Service:
 
By Layer:
  Infrastructure:
    - Pod delete (1 of 3 pods)
    - Node drain
    - Network latency to database
  
  Dependency:
    - Database offline (primary)
    - Cache offline (Redis)
    - Message broker offline
  
  Application:
    - Timeout simulation (dependency slow)
    - Error injection (downstream fails)
    - Resource constraints (memory)
  
  Combination:
    - Pod delete + network latency
    - Database offline + memory pressure
    - Cache offline + timeout
 
By Criticality:
  Tier 1 (Must handle): All above
  Tier 2 (Should handle): Graceful degradation
  Tier 3 (Nice to have): User messaging

Pitfall 5: Not Documenting and Sharing Learnings

The Mistake:

Team A: Runs experiment, discovers timeout issue
        Fixes it, moves on
        No documentation

3 months later:

Team B: Runs same experiment, discovers same issue
        \"Why are timeouts failing?\"
        Wasted time rediscovering problem

Team C: Never runs this experiment
        Gets surprised by timeout issue in production
        Customer impact

Learning: Repeated across teams multiple times

Prevention:

✅ Document Everything:

# Chaos Experiment Report
 
## Metadata
- Date: 2024-01-15
- Team: Payment Service
- Experimenter: @alice
- Service: payment-processor
 
## Experiment
- Name: Database Primary Failure
- Duration: 10 minutes
- Blast Radius: 5% of traffic
- Tool: Litmus
 
## Hypothesis
\"If database primary becomes unavailable, read replicas 
should provide full functionality with no customer impact.\"
 
## Results
✅ PASS - System recovered automatically
 
### Metrics
- Error rate: 0.5% (acceptable)
- Latency: 250ms → 450ms (acceptable)
- Database connections: Peaked at 85/100 (ok)
- Customer impact: None detected
 
### Timeline
- T+0s:   Primary killed
- T+3s:   Application detected failure (health check)
- T+5s:   Traffic rerouted to replica
- T+5m:   Primary restarted
- T+7m:   Replication caught up
- T+10m:  Full recovery
 
## Learnings
1. ✅ Read replica failover works as expected
2. ✅ Connection pool handles spike
3. ⚠️  Timeout for detecting primary failure is 3 seconds
   - Currently configured at 5 seconds
   - Could reduce to 2 seconds for faster failover
 
## Action Items
- [ ] Reduce database health check timeout to 2s (Alice, due 1/22)
- [ ] Test with multiple replicas offline (Bob, due 2/5)
- [ ] Document replica failover in runbook (Carol, due 1/29)
 
## Related Incidents
- [INC-1234] Database primary failure (2024-01-10) - NOT caught by chaos
  - Root cause: Health check timeout too long
  - This experiment would have caught it if run weekly
 
## Links
- Review: [Grafana Dashboard](link)
- Runbook: [Replica Failover Runbook](link)
- Slack Thread: [@chaos-engineering channel](link)

✅ Share Widely:

After Each Experiment:
  1. Write report (template above)
  2. Save to shared repository (Git/Wiki)
  3. Post summary to team Slack channel
  4. Monthly: Present findings in \"Failure Friday\" meeting
  5. Quarterly: Compile into organizational metrics report

Pitfall 6: Over-Automating Too Early

The Mistake:

Team: \"Chaos is too manual, let's automate immediately\"

Result:
  - Automated experiments running constantly
  - Nobody checking results
  - Issues discovered but not fixed
  - Confusion about which failures are intentional vs real incidents
  - \"Chaos Alarm Fatigue\" - ignoring all chaos-related alerts

Prevention:

✅ Manual Before Automation:

Phase 1: Manual Experiments (Weeks 1-8)
  - Team runs chaos consciously
  - Reviews results manually
  - Learns what breaks
  - Fixes issues deliberately

Phase 2: Semi-Automated (Weeks 9-16)
  - Experiments trigger on-demand via dashboard
  - Automated results collection
  - Manual review of findings
  - Automated alerting (optional)

Phase 3: Automated in CI/CD (Weeks 17+)
  - Lightweight experiments on deployment
  - Automated pass/fail verdict
  - Auto-rollback on failure
  - Comprehensive observability

✅ Automation Checklist:

Before automating, verify:

Manual experiments have run successfully 10+ times
Safeguards and auto-rollback proven reliable
Team trained on how to interpret results
Alerts configured (not causing fatigue)
Process for fixing discovered issues established
Documentation complete

Pitfall 7: Treating Chaos as a One-Time Initiative Instead of Ongoing Practice

The Mistake:

Month 1: \"We're implementing Chaos Engineering!\"
         5 experiments run, 3 issues fixed, great energy

Month 2-3: Momentum continues, some experiments

Month 6: \"When was the last chaos experiment?\"
         Last experiment: 2 months ago
         New issues: Not discovered early
         Team trained: Skills atrophying

1 Year Later: Abandoned

Lesson Learned: One-time initiatives never stick

Prevention:

✅ Make Chaos Engineering Permanent:

Institutionalization:
 
Policy:
  - Every service must run chaos experiments monthly
  - At least 1 experiment before each major deployment
  - Issues found must be fixed within sprint
 
Structure:
  - Dedicated Chaos Engineering team
  - Chaos champion in each service team
  - Monthly \"Failure Friday\" reviews (mandatory)
  - Quarterly chaos summit
 
Incentives:
  - Team OKRs include reliability metrics
  - Bonuses tied to availability improvements
  - Public recognition for reliability improvements
  - Include in hiring/promotion criteria
 
Measurement:
  - Weekly: Report experiments run and issues found
  - Monthly: Share findings publicly
  - Quarterly: Executive review of metrics
  - Annually: Document impact and progress

✅ Build Rituals:

Weekly:
  - Team Standup: \"Any chaos findings over past week?\"

Monthly:
  - Failure Friday (60 min): Quarterly review of discoveries
  - Chaos Experiment Planning: What to test next month

Quarterly:
  - Chaos Summit (half day): Teams share learnings
  - Strategy Session: Alignment on next quarter focus
  - Training Bootcamp: New skills and technologies

Annually:
  - Reliability Report: Year in review
  - Reliability Planning: Year ahead goals
  - Team Celebration: Recognize contributions

Pitfall 8: Chaos Experiments Cause Real Incidents

The Mistake:

Experiment: \"Let's test database replica failure\"

Reality:
  - Application doesn't recognize replica as down
  - Continues sending queries to dead replica
  - Queries timeout
  - Connection pool exhausted
  - All database connections hung
  - Service hangs for customers
  - REAL INCIDENT because of experiment

Prevention:

✅ Start Conservatively:

First 10 Experiments:
  - Staging environment only
  - No production impact whatsoever
  - Kill obviously redundant components

Experiments 11-20:
  - Production, but non-business hours
  - Off-peak traffic only
  - 1-2 redundant components
  - Strong safeguards

Experiments 21+:
  - Gradually increase blast radius
  - Always with auto-rollback
  - Team on standby

✅ Design for Safety:

# Experiment design pattern
class SafeExperiment:
    def __init__(self, target_component, duration_seconds=60):
        self.target = target_component
        self.duration = duration_seconds
        self.baseline = self.capture_metrics()  # Before
        self.thresholds = {
            'error_rate': 0.05,
            'latency_p99': 5.0,
            'pod_restarts': 1,
        }
    
    def pre_experiment_checks(self):
        '''Verify system is healthy before starting'''
        assert self.baseline['error_rate'] < 0.01, \"High baseline errors\"
        assert self.baseline['pods_ready'] == self.baseline['pods_total']
        print(\"✓ Pre-experiment checks passed\")
    
    def inject_failure(self):
        \"\"\"Inject failure with safeguards\"\"\"
        self.pre_experiment_checks()
        self.inject()  # Kill pod, add latency, etc.
        self.monitor()  # Every 5 seconds check metrics
    
    def monitor(self):
        '''Check if experiment should stop early'''
        while self.elapsed_seconds() < self.duration:
            metrics = self.get_current_metrics()
            
            # Auto-rollback if bad
            if metrics['error_rate'] > self.thresholds['error_rate']:
                print(f\"ERROR RATE {metrics['error_rate']} TOO HIGH\")
                self.stop_experiment()
                return False
            
            if metrics['latency_p99'] > self.thresholds['latency_p99']:
                print(f\"LATENCY {metrics['latency_p99']}s TOO HIGH\")
                self.stop_experiment()
                return False
            
            time.sleep(5)
        
        return True
    
    def stop_experiment(self):
        '''Clean up and restore to normal'''
        self.rollback()
        self.verify_recovery()

Lessons from Failed Implementations

Case Study 1: "Chaos Experiment Broke Production"

What Happened:
  - Team 1 ran: \"Kill all pods in service\" experiment
  - Did not test cascade impact on downstream services
  - Downstream service got overwhelmed
  - Cascading failure
  - Real production outage for 2 hours
  - Company wide incident review
  - Chaos program paused 3 months

Lessons:
  ✗ Did not understand blast radius
  ✗ Did not test impact on dependencies
  ✗ Did not have sufficient safeguards
  ✓ Should have tested in staging first
  ✓ Should have started smaller (kill 1 pod, not all)
  ✓ Should have had auto-rollback

Prevention:
  1. Mandate staging environment first
  2. Start with 1% blast radius
  3. Progressive escalation plan
  4. Multiple layers of safeguards

Case Study 2: "Chaos Program Abandoned"

What Happened:
  - Initial 2-month sprint: High energy, 50 experiments
  - Teams excited, learning a lot
  - Issues fixed, reliability improved
  - End of sprint: Momentum lost
  - 6 months later: No chaos experiments run
  - Skills atrophied, interest gone

Lesson:
  ✗ Treated as sprint initiative, not ongoing practice
  ✗ No permanent team assigned
  ✗ No processes in place
  ✓ Should have had permanent structure
  ✓ Should have built into standard practices
  ✓ Should have created ongoing metrics/tracking

Prevention:
  1. Institutionalize from day 1
  2. Assign permanent team members
  3. Build into deployment process
  4. Monthly reviews mandatory
  5. Quarterly planning for next set

Case Study 3: "Running Chaos on Critical Services Too Early"

What Happened:
  - Team eager to test everything
  - Started with \"payment processing\" service (critical)
  - Experiment revealed: Payment service doesn't handle latency well
  - While running experiment: Unexpected message queue backup
  - Payment processing failed
  - Orders lost
  - Customers couldn't complete purchases

Lesson:
  ✗ Tested critical service before understanding it
  ✗ No staged approach
  ✗ Insufficient monitoring of message queue

Prevention:
  1. Non-critical services first (analytics, notifications, etc.)
  2. Only move to critical services after experience
  3. Double safeguards for critical services
  4. Comprehensive monitoring in place first

Checklist: Happy Path Implementation

Before Running Any Experiments:
  ☐ Monitoring visible and working
  ☐ Alerting configured (not noisy)
  ☐ Team trained on basics
  ☐ Staging environment tested
  ☐ Safeguards in place (auto-rollback)
  ☐ Stakeholders informed
  ☐ Process documented

During Experiments:
  ☐ Start very conservatively
  ☐ Progressive escalation planned
  ☐ Team on standby (first few times)
  ☐ Real-time monitoring dashboard visible
  ☐ Documented findings saved immediately
  ☐ Results shared within 24 hours

After Experiments:
  ☐ Issues documented
  ☐ Root causes identified
  ☐ Fixes designed
  ☐ Fixes verified with re-test
  ☐ Runbooks updated if needed
  ☐ Learnings shared organization-wide
  ☐ Metrics updated

Ongoing:
  ☐ Monthly reviews of incidents vs chaos findings
  ☐ Quarterly planning and strategy alignment
  ☐ Continuous learning culture
  ☐ Regular communication to stakeholders

Key Takeaways

Monitor First: Can't improve what you can't measure
Gradual: Don't jump to aggressive experiments
Communicate: Keep stakeholders informed
Test All Layers: Application, not just infrastructure
Document Learnings: Share knowledge widely
Automate Late: Master manual first
Permanent: Make it part of standard operations
Safe: Start conservatively, scale gradually