G
GuideDevOps
Lesson 13 of 14

Measuring ROI

Part of the Chaos Engineering tutorial series.

Why Measurement Matters

Without measurement, you cannot:

  • Prove chaos engineering provides value
  • Justify continued investment and team time
  • Make informed decisions about scaling
  • Identify which experiments are most valuable
  • Track improvement over time

Key Metrics to Track

Tier 1: Reliability Metrics (Core)

1. System Availability

Definition: Percentage of time the system is operational

Formula: Uptime / Total Time * 100

Example:
  Month 1 (before chaos): 99.0% = ~7.2 hours downtime
  Month 6 (after chaos): 99.9% = ~43 minutes downtime
  Month 12 (after chaos): 99.95% = ~22 minutes downtime

Target: Should increase by 0.5-5% within 12 months

Implementation:

# Calculate from error logs
monitored_period = 30 * 24 * 60  # 30 days in minutes
downtime_incidents = [
    {'start': 14532, 'duration': 45},  # Minute and duration
    {'start': 18921, 'duration': 12},
    {'start': 22145, 'duration': 8},
]
 
total_downtime = sum(i['duration'] for i in downtime_incidents)
availability = ((monitored_period - total_downtime) / monitored_period) * 100
 
print(f\"Availability: {availability:.3f}%\")

2. Mean Time Between Failures (MTBF)

Definition: Average time between system failures

Formula: Total Uptime / Number of Failures

Example:
  Before:  700 hours / 10 failures = 70 hours MTBF
  After:   6900 hours / 5 failures = 1380 hours MTBF
  Improvement: 20x longer between failures

Tracking:

Failure Log:
  - Time: 2024-01-05 14:32:00
    Duration: 45 minutes
    Cause: Database connection pool exhausted
    
  - Time: 2024-01-07 09:15:00
    Duration: 12 minutes
    Cause: Cache service unavailable
    
  - Time: 2024-01-12 18:45:00
    Duration: 8 minutes
    Cause: Network latency spike

3. Mean Time To Recovery (MTTR)

Definition: Average time from failure detection to full recovery

Before: 
  Detection: 10 min (alert ignored)
  Diagnosis: 15 min (why?)
  Response: 10 min (who does what?)
  Fix: 20 min (apply fix)
  Total: 55 minutes average

After:
  Detection: 2 min (team knows pattern)
  Diagnosis: 3 min (practiced this)
  Response: Automatic (failover works)
  Fix: 5 min (permanent fix)
  Total: 10 minutes average

Target: Reduce by 50-80%

Calculation:

import statistics
 
incidents = [
    {'name': 'Database failure 1', 'minutes': 55},
    {'name': 'Cache timeout', 'minutes': 43},
    {'name': 'Network partition', 'minutes': 38},
    {'name': 'Memory leak', 'minutes': 67},
    {'name': 'Disk full', 'minutes': 51},
]
 
mttr = statistics.mean([i['minutes'] for i in incidents])
print(f\"MTTR: {mttr:.0f} minutes\")
 
# Track trend
mttr_before = 55
mttr_after = 10
improvement = ((mttr_before - mttr_after) / mttr_before) * 100
print(f\"MTTR improved by {improvement:.0f}%\")

Tier 2: Operational Metrics

1. Issues Found by Chaos vs Production

Metric: \"Chaos-Found\" vs \"Production-Found\" issues

Ideal: Almost everything found by chaos, nothing by production

Example:
  Month 1: 5 issues found by chaos, 2 by production = 71% found by chaos
  Month 6: 20 issues found by chaos, 1 by production = 95% found by chaos
  Month 12: 40 issues found by chaos, 0 by production = 100% found by chaos

Target: >95% of issues caught before reaching customers

Tracking:

Issues Matrix:
  
  Chaos-Found (Good):
    - Circuit breaker threshold too aggressive (fixed)
    - Retry logic causes thundering herd (fixed)
    - Timeout too short for database (fixed)
    - PVC attachment fails silently (fixed)
    
  Production-Found (Bad):
    - DNS cache stale after zone change (1 incident, 30 min)

2. Chaos Tests Created

Metric: Volume of experiments and coverage

Example:
  Sprint 1: 2 experiments (basic)
  Sprint 2: 5 experiments (moderate coverage)
  Sprint 3: 12 experiments (good coverage)
  Sprint 4: 20 experiments (comprehensive)
  Target: 1 experiment per critical system + 5+ combinations

3. Remediation Time

Metric: Time from discovering issue via chaos to permanent fix

Definition: Issue found → Root cause identified → Fix deployed

Good: < 1 week
Excellent: < 3 days
Outstanding: Same day

Example:
  Issue: "Circuit breaker threshold too low"
  Found: Wednesday morning
  Fixed: Wednesday afternoon
  Remediation Time: 4 hours

Tier 3: Business Metrics

1. Revenue Impact of Downtime

Formula: Downtime Hours × Revenue Per Hour

Example (E-commerce):
  Before:  5 outages/year × 1 hour avg = 5 hours downtime
           5 hours × $100k/hour = $500k lost revenue
  
  After:   1 outage/year × 0.3 hours avg = 0.3 hours downtime
           0.3 hours × $100k/hour = $30k lost revenue
  
  Savings: $470k/year

Example (SaaS):
  Customer churn due to outages:
    Before: 2% additional churn after major outage × $1M MRR = $20k lost MRR
    After: Only 0.2% additional churn = $2k lost MRR
    Savings: $18k MRR or $216k/year

Calculation Tool:

class DowntimeROI:
    def __init__(self, revenue_per_hour, base_annual_outages, base_outage_duration):
        self.revenue_per_hour = revenue_per_hour
        self.base_annual_outages = base_annual_outages
        self.base_outage_duration = base_outage_duration
    
    def calculate_before(self):
        total_hours = self.base_annual_outages * self.base_outage_duration
        return total_hours * self.revenue_per_hour
    
    def calculate_after(self, improved_outages, improved_duration):
        total_hours = improved_outages * improved_duration
        return total_hours * self.revenue_per_hour
    
    def roi_savings(self, before_cost, after_cost):
        return before_cost - after_cost
 
# Example usage
roi = DowntimeROI(
    revenue_per_hour=100_000,
    base_annual_outages=5,
    base_outage_duration=1.0  # hours
)
 
before = roi.calculate_before()  # $500,000
after = roi.calculate_after(improved_outages=1, improved_duration=0.3)  # $30,000
savings = roi.roi_savings(before, after)  # $470,000
 
print(f\"Annual downtime cost before: ${before:,.0f}\")
print(f\"Annual downtime cost after: ${after:,.0f}\")
print(f\"Annual savings: ${savings:,.0f}\")

2. Customer Satisfaction

Metric: Impact on NPS, CSAT, or customer sentiment

Before:
  NPS: 35 (acceptable)
  CSAT: 72%
  Complaints about \"frequent downtime\": 15% of feedback

After (12 months):
  NPS: 52 (good improvement)
  CSAT: 87%
  Complaints about \"frequent downtime\": 2% of feedback
  Improvement: +17 NPS points, +15% CSAT

3. Engineer Productivity

Metric: How much time engineers spend firefighting vs building

Before:
  Firefighting/on-call: 40% of time
  Feature development: 35% of time
  Technical debt/improvement: 25% of time

After:
  Firefighting/on-call: 10% of time
  Feature development: 55% of time
  Technical debt/improvement: 35% of time

Productivity gain: 20 percentage points × engineering team cost

Calculating ROI

Investment Required

Year 1 Costs:
  Tool (Gremlin): $60,000/year
  One dedicated engineer: $120,000/year (0.5 FTE)
  Training and resources: $20,000
  Infrastructure for testing: $10,000
  
  Total Year 1 Investment: $210,000

Expected Return (Conservative)

Cost of a Major Outage:
  Direct downtime: $500,000 (5 hours @ $100k/hr)
  Infrastructure recovery: $50,000
  Customer support overtime: $10,000
  Potential churn: $200,000
  Total per major outage: $760,000

With Chaos Engineering:
  Prevent 4 of 5 annual major outages
  Reduce remaining outage severity by 70%
  
  Return = ($760,000 × 4) + ($760,000 × 0.3)
  Return = $3,040,000 + $228,000
  Return = $3,268,000/year

ROI Calculation

ROI = (Total Return - Investment) / Investment × 100

ROI = ($3,268,000 - $210,000) / $210,000 × 100
ROI = $3,058,000 / $210,000 × 100
ROI = 1,457%

Payback Period:
  $210,000 / ($3,268,000 / 365) = 23 days

Creating a Metrics Dashboard

Prometheus Queries

# Availability (0-1, multiply by 100 for percentage)
rate(requests_total[1h]) - rate(requests_errors_total[1h]) / rate(requests_total[1h])
 
# MTTR - Time between failure detection and recovery
histogram_quantile(0.95, recovery_time_seconds_bucket)
 
# Incident frequency
increase(incidents_total[24h])
 
# Issues found by chaos vs production
increase(issues_found_by_chaos_total[7d])
vs
increase(issues_found_by_production_total[7d])
 
# P99 Latency trend over time
histogram_quantile(0.99, rate(request_duration_seconds_bucket[5m]))

Sample Dashboard

\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n\u2502  Chaos Engineering Impact Dashboard          \u2502\n\u251c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2524\n\u2502                                             \u2502\n\u2502  Availability     99.95% \u2191 from 99.0%      \u2502\n\u2502  MTTR             10 min \u2193 from 55 min     \u2502\n\u2502  Annual Downtime  22 min \u2193 from 7.2 hours   \u2502\n\u2502  Issues Prevented 95 this year               \u2502\n\u2502                                             \u2502\n\u251c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2524\n\u2502                                             \u2502\n\u2502  Incidents/Month:                           \u2502\n\u2502  \u2552\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2555                \u2502\n\u2502  \u25518 \u2502               \u2502                      \u2502\u2502                \u2502\n\u2502  \u25516 \u2502       \u2502       \u2502                      \u2502\u2507Chaos Start \u2502\n\u2502  \u25514 \u2502       \u2502   \u2502   \u2502                      \u2502\u2502             \u2502\n\u2502  \u25512 \u2502       \u2502   \u2502   \u2502 \u2502 \u2502                \u2502                \u2502\n\u2502  \u2551 0\u2502\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2502                \u2502\n\u2502  \u2559\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u255c                \u2502\n\u2502   Jan Feb Mar Apr May Jun Jul Aug Sep Oct    \u2502\n\u2502                                             \u2502\n\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n```

## Executive Reporting

### Monthly Report Template

CHAOS ENGINEERING - MONTHLY IMPACT REPORT

Period: [Month/Year]

RELIABILITY METRICS ✓ System Availability: 99.95% (Target: 99.9%) ✓ Average MTTR: 8 minutes (Target: <15 min) ✓ Incidents This Month: 1 (Target: <2) ✓ None due to issues caught by chaos training

EXPERIMENT ACTIVITY ✓ Experiments Run: 15

  • 3 discovered issues (all fixed before reaching customers)
  • 0 caused production impact
  • Average blast radius: 2% of traffic

✓ Issues Found: 3

  • Timeout configuration too aggressive
  • Retry logic needed jitter
  • PVC attachment timeout too short

BUSINESS IMPACT ✓ Downtime Cost This Month: $8,000 (1 minor incident) ✓ Cost Without Chaos Engineering: $450,000 (estimated) ✓ Cost Avoided: $442,000 ✓ Investment This Month: $17,500 ✓ ROI This Month: 25x

CUSTOMER SATISFACTION ✓ Complaints about "downtime": Down 50% YoY ✓ NPS Score: +52 (up from +35 at start) ✓ CSAT: 87% (up from 72%)

NEXT MONTH FOCUS

  • Implement graceful degradation for cache layer
  • Test multi-region failover
  • Automate chaos tests in CI/CD pipeline

## Key Takeaways

1. **Measure Multiple Dimensions**: Reliability + Operations + Business
2. **Track Trends**: Month-over-month and year-over-year changes matter most
3. **Communicate Value**: Translate metrics to business language ($$)
4. **Conservative Estimates**: Better to under-promise and over-deliver
5. **Continuous Improvement**: Use metrics to guide investment decisions

---