Overview
While chaos engineering tests resilience, knowing how to build resilient systems is equally important. Advanced resilience patterns help systems withstand failure gracefully.
Pattern 1: Circuit Breaker
Purpose: Prevent cascading failures by stopping requests to failing services
The Problem It Solves
Without Circuit Breaker:
Service A calls Service B repeatedly
Service B is down
Service A keeps trying, wasting resources
Requests pile up, eventually Service A crashes too (cascading failure)
With Circuit Breaker:
Service B is down
First few calls fail
Circuit breaker opens (STOP making calls)
Requests immediately fail fast (users can try again)
When Service B recovers, circuit breaker closes
States
CLOSED (Normal Operation)
│
├─ Requests succeed? → Stay CLOSED
│
└─ Error threshold exceeded? → Go to OPEN
OPEN (Failing, Stop Trying)
│
├─ Timeout elapsed? → Go to HALF-OPEN
│
└─ Continue failing immediately (fail-fast)
HALF-OPEN (Testing if service recovered)
│
├─ Test requests succeed? → Go to CLOSED
│
└─ Test requests fail? → Go to OPEN
Implementation Example
from datetime import datetime, timedelta
class CircuitBreaker:
def __init__(self, failure_threshold=5, timeout=60):
self.failure_threshold = failure_threshold
self.timeout = timeout
self.state = 'CLOSED'
self.failure_count = 0
self.last_failure_time = None
def call(self, func, *args, **kwargs):
if self.state == 'OPEN':
if self._should_attempt_reset():
self.state = 'HALF_OPEN'
else:
raise Exception('Circuit breaker is OPEN')
try:
result = func(*args, **kwargs)
self._on_success()
return result
except Exception as e:
self._on_failure()
raise
def _on_success(self):
self.failure_count = 0
self.state = 'CLOSED'
def _on_failure(self):
self.failure_count += 1
self.last_failure_time = datetime.now()
if self.failure_count >= self.failure_threshold:
self.state = 'OPEN'
def _should_attempt_reset(self):
return (datetime.now() - self.last_failure_time) > timedelta(seconds=self.timeout)
# Usage
breaker = CircuitBreaker(failure_threshold=5, timeout=60)
def call_external_service():
try:
return breaker.call(requests.get, 'https://api.example.com/data')
except Exception as e:
return {'error': 'Service unavailable', 'status': 503}Chaos Test for Circuit Breaker
# Test that circuit breaker activates when service fails
Hypothesis: \"When downstream service errors exceed threshold,
circuit breaker activates and prevents cascading failure\"
Test Steps:
1. Make 10 requests to downstream service (should succeed)
2. Shut down downstream service
3. Make 10 requests to downstream service (should fail, circuit breaker opens)
4. Verify no further requests are made for 60 seconds (fail-fast)
5. Restart downstream service
6. Circuit breaker eventually closes after timeout
7. Requests resume succeeding
Expected: System handles failure gracefully, not cascadingPattern 2: Bulkhead (Thread Pool Isolation)
Purpose: Isolate failures to prevent resource exhaustion from spreading
The Problem
Without Bulkheads (Shared Thread Pool):
10 threads total, all shared
Payment service uses 8 threads, gets slow
Analytics service needs threads but all 8 are stuck waiting
Analytics fails, entire system grinds to halt
With Bulkheads (Separate Thread Pools):
Payment service: 5 threads
Analytics service: 3 threads
Notifications service: 2 threads
If payment service gets slow, analytics¬ifications still work
Implementation
import java.util.concurrent.*;
class BulkheadExecutor {
private final ExecutorService paymentThreadPool;
private final ExecutorService analyticsThreadPool;
private final ExecutorService notificationsThreadPool;
public BulkheadExecutor() {
// Separate thread pools with defined sizes
paymentThreadPool = Executors.newFixedThreadPool(5);
analyticsThreadPool = Executors.newFixedThreadPool(3);
notificationsThreadPool = Executors.newFixedThreadPool(2);
}
public Future<PaymentResult> processPayment(Payment payment) {
// If payment pool is full, requests queue or fail fast
return paymentThreadPool.submit(() -> {
// Process payment with isolated resources
return paymentService.process(payment);
});
}
public Future<AnalyticsEvent> trackEvent(Event event) {
// Analytics failure won't affect payment processing
return analyticsThreadPool.submit(() -> {
return analyticsService.track(event);
});
}
// If thread pool is exhausted, fail fast
public Future<NotificationResult> sendNotification(Notification notif) {
try {
return notificationsThreadPool.submit(() -> {
return notificationService.send(notif);
});
} catch (RejectedExecutionException e) {
// Pool full - return immediate failure instead of queueing
CompletableFuture<NotificationResult> future = new CompletableFuture<>();
future.completeExceptionally(
new Exception(\"Notification service overloaded\")
);
return future;
}
}
}Kubernetes Pod Disruption Budgets (PDB)
PDBs are Kubernetes bulkheads:
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: critical-service-pdb
spec:
minAvailable: 2 # Always keep at least 2 pods running
selector:
matchLabels:
app: payment-service
---
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: analytics-service-pdb
spec:
maxUnavailable: 1 # Allow only 1 pod to be disrupted at a time
selector:
matchLabels:
app: analytics-servicePattern 3: Retry with Exponential Backoff and Jitter
Purpose: Handle transient failures automatically without overwhelming the system
The Problem
Without Exponential Backoff:
Request fails
Immediately retry
Immediately retry
Immediately retry (thundering herd - all clients retry simultaneously)
System still slow, more failures
With Exponential Backoff + Jitter:
Request fails (first retry: wait 1s with random jitter)
Still fails (second retry: wait 2s with random jitter)
Still fails (third retry: wait 4s with random jitter)
Clients retry at different times (jitter spreads load)
System recovers gradually
Implementation
import random
import time
def retry_with_backoff(func, max_retries=3, base_delay=1):
\"\"\"
Retry a function with exponential backoff and jitter
\"\"\"
retries = 0
while retries < max_retries:
try:
return func()
except Exception as e:
if retries >= max_retries - 1:
raise # Give up after max retries
# Calculate delay: (base_delay ^ retries) + random jitter
delay = (base_delay ** retries) + random.uniform(0, 0.1)
print(f\"Attempt {retries + 1} failed. Retrying in {delay:.2f}s...\")
time.sleep(delay)
retries += 1
# Usage
def call_database():
# This might fail temporarily
return db.execute_query()
result = retry_with_backoff(call_database, max_retries=3, base_delay=1)Decorator Pattern
import functools
import random
import time
def retry_decorator(max_retries=3, base_delay=1, exceptions=(Exception,)):
def decorator(func):
@functools.wraps(func)
def wrapper(*args, **kwargs):
retries = 0
while retries < max_retries:
try:
return func(*args, **kwargs)
except exceptions as e:
if retries >= max_retries - 1:
raise
delay = (base_delay ** retries) + random.uniform(0, 0.1)
time.sleep(delay)
retries += 1
return wrapper
return decorator
# Usage
@retry_decorator(max_retries=3, base_delay=1)
def send_email(email):
# This might fail transiently
return email_service.send(email)Pattern 4: Timeout
Purpose: Prevent hung requests from consuming resources indefinitely
The Problem
Without Timeout:
Request sent to slow service
Request takes 30 seconds (connection hung)
Client waits forever
Thread occupied
More requests... more hung threads
Thread pool exhausted
Entire system becomes unresponsive
With Timeout:
Request sent to slow service
After 5 seconds with no response, timeout
Request fails fast, resource released
Client can retry or show error
Implementation
import threading
import requests
def call_with_timeout(url, timeout=5):
try:
# Requests library supports timeout
response = requests.get(url, timeout=timeout)
return response.json()
except requests.Timeout:
print(f\"Request timed out after {timeout}s\")
raise
except requests.RequestException as e:
print(f\"Request failed: {e}\")
raise
# Different timeout for different operations
# Connect timeout: 2s (if connection takes >2s, fail)
# Read timeout: 5s (if no data for >5s, fail)
response = requests.get(
'https://api.example.com/data',
timeout=(2, 5) # (connect_timeout, read_timeout)
)Kubernetes Timeout Configuration
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
name: backend-service
spec:
hosts:
- backend-service
http:
- match:
- uri:
prefix: /api/
route:
- destination:
host: backend-service
port:
number: 8080
timeout: 5s # Maximum time for request
---
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
name: backend-service
spec:
host: backend-service
trafficPolicy:
connectionPool:
tcp:
maxConnections: 100
http:
http1MaxPendingRequests: 50
http2MaxRequests: 100
maxRequestsPerConnection: 2Pattern 5: Graceful Degradation
Purpose: Provide reduced functionality instead of complete failure
The Problem
Without Graceful Degradation:
Cache service (Redis) fails
Application throws error
Users see \"500 Internal Server Error\"
Users angry, leave your site
With Graceful Degradation:
Cache service (Redis) fails
Application detects circuit breaker OPEN
Queries database directly (slower but works)
Users see same page, slightly slower (2s instead of 200ms)
Users don't notice, system recovers
Once cache recovers, performance returns to normal
Implementation
class UserService:
def get_user(self, user_id):
try:
# Try cache first (fast path)
user = self.cache.get(f'user:{user_id}')
if user:
return user
# Cache miss, get from database
user = self.db.query_user(user_id)
# Try to cache for next time
try:
self.cache.set(f'user:{user_id}', user, ttl=3600)
except Exception as cache_error:
# Cache failure is not critical
logger.warning(f'Cache set failed: {cache_error}')
# Continue without caching
return user
except Exception as db_error:
# Last resort: return stale cached data
logger.error(f'Database query failed: {db_error}')
stale_user = self.cache_backup.get(f'user:{user_id}')
if stale_user:
# Mark as stale so client knows this might be old
stale_user['_stale'] = True
return stale_user
else:
# No stale data available, must fail
raise Exception(f'User service unavailable')
# Usage
service = UserService()
user = service.get_user(123)
print(user) # Works even if cache or database temporarily failsPattern 6: Rate Limiting / Throttling
Purpose: Prevent overwhelming dependent systems during failure
Token Bucket Algorithm
import time
from threading import Lock
class TokenBucket:
def __init__(self, capacity, refill_rate):
self.capacity = capacity # Max tokens
self.tokens = capacity # Current tokens
self.refill_rate = refill_rate # Tokens per second
self.last_refill = time.time()
self.lock = Lock()
def allow_request(self):
\"\"\"
Check if request is allowed (has token available)
\"\"\"
with self.lock:
self._refill()
if self.tokens > 0:
self.tokens -= 1
return True
return False
def _refill(self):
now = time.time()
elapsed = now - self.last_refill
tokens_to_add = elapsed * self.refill_rate
self.tokens = min(self.capacity, self.tokens + tokens_to_add)
self.last_refill = now
# Usage: 100 requests per second, burst to 200
limiter = TokenBucket(capacity=200, refill_rate=100)
def api_endpoint():
if not limiter.allow_request():
return {'error': 'Rate limit exceeded'}, 429
return process_request()Kubernetes Rate Limiting
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
name: api-service
spec:
hosts:
- api-service
http:
- fault:
delay:
percentage: 0.1 # Delay 0.1% of requests
fixedDelay: 5s # By 5 seconds (for testing)
abort:
percentage: 0.001 # Abort 0.001% of requests
grpcStatus: UNAVAILABLE # Return UNAVAILABLE
route:
- destination:
host: api-service
---
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
name: api-service
spec:
host: api-service
trafficPolicy:
connectionPool:
tcp:
maxConnections: 1000
http:
http1MaxPendingRequests: 100
http2MaxRequests: 1000
maxRequestsPerConnection: 1Testing These Patterns with Chaos Engineering
Experiment 1: Verify Circuit Breaker
# Make requests that work
for i in {1..10}; do curl http://service/health; done
# Kill service
kubectl delete pod -n production service-pod-xyz
# Make requests (should see circuit breaker activate)
for i in {1..20}; do curl http://service/api; done
# Restore service
kubectl get pods -n production # Verify pod restarted
# Requests should eventually succeed
for i in {1..10}; do curl http://service/api; doneExperiment 2: Verify Timeout
# Add 10-second latency to all requests
gremlin attack latency --latency 10000 --length 60
# Verify requests timeout at 5 seconds instead of hanging
time curl http://service/api # Should fail after ~5s, not hangKey Takeaways
- Circuit Breaker: Stop cascading by failing fast
- Bulkhead: Isolate failures with resource separation
- Retry + Backoff: Handle transient failures automatically
- Timeout: Prevent resource exhaustion from hung requests
- Graceful Degradation: Provide reduced functionality instead of complete failure
- Rate Limiting: Protect against being overwhelmed