Overview
Chaos experiments are useless if you cannot observe the results. Monitoring detects the impact, while observability helps you understand why the system behaved the way it did during the chaos.
Key Observability Metrics
- System Health: CPU, Memory, Disk I/O.
- Application Performance: Request rate, latency, and error rate (R.E.D. pattern).
- Control Plane: Kubernetes API server latency, etcd health.
Example: Monitoring During a Chaos Attack
If you are injecting latency into an API, your monitoring tool should show a clear spike at the time of the attack.
# Query to check error rate spike during experiment
rate(http_requests_total{status=~"5.."}[1m])Expected Result (Dashboard Graph): You should see a sharp increase in the graph precisely when the chaos experiment starts and a return to normal once the experiment is finished.
[Time] [Error Rate]
10:00 0.01
10:01 0.05 <-- Chaos experiment starts
10:02 0.08
10:03 0.01 <-- Chaos experiment stopsIf the error rate doesn't return to normal, or spikes significantly higher than expected, your hypothesis may be incorrect.