G
GuideDevOps
Lesson 5 of 14

Designing Chaos Experiments

Part of the Chaos Engineering tutorial series.

Overview

A chaos experiment is a structured test that introduces controlled failure to a system to verify its resilience. Do not just break things randomly; follow the scientific method.

The Experiment Lifecycle

  1. Hypothesis: "If I kill the primary database pod, the service will automatically fail over without users seeing an error."
  2. Blast Radius: Define what part of the system is impacted.
  3. Execution: Run the experiment in a controlled environment (staging or production with monitoring).
  4. Analysis: Verify the hypothesis. Did the system survive as expected?

Example: Network Latency Injection

Inject 500ms latency to all requests to the "auth-service".

# Using a tool like Gremlin or Litmus to inject latency
gremlin attack run latency --percentage 100 --delay 500 --target auth-service

Expected Result: The system should maintain stability, and monitor alerts should report increased latency without failing requests.

Experiment Status: COMPLETED
Hypothesis Verified: True
Auth service latency increased, but 200 OK rate remained at 100%.