G
GuideDevOps
Lesson 9 of 14

Gremlin

Part of the Chaos Engineering tutorial series.

What is Gremlin?

Gremlin is a commercial chaos engineering platform that provides infrastructure, platform, and application-level chaos experiments. Unlike Litmus (Kubernetes-focused), Gremlin works across:

  • Cloud infrastructure (AWS, Azure, GCP)
  • Kubernetes clusters
  • VMs and bare metal
  • Application code
  • Network infrastructure

Why Choose Gremlin?

Advantages:

  • Multi-platform support: Single pane for VM, container, and cloud failures
  • Enterprise features: API-first design, SSO, audit logging
  • No agent modification: Uses system tools (iptables, tc, etc.) under the hood
  • User-friendly UI: Dashboard for running, monitoring, and analyzing experiments

Considerations:

  • Commercial product (free tier available)
  • Requires agent installation on all target systems

Gremlin Architecture

Core Components

┌─────────────────────────────────────────┐
│      Gremlin SaaS Control Plane         │
│  - Experiment scheduling and reporting  │
│  - Results aggregation                  │
│  - Team management and RBAC             │
└────────────────┬────────────────────────┘
                 │ API
      ┌──────────┼──────────┐
      │          │          │
      ▼          ▼          ▼
┌──────────┐ ┌──────────┐ ┌──────────┐
│ Gremlin  │ │ Gremlin  │ │ Gremlin  │
│ Agent    │ │ Agent    │ │ Agent    │
│ (Linux)  │ │(Kubernetes)  │ (Windows)│
└──────────┘ └──────────┘ └──────────┘
   |            |              |
   └────────────┴──────────────┘
        Injects failure on systems

Installing Gremlin Agent

Linux Installation

# 1. Download and install
curl -O https://downloads.gremlin.com/gremlin/downloads/client/latest/linux/gremlin-latest.linux_amd64.rpm
sudo rpm -i gremlin-latest.linux_amd64.rpm
 
# Or for Debian/Ubuntu:
sudo apt install gremlin
 
# 2. Authenticate
# Option A: Team ID + Private Key
sudo gremlin config set -c <TEAM_ID> -p <PRIVATE_KEY>
 
# Option B: OAuth token
sudo gremlin config set -c <TEAM_ID> -a <AUTH_TOKEN>
 
# 3. Start Gremlin
sudo systemctl enable gremlin
sudo systemctl start gremlin
 
# 4. Verify
gremlin check

Kubernetes Installation

# Add Gremlin Helm repository
helm repo add gremlin https://helm.gremlin.com
helm repo update
 
# Install Gremlin agent
helm install gremlin gremlin/gremlin \
  --namespace gremlin \
  --create-namespace \
  --set gremlin.teamID=<TEAM_ID> \
  --set gremlin.privKey=<PRIVATE_KEY>
 
# Verify
kubectl get pods -n gremlin

Gremlin Experiments by Layer

1. Infrastructure Attacks

Target the physical/virtual infrastructure layer.

CPU Attack

# Consume CPU cores
gremlin attack cpu \
  --cores 4 \
  --length 300

What happens:

  • Specified number of CPU cores maxed out
  • Application performance degrades
  • Tests if system can handle resource constraints
  • Tests autoscaling triggers

Memory Attack

# Consume RAM
gremlin attack memory \
  --megabytes 4096 \
  --length 300 \
  --percent-consumed 80  # Stop when 80% of memory is consumed

What happens:

  • Memory pressure increases
  • Swap usage increases
  • Application may be OOM-killed
  • Tests memory limits and caching strategies

Disk Attack

# Fill disk space
gremlin attack disk \
  --size 50GB \
  --path /tmp \
  --length 300

What happens:

  • Target path fills with temporary files
  • Applications fail to write logs (critical!)
  • Tests error handling for disk-full scenarios

Process Kill

# Kill a specific process
gremlin attack process-kill \
  --process-name nginx \
  --interval 30  # Kill every 30 seconds

What happens:

  • Process is terminated
  • If managed by supervisor, will respawn
  • Tests process restart mechanisms

2. Network Attacks

Target network communication and latency.

Latency Attack

# Add network latency
gremlin attack latency \
  --latency 1000 \
  --target-host 10.0.1.50 \
  --length 300

What happens:

  • All packets to 10.0.1.50 delayed by 1 second
  • Requests see significant latency increase
  • Tests timeout configurations
  • Tests circuit breaker behavior

Packet Loss Attack

# Lose network packets
gremlin attack packet-loss \
  --percentage 50 \
  --target-host database.example.com \
  --length 300

What happens:

  • 50% of packets to target are dropped
  • TCP retransmissions kick in
  • Significant latency and potential timeouts
  • Tests resilience to poor network conditions

Blackhole Attack

# Drop all packets to/from a target
gremlin attack blackhole \
  --target-host 10.0.2.0/24 \
  --length 300

What happens:

  • Complete network isolation
  • Similar to availability zone partition
  • Tests failover to backup systems

DNS Attack

# Corrupt DNS responses
gremlin attack dns \
  --target-host api.example.com \
  --corrupt-response true \
  --length 300

What happens:

  • DNS queries return corrupted data
  • Services fail to resolve names
  • Tests DNS failover and retry logic

3. Application Attacks

Target application-level behavior (requires code instrumentation).

Exception Throwing

# Gremlin for Java with exception injection
gremlin attack exception \
  --service payment-service \
  --exception NullPointerException \
  --percent-affected 10  # Affect 10% of requests

Latency Injection

# Add latency to specific methods
gremlin attack latency \
  --service order-service \
  --method calculateTotal \
  --latency 2000 \
  --percent-affected 20

Running Experiments Through the UI

Step 1: Log Into Gremlin

Access the Gremlin web dashboard at https://app.gremlin.com

Step 2: Create an Experiment

  1. Click ScenariosCreate Scenario
  2. Choose Infrastructure Attack or Application Attack
  3. Select experiment type (CPU, Memory, Latency, etc.)
  4. Configure parameters
  5. Select target hosts (by tag, region, or host name)

Step 3: Set Blast Radius

Target Selection Options:
- By tag:     app=payment-service
- By region:  us-east-1
- By type:    Database servers
- Percentage: Randomly select 20% of matching hosts

Step 4: Monitor Execution

  • Watch real-time metrics during the experiment
  • See which hosts are affected
  • Monitor application-level impact (through integrations with Datadog, New Relic, etc.)

Step 5: View Results

Gremlin provides:

  • Timeline of events
  • Affected hosts and their metrics
  • Application impact summary
  • Pass/fail verdict based on monitoring

Integration with Monitoring Tools

Datadog Integration

# Configure Gremlin to report to Datadog
gremlin config set --datadog-api-key <YOUR_API_KEY> \
                   --datadog-app-key <YOUR_APP_KEY>

Datadog will then:

  • Show Gremlin events on metric graphs
  • Correlate with application errors
  • Show experiment timeline

Prometheus Integration

# Gremlin exposes metrics via Prometheus
scrape_configs:
  - job_name: 'gremlin'
    static_configs:
      - targets: ['localhost:9000']

API Usage

List Available Agents

curl -H "Authorization: Bearer <API_KEY>" \
  https://api.gremlin.com/v1/agents

Trigger Experiment via API

curl -X POST https://api.gremlin.com/v1/scenarios \
  -H "Authorization: Bearer <API_KEY>" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "CPU Attack",
    "description": "Test high CPU",
    "definition": {
      "attacks": [{
        "type": "cpu",
        "parameters": {
          "cores": 4,
          "length": 300
        }
      }],
      "targets": {
        "filters": {
          "names": ["prod-server-1"]
        }
      }
    }
  }'

Best Practices

  1. Start with Staging: Never experiment on production first
  2. Use Tags: Organize infrastructure with meaningful tags
  3. Document Hypotheses: Record what you expect before running experiments
  4. Gradual Rollout: Start with 1-2 hosts, then expand to 10%, then 25%
  5. Team Notifications: Alert team before running experiments
  6. Automated Testing: Integrate Gremlin into CD pipelines

Gremlin vs Litmus vs Chaos Monkey

FeatureGremlinLitmusChaos Monkey
PlatformMulti-cloud/VM/K8sKubernetesAWS only
AgentRequiredOptionalBuilt-in
CostCommercialOpen-sourceOpen-source
Ease of UseDashboard-firstCRD-basedImperative
APIYesYesLimited
Enterprise FeaturesRBAC, SSO, AuditCommunity-drivenBasic

Key Takeaways

  1. Gremlin covers all layers: Infrastructure, network, and application
  2. Enterprise-ready: Designed for large organizations
  3. Easy to use: Dashboard and API for different preferences
  4. Multi-platform: Works wherever your infrastructure is
  5. Integrates with monitoring: Fire experiments and correlate with metrics