G
GuideDevOps
Lesson 28 of 28

Network Performance & Optimization

Part of the Networking Basics tutorial series.

Network performance determines user experience and application efficiency. Understanding performance characteristics and how to optimize them is critical for DevOps engineers.

Key Performance Metrics

Latency (Delay) Time for data to travel from source to destination:

Round Trip Time (RTT): 50ms
One Way Latency: 25ms

Measured by: ping, traceroute, synthetic monitoring

Bandwidth (Capacity) Maximum data rate a link can carry:

Gigabit Ethernet: 1 Gbps = 125 MB/s
10 Gbps = 1.25 GB/s

Lower of all links in path = bottleneck

Throughput (Actual Rate) Actual data rate achieved:

Theoretical max: 1 Gbps link
Actual throughput: 950 Mbps (95%)
Overhead: TCP headers, IP headers, retransmissions, etc.

Packet Loss Percentage of packets that don't arrive:

Sent: 1000 packets
Received: 998 packets
Lost: 2 packets = 0.2% loss

Causes: Congestion, line errors, buffer overflow

Jitter Variance in latency:

Normal: 50ms ± 5ms (jitter = 5ms)
Bad: 50ms ± 50ms (jitter = 50ms)

Affects: VoIP quality, streaming smoothness

Latency Sources

Propagation Delay Speed of light through medium:

Speed: ~200,000 km/sec in fiber
Distance: 100km = 100000 m
Delay: 100000 / 200000000 = 0.5ms

Minimum latency based on geography
Cannot improve below this

Processing Delay Time to examine and forward packets:

Router: Read header, lookup route: ~1ms
Switch: Learn MAC, forward: ~0.1ms
Firewall: Stateful inspection: ~5ms

Sum of all hops

Queuing Delay Wait in buffer if link busy:

Link utilization: 90%
Packets queued: High
Queuing delay: +20ms

Congestion causes
Can spike dramatically

Serialization Delay Time to transmit packet bits:

1500-byte packet on 1 Gbps link:
1500 bytes = 12000 bits
12000 bits / 1000000000 bps = 12 microseconds

High bandwidth = lower serialization delay

Total Latency = Propagation + Processing + Queuing + Serialization

Bandwidth Utilization

Link Capacity vs Actual Use

┌─────────────────────────────────────┐
│ 1 Gbps Ethernet link available      │
├─────────────────────────────────────┤
│ Used:           600 Mbps (60%)      │
│ Available:      400 Mbps (40%)      │
├─────────────────────────────────────┤
│ Check: How much can add before      │
│ congestion? Answer: ~400 Mbps more  │
└─────────────────────────────────────┘

Rule of thumb: Keep under 70% for headroom

Measuring Network Performance

Ping Measure RTT:

ping 8.8.8.8
 
# Output:
# PING 8.8.8.8 (8.8.8.8) 56(84) bytes of data.
# 64 bytes from 8.8.8.8: icmp_seq=1 ttl=119 time=25.3 ms
# 64 bytes from 8.8.8.8: icmp_seq=2 ttl=119 time=25.2 ms
# 64 bytes from 8.8.8.8: icmp_seq=3 ttl=119 time=27.1 ms
# --- 8.8.8.8 statistics ---
# min/avg/max/stddev = 25.2/25.9/27.1/0.8 ms

Traceroute Show path and latency at each hop:

traceroute google.com
 
# Output shows:
# Hop 1: 192.168.1.1 1.2 ms
# Hop 2: 203.0.113.1 5.3 ms
# Hop 3: 203.0.113.100 15.2 ms
# Hop 4: 8.8.8.1 25.3 ms

MTR (My Traceroute) Combines ping and traceroute, continuous monitoring:

mtr -c 100 google.com
 
Shows:
- Packet loss % at each hop
- Latency statistics (min/avg/max)
- Continuously updated

iperf/iperf3 Measure TCP/UDP throughput:

# Server
iperf3 -s
 
# Client
iperf3 -c server.example.com -t 30
 
# Output:
# [ ID] Interval     Transfer    Bitrate
# [  5] 0.00-30.00   3.62 GBytes 1.04 Gbps

netperf Network performance benchmarking:

netperf -H server.example.com -t TCP_RR
 
# Request/Response latency test

Performance Optimization Techniques

1. Link Aggregation (Bonding)

Multiple links → Single logical link:

┌─ Link 1 (1 Gbps) ─┐
├─ Link 2 (1 Gbps) ─┤ → Bonded: 2 Gbps
├─ Link 3 (1 Gbps) ─┤
└─ Link 4 (1 Gbps) ─┘

Benefits:
- Higher throughput (sum of links)
- Failover if one link fails

2. Compression

Reduce data volume:

Uncompressed HTTP: 1 MB
Compressed (gzip): 200 KB (80% reduction)

Benefits:
- Less bandwidth needed
- Faster transfer
- Less congestion

Tradeoff: CPU time for compression/decompression

3. Protocol Optimization

TCP Tuning:

# Increase TCP window size (more in-flight data)
sysctl -w net.core.rmem_max=134217728
sysctl -w net.core.wmem_max=134217728
 
# Longer TCP backlog (more concurrent connections)
sysctl -w net.ipv4.tcp_max_syn_backlog=5120
 
# TCP buffer auto-tuning
sysctl -w net.ipv4.tcp_tw_reuse=1

UDP for Real-Time:

TCP: Reliable but retransmits (adds latency)
UDP: Unreliable but fast (good for gaming, VoIP)

Choice depends on tolerance for loss vs latency

4. QoS (Quality of Service)

Prioritize traffic:

Traffic Classes:
├─ Voice: Highest priority (must be ≤150ms)
├─ Video: High priority (must be ≤300ms)
├─ Web: Medium priority
└─ Best Effort: Low priority

When congested:
- Drop Best Effort traffic first
- Keep Voice traffic flowing

Result: Critical apps get consistent experience

5. Caching and CDN

Direct (no cache):
Client → Origin Server (30ms latency)

With CDN/Cache:
Client → Edge Server (5ms latency, cached copy)

Benefits:
- Lower latency
- Less origin server load
- Reduced bandwidth

Examples: Cloudflare, Akamai, AWS CloudFront

6. Connection Pooling

Reuse connections:

Without pooling:
Request 1: TCP handshake (30ms) + request (10ms) = 40ms
Request 2: TCP handshake (30ms) + request (10ms) = 40ms
Total: 80ms

With pooling:
Request 1: TCP handshake (30ms) + request (10ms) = 40ms
Request 2: Reuse same connection (10ms) = 10ms
Total: 50ms

Biggest improvement for many small requests

7. MTU (Maximum Transmission Unit) Tuning

Standard: 1500 bytes (0-1500 frame size)
Jumbo Frames: 9000 bytes

Larger MTU:
├─ Fewer packets for same data
├─ Lower per-packet overhead
├─ Improves throughput
└─ Lower CPU usage

Tradeoff: Not supported everywhere
Requirement: All network devices support same MTU

Set MTU:

# View current
ip link show eth0
 
# Change (temporary)
sudo ip link set eth0 mtu 9000
 
# Persistent (varies by distro)
# In netplan or network config

Network Bottleneck Identification

Step 1: Measure

ping server.example.com → RTT = 100ms (high?)
iperf3 -c server.example.com → 100 Mbps (low?)
traceroute → Where is latency? Which hop?

Step 2: Analyze

High latency:
✓ Is it propagation? (geography, can't improve)
✓ Is it processing? (router CPU high?)
✓ Is it queuing? (link utilization high?)
✓ Is it congestion? (packet loss detected?)

Step 3: Locate Bottleneck

Throughput test shows 100 Mbps on 1 Gbps link:

Run: ifstat (interface statistics)
Look for:
- High TX errors? Driver issue
- High collisions? Half-duplex link
- Interface down? Connection problem
- Duplex mismatch? Speed negotiation issue

Run: netstat -i (interface utilization)
Look for: % utilization, errors, dropped

Step 4: Fix

Common fixes:
├─ Clear congestion (add capacity, reroute traffic)
├─ Fix duplex mismatch (force full-duplex)
├─ Update drivers (newer = often better)
├─ Physically move server (reduce latency)
├─ Add link aggregation (increase capacity)
├─ Optimize routes (less hops)
└─ Enable QoS (prioritize critical traffic)

Performance by Application Type

OLTP (OnLine Transaction Processing)

  • Sensitive to: Latency
  • Goal: ≤50ms response
  • Focus: Minimize RTT
  • Example: Online banking

Batch Processing

  • Sensitive to: Throughput
  • Goal: Complete in time window
  • Focus: Maximize total data moved
  • Example: Nightly reports

Streaming

  • Sensitive to: Jitter, latency
  • Goal: Consistent bitrate, ≤300ms latency
  • Focus: QoS, bandwidth reservation
  • Example: Video services

VoIP

  • Sensitive to: Latency, packet loss, jitter
  • Goal: ≤150ms, ≤1% loss, ≤10ms jitter
  • Focus: QoS, dedicated bandwidth
  • Example: Video conferencing

Performance Monitoring

Continuous Monitoring:

# Watch link utilization
watch -n 1 'ifstat -i eth0'
 
# Monitor connection count
watch 'netstat -tan | grep ESTABLISHED | wc -l'
 
# Track packet loss
ping -c 100 server.example.com | grep loss
 
# Bandwidth monitoring
nethogs (shows per-process bandwidth)
iotop (shows per-process disk I/O)

Best Practices

✓ Establish performance baseline first ✓ Monitor continuously (don't wait for problems) ✓ Test with realistic load (lab vs production differ) ✓ Consider latency AND throughput (not just speed) ✓ Document what "good performance" means ✓ Test failover scenarios ✓ Use layered caching (multiple levels) ✓ Compress what makes sense (text yes, video no) ✓ Set realistic QoS policies ✓ Account for full round-trip (app → network → app)

Key Concepts

  • Latency = Delay (milliseconds)
  • Bandwidth = Capacity (megabits/second)
  • Throughput = Actual rate (megabits/second, usually less than bandwidth)
  • Packet loss = % of packets not arriving
  • Jitter = Variance in latency
  • Queuing delay = Wait due to congestion
  • Propagation delay = Speed of light through medium (can't improve)
  • Bottleneck = slowest link in path
  • QoS = Prioritize traffic by importance
  • Trade-off balance between latency (UDP) vs reliability (TCP)