G
GuideDevOps
Lesson 12 of 15

Error Handling & Logging

Part of the Python for DevOps tutorial series.

In production DevOps, things will go wrong: network timeouts, permission errors, or missing files. Your scripts must handle these gracefully and log what happened for troubleshooting.

1. Exception Handling

Use try...except blocks to catch errors and prevent your script from crashing silently.

Basic Try-Except

Action:

import logging
 
try:
    # Attempting to read a missing file
    with open("config.yaml", "r") as f:
        config = f.read()
except FileNotFoundError:
    print("Error: The configuration file was not found.")
except Exception as e:
    print(f"An unexpected error occurred: {e}")

Result:

Error: The configuration file was not found.

Catching Specific API Errors

Action:

import requests
 
try:
    response = requests.get("https://api.github.com/invalid-url")
    response.raise_for_status()
except requests.exceptions.HTTPError as err:
    print(f"HTTP error occurred: {err}")
except Exception as err:
    print(f"Other error occurred: {err}")

Result:

HTTP error occurred: 404 Client Error: Not Found for url: https://api.github.com/invalid-url

2. Logging

The logging module is the standard way to record events. Unlike print, logs can be categorized by severity and sent to files or external systems.

Basic Configuration

Action:

import logging
 
# Configure logging to show time and severity
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(levelname)s - %(message)s'
)
 
logging.info("Starting the deployment script...")
logging.warning("Disk space is low (15% remaining)")
logging.error("Failed to connect to the database.")

Result:

2026-04-10 14:00:00,123 - INFO - Starting the deployment script...
2026-04-10 14:00:00,125 - WARNING - Disk space is low (15% remaining)
2026-04-10 14:00:00,127 - ERROR - Failed to connect to the database.

3. Advanced: Retry Logic

DevOps scripts often interact with unstable networks. Adding retries makes your automation "self-healing."

Action:

import time
import random
 
def unreliable_task():
    if random.random() < 0.7: # 70% chance of failure
        raise Exception("Temporary Network Timeout")
    return "Success!"
 
max_retries = 3
for attempt in range(max_retries):
    try:
        print(f"Attempt {attempt + 1}...")
        result = unreliable_task()
        print(result)
        break
    except Exception as e:
        print(f"Failed: {e}")
        if attempt < max_retries - 1:
            time.sleep(1) # Wait before retrying

Result (Example):

Attempt 1...
Failed: Temporary Network Timeout
Attempt 2...
Failed: Temporary Network Timeout
Attempt 3...
Success!

Summary

  • Never use a bare except: block; catch specific exceptions.
  • Use logging instead of print for production scripts.
  • Use Log Levels: DEBUG (noisy), INFO (normal), WARNING (caution), ERROR (failure).
  • Implement retries for network-dependent tasks.
  • Log to stderr for errors and stdout for normal output.