Building SRE Culture - Site Reliability Engineering

Why Culture Matters in SRE

SRE success depends on culture as much as tools:

Good Tools + Bad Culture = Failure
  - Team ignores the tools
  - Blame-oriented incident response
  - People don't trust automation
  - Engineers leave

Good Tools + Good Culture = Success
  - Team embraces tools and practices
  - Blameless, learning-focused
  - People trust systems
  - Engineers stay and innovate

Organizational Models for SRE

Model 1: Centralized SRE Team

Company Structure:
├── Product Teams (Own feature development)
├── SRE Team
│   ├── SRE Lead
│   ├── Senior SRE
│   └── SRE Engineers
└── Infrastructure Team (Own hardware/cloud)

Characteristics:
- Dedicated SRE team
- Supports all product teams
- Single point of authority on reliability
- Cost-effective for single service

When to use: Small companies, single large application

Model 2: Embedded SRE Teams

Company Structure:
├── Product Team A
│   ├── Feature engineers
│   └── Embedded SRE (1-2 people)
├── Product Team B
│   ├── Feature engineers
│   └── Embedded SRE (1-2 people)
└── Platform SRE Team (shared infrastructure)

Characteristics:
- SRE embedded in each product team
- Also have platform team for shared services
- Closer collaboration
- Expensive but scalable

When to use: Larger companies with multiple distinct services

Model 3: DevOps-First with SRE on Demand

Company Structure:
├── Product Teams (Own feature + operations)
│   ├── Feature engineers
│   └── DevOps engineer
└── SRE Center of Excellence
    └── Available to consult on reliability

Characteristics:
- Product teams own everything
- SRE available as consultants
- Lower ops overhead
- Often used at startups/growth stage

When to use: Cloud-native organizations with mature deployment

Hiring SREs

What Makes a Good SRE?

Core Skills:
✅ Software engineering ability (must write code)
✅ Systems thinking (understand complex systems)
✅ Operations experience (know what fails)
✅ Problem-solving (debug under pressure)
✅ Communication (work across teams)

Experience Path (common):
1. Software engineer (2-3 years)
2. Operations/DevOps engineer (2-3 years)
3. SRE engineer (career)

OR

1. SRE at another company (if hiring externally)

Hiring Mistakes to Avoid

❌ "Pure operations person without coding skills"
   - Won't be able to write automation
   - SRE is engineering discipline, not operations

❌ "Theoretical computer scientist with no ops experience"
   - Won't understand operational constraints
   - Will be surprised by real-world failures

❌ "Hiring for seniority only"
   - Burning out your senior engineers
   - Need junior SREs with mentoring
   - Career progression matters

✅ "Mix of experience levels"
   - Ratio: 1 senior : 2-3 mid-level : 1-2 junior
   - Allows mentoring and knowledge sharing

Interview Process

Good SRE interviews assess:

1. Coding ability
   - Design a monitoring system
   - Write deployment automation
   - Optimize database query

2. Operations knowledge
   - "System is slow, walk me through investigation"
   - "We need 99.99% uptime, how?"
   - "Deployment failed, what now?"

3. Communication
   - "Explain complex system to non-technical lead"
   - "You disagree with product on deadline, how do you handle?"

4. Problem-solving under pressure
   - "You're on-call, alert fires, describe response"

5. Culture fit
   - "Tell me about an incident you learned from"
   - "How do you handle blame in your current org?"

Training and Development

On-Boarding Path

Week 1: Orientation

- Meet team
- Understand services
- Set up development environment
- Get laptop

Week 2-4: Knowledge

- Pair programming with experienced SRE
- Read documentation
- Attend design reviews
- Review recent incidents

Month 2-3: Hands-On

- Part of on-call (with mentor)
- Implement small reliability improvements
- Participate in incident response
- Present learnings

Month 4+: Autonomy

- Independent on-call
- Lead reliability projects
- Mentor new teammates
- Expand responsibilities

Continuous Learning

Reading List for SREs:
- "Site Reliability Engineering" (Google SRE Book)
- "The Phoenix Project" (DevOps culture)
- "Release It!" (production patterns)
- "Designing Data-Intensive Applications" (systems design)

Conferences:
- SREcon (regional and annual)
- DevOps Enterprise Summit
- Cloud Native Computing Foundation conferences

Certifications (optional):
- CKA (Certified Kubernetes Administrator)
- AWS Certified SysOps Administrator
- Various vendor-specific certs

Building Blameless Culture

Foundation: Psychological Safety

Psychological safety is prerequisite for blameless culture

Definition: People feel safe to:
- Take intelligent risks
- Admit mistakes
- Ask for help
- Disagree with authority

Without it:
- People hide problems
- Blame others to protect themselves
- Innovation stops
- Quiet resignations

With it:
- Problems surface quickly
- Team learns together
- Innovation happens
- People stay

Creating Psychological Safety

Leadership behaviors that build safety:

✅ DO:
- Admit when you don't know something
- Ask "What happened?" not "Who caused this?"
- Thank people for surface bugs in low-pressure scenarios
- Celebrate learning from incidents
- Don't shoot the messenger

❌ DON'T:
- Use incidents to create performance evaluations
- Blame individuals in meetings
- Punish people for honest mistakes
- Make heroes of people who "save the day"
- Hide failures from team

Explicit Blameless Commitments

# Team Agreement on Blameless Culture
 
Our commitment:
- No one will be punished for an incident they contributed to
- We focus on improving systems, not blaming individuals
- Mistakes are learning opportunities
- We discuss failures openly and honestly
 
Postmortem principles:
- We document what happened, not who caused it
- We ask "What can we improve?" not "Who messed up?"
- Action items focus on system improvements
- Participants feel safe sharing honestly
 
If we break these principles:
- Any team member can call it out immediately
- We discuss how to do better
- We repair relationships
- We continuously improve our culture

Scaling SRE Teams

5 Engineers → 10 Engineers

Addition needed:
- Hire 2-3 more SREs
- SRE lead now spends 50% on management
- Define clear ownership of services
- Start formal training program

10 Engineers → 20 Engineers

Addition needed:
- Split into sub-teams (by service or domain)
- Each sub-team has lead
- Create SRE manager role
- Define career progression
- Codify practices and standards

20+ Engineers

Addition needed:
- SRE director or VP
- Separate teams for different domains:
  - Production reliability
  - Infrastructure/platform
  - Observability/monitoring
  - Chaos/resilience
- Hiring manager for each team
- Cross-team coordination meetings

Compensation and Career Growth

Fair SRE Compensation

SRE careers should be as lucrative as software engineering:

Industry ranges (varies by location and company):
- Junior SRE: $100-150k + bonus
- Mid-level SRE: $150-220k + bonus  
- Senior SRE: $220-300k + bonus
- SRE Manager: $250-350k + bonus
- SRE Director: $300-400k + bonus

On-call compensation:
- Base: Included in salary
- Stipend: $500-2000/month while on-call
- Holiday/weekend on-call: 1.5x - 2x base rate
- After-hours incident: Comp time or OT pay

Career Progression

Paths for SRE growth:

Staff/Principal Track (Technical):
- Senior SRE → Staff SRE → Principal SRE
- Focus on architecture and technical leadership
- Authority without management responsibility

Management Track:
- Senior SRE → SRE Manager → SRE Lead → Director
- Responsible for team, hiring, development

Both tracks should be equally valued and compensated

Skills Development Ladder

Junior SRE:
- Writing automation scripts
- Responding to incidents (guided)
- Basic monitoring and alerting
- Learning SRE culture and practices

Mid-level SRE:
- Designing reliable systems
- Independent incident response
- Mentoring junior SREs
- Driving reliability improvements

Senior SRE:
- Architectural decisions
- Cross-team collaboration
- Mentoring and hiring
- Organizational reliability strategy

Measuring Team Health

Team Metrics Dashboard

SRE Team Health (Monthly)
 
On-call Metrics:
  Incidents per person: 2.4 avg
  Night disruptions: 1.2 per person
  MTTR (mean time to recovery): 15 min
  On-call satisfaction: 8/10 avg
 
Capability Metrics:
  Runbook coverage: 95% of services
  Monitoring coverage: 92% of services
  Automation coverage: 80% of services
  
Team Health:
  Turnover (annual): 5% (healthy)
  Training hours: 40 hours per person/year
  Promotion rate: 20% per year
  New hire success rate: 90%
  
Project Progress:
  Toil reduction: 10% this quarter
  New services taken on: 2
  Major outages: 0
  SRO achievement: 99.95% (target met)

Employee Satisfaction Survey

Annual SRE Survey
 
Questions:
  "I feel safe admitting mistakes": 4.2/5 ✅
  "I have good work-life balance": 3.8/5 ⚠️
  "I'm growing technically": 4.5/5 ✅
  "I understand my career path": 3.5/5 ⚠️
  "I'd recommend this company": 4.3/5 ✅
  "On-call is sustainable": 3.2/5 ⚠️
  
Insights:
- Team feels psychologically safe (good)
- Work-life balance needs improvement (hiring?)
- On-call workload too high (need more people?)
- Career progression unclear (need mentoring?)

Building Community

Internal

Regular practices:
- Weekly SRE sync (15 min, quick updates)
- Monthly SRE brown bags (learning and presentations)
- Quarterly SRE retreat (planning and culture)
- Incident retrospectives (continuous learning)
- Mentoring pairs (senior + junior)

External

Representation:
- SRE members speak at conferences
- Write blog posts about practices
- Contribute to open source
- Organize local SRE meetup

Benefits:
- Recruits see your culture
- Learn from community
- Share knowledge
- Improve your brand

Common Team Culture Mistakes

❌ "Heroic on-call"
   Problem: Celebrate people working 24/7
   Fix: Celebrate good architecture instead

❌ "Blame individuals"
   Problem: Use incidents to punish
   Fix: Explicit blameless commitment

❌ "Ignore on-call load"
   Problem: On-call engineers burn out
   Fix: Track and respond to health metrics

❌ "No career growth"
   Problem: Senior engineers leave
   Fix: Clear paths forward

❌ "Only hire senior people"
   Problem: Expensive and no future pipeline
   Fix: Mix of experience levels

✅ "Psychological safety"
✅ "Fair compensation"
✅ "Clear growth paths"
✅ "Reasonable on-call"
✅ "Continuous learning"

Key Takeaways

✓ Culture is as important as tools
✓ Psychological safety enables learning
✓ Mix of experience levels in team
✓ Clear career progression
✓ Fair compensation matches software engineers
✓ Explicit blameless commitments
✓ Regular training and development
✓ Track team health metrics
✓ Scale teams thoughtfully
✓ Build community internally and externally