SREDoctor - Site Reliability Engineering Knowledge Hub

Core SRE Principles

Foundational concepts from Google's Site Reliability Engineering practice

Embrace Risk

100% reliability is the wrong target. Users can't tell the difference between 99.99% and 99.999% availability, but the engineering cost differs by orders of magnitude. Accept appropriate levels of risk based on user expectations.

Service Level Objectives

Define clear SLOs that represent what users actually care about. Use SLIs to measure, SLOs to set targets, and SLAs for business commitments. Let error budgets drive decision-making.

Eliminate Toil

Toil is work that is manual, repetitive, automatable, tactical, and devoid of enduring value. SRE teams should spend no more than 50% of their time on toil—the rest on engineering.

Monitoring & Alerting

Monitor for symptoms, not causes. Alert on user-facing issues. Every page should be actionable. If a human doesn't need to take action, it shouldn't be an alert.

Automation

Automate yourself out of a job—then find a harder job. Automation provides consistency, a platform for extending, and a mechanism for sharing knowledge.

Release Engineering

Releases should be boring. Build automated, repeatable, reliable release processes. Use canary deployments, feature flags, and progressive rollouts to reduce risk.

Simplicity

Every line of code is a liability. Simplicity is a prerequisite for reliability. Remove unnecessary complexity. A system with fewer moving parts has fewer opportunities to fail.

Blameless Postmortems

When incidents occur, focus on what happened, not who did it. Create a culture where people feel safe to report issues. Learn from failures without fear of punishment.

The Error Budget Model

Error budgets align development and SRE teams on a shared objective: reliable innovation

An error budget is derived from the SLO. If your SLO is 99.9% availability, your error budget is 0.1% (about 43 minutes per month). This budget can be "spent" on:

Planned downtime for maintenance
Risky feature launches
Experiments and testing
Inevitable system failures

When the budget is exhausted: Focus shifts entirely to reliability. No new features until the service is back within SLO.

Error Budget = 1 - SLO

Example: 99.9% SLO

Error Budget = 1 - 0.999 = 0.001 (0.1%)

Monthly budget:
30 days × 24 hours × 60 min × 0.001 = 43.2 minutes

Quarterly budget:
90 days × 24 hours × 60 min × 0.001 = 129.6 minutes

SLIs, SLOs, and SLAs

The hierarchy of service level terminology

SLI

Service Level Indicator

A carefully defined quantitative measure of some aspect of the level of service that is provided.

request_latency_ms < 200
successful_requests / total_requests

SLO

Service Level Objective

A target value or range for a service level measured by an SLI. Choosing the right SLO is complex—too high wastes resources, too low frustrates users.

99.9% of requests < 200ms
99.95% availability per month

SLA

Service Level Agreement

An explicit or implicit contract with users that includes consequences of meeting (or missing) the SLOs contained within.

If availability < 99.9%:
Customer receives 10% credit

Key SRE Practices

Practical approaches for building reliable systems

On-Call Engineers carry the pager and respond to incidents. Maximum 25% of time on-call. Each incident should generate tickets for follow-up work. Rotation should be sustainable and well-compensated.
Incident Management Clear roles during incidents: Incident Commander, Communications Lead, Operations Lead. Use shared documents for real-time collaboration. Focus on mitigation first, root cause later.
Postmortem Culture Write blameless postmortems for significant incidents. Include timeline, root cause, impact, lessons learned, and action items. Share widely to spread knowledge.
Capacity Planning Plan for organic growth, inorganic growth (launches), and failure domains. Maintain N+2 redundancy where practical. Regular load testing validates capacity models.
Change Management 70% of outages are caused by changes. Use progressive rollouts, canary deployments, and feature flags. Make rollback the first response, not the last resort.
Chaos Engineering Proactively inject failures to discover weaknesses. Start small, in non-production. Build confidence through controlled experiments. Make failure a regular, expected event.

Understanding Toil

Not all work is toil. Learn to identify and eliminate it.

Toil is the kind of work tied to running a production service that tends to be manual, repetitive, automatable, tactical, devoid of enduring value, and that scales linearly as a service grows.

Manual

Running a script manually that could be triggered automatically

Repetitive

Work you do over and over, not solving new problems

Automatable

If a machine could do it, a human shouldn't have to

Tactical

Interrupt-driven, reactive work without strategy

No Enduring Value

After you're done, the service is in the same state

Scales with Service

O(n) with service size, not O(1) or O(log n)

Essential Reading

The foundational texts of Site Reliability Engineering

Site Reliability Engineering

Edited by Betsy Beyer, Chris Jones, Jennifer Petoff, Niall Murphy

The original "SRE Book" from Google. Covers principles, practices, and management of SRE teams.

Read free online →

The Site Reliability Workbook

Edited by Betsy Beyer, Niall Murphy, David Rensin, Kent Kawahara, Stephen Thorne

Practical companion to the SRE book with actionable examples and case studies.

Read free online →

Building Secure & Reliable Systems

By Heather Adkins, Betsy Beyer, Paul Blankinship, et al.

Best practices for designing systems that are both secure and reliable from the ground up.

Read free online →

Site Reliability Engineering: A Practical Guide

Core SRE Principles

Embrace Risk

Service Level Objectives

Eliminate Toil

Monitoring & Alerting

Automation

Release Engineering

Simplicity

Blameless Postmortems

The Error Budget Model

SLIs, SLOs, and SLAs

SLI

Service Level Indicator

SLO

Service Level Objective

SLA

Service Level Agreement

Key SRE Practices

Understanding Toil

Manual

Repetitive

Automatable

Tactical

No Enduring Value

Scales with Service

Essential Reading

Site Reliability Engineering

The Site Reliability Workbook

Building Secure & Reliable Systems

Built by the Community, for the Community