SRE is what happens when you ask a software engineer to design an operations team. This is a community-driven collection of SRE knowledge, practices, and open source tools.
Hope is not a strategy. Engineering is.
Foundational concepts from Google's Site Reliability Engineering practice
100% reliability is the wrong target. Users can't tell the difference between 99.99% and 99.999% availability, but the engineering cost differs by orders of magnitude. Accept appropriate levels of risk based on user expectations.
Define clear SLOs that represent what users actually care about. Use SLIs to measure, SLOs to set targets, and SLAs for business commitments. Let error budgets drive decision-making.
Toil is work that is manual, repetitive, automatable, tactical, and devoid of enduring value. SRE teams should spend no more than 50% of their time on toil—the rest on engineering.
Monitor for symptoms, not causes. Alert on user-facing issues. Every page should be actionable. If a human doesn't need to take action, it shouldn't be an alert.
Automate yourself out of a job—then find a harder job. Automation provides consistency, a platform for extending, and a mechanism for sharing knowledge.
Releases should be boring. Build automated, repeatable, reliable release processes. Use canary deployments, feature flags, and progressive rollouts to reduce risk.
Every line of code is a liability. Simplicity is a prerequisite for reliability. Remove unnecessary complexity. A system with fewer moving parts has fewer opportunities to fail.
When incidents occur, focus on what happened, not who did it. Create a culture where people feel safe to report issues. Learn from failures without fear of punishment.
Error budgets align development and SRE teams on a shared objective: reliable innovation
An error budget is derived from the SLO. If your SLO is 99.9% availability, your error budget is 0.1% (about 43 minutes per month). This budget can be "spent" on:
When the budget is exhausted: Focus shifts entirely to reliability. No new features until the service is back within SLO.
The hierarchy of service level terminology
A carefully defined quantitative measure of some aspect of the level of service that is provided.
A target value or range for a service level measured by an SLI. Choosing the right SLO is complex—too high wastes resources, too low frustrates users.
An explicit or implicit contract with users that includes consequences of meeting (or missing) the SLOs contained within.
Practical approaches for building reliable systems
Not all work is toil. Learn to identify and eliminate it.
Toil is the kind of work tied to running a production service that tends to be manual, repetitive, automatable, tactical, devoid of enduring value, and that scales linearly as a service grows.
Running a script manually that could be triggered automatically
Work you do over and over, not solving new problems
If a machine could do it, a human shouldn't have to
Interrupt-driven, reactive work without strategy
After you're done, the service is in the same state
O(n) with service size, not O(1) or O(log n)
The foundational texts of Site Reliability Engineering
The original "SRE Book" from Google. Covers principles, practices, and management of SRE teams.
Read free online →Practical companion to the SRE book with actionable examples and case studies.
Read free online →Best practices for designing systems that are both secure and reliable from the ground up.
Read free online →All content is open source. Contribute your knowledge, tools, and experiences.