Site Reliability Engineering

How Google Runs Production Systems

First edition.
  • 3.88 ·
  • 8 Ratings
  • 30 Want to read
  • 3 Currently reading
  • 12 Have read

My Reading Lists:

Create a new list

Check-In

×Close
Add an optional check-in date. Check-in dates are used to track yearly reading goals.
Today

  • 3.88 ·
  • 8 Ratings
  • 30 Want to read
  • 3 Currently reading
  • 12 Have read

Buy this book

Last edited by ImportBot
December 20, 2023 | History

Site Reliability Engineering

How Google Runs Production Systems

First edition.
  • 3.88 ·
  • 8 Ratings
  • 30 Want to read
  • 3 Currently reading
  • 12 Have read

Members of the SRE team explain how their engagement with the entire software lifecycle has enabled Google to build, deploy, monitor, and maintain some of the largest software systems in the world.

Publish Date
Language
English
Pages
524

Buy this book

Previews available in: English

Edition Availability
Cover of: Site Reliability Engineering
Site Reliability Engineering: How Google Runs Production Systems
2016, O'Reilly Media, Inc.
Paperback in English - First edition.

Add another edition?

Book Details


Table of Contents

Introduction. The production environment at Google, from the viewpoint of an SRE
Principles. Embracing risk
Service level objectives
Eliminating toil
Monitoring distributed systems
The evolution of automation at Google
Release engineering
Simplicity
Practices. Practical alerting from time-series data
Being on-call
Effective troubleshooting
Emergency response
Managing incidents
Postmortem culture: learning from failure
Tracking outages
Testing for reliability
Software engineering in SRE
Load balancing at the frontend
Load balancing in the datacenter
Handling overload
Addressing cascading failures
Managing critical state: distributed consensus for reliability
Distributed periodic scheduling with Cron
Data processing pipelines
Date integrity: what you read is what your wrote
Reliable product launches at scale
Management. Accelerating SREs to on-call and beyond
Dealing with interrupts
Embedding an SRE to recover from operational overload
Communication and collaboration in SRE
The evolving SRE engagement model
Conclusions. Lessons learned from other industries.

Edition Notes

Includes bibliographical references (pages 501-512) and index.

Classifications

Dewey Decimal Class
620.00452 SIT
Library of Congress
HD9696.8.U64 G6666 2016, QA76.77

The Physical Object

Format
Paperback
Pagination
xxiv, 524 pages
Number of pages
524

ID Numbers

Open Library
OL27208603M
Internet Archive
sitereliabilitye0000unse
ISBN 10
149192912X
ISBN 13
9781491929124
OCLC/WorldCat
950479609, 930683030
Goodreads
27968891

Links outside Open Library

Community Reviews (0)

Feedback?
No community reviews have been submitted for this work.

History

Download catalog record: RDF / JSON
December 20, 2023 Edited by ImportBot import existing book
August 24, 2020 Edited by ImportBot import existing book
December 14, 2019 Edited by l9i Link to the online version
December 14, 2019 Edited by l9i Added new cover
July 19, 2019 Created by MARC Bot import new book