Summary of "Reliability Engineering Mindset • Alex Ewerlöf & Charity Majors • GOTO 2025"
Summary — Reliability engineering mindset (Charity Majors & Alex Ewerlöf)
Core themes
-
SRE and organizational fit
- Google’s SRE materials are influential but describe a high‑resource, Google‑specific recipe.
- Most companies need “fit practices” rather than one-size-fits-all “best practices.”
- In many organizations, platform engineering often precedes full SRE adoption.
-
Ownership and feedback loops
- Modern reliable software depends on engineers owning code in production and keeping feedback loops short.
- Handovers and ambiguous ownership are frequent incident hotspots.
-
Trade-offs and negotiation
- Reliability costs real money and effort — higher availability (more nines) increases cost.
- SLOs/SLIs enable informed trade-offs and let teams push back on management with data rather than emotion.
Reliability involves trade-offs: achieving higher availability increases cost and operational complexity. SLOs let teams make those trade-offs explicit.
SLI / SLO guidance, pitfalls, and uses
-
Biggest wins
- The SLI/SLO concepts (introduced by Google) are powerful and broadly applicable.
- They normalize failure and enable the use of error budgets.
-
Common pitfalls
- Not measuring SLIs at all.
- Measuring the wrong thing (using the same metric for all services).
- Applying cookie‑cutter availability targets that can be weaponized or misapplied by management.
-
Practical advice
- Define a “service” as the business problem you solve (not just a microservice or database).
- Tailor SLOs to the service class and business impact — treat them as fit practices.
- Treat SLOs as negotiated contracts: surface the costs and trade-offs when higher reliability is requested.
- Use SLOs to depoliticize roadmap decisions and to prioritize work when error budgets are exhausted.
Tools, features, and workflows discussed
-
Alex’s open-source visual tool
- Front-end (SVG) visualization mapping provider → consumer dependencies.
- Helps list and prioritize failures by business impact and derive meaningful SLIs.
- Used at Volvo to scale SLO rollout.
-
Honeycomb features (advocated by Charity)
- SLOs as first-class citizens, computed from the same incoming event data used for investigation.
- Seamless exploration from SLO violations to raw events (bubble-up/diff workflows) so teams can pivot from metric to root cause without cross-tool data mismatches.
Observability trend
-
Observability 1.0
- Splintered stores for metrics, traces, and logs.
-
Observability 2.0
- A single‑source, columnar approach treating observability as a data problem.
- Enables richer queries across signals and reduces cross-tool mismatches.
- Next-gen tools are expected to treat SLOs as primary UI/entry points into systems.
Organization, scale, and rollout
-
Scaling SLO adoption
- Reaching hundreds of teams requires repeatable workshops, visual tools, and a common language (a goal of Alex’s book).
-
Org architecture and culture
- Organizational culture, team identity, and social dynamics strongly influence reliability practices.
- Alex’s book includes diagrams and models for org design and communication.
-
Role of engineering managers
- Managers should quantify and communicate costs, say “no” or propose alternatives, and move discussions from emotional asks to measurable trade-offs.
Concrete examples and anecdotes
-
Media company example
- Leadership requested 99.999% availability for a streaming app.
- Research showed users tolerated up to roughly 2 hours downtime (~99.7%), demonstrating that ultra-high nines were unnecessary and costly.
-
Career path and onboarding tip
- Alex uses TPOP (Tech, People, Operation, Product) as a personal onboarding framework.
Recommended readings
- Fluke: Chance, Chaos, and Why Everything We Do Matters — recommended by Alex
- The Elephant in the Brain — recommended
Main speakers and sources
- Charity Majors — co‑founder & CTO, Honeycomb
- Alex Ewerlöf — senior software engineer at Volvo Cars; author of the SRE/SLO book discussed
- Referenced works and vendors: Google SRE books, Alex Hidalgo (on SLOs), Honeycomb, Volvo Cars, and the Observability 2.0 movement
No further action.
Category
Technology
Share this summary
Is the summary off?
If you think the summary is inaccurate, you can reprocess it with the latest model.
Preparing reprocess...