Summary of "Platform Engineering: From Theory to Practice • Liz Fong-Jones & Lesley Cordero • GOTO 2025"
Platform Engineering: From Theory to Practice
(Liz Fong‑Jones & Lesley Cordero)
High-level themes
- Platform engineering and SRE are related, overlapping applications of DevOps. They share technical practices and sociotechnical thinking, but platform engineering can act as a unifying domain that brings together reliability, security, UX, accessibility, documentation, and tooling.
- Kubernetes is a substrate for building developer platforms — it is not the developer platform itself. Platform teams should build internal developer platforms (IDPs) that developers consume, hiding low-level Kubernetes details.
- Observability is central: it provides the data and feedback loops that form accurate mental models, guide decisions, and reveal mismatches between design docs and runtime behavior.
- Open source is essential to modern platforms but faces sustainability, governance, and funding challenges. Companies that benefit from OSS should contribute maintenance, funding, and upstream work.
Technical concepts, product features, and analysis
Observability / telemetry
- Tracing and telemetry validate assumptions (for example, that service A calls service B) and are invaluable for onboarding and debugging.
- OpenTelemetry (OTEL) succeeded because vendors, library authors, and users aligned on a common telemetry format and SDKs, reducing duplicated SDK effort.
- Linking runtime telemetry (trace/span names, endpoints) back to source code and commits speeds debugging and onboarding. This linkage is easier in some language/runtime stacks than others.
Kubernetes and developer platforms
- Distinguish between Kubernetes as the platform to build on and internal developer platforms (IDPs) as what developers should consume.
- Platform teams should provide tooling and UX so developers don’t need to author or understand low-level Kubernetes manifests.
Documentation and developer experience
- Documentation is part of platform engineering; auto-generated docs from code and comments reduce maintenance burden.
- Misleading documentation is worse than no documentation — observability can detect incorrect mental models created by bad docs.
- Survey developer experience with qualitative inputs (e.g., “how was your last on‑call?”) to inform platform prioritization.
Prioritization, toil, and lifecycle
- Apply SRE concepts: measure and pay down toil (repetitive, automatable work). The organization’s stage dictates how much technical debt is tolerable.
- Prioritize using evidence: telemetry, developer surveys, and growth indicators (for example, more frequent pushes). Move from reactive → proactive → preventive modes.
- Practical example: reducing CI/build times (e.g., from ~14 minutes to ~6.5 minutes) is high-impact platform work even if not on a formal roadmap.
Open source dynamics and sustainability
- Challenges include maintainer burnout, corporate-dominant contributor bases, cross-ecosystem security/patching demands, and contributor licensing/relicensing issues.
- Mitigations:
- Corporate sponsorship and paid time for employees to contribute upstream.
- Paid internships or programs (Google Summer of Code style) to teach contribution.
- Rotating engineers into upstream work, writing good bug reports, and learning maintainer norms to reduce friction.
- Treat OSS funding and contributor time as corporate responsibility for projects the company relies on.
People, process, and culture
- Psychological safety and the ability to experiment are crucial. Teams succeed when they can try things, fail safely, and iterate — this requires organizational support.
- Platform teams must provide value and enable developers. Coercive “you must use our platform” approaches drive shadow ops; instead, make migration attractive.
- Onboarding practices:
- Use living oral histories where recent hires explain system architecture.
- Show live telemetry to new engineers and encourage them to flag surprising observations — this surfaces false assumptions.
Guides, tutorials, and example resources
- Google SRE book — a conversation starter and guide to the experimentation mindset; not a one‑to‑one recipe for every org.
- Mikey Dickerson’s course “Creating and Running Reliable Systems” — lab-style assignments (scale, zone failures) for teaching production practices.
- Google Summer of Code / corporate-paid internship models — ways to teach and fund open source contribution.
- O’Reilly course (briefly mentioned) — “How to Become an Open Source Contributor” for practical onboarding.
Practical takeaways / recommendations
- Treat platform engineering as a unifying product: combine reliability, security, UX, docs, and tooling into sensible defaults and consultative services.
- Invest in observability and link telemetry to code to speed onboarding and improve mental models.
- Prioritize platform work using evidence (surveys, telemetry, growth signals) and balance reactive versus preventive investments.
- Encourage rotation into upstream open source and allocate corporate funding/time to sustain projects the company depends on.
- Avoid imposing monolithic rules; build platforms teams actually want to adopt to prevent shadow ops.
Main speakers / sources
- Liz Fong‑Jones — Honeycomb (field CTO / co‑CTO; formerly Google)
- Lesley Cordero — Staff Engineer, The New York Times
Category
Technology
Share this summary
Is the summary off?
If you think the summary is inaccurate, you can reprocess it with the latest model.
Preparing reprocess...