Summary of "I can't believe nobody's done this before..."
High-level summary
- OpenAI changed its Responses API to support persistent WebSocket connections — a significant architectural shift with practical performance gains for agent-style workloads.
- Core benefit: WebSockets let the API server keep in-memory session state and guarantee routing to the same server box, so clients only need to send new inputs (for example, tool-call outputs) instead of re-sending the whole chat history every time.
- Measured/claimed impacts: large bandwidth reductions (quoted as “90+%” for some cases) and latency/throughput improvements for multi-tool-call agent runs (OpenAI and the speaker cite roughly 20–40% speedups when many tool calls are involved).
Core takeaway: persistent connections reduce repeated history transmission and orchestration overhead, making multi-step, tool-heavy agent runs much more efficient.
Technical problem explained
How agents typically work:
- User prompt → model generates → model decides to call tools (for example, ls, read file) → external tool runs → tool output must be fed back into the model to continue generation.
Problems with current stateless REST/SSE approach:
- Every tool-call result and every continuation requires sending the full conversation/context back to the API.
- The model is effectively stateless between requests, so each continuation requires the entire history.
- Consequences: repeated transmission of a growing history wastes bandwidth and compute. Real-world examples include sending huge histories (100k tokens / megabytes) to get a very short reply (a few tokens).
Notes on mitigations:
- Caching: reduces compute (less work to generate next tokens) but does not reduce upload size — the client still must send the entire history. The cache is keyed by a hash of that history.
- Compaction (summarization): shortens history and reduces tokens sent, but breaks reuse of caches keyed to the original history. It’s a tradeoff between upload size and cacheability.
Why WebSockets help
- Persistent connection guarantees you stay connected to the same API server instance, allowing that box to maintain in-memory session context without reloading from an external cache on every turn.
- Clients can send only deltas / new inputs (for example, tool outputs), dramatically reducing bandwidth and orchestration overhead.
- Most valuable for agent runs that perform many sequential tool calls (tens+). For simple chat follow-ups that are infrequent, the benefit may not justify maintaining a persistent connection.
Infrastructure complexities discussed
- OpenAI’s previous orchestration used front-end API boxes that route requests to thousands of GPUs; stateless routing required clients to resend history so any box could reconstitute state.
- A global persistent cache shared by every API box would be complex, add latency, and be impractical at scale.
- WebSockets provide a simpler guarantee: keep the session on the same box for the duration instead of building massive cross-box state services.
Standards and ecosystem
- Open Responses: an open standard inspired by OpenAI’s Responses API for shared request/response shapes, streaming semantics, and tool-invocation patterns. The standard is open-sourced and meant to be provider-agnostic (Anthropic, Gemini, etc.).
- The WebSocket capability is not yet part of the Open Responses standard but is expected to be added soon. Because OpenAI’s approach often becomes a de facto standard, this likely accelerates broader adoption.
Practical takeaways / recommendations for developers
- Use WebSockets for agentic workloads that involve many tool calls to reduce bandwidth and improve speed.
- For ordinary chat apps with infrequent follow-ups, WebSockets are less likely to provide large benefits.
- Monitor network traffic for long agent runs to understand payload sizes and identify where savings occur.
- Understand tradeoffs between caching and compaction:
- Caching = compute savings, does not reduce upload size.
- Compaction = smaller histories, breaks caching, and is an explicit design decision.
- Expect architectural shifts across the stack: networking, request formats, caching, tool integrations, and CI/build flows may all evolve.
- Consider adopting Open Responses–compatible interfaces for portability across providers.
Sponsor note (product briefly reviewed)
- Blacksmith (CI provider): sponsor claims major CI speedups (build times cut in half for TypeScript, Docker builds up to 40× faster via NVMe layer caching), lower costs, and better observability/debugging compared to native GitHub Actions. The host reported using it for their organization’s CI.
Speakers / sources referenced
- Video host / narrator (unnamed in transcript) — primary explainer and tester of the new API behavior.
- OpenAI — implemented the Responses API WebSocket feature and provided performance figures.
- Open Responses — open standard inspired by OpenAI’s Responses API.
- Blacksmith — sponsor and CI product mentioned and briefly reviewed.
- Other providers mentioned for context: Anthropic, Google Gemini.
Category
Technology
Share this summary
Is the summary off?
If you think the summary is inaccurate, you can reprocess it with the latest model.
Preparing reprocess...