Summary of "I can't believe nobody's done this before..."

High-level summary

OpenAI changed its Responses API to support persistent WebSocket connections — a significant architectural shift with practical performance gains for agent-style workloads.
Core benefit: WebSockets let the API server keep in-memory session state and guarantee routing to the same server box, so clients only need to send new inputs (for example, tool-call outputs) instead of re-sending the whole chat history every time.
Measured/claimed impacts: large bandwidth reductions (quoted as “90+%” for some cases) and latency/throughput improvements for multi-tool-call agent runs (OpenAI and the speaker cite roughly 20–40% speedups when many tool calls are involved).

Core takeaway: persistent connections reduce repeated history transmission and orchestration overhead, making multi-step, tool-heavy agent runs much more efficient.

Technical problem explained

How agents typically work:

User prompt → model generates → model decides to call tools (for example, ls, read file) → external tool runs → tool output must be fed back into the model to continue generation.

Problems with current stateless REST/SSE approach:

Every tool-call result and every continuation requires sending the full conversation/context back to the API.
The model is effectively stateless between requests, so each continuation requires the entire history.
Consequences: repeated transmission of a growing history wastes bandwidth and compute. Real-world examples include sending huge histories (100k tokens / megabytes) to get a very short reply (a few tokens).

Notes on mitigations:

Caching: reduces compute (less work to generate next tokens) but does not reduce upload size — the client still must send the entire history. The cache is keyed by a hash of that history.
Compaction (summarization): shortens history and reduces tokens sent, but breaks reuse of caches keyed to the original history. It’s a tradeoff between upload size and cacheability.

Why WebSockets help

Persistent connection guarantees you stay connected to the same API server instance, allowing that box to maintain in-memory session context without reloading from an external cache on every turn.
Clients can send only deltas / new inputs (for example, tool outputs), dramatically reducing bandwidth and orchestration overhead.
Most valuable for agent runs that perform many sequential tool calls (tens+). For simple chat follow-ups that are infrequent, the benefit may not justify maintaining a persistent connection.

Infrastructure complexities discussed

OpenAI’s previous orchestration used front-end API boxes that route requests to thousands of GPUs; stateless routing required clients to resend history so any box could reconstitute state.
A global persistent cache shared by every API box would be complex, add latency, and be impractical at scale.
WebSockets provide a simpler guarantee: keep the session on the same box for the duration instead of building massive cross-box state services.

Standards and ecosystem

Open Responses: an open standard inspired by OpenAI’s Responses API for shared request/response shapes, streaming semantics, and tool-invocation patterns. The standard is open-sourced and meant to be provider-agnostic (Anthropic, Gemini, etc.).
The WebSocket capability is not yet part of the Open Responses standard but is expected to be added soon. Because OpenAI’s approach often becomes a de facto standard, this likely accelerates broader adoption.

Practical takeaways / recommendations for developers

Use WebSockets for agentic workloads that involve many tool calls to reduce bandwidth and improve speed.
For ordinary chat apps with infrequent follow-ups, WebSockets are less likely to provide large benefits.
Monitor network traffic for long agent runs to understand payload sizes and identify where savings occur.
Understand tradeoffs between caching and compaction:
- Caching = compute savings, does not reduce upload size.
- Compaction = smaller histories, breaks caching, and is an explicit design decision.
Expect architectural shifts across the stack: networking, request formats, caching, tool integrations, and CI/build flows may all evolve.
Consider adopting Open Responses–compatible interfaces for portability across providers.

Sponsor note (product briefly reviewed)

Blacksmith (CI provider): sponsor claims major CI speedups (build times cut in half for TypeScript, Docker builds up to 40× faster via NVMe layer caching), lower costs, and better observability/debugging compared to native GitHub Actions. The host reported using it for their organization’s CI.

Speakers / sources referenced

Video host / narrator (unnamed in transcript) — primary explainer and tester of the new API behavior.
OpenAI — implemented the Responses API WebSocket feature and provided performance figures.
Open Responses — open standard inspired by OpenAI’s Responses API.
Blacksmith — sponsor and CI product mentioned and briefly reviewed.
Other providers mentioned for context: Anthropic, Google Gemini.