Streaming LLM Responses: SSE, WebSocket, JSON (2026)

Q: When should I choose SSE versus WebSockets versus HTTP/2 chunked for LLM streaming?

Pick SSE for unidirectional text streaming — chat surfaces, code generation, structured output — because it has built-in reconnect via Last-Event-ID, every browser and every backend language has first-class support, and the protocol overhead is negligible. Pick WebSockets when the channel must be bidirectional — voice agents that stream audio both ways, chat with mid-turn user corrections, interactive cancellation — because the duplex channel removes the round-trip cost of opening a new request for every user message. Pick HTTP/2 chunked (raw fetch streams) when the client is a modern browser and the response is structured-output JSON being progressively built, because raw byte streams without SSE’s protocol overhead are the cleanest fit. The cost tradeoffs are real: SSE is the cheapest to operate (pure HTTP, no sticky sessions), WebSockets need sticky load balancing and stateful connection accounting, HTTP/2 chunked needs HTTP/2 end-to-end with reconnect logic in the application. Most production estates run all three: SSE for the main chat, WebSockets for voice, HTTP/2 chunked for structured-output APIs consumed by the company’s own apps. The architectural rule is to pick per-surface, not per-platform — a single product can ship all three transports with one shared observability and cancellation contract underneath.

Q: What is the realistic TTFT budget for a production chat surface in 2026?

Under 800 ms p50 and under 1.8 s p99 for chat; under 250 ms p50 for voice agents where the user perceives anything over a third of a second as a stalled turn. The budget decomposes into seven stages each worth its own optimisation: client-to-edge network (30 ms p50 / 120 ms p99) tuned by CDN POP placement; edge-to-provider-region (50/200) tuned by selecting a provider region in the same continent as your users; auth and rate-limit gateway (20/100) tuned by caching auth lookups; application preprocessing such as RAG retrieval, classification, and routing (100/500); provider queue time (30/800) influenced by your provider tier and traffic priority; model prefill on KV-cache miss (200/1500); first decode token (30/150). The sum is roughly 460 ms p50 and 3.4 s p99 in baseline shape, which means the only way to hit 800 ms p99 is aggressive prefix caching to collapse the prefill term. The single biggest lever is prompt-cache hit rate — the same 4 000-token system prompt costs 800–1500 ms cold and 50–150 ms warm, so putting stable system prompts and few-shot examples at the front of every request (where they cache) and dynamic content at the back is the architectural pattern that pays off the most. Speculative decoding helps throughput but does nothing for TTFT — these are different metrics with different levers, and optimising the wrong one is a common mistake.

Q: How do I make sure client disconnects actually cancel the provider call and stop the bill?

By chaining an AbortController end-to-end from the writable response stream through the SDK to the underlying HTTPS connection, and verifying the cancellation path in load tests so you discover the broken propagation in staging rather than the first production incident. The Node.js idiom is to attach an abort listener to the response’s "close" event (which fires when the client disconnects, before the application would otherwise notice), pass the abort signal as the SDK call’s signal option (every 2026 SDK — OpenAI, Anthropic, Bedrock async runtime, Mistral, Gemini — supports AbortController.signal in Node or async-context cancellation in Python), and check abort.signal.aborted inside the for-await loop so you stop reading even if the SDK does not immediately notice the abort. The Python idiom is identical with async-context cancellation: register the cancel scope around the streaming call. The provider stops billing at connection close, not at user disconnect — every hop that holds the connection open after the user is gone costs real money. Beyond the SDK contract, emit a metric on every cancellation so the cancellation rate is visible on dashboards (a rate above 5% indicates TTFT regression or UX issues), and treat orphaned-generation cost-per-stream as a first-class metric. The architectural rule is that cancellation is not a feature; it is a billing-correctness requirement that must be load-tested into the system.

Q: How do I parse JSON that arrives token by token without my UI flickering?

Use the SDK’s typed events (OpenAI Responses API emits response.output_text.delta and response.function_call_arguments.delta; Anthropic Messages streaming emits content_block_start, content_block_delta, content_block_stop; Bedrock Converse has equivalent typed events), a tested partial-JSON library (partial-json-parser for JS, json-stream for Python, streamparser-json for streaming-first parsing), or schema-guided decoding that guarantees every prefix of the output is a valid prefix of a valid JSON object. The naive JSON.parse on each chunk throws on every chunk except the last, which is why hand-rolled streaming JSON parsers are the source of so much production pain. The 2026 production pattern is to use the SDK’s typed events when the SDK provides them (it does, for every major provider), and only fall back to a partial-JSON library when the SDK does not expose the typed shape. For high-stakes structured output — financial extraction, medical coding, legal document generation — schema-guided decoding via Outlines, llguidance, or Microsoft TypeChat is worth the 10–20% throughput hit because every chunk is mathematically guaranteed to be a valid prefix. For list-shaped responses, JSONL line-delimited streaming (one JSON object per line, newline delimited) is the cleanest pattern because each line parses independently. Avoid rolling your own state machine across tool-call-interleaved-with-text boundaries; that is the path of pain.

Q: Why does my SSE stream work on localhost but break in production?

Almost always because some L7 hop between the application and the user is buffering the text/event-stream response, holding the first token until either the buffer fills or the connection times out. Corporate proxies, AWS ALB without explicit no-buffer, NGINX with default proxy_buffering on, CloudFront with the wrong cache-behavior, Azure Front Door with default response-buffering, and GCP Load Balancer without HTTP/2 streaming enabled are the most common culprits. The fix is to set explicit no-buffer headers at the application (Cache-Control: no-cache, X-Accel-Buffering: no, Content-Type: text/event-stream), to configure every L7 layer between your application and the user to pass text/event-stream through without buffering (proxy_buffering off in NGINX, attribute idle_timeout and response_buffering=disabled on AWS ALB, equivalent settings on every other edge), and to test from the actual user network (a corporate user behind a proxy, a mobile user on a flaky carrier) rather than from localhost. The diagnostic is to time the first byte of the response on the wire — if curl -N from your production endpoint sees the first event in 200 ms but the same request from a user’s browser sees it in 30 seconds, you have a buffering hop. The architectural rule is that the SSE configuration runs through every layer of your edge stack and any new hop (CDN, WAF, security inspection appliance) is a buffering risk until proven otherwise.

Q: How do reasoning-model thinking events interact with streaming UI?

They require a visible UI affordance during the thinking phase, because at typical reasoning-token-to-output ratios of 3–10x, the user will otherwise sit through a 14-second blank period and disconnect. OpenAI o-series, DeepSeek-R1, Claude extended-thinking, and Gemini deep-think all emit a distinct "thinking" phase before the final answer; the streaming SDK exposes it via a separate event type (response.thinking.delta for OpenAI, thinking content blocks for Claude, equivalent for others) so your UI can surface a "thinking..." spinner or — for power-user surfaces — a collapsible reasoning trace. The architectural pattern is to render a thinking affordance immediately when the thinking event starts, to optionally show a token count or a brief excerpt for transparency, and to render the final answer with a clear visual transition when the thinking phase ends. The cost angle matters too — thinking tokens are billed at full rate even though they are not displayed, so the cost-per-stream metric for reasoning models is meaningfully different from non-reasoning models. The pairing with [the reasoning LLM models production pattern](/en/blog/reasoning-llm-models-production-architecture-o-series-deepseek-r1-claude-thinking-2026) covers the deeper economics; the streaming-specific rule is that the thinking phase needs UI affordance or the user will abandon.

Q: What is the right backpressure pattern when the consumer is slower than the producer?

Let the transport’s natural backpressure (TCP window, writable-stream drain event, the equivalent on each transport) propagate to the SDK reader, and never use unbounded in-memory queues between the SDK and the transport. A mobile client on poor network may read SSE at 5 KB/s while the model generates at 25 KB/s; naive code buffers the difference unbounded, memory climbs, and the application eventually OOMs. The correct pattern uses the language’s native async iteration semantics that already respect backpressure if the underlying writable applies it correctly — for-await over the SDK stream in Node.js when the response writable is propagating drain events, async-for in Python with an async-iterator-aware transport. The architectural rule is to verify backpressure behaviour explicitly in load tests by throttling the consumer side and watching memory: if memory grows monotonically with the production-consumption mismatch, the backpressure chain is broken somewhere. The fix is almost always to remove an unbounded queue — a Promise.all collection, a setImmediate fan-out, a worker pool with no bounded inbox — and let the natural propagation work. For systems that absolutely need a buffer (smoothing token-emit-time jitter for voice TTS, for example), use a bounded queue with explicit drop semantics rather than an unbounded one.

Q: How does streaming compose with tool calls and structured output simultaneously?

Through the SDK’s typed event stream, which interleaves text deltas, tool-call argument deltas, and structured-output deltas in one ordered sequence that the application must demultiplex per event type. A typical agentic interaction emits text deltas while the model is generating a preamble, then function_call_arguments.delta events as the model decides on a tool call, then function_call.completed when the arguments are complete; the application executes the tool, posts the result back, and the model resumes with more text deltas or another tool call or the final answer. The UI must render each event type as a distinct affordance — a tool-call as a "calling tool: get_weather(location=Mumbai)" pill, a structured-output field as a typed input being filled, plain text as the chat bubble — and the application state machine must handle the resumption cleanly when the model continues after a tool result. The hardest shape is structured output with tool calls interleaved (the model is producing a JSON object, calls a tool partway through to fill one field, resumes the JSON), where the partial-JSON parser must hold its state across the tool-call boundary and the schema must be flexible enough to accept fields out of order. The architectural rule is to use the SDK’s typed events end-to-end and never roll your own state machine across these boundaries — the SDK has already solved it correctly and your version will not.

Q: How do I prevent EventSource auto-reconnect from doubling my provider bill?

By deduplicating at the application server using a request-ID that the client generates and reuses across reconnects, and by returning a cached partial response on reconnect rather than starting a new LLM call. EventSource’s automatic reconnect re-issues the HTTP request to the application server when the connection drops, and if the application server naively starts a new provider call, the user is billed twice (or more if the reconnect happens repeatedly). The pattern is for the client to send a stable X-Request-ID header on the initial request and on every reconnect; the application server caches the in-flight stream (or its accumulated chunks) keyed by request ID, with a TTL of a few minutes; on reconnect, the application server detects the duplicate request ID and either resumes streaming from where it left off (sending only chunks the client has not received, identified by Last-Event-ID) or refuses the duplicate with a clear status. The deeper architectural answer is that LLM streaming is poorly suited to mid-stream resumption because the provider has the model state in memory and cannot be asked to resume from a token offset — so "resumption" in practice means "return the accumulated chunks so far and let the client know the stream is complete (or has failed)". Full mid-stream resumption is generally not worth implementing for chat; voice agents handle it via WebRTC’s connection-recovery primitives rather than SSE reconnect.

Q: What is the five-stage maturity ladder for production LLM streaming, and where do most teams stall?

Stage 0 is non-streaming — every response is a full POST/response cycle, the user stares at a spinner for 14 seconds; typical of MVPs and internal tools, unacceptable for customer-facing surfaces above 100 tokens. Stage 1 is SSE that works on localhost but breaks in production because proxies buffer, no cancellation propagation so orphaned generations bill in full, no TTFT or ITL metrics so regressions are invisible; typical first-iteration deployment with predictable production incidents. Stage 2 is SSE with end-to-end AbortController cancellation, no-buffer headers everywhere, p50/p99 TTFT and ITL on dashboards, partial-JSON parsing via SDK typed events; the streaming pipeline works and the team can see when it does not. Stage 3 is structured-output streaming with schema-guided decoding for high-stakes outputs, tool-call UI affordances, reasoning thinking-phase UI, prefix-caching for TTFT optimisation, multi-transport (SSE for chat, WebSocket for voice, HTTP/2 chunked for structured APIs); the streaming layer is a first-class architectural component. Stage 4 is multi-provider streaming abstraction with normalised events, full egress pipeline (PII redaction with chunk buffering, guardrails per N tokens, transform layer), cost-per-stream and cancellation-rate dashboards driving optimisation, adversarial harness in CI for proxy-buffering and orphaned-generation regressions, prompt-caching hit-rate driving prefix discipline. Most teams stall at Stage 2 because Stage 3 looks like a UI project rather than an architecture project — but the cost of not getting there is paid in cancellation rate, in provider bills for orphaned streams, and in user abandonment that the dashboard does not show until it is reflected in retention.

Satyam Kumar

عودة إلى المدونة

ai-services-patterns

Streaming LLM Response Pattern — SSE, WebSockets, Structured Output, and Backpressure (2026 Architecture)

May 29, 202621 min read

Frequently Asked Questions

شارك هذه المقالة

Twitter LinkedIn WhatsApp

Satyam

مهندس الذكاء الاصطناعي والسحابة. مساعدة الفرق على بناء أنظمة تتسع للملايين.

Streaming LLM Response Pattern — SSE, WebSockets, Structured Output, and Backpressure (2026 Architecture)

Frequently Asked Questions

شارك هذه المقالة

Comments

Leave a comment