Frontend AI Guide

Streaming Responses for Frontend Engineers

FoundationalAI12 min read

Streaming Responses for Frontend Engineers

When the API takes 5 seconds to answer, streaming makes it feel like 200ms. Here's how it actually works under the hood.

If you've used ChatGPT, you've felt the difference between a streaming reply and a non-streaming one. The bot starts answering within 200ms, even though the full answer takes five seconds to finish. That gap — between "first character" and "last character" — is where streaming lives. It's the single biggest UX win you can ship as a frontend engineer at an AI company.

But streaming is also where the most subtle bugs hide. People ship a working chat in an afternoon, then spend the next month fixing edge cases: tabs that go to sleep, users who scroll up, multibyte characters that corrupt mid-stream, cancellation that doesn't actually cancel. Let me walk you through the pieces in the order I wish someone had taught me.

01What streaming actually is

Streaming isn't magic. It's just an HTTP response that doesn't end immediately. The server keeps the connection open and writes bytes to it as it has them. Your fetch call resolves with a Response object whose body is a ReadableStream — and you read from that stream chunk by chunk until the server says it's done.

Most LLM APIs put a specific format on top of this called Server-Sent Events (SSE). Each "frame" is a data: <json> line followed by a blank line. Your job: read bytes, decode them to characters, split on the blank-line delimiter, parse each frame as JSON, and update your UI.

const reader = res.body.getReader();
const decoder = new TextDecoder();
let buffer = '';

while (true) {
  const { value, done } = await reader.read();
  if (done) break;
  buffer += decoder.decode(value, { stream: true });
  const frames = buffer.split('\n\n');
  buffer = frames.pop() ?? '';  // last is partial — save for next iteration
  for (const frame of frames) {
    const data = frame.replace(/^data: /, '');
    if (data === '[DONE]') return;
    onDelta(JSON.parse(data).delta);
  }
}

That nine-line loop is the heart of every streaming chat in the world. Memorize it.

02Cancellation, the right way

Streaming UIs need a Stop button, and the Stop button has to actually stop. The wrong way is to set a cancelled boolean in state and check it inside your loop. State updates close over the value at function creation, so by iteration ten your loop is reading a stale snapshot. The user thinks they cancelled; the stream keeps going.

The right primitive is AbortController. Create one, pass controller.signal to fetch, and call controller.abort() to cancel. The pending fetch rejects with an AbortError, the connection closes, your loop exits.

const ctrl = new AbortController();
const res = await fetch('/api/chat', { signal: ctrl.signal });
// later, when user hits Stop:
ctrl.abort();
⚠ The trap that catches juniors every time

Don't reach for useState for the cancel flag. State doesn't update fast enough for an async loop to read it freshly. If you ever feel the urge to write const [cancelled, setCancelled] = useState(false) for cancellation, stop and use a ref or an AbortController instead. State is for rendering. Refs are for control flow.

03The multibyte character trap (the bug nobody warns you about)

Here's a bug that only shows up after you launch internationally. You read a chunk of bytes, convert to a string, render it. Works perfectly in your tests. But every now and then a UTF-8 character — an emoji, a Chinese character, a Hindi vowel — splits across two chunks. Your decode produces a question-mark replacement character. The user sees garbage. You can't reproduce it locally.

The fix is one parameter: always use TextDecoder with { stream: true }. That flag tells the decoder to hold back any partial multibyte sequence until the next chunk arrives.

const decoder = new TextDecoder();
buffer += decoder.decode(value, { stream: true });
//                                ^^^^^^^^^^^^ this prevents the bug

Without the flag, the decoder finalizes whatever bytes you give it, even if they're incomplete. With it, you get correct text every time. Cost is zero. Add it as muscle memory.

04When tokens arrive faster than React can render

For most LLM APIs this isn't a problem — tokens arrive at maybe 30 per second and React handles that comfortably. But if you connect to a fast inference endpoint, or render multiple streams concurrently, you can hit a regime where setState-per-token drops frames. The typing indicator stutters, scroll feels sticky, things get janky.

The mitigation is to batch token writes onto requestAnimationFrame. Accumulate incoming chunks in a ref, flush to state once per frame. Your effective update rate caps at 60fps no matter how fast tokens arrive.

const pending = useRef('');
const rafId = useRef(null);

const onChunk = (chunk) => {
  pending.current += chunk;
  if (rafId.current) return;
  rafId.current = requestAnimationFrame(() => {
    setMessage((m) => m + pending.current);
    pending.current = '';
    rafId.current = null;
  });
};

One thing — don't add this on day one. Measure first. The fastest way to look junior in a senior interview is to optimize before you have a problem.

05What bites you in production

Things to plan for, in rough order of how often they actually hurt:

  • Network blip mid-stream — the connection drops, you have half a message. Mark it as interrupted in the UI (italic, a small badge) instead of silently leaving partial text. Offer "Retry from here."
  • Backgrounded tabs — browsers throttle timers when a tab loses focus. Your stream may stall and resume. Listen for visibilitychange and show a subtle indicator that the response is paused.
  • Memory growth — at 50KB per message and 200 messages, you're at 10MB of React-tracked strings. Virtualize the message list (react-virtuoso) once you cross ~50 turns.
  • Screen readers — without aria-live="polite" on the assistant message wrapper, blind users get no announcement of the streamed reply.
  • Race on rapid sends — user hits Stop, sends again before cleanup runs. Tag each stream with an id; only apply chunks if their id matches the latest. Otherwise late chunks from the cancelled stream append to the new message.

Key Takeaways

  • 01Streaming responses cut perceived latency from seconds to milliseconds — usually the highest-leverage UX work you can ship.
  • 02The whole pattern is fetch → ReadableStream → TextDecoder → split on \n\n → JSON.parse → setState. Memorize that loop.
  • 03AbortController is the standard cancellation primitive. Refs (or AbortController) for control flow; state for rendering.
  • 04Always pass { stream: true } to TextDecoder — it prevents corrupted multibyte characters silently.
  • 05Profile before adding rAF batching; premature optimization is a senior interview anti-signal.
  • 06Plan for network blips, backgrounded tabs, memory growth, screen readers, and rapid-send races before they ship.