Curriculum Series

TextDecoder({ stream: true }): why UTF-8 boundaries break naïve streaming

TextDecoder({ stream: true }): why UTF-8 boundaries break naïve streaming

This is the bug that ships to production, sits in your app for months, and gets reported by a Japanese or Hindi user one day saying "the responses look corrupted." Then you spend a frustrating afternoon learning more about UTF-8 than you ever wanted to. Let me save you that afternoon.

The setup: bytes are not characters

When the server streams data back, what arrives at the browser is bytes — Uint8Array chunks. Your code converts those bytes into a string so you can render text. The naive way:

JS
1const reader = response.body.getReader();
2const decoder = new TextDecoder();
3
4while (true) {
5 const { value, done } = await reader.read();
6 if (done) break;
7 const text = decoder.decode(value); // ← naive
8 setMessage(prev => prev + text);
9}

For ASCII English, this works. For anything else, it's a time bomb.

Why: how UTF-8 encodes non-ASCII characters

UTF-8 is a variable-width encoding. ASCII characters (a-z, 0-9, basic punctuation) are 1 byte each. Most European accents (é, ñ, ö) are 2 bytes. Most Asian characters (你, 안, あ) are 3 bytes. Emoji and many supplementary-plane characters are 4 bytes.

The byte sequence for "你好" (Chinese for "hello") is:

TEXT
10xE4 0xBD 0xA0 ← 你 (3 bytes)
20xE5 0xA5 0xBD ← 好 (3 bytes)

The high bits of the first byte tell you how many bytes the character takes. The browser knows how to interpret this — if it sees the whole character at once.

The bug: chunk boundaries cut characters in half

The server streams chunks. The TCP layer doesn't know or care about character boundaries. So a chunk might end mid-character:

TEXT
1Chunk 1: 0xE4 0xBD 0xA0 0xE5 0xA5 ← 5 bytes: full "你" + 2/3 of "好"
2Chunk 2: 0xBD ← 1 byte: the last 1/3 of "好"

When you call decoder.decode(value) on chunk 1 without { stream: true }, the decoder treats the input as complete. It sees the truncated "好" sequence (0xE5 0xA5 followed by end-of-input), can't form a valid character, and replaces it with U+FFFD — the replacement character, which renders as .

Your user sees: 你� and on the next chunk just appears. The is permanently in the message. Refresh the page and it's still there if you saved it to the database. Bug.

The fix: { stream: true }

Pass { stream: true } to TextDecoder.decode() and the bug disappears:

JS
1const decoder = new TextDecoder('utf-8');
2
3while (true) {
4 const { value, done } = await reader.read();
5 if (done) {
6 setMessage(prev => prev + decoder.decode()); // flush the final state
7 break;
8 }
9 setMessage(prev => prev + decoder.decode(value, { stream: true }));
10}

What stream: true does internally: when the decoder hits an incomplete byte sequence at the end of the input, instead of inserting a replacement character, it holds the bytes in an internal buffer. On the next call to decode(), those held bytes are prepended to the new chunk and decoding resumes. The decoder is now stateful across calls.

The final decoder.decode() (no argument) is a flush. If anything is left in the buffer when the stream ends, it gets emitted (as replacement characters if it's still invalid). Forgetting this final flush is fine for well-formed UTF-8 streams but is good hygiene to write.

Why this is an AI-era problem specifically

Three forces converge:

  1. LLMs are tokenized at the model level, not the byte level. A token is roughly 4 characters of English, but for languages with multi-byte characters, one token can be a single character that's 3 bytes wide. The server streams one token per SSE event or one token per chunk — and each token write is whatever bytes the tokenizer emits. The server has no incentive to align writes to byte boundaries because, from its perspective, "send the token" is the unit of work.

  2. Modern LLMs are trained on multilingual corpora. Even an English-only product gets users asking for translations, code with Unicode strings, emoji in responses (✅ ❌ 📊 are common in chat bots). The bug is latent until those characters show up.

  3. The streams are long. A 2000-token response has 2000 chunk boundaries. The probability of at least one falling mid-character is high.

The other places this bug shows up

TextDecoder isn't the only stateful decoder you'll meet:

  • NDJSON parsers — newline-delimited JSON has the same issue at JSON-object boundaries. A chunk might end mid-object: {"role":"assistant","content":"hel. Naive code does JSON.parse(chunk) and throws. Streaming JSON parsers buffer until the next newline, then parse. Most AI SDKs ship one (@anthropic-ai/sdk, openai's streaming helper, Vercel AI SDK) — but if you're rolling your own, this is the same class of bug as the UTF-8 one.

  • SSE parsing — SSE messages are separated by blank lines (\n\n). A chunk might split a message in half. The browser's built-in EventSource handles this for you. If you're parsing SSE manually because you want more control over headers or want POST-with-SSE (which EventSource doesn't support), you write the buffer-until-\n\n logic yourself.

  • Server-side rendering of streams — Next.js's streaming SSR has the same issue at component boundaries. React's solution is the suspense boundary; the framework handles the byte-level batching for you.

The senior-level mental model

Think of decoders as stateful filters with a buffer. stream: true tells them "I'll be calling you again, save what you can't decode yet." The naive default — synchronous, stateless decoding — is fine for "decode this whole file at once." It is wrong for any streaming scenario.

The pattern generalizes: any time you're processing a byte stream at a higher semantic layer (characters, JSON objects, SSE events, framed messages), you need a stateful parser that can hold bytes across chunks. The choice is between using a library that does this for you (the SDKs, EventSource, Node's readline) or rolling your own with a small ring buffer.

What interviewers probe

  1. "What's the difference between stream: true and not?"stream: false (the default) treats the input as the complete message. stream: true keeps an internal buffer of trailing incomplete byte sequences for the next call. Forgetting stream: true causes silent corruption on multi-byte characters.

  2. "What if the server uses a different encoding?"new TextDecoder('utf-16le'), etc. The stream flag works for any encoding. Modern servers should always be UTF-8 — if you find UTF-16 in an interview answer, that's usually the wrong instinct.

  3. "Why doesn't the browser's EventSource have this problem?" — because the browser implements SSE parsing in C++, with byte-correct buffering built in. The bug only appears when you read response.body directly. Which is exactly what every modern AI streaming pattern does, because EventSource doesn't support POST or custom headers.

If you can articulate "the bytes-vs-characters distinction is encoding-aware buffering, and stream: true enables that buffering," you're ahead of 90% of candidates on this question.

Finished reading?

Mark this topic as solved to track your progress.