ReferenceAI Frontend Glossary
ReferenceAI32 terms

AI Frontend Glossary

32 terms every frontend engineer at an AI company is expected to know. Streaming primitives, LLM API concepts, agent loops, retrieval pipelines, UI patterns, and the SDKs that wire it all together.

Use this as a reference before interviews, when reading job descriptions, or whenever a teammate drops an acronym you do not want to ask about.

01 · 6 terms

Streaming & Network

The plumbing under every "AI types responses live" experience. If you cannot reason about these primitives end-to-end, you cannot debug a streaming bug at 11pm.

Server-Sent Events (SSE)#

One-way HTTP streaming used by every major LLM API.

A simple HTTP streaming protocol where the server responds with content-type text/event-stream and emits frames separated by a blank line. Each frame is data: <payload>\n\n. Used by OpenAI, Anthropic, and most LLM providers because it works through proxies and CDNs that often break WebSockets, and the EventSource browser API can resume after disconnect via Last-Event-ID. The frontend reads chunks, splits on \n\n, parses each JSON payload, and updates state.

// Server emits:
// data: {"delta":"Hel"}\n\n
// data: {"delta":"lo"}\n\n
// data: [DONE]\n\n

const reader = res.body.getReader();
const decoder = new TextDecoder();
let buf = '';
while (true) {
  const { value, done } = await reader.read();
  if (done) break;
  buf += decoder.decode(value, { stream: true });
  const frames = buf.split('\n\n');
  buf = frames.pop()!; // keep partial frame
  for (const frame of frames) {
    const data = frame.replace(/^data: /, '');
    if (data === '[DONE]') return;
    onDelta(JSON.parse(data).delta);
  }
}

ReadableStream#

The web platform primitive under fetch().body.

A Web API for byte streams. fetch().body is a ReadableStream. Calling .getReader() returns a reader you can await on for { value, done } chunks. ReadableStream is the base layer underneath every "streaming response" implementation in modern JavaScript. Pair it with TextDecoder to turn bytes into characters, and an AbortController to cancel cleanly.

TextDecoder#

Bytes → strings, with multibyte safety.

A built-in decoder for byte-to-string conversion. Always pass { stream: true } when decoding a chunk inside a streaming loop — it holds back any partial multibyte character (UTF-8 emoji, non-Latin scripts) until the next chunk arrives. Without { stream: true }, you will silently render garbage characters when bytes split mid-codepoint. This is one of the most common silent bugs in hand-rolled SSE clients.

const decoder = new TextDecoder();
buf += decoder.decode(value, { stream: true });
//                                     ^^^^^^^^ critical

AbortController / AbortSignal#

The standard cancellation primitive for fetch and any signal-aware async work.

A pair of objects that let you cancel an in-flight operation cooperatively. Create a controller, pass controller.signal to fetch (or any API that accepts AbortSignal), and call controller.abort() to cancel. The pending fetch rejects with a DOMException of name AbortError. Use it anywhere you previously would have written a "cancelled" boolean — it propagates correctly through async chains and is recognized by the platform.

const ctrl = new AbortController();
fetch('/api/chat', { signal: ctrl.signal });
// later:
ctrl.abort(); // pending fetch rejects with AbortError

NDJSON / JSON Lines#

Streaming JSON, one object per line.

A simpler alternative to SSE. The server emits a stream where each line is a complete JSON object, separated by \n. Lighter framing than SSE, no event/data prefixes. Used by Anthropic batch APIs and many internal AI pipelines. The parsing trick is the same as SSE: buffer chunks, split on \n, JSON.parse each complete line, keep the trailing partial.

Backpressure#

When the producer outpaces the consumer.

In streaming UIs, backpressure happens when tokens arrive faster than React can render them. Symptoms: dropped frames, stuttery typing indicator, scroll jank. Mitigations: batch token writes onto requestAnimationFrame so state updates cap at 60fps regardless of arrival rate; or throttle the source on the network layer. Real LLM APIs rarely hit this regime, but custom inference (Groq, on-prem GPUs) or many concurrent streams can.

02 · 6 terms

LLM Concepts (frontend-relevant)

You will not be training models, but you will be reasoning about them every day. These are the parts of the model API that shape your UI decisions.

Token#

The unit of LLM input and output.

Roughly four characters of English text, though it varies by language and tokenizer. Pricing, context windows, and rate limits are all denominated in tokens. Important on the frontend because cost dashboards, context-window warnings, and "you are approaching your usage" UIs all need to surface token counts (which the backend usually returns alongside responses).

Streaming vs. non-streaming#

Incremental delivery vs. wait-for-full-response.

A streaming response renders tokens as they arrive — first token typically in 200–600ms, full response in 4–8 seconds. A non-streaming response blocks until done, then arrives all at once. Total time is often identical, but perceived latency for streaming is dramatically lower because the user sees activity immediately. For any user-facing chat or completion, streaming is the default.

Function calling / Tool use#

Model returns structured intent instead of text.

The model emits a structured payload — tool name plus typed arguments — instead of free text, when it judges that a tool would help. The frontend's job is to display the tool call (often in the agent trace), execute it via your backend, and feed the result back into the conversation. OpenAI calls this "function calling," Anthropic calls it "tool use" — same idea, slightly different schemas.

Structured output (JSON mode)#

Constrained generation against a JSON schema.

A mode where the model is forced to emit valid JSON matching a schema you provide. OpenAI exposes it as response_format: { type: "json_schema" }; Anthropic uses tool-use coercion. Eliminates the brittle "extract JSON from a free-text response" prompts that fail 5% of the time. Critical for any feature where the LLM output drives downstream UI (cards, forms, list filters).

Context window#

Max tokens the model can attend to per call.

The hard limit on how much text — system prompt + tools + history + retrieved docs + user input + planned output — fits in one model call. Claude 4.5 is 200K tokens, Gemini 1M+, GPT-4o 128K. When you exceed it, the call fails. Frontend implications: warn users approaching the limit; truncate or summarize old messages; stop showing every tool result in full.

Prompt caching#

Server-side caching of repeated prompt prefixes.

Anthropic and OpenAI both let you cache stable prefixes of your prompts (system message, tool definitions, large reference documents). Subsequent requests with the same prefix skip recomputation, reducing latency and cost. The frontend lever: keep the early parts of your prompt stable across calls (do not put dynamic data first), and tag the cacheable portions per the API's convention.

03 · 5 terms

Agents

An agent is just a chat that talks to itself. Once you understand the loop, the rest is plumbing.

Agent#

An LLM in a loop that calls tools and decides when it is done.

An LLM wired into a control loop where it can call tools, observe their results, and decide what to do next. Different from a chat: a chat ends when the user replies; an agent ends when the model decides it has finished the task. The trace UI you build is the user's window into this loop.

ReAct loop#

Reason → Act → Observe, repeat.

The standard agent control flow, named for its three steps: Reason (think about what to do), Act (call a tool with arguments), Observe (read the tool's output and update mental state). Loop until the model emits a "done" signal. Almost every production agent — Claude, ChatGPT agents, Devin, Cursor — runs some variant of this pattern.

Tool / Function call#

A discrete capability the agent can invoke.

A typed capability exposed to the agent: web_search, code_run, db_query, send_email. Each tool has an input schema (what arguments it accepts) and an output type (what it returns). Frontend implication: the trace UI should render tool inputs and outputs differently depending on which tool — search results as cards, code as a syntax-highlighted block, errors as red callouts. A discriminated union is the right TypeScript shape.

Sub-agent#

An agent spawned by another agent for a sub-task.

Production agents (Devin, Claude Code, Manus) spawn sub-agents to handle delegated work — research a sub-question, refactor a single file, browse a website. Each sub-agent has its own ReAct loop and its own trace. UI implication: traces are recursive — a tree of agent runs, not a flat list. Plan your data shape accordingly: type Step = { ..., children?: Step[] }.

MCP (Model Context Protocol)#

Anthropic's open standard for connecting LLMs to tools and data.

A growing ecosystem of MCP servers — small adapters that expose databases, file systems, APIs, or services to any MCP-compatible client. Reduces the N×M problem of every model integrating with every tool. Frontend implication: MCP-based products often surface a "connected tools" UI where users add and configure servers; expect this surface to grow as more clients (Claude Desktop, Cursor, Zed) adopt it.

04 · 6 terms

Retrieval (RAG)

How AI products talk about data the model never saw during training. Frontend rarely owns the retrieval pipeline, but is on the hook for surfacing the result legibly.

RAG (Retrieval-Augmented Generation)#

Retrieve relevant documents, stuff them in the prompt, then generate.

The standard pattern for "chat with your docs," "search with cited answers," and any product that needs the model to know about content outside its training set. Pipeline: query → retrieve top-K relevant chunks via embedding similarity → optionally rerank → assemble prompt with retrieved context → generate. The Citations Panel UI is the most visible artifact of a RAG pipeline.

Embedding#

A vector representation of text where similar meaning → similar vector.

A fixed-size array (typically 768 to 3072 dimensions) of floating-point numbers that encodes the semantic meaning of a text snippet. Computed by embedding models (OpenAI text-embedding-3-large, Cohere, etc.). Two snippets with similar meaning produce vectors with high cosine similarity. The foundation of every retrieval system. Frontend rarely generates embeddings but may surface similarity scores in admin tooling.

Vector database#

Storage optimized for nearest-neighbor search over embeddings.

Databases purpose-built for fast approximate-nearest-neighbor search over millions of embeddings — Pinecone, Weaviate, Qdrant, Postgres + pgvector. The retrieval step in RAG hits the vector DB. Frontend implication: search latency budgets are dominated by this lookup (and the embedding step), not by the LLM call. If your "AI search" feels slow, profile retrieval before tuning the model.

Chunking#

Splitting source documents before embedding.

Long documents must be split into smaller pieces (chunks) before embedding, because embedding models have their own context limits and because retrieval at the chunk level is more precise. Strategies: fixed-size (500 tokens), semantic (split on paragraphs or headings), recursive (try paragraph, fall back to sentence). Bad chunking is the #1 cause of bad RAG quality — chunks too small lose context, too large dilute relevance.

Reranking#

Second-pass scoring of retrieved candidates.

A retriever returns 50 candidate chunks based on cheap embedding similarity; a reranker (typically a cross-encoder model like Cohere Rerank or a small LLM) scores those 50 more carefully and picks the best 5 to put in the prompt. Adds latency but dramatically improves quality. Explains why "AI search" sometimes feels multi-step in your latency budget.

Citation / Grounding#

Linking each generated claim back to its source document.

The practice of attributing every model-generated statement to the source that supported it. Without grounding, the user has no way to verify a claim. Without verification, AI products lose trust. The Citations Panel UI is the standard artifact: inline numbered pills [1] [2] tied to a source list. Perplexity built its entire product around this single UX move.

05 · 6 terms

UI Patterns

The conventions that have emerged across AI products. Following them makes your product feel native; ignoring them makes it feel off.

Pinned-to-bottom auto-scroll#

Auto-scroll only when the user is already near the bottom.

The standard pattern for chat and trace UIs. Track distance from bottom on every scroll event; if within ~60px, treat the user as pinned and auto-scroll on new content; if not, leave them alone but show a "Jump to latest" pill. Forcing scroll unconditionally is the single most-hated chat-UI antipattern.

Typing indicator#

Animated placeholder before the first token arrives.

The three-dot blinking animation (or "thinking..." text) shown while the user waits for the first token. Bridges the dead time between request and first response. Disappears the instant streaming starts. Variants: model-state indicators ("Searching the web", "Reading sources") show what the model is actually doing — Perplexity-style. Always pair with an aria-live region for screen-reader users.

Optimistic UI for AI#

Render the user's message immediately, before the API responds.

When the user hits send, append their message to the conversation instantly and add a placeholder for the bot response. The user sees activity in the same frame as their click. Roll back if the request fails. Reduces perceived latency by 100–200ms (the network round-trip you would otherwise wait through). Standard in every chat UI; a senior signal if you bring it up unprompted.

Slash commands#

Trigger a menu of actions by typing /.

A modern AI input pattern from Notion, Linear, Cursor. Typing "/" at the start or after whitespace opens a menu of commands; fuzzy filter as the user types; arrow keys navigate; Enter selects. Looks trivial; the hard part is focus and caret coordination — the textarea must remain focused while the menu accepts keyboard input.

Inline diff / suggestion#

AI-proposed code changes with hunk-level accept/reject.

The defining UI of AI code editors (Cursor, Copilot Chat, Codeium). Renders proposed changes as a diff with per-hunk Accept and Reject buttons. The user reads, accepts the parts they trust, rejects the rest. Mental model: each hunk is a decision; lines are presentational; files are organizational.

Agent trace#

Vertical list of expandable steps showing what the agent is doing.

How agentic products narrate themselves. Each step shows status (pending/running/done/error), tool name, optional thinking, and structured input/output. Currently-running step pulses; finished steps are static; errors halt or retry. The user's only window into a process that may take minutes — get it wrong and your product feels like a black box.

06 · 3 terms

SDKs & Tools

You are not going to write the streaming and tool-call plumbing from scratch in 2026. Know which library to reach for.

Vercel AI SDK#

Provider-agnostic SDK for streaming, tool calls, and structured output in React/Next.

The de facto standard for AI features in Next.js apps. Provides the useChat hook (handles message state, streaming, scroll), generateText / streamText for one-shot calls, and a unified tool-call interface across OpenAI, Anthropic, Google, and dozens of other providers. Skip the hand-rolled fetch + ReadableStream code and use this unless you have a strong reason not to.

AI Elements (shadcn)#

Pre-built React components for chat, citation, agent trace UIs.

A growing collection of drop-in shadcn-style components for AI surfaces — chat windows, message lists, citation pills, agent steps, prompt inputs. Owned by Vercel and aligned with the AI SDK. Fastest path from zero to a polished AI UI; copy the source into your repo and customize.

Streaming SSR with Suspense#

React 18+ pattern for progressively rendering server-rendered pages.

Wrap async parts of your tree in <Suspense>; the server streams the page in chunks as each Suspense boundary resolves. Different from token streaming (which is API-level), but composes with it: a Server Component can render a streaming chat client. This is how Next.js App Router renders fast on slow data.

Spot a term that should be here? Browse the AI interview hub or check the AI machine-coding questions for problems that put these primitives to work.