RAG: where does the frontend's responsibility start and end?

Half the candidates I see for AI-frontend roles confidently say "yes, I've worked with RAG" and then describe building an embedding pipeline — which they didn't, because that's a backend job. The other half say "RAG is a backend thing" and miss the half-dozen places the frontend matters enormously. Let me untangle this so you can confidently scope what you own and what you don't.

The 60-second definition

RAG — Retrieval-Augmented Generation — is the pattern where, before generating a response, the system retrieves relevant documents and stuffs them into the prompt. So the LLM answers based on a knowledge base it didn't see at training time.

A typical RAG flow:

Index time (offline): documents are chunked, each chunk gets an embedding (a high-dimensional vector), and the embeddings are stored in a vector database (Pinecone, Weaviate, pgvector, Chroma, Turbopuffer, etc.).
Query time: the user's question is embedded, the vector DB returns the top-K nearest chunks, those chunks are concatenated into the prompt, and the LLM generates an answer.
Render time: the answer is shown to the user, often with citations linking back to the retrieved chunks.

Step 3 is the part the frontend mostly owns. Steps 1 and 2 are mostly backend.

What the frontend does NOT own

Let's get the boundary clear, because this is where senior interviews differentiate candidates who actually know what they're talking about from those who watched a YouTube video on RAG.

Chunking — splitting documents into retrievable pieces (typically 256–1024 tokens) is a backend job. The strategies (fixed-size, sentence-aware, recursive, semantic chunking) live in your ingestion pipeline. The frontend never touches a document for chunking.

Embeddings — running text through an embedding model (OpenAI's text-embedding-3-small, Cohere, BGE) is backend. Even when the frontend triggers it (e.g., "user asks question, frontend POSTs the question"), the embedding call happens server-side. You don't ship embeddings to the browser.

Vector search — querying the vector DB with cosine similarity, MMR (maximum marginal relevance), or HNSW indexes is backend. The frontend doesn't run the index.

Re-ranking and filtering — after the initial vector retrieval, most production RAG systems use a re-ranker (Cohere Rerank, an LLM-based scorer, or hybrid BM25+vector) to refine the top-K. Backend.

Prompt construction — concatenating the retrieved chunks into a system prompt, often with templates, is backend. The frontend should never see the raw retrieved chunks before they're processed.

If a candidate says they "implemented RAG on the frontend," they almost certainly mean they consumed a RAG endpoint and built UI for it. That's frontend work, but the framing matters in interviews.

What the frontend DOES own — and these are nontrivial

Here's where the senior frontend role shows up.

1. Citation rendering

When the LLM answers based on retrieved sources, the user wants to know which source a given claim came from. Modern AI chat products (Perplexity, ChatGPT with web search, Claude's Projects feature) render inline citations as superscript numbers [1] [2] that link to the source. Hover for a preview, click for the full document.

The frontend job:

Parse the LLM's output for citation markers (formats vary: [1], <cite id="1">, JSON metadata sidecar).
Resolve each marker to a source object (URL, title, snippet).
Render the citation as an interactive element with hover preview and click-through.
Handle the case where the LLM hallucinates a citation that doesn't exist in the retrieved set (it happens — you fall back to "source unknown" or hide the marker).

This is what the Source Citations Panel machine-coding question in this track tests. The pattern is well-defined and lives entirely in the frontend.

2. Source preview UI

When the user clicks a citation, what do they see? Options:

Modal with full document — heavy, breaks flow, but gives complete context.
Side panel with the cited chunk highlighted — Perplexity does this. Strong UX.
Inline expansion — the citation expands inline. Lightweight but breaks reading flow.
Tooltip on hover — fast preview, good for confidence.

The decision is product/design, but the implementation is frontend. The data the backend returns must include enough metadata (the chunk text, the surrounding paragraph, the source URL) for the frontend to render any of these.

3. Latency masking during retrieval

RAG is slow. A naive RAG flow does: embed query (50–200ms) → vector search (50–300ms) → re-rank (100–500ms) → LLM generation (TTFT 500–2000ms). The user is staring at a blank screen for seconds.

The frontend hides this with state messages:

"Searching the web…" (during embedding + retrieval)
"Reading 3 sources…" (during re-rank, with the count)
"Synthesizing answer…" (just before TTFT)

Each of these requires the backend to send progress events back. The frontend's job is to render them in a way that feels like progress, not stalling. The visual difference between "Loading…" and "Reading [openai-cookbook.md], [fastapi-docs.md], [stripe-api.md]" is enormous.

The Agent Trace Renderer machine-coding question covers this pattern — you render a stream of step events as they arrive.

4. Optional: client-side re-ranking

This is a contentious one. Some products (Algolia Recommend, parts of Linear's search) ship a small re-ranker to the client to reorder results based on user behavior. The pattern:

Backend returns top-50 chunks (cheap to fetch in one go).
Frontend has a tiny model or heuristic that re-ranks based on user signals (recently viewed, currently in this thread, etc.).
The top 5 are then sent to the LLM as context.

This is experimental and not the default. Most production RAG systems do not do client-side re-ranking. If an interviewer asks, the right answer is "interesting but rare in practice — the re-ranker quality usually justifies a server roundtrip."

5. Trust and verification UI

The defining failure mode of RAG is the LLM saying something the source doesn't actually support. (Sources used to be called "groundings," but the term has largely moved to "citations.") The senior-level frontend question is: how do you let the user verify?

Patterns:

Selection-based verify — user highlights a sentence in the answer, clicks "verify," and the UI shows which source chunk it came from with the relevant sentence highlighted.
Diff-style — for content extracted from a source, show the original alongside the LLM's reformulation.
Confidence indicators — visualize how strongly the answer is grounded (some products show a per-sentence confidence score the LLM emits).

These are pure UX problems. They're senior-level because they require thinking about what would make this user trust the system, not just rendering data.

The clean boundary, stated as a rule

The frontend handles everything user-facing about retrieved sources: surfacing them, citing them, previewing them, letting users verify them, masking their retrieval latency.

The backend handles everything about producing the right sources: chunking, embedding, vector search, re-ranking, prompt assembly.

The API contract between them is something like:

1type RagAnswer = {
2  text: string;                    // the LLM output, with citation markers
3  sources: Array<{
4    id: string;                    // matches markers in `text`
5    title: string;
6    url: string;
7    chunk: string;                 // the retrieved chunk
8    score: number;                 // similarity score, optional
9  }>;
10  trace?: Array<{                  // optional progress events for streaming
11    type: 'embedding' | 'search' | 'rerank' | 'generate';
12    label: string;                 // user-facing label like "Reading 3 sources"
13    timestamp: number;
14  }>;
15};

If your backend returns this shape, the frontend can build a great experience. If your frontend tries to do retrieval itself, you're either missing infrastructure or duplicating it badly.

What interviewers probe

"How would you handle a hallucinated citation?" — graceful: detect that the citation ID doesn't resolve to a source, fall back to plain text without the marker, log the event for monitoring, and consider showing a small warning ("source not found") for transparency.
"What if the retrieved chunks contradict each other?" — backend's responsibility to handle (re-ranking, contradiction detection in the prompt). Frontend job: render all the cited sources clearly so the user can see the disagreement and judge.
"How do you make RAG feel faster?" — start by streaming progress events during retrieval, then stream the answer tokens. Don't block on retrieval before showing anything. The user should see "searching" within 100ms, "reading" within 500ms, "answering" before 2 seconds.
"Where does evaluation live?" — RAGAS, TruLens, custom evals are backend. The frontend's evaluation contribution is making it easy for users to flag bad answers — feedback buttons, retry-with-different-context, source quality voting.

The senior-level move on this question is naming the boundary cleanly. Don't claim things you don't own; don't disclaim things you do. Citations, source preview, latency masking, verification — those are yours, and they're nontrivial.

Finished reading?

Mark this topic as solved to track your progress.