CoreAI10 min read

RAG: How AI Products Talk About Your Data

Retrieve, then generate. The pattern under every 'chat with your docs' product, taught from the frontend's perspective.

The model wasn't trained on your company's docs, your customer's PDFs, or last week's Slack threads. So how does Perplexity answer questions with current sources? How does Notion AI know about your notes? How does a "chat with this PDF" feature actually work?

The answer is RAG — Retrieval-Augmented Generation. It's a pipeline, not a model. And while the frontend rarely owns the retrieval bits, you'll spend a lot of time rendering the results, surfacing citations, and explaining latency budgets that are dominated by retrieval, not by the LLM. Let me walk you through the pipeline so you can reason about your product end-to-end.

01The pipeline, in one sentence

Take the user's question, find the most relevant chunks of your documents, stuff those chunks into the prompt, and let the model generate the answer with that context in front of it.

user question
    │
    ▼
┌─────────────┐    ┌────────────┐    ┌──────────┐
│  Embed      │ →  │  Retrieve  │ →  │  Rerank  │
│  (question) │    │  top-K     │    │  best 5  │
└─────────────┘    └────────────┘    └────┬─────┘
                                          ▼
                                  ┌─────────────────┐
                                  │  LLM generates  │
                                  │  with context   │
                                  └─────────────────┘
                                          ▼
                                       answer + citations

That's the whole thing. Each stage is replaceable; each one has tradeoffs. Let's walk them.

02Embeddings — vectors that mean something

An embedding is a vector — typically 768 to 3072 floating-point numbers — that encodes the semantic meaning of a piece of text. Computed by an embedding model (OpenAI text-embedding-3-large, Cohere, Voyage, etc.). Two snippets with similar meaning produce vectors that are close together by cosine similarity.

You don't need to understand the math. You need to understand the property: nearby vectors mean nearby meanings. "How do I cancel my subscription?" and "I want to unsubscribe" produce similar vectors even though they share zero words.

Frontend almost never generates embeddings directly. But you'll see similarity scores in admin tools, debug panels, and "why was this result chosen?" UIs.

03Vector databases and why retrieval is your latency bottleneck

You can't loop over a million embeddings on every request. Vector databases — Pinecone, Weaviate, Qdrant, Postgres + pgvector — are purpose-built for fast approximate-nearest-neighbor search over millions of vectors. They give you the top-K closest matches in tens of milliseconds.

The frontend implication that nobody warns you about: retrieval latency dominates your "AI search" feel, not the LLM call. A 2-second answer might break down as 800ms retrieval + 400ms reranking + 800ms LLM. If your search feels slow, profile retrieval before you tune the model.

Show retrieval steps in your UI when they take noticeable time. Perplexity shows "Reading 4 sources" while it retrieves and reranks; that's not just for show, it's an honest progress indicator.

04Chunking — the unsexy step that makes or breaks RAG quality

Long documents have to be split into smaller pieces before embedding. Why: embedding models have their own context limits, and retrieval at the chunk level is way more precise than at the document level. (You don't want to retrieve "the entire 50-page handbook" when the answer is one paragraph.)

How you chunk matters more than almost any other RAG decision:

Fixed-size (every 500 tokens) — simple, fast, but cuts ideas in half.
Semantic (split on paragraph or heading boundaries) — preserves meaning but variable size.
Recursive (try paragraphs, fall back to sentences, fall back to fixed-size) — best of both. The default in LangChain and llamaindex.

Bad chunking is the #1 cause of bad RAG. Chunks too small lose context; chunks too large dilute relevance. If your "chat with docs" feature returns mediocre answers, look at chunking before blaming the model.

05Reranking — why retrieval often happens twice

The retriever returns the top 50 candidate chunks based on cheap embedding similarity. Some of those will be relevant; some won't. A reranker — typically a cross-encoder model like Cohere Rerank — scores those 50 more carefully and picks the best 5 to put in the prompt.

Reranking adds latency (a few hundred ms typically) but dramatically improves quality. Your retrieval recall is high (the right answer is somewhere in the top 50); reranking drives precision (the right answer is in the top 5 you actually use).

You'll see reranking steps showing up explicitly in agent traces and "thinking" UIs. Now you know why they're there.

06Citations: how products earn user trust

Without citations, every AI-generated claim is a confident-sounding stranger. The user can't verify it. They can't trust it. Citations close the loop by linking each generated claim back to the source it came from.

Perplexity built its entire product around citations. Bing Copilot, Brave, You.com, Google's AI Overviews — they all converged on roughly the same UI shape: numbered inline pills [1] [2] [3] tied to a side panel of sources. Hover the pill, see a preview. Click it, jump to the source.

If you're building any RAG UI, design for citations from day one. They're not a polish item; they're the trust mechanism that keeps your product from being dismissed as "just another hallucination machine."

Key Takeaways

01RAG is a pipeline: embed query → retrieve top-K → rerank → stuff into prompt → generate. Each stage is replaceable.
02Nearby embedding vectors mean nearby meanings. That property is the entire foundation.
03Vector DB latency, not the LLM call, is usually the bottleneck in "AI search" features.
04Chunking strategy makes or breaks RAG quality more than any other knob. Default to recursive chunking.
05Reranking trades latency for precision — it is why "AI search" feels multi-step in your latency budget.
06Citations are the trust mechanism. Design for them from day one, not as a polish item.

Previous Next