What Is RAG? A Simple Mental Model With Code

▸ 6 min read

RAG (Retrieval-Augmented Generation) is the pattern where you give a language model your private documents at query time, instead of training the model on them. Search your knowledge base, take the most relevant chunks, stuff them into the prompt, return the model’s answer.

That’s it. Everything called “RAG” is variations on that one idea. This is the 5-minute mental model (the version I wish someone had given me before I built three of these) with the actual code you’d need to build the simplest working version in your stack.

The diagram

user question
      ↓
   embed it
      ↓
  vector DB → top-k closest chunks
      ↓
prompt: “answer using only this context: …”
      ↓
     LLM
      ↓
   answer

Two phases happen in different places. Indexing runs once (or on document updates), retrieval + generation runs on every query.

The minimum viable RAG, in eight lines of code

Pseudocode that maps 1:1 to whatever stack you use:

// === one-time, when documents change ===
const chunks = chunk(document, { size: 400, overlap: 50 });
const vectors = await embed(chunks);
db.insert(chunks.map((c, i) => ({ text: c, vector: vectors[i] })));

// === every query ===
const qVec = await embed(userQuestion);
const topChunks = db.findClosest(qVec, { k: 5 });
const context = topChunks.map(c => c.text).join('nn');
const answer = await llm.complete(`Use only this context:n${context}nnQ: ${userQuestion}`);

Real implementations add error handling, citations, and re-ranking. The shape never changes.

Why bother? Why not stuff everything in the prompt?

Three reasons RAG beats “just put it all in the context window” once you scale past a few documents:

Cost. Sending 50k tokens of context per query at $3 per million input tokens costs $0.15 per query. RAG with 1k tokens of relevant context costs $0.003. 50x cheaper.
Latency. Long contexts are slower to process. A 1k-token prompt responds in less than a second; a 200k-token one can take 10 to 30 seconds.
Accuracy. Models actually get worse with very long contexts. There’s a well-documented “lost in the middle” effect where information buried in a long prompt gets ignored.

Counterintuitively: tighter is more accurate. RAG isn’t just a cost optimization, it’s a quality one.

The part that actually matters: chunking

Bad chunking ruins RAG. Get this right and a basic system feels brilliant. Get it wrong and even Opus 4.7 will hallucinate.

The four chunking strategies, ordered from simple to fancy:

Fixed-size with overlap (start here). Split into 300 to 500 token chunks with 50 tokens of overlap. Boring, robust, works on 80% of corpora. The default.
Structural / heading-based. If your docs have natural sections (markdown headings, HTML <h2>s, function definitions), split there. Best for technical docs.
Sentence-aware sliding window. Same as fixed-size but only break at sentence boundaries. Avoids cutting mid-thought.
Semantic. Embed sentences, group consecutive sentences whose embeddings stay close, split where they shift. Slow, expensive, marginally better. Don’t start here.

My rule: ship fixed-size first, measure retrieval quality on real queries, only upgrade if you can prove a specific failure mode.

When is RAG the wrong answer?

RAG is the default tool for “answer questions about my private documents.” It’s the wrong tool when:

Your entire dataset fits in the context window. If you have 50 pages of docs and a 200k-token model, just put them in the system prompt. Caching makes this nearly free on repeated calls.
Your task doesn’t need facts. Creative writing, simple translation, code generation in well-known languages. These don’t need retrieved knowledge.
Your data changes faster than you can re-index. If documents update every minute and your reindex takes 10 minutes, RAG will silently serve stale answers.
Users ask exact-keyword questions. “Show me the post titled ‘X’” is keyword search, not semantic search. RAG can find it, but a basic full-text query is faster and more accurate.
You don’t have a clear corpus. RAG over “the whole internet” is just web search with extra steps. Use a search API.

The five mistakes I keep seeing

Retrieving too many chunks. “k=20” sounds safe; in practice it dilutes relevance and burns tokens. Start at k=3-5 and go up only if you can show recall is the bottleneck.
Not measuring retrieval separately from generation. When the answer is wrong, did the retriever fail (didn’t find the right chunk) or the generator fail (had the chunk but ignored it)? Without measuring both, you’re tuning the wrong thing.
Embedding the wrong thing. A common mistake: embed section titles when the answer lives in the body. Embed what users actually search through.
Skipping re-ranking. Vector search returns “close enough,” not “most relevant.” A second-pass re-ranker (Cohere’s API, or a cross-encoder) on the top 20 chunks, narrowed to the top 5, often beats any chunking improvement.
Forgetting to re-index. Documents change. Your index doesn’t know unless you tell it. Set up the cron job before you launch.

A 30-second decision tree

Do you have private/recent knowledge the model wasn’t trained on?
  no  → you don’t need RAG. Just prompt better.
  yes ↓

Does that knowledge fit in your model’s context window?
  yes → stuff it in the system prompt with caching.
  no  ↓

Is the answer findable by exact keyword search?
  yes → use full-text search. Cheaper, faster, more accurate.
  no  → RAG. Start with fixed-size chunking, k=5, and measure.

What I’d use today

Embeddings: OpenAI text-embedding-3-small ($0.02 per million tokens), or Voyage if you need top-tier quality.
Vector store: pgvector on the same Postgres I’m already running. One database. No extra bill.
Chunking: fixed-size 400 tokens with 50 overlap, sentence-aware split.
Retrieval: top-20 by cosine, re-ranked to top-5 with Cohere’s rerank API.
Generation: Claude Sonnet 4.6 (cheap, smart enough), with a system prompt that says “answer using only the provided context, cite the chunk number, say ‘I don’t know’ if missing.”

That stack costs roughly $0.005 per query at moderate volume. It’s the boring answer. It works.

Going deeper

If you want to understand the building blocks of RAG more deeply, the next things to read are about embeddings (how the “meaning” matching actually works) and tokens (the unit you’re paying for, and the constraint that makes chunking a thing in the first place).

RAG is just one pattern in a bigger toolkit. The full mental model for everything you actually need to ship an AI app is in the field guide.

What is RAG? A 5-minute mental model with code

The diagram

The minimum viable RAG, in eight lines of code

Why bother? Why not stuff everything in the prompt?

The part that actually matters: chunking

When is RAG the wrong answer?

The five mistakes I keep seeing

A 30-second decision tree

What I’d use today

Going deeper

AI for builders: a no-BS field guide

Tokens, context, and why your prompt costs more than you think

One Comment

Leave a Reply Cancel reply

The diagram

The minimum viable RAG, in eight lines of code

Why bother? Why not stuff everything in the prompt?

The part that actually matters: chunking

When is RAG the wrong answer?

The five mistakes I keep seeing

A 30-second decision tree

What I’d use today

Going deeper

Similar Posts

One Comment

Leave a Reply Cancel reply