AI for builders: a no-BS field guide
Most AI explainers are written by people who haven’t shipped an AI app. They lean on metaphors (“the model is like a brain”) that fall apart the moment you try to debug something at 2am. This guide is the opposite. Tight mental models, only the concepts you’ll actually use, with links to deeper posts when you need them.
If you’ve made one HTTP request in your career, you can read this in 15 minutes and know enough to build.
How to use this guide
- Read top-to-bottom if you’re new. The order isn’t accidental.
- If you came here from Google with a specific question, jump to that section.
- Each section is the 80% mental model. The deep-dive cluster post is the 20% you sometimes need.
- Code is pseudocode for clarity, not production-ready snippets.
/* tokens and context windows */
A token is a chunk of text the model treats as one unit. Roughly: 1 token ≈ 4 characters of English, or 0.75 words. “Hello world” is 2 tokens. A typical sentence is 15 to 20.
Why you care:
- Pricing is per-token. Claude Sonnet 4.6: $3 per 1M input tokens, $15 per 1M output. Math is real.
- Context window is how much the model can “see” at once. Sonnet 4.6: 200k tokens. Opus 4.7: 1M.
- Output limits are separate. Most models cap responses at 8k to 32k output tokens. More than that requires streaming and chunking strategies.
The trap: huge prompts feel cheap until you scale. A naive RAG system stuffing 50k tokens of context into every query at $3 per million input runs $0.15 per query. Times 10k queries per day = $1,500 per day. Tighter is cheaper, faster, and almost always more accurate.
/* embeddings */
An embedding is a list of numbers, usually 768 to 3072 of them, that represents the “meaning” of a piece of text. Two pieces of text that mean similar things have embeddings that are mathematically close. That’s it.
You don’t read the numbers. You don’t interpret them. You just compare them.
The use case: take a user’s question, embed it, find the documents in your database whose embeddings are closest, return them. That’s search-by-meaning. It’s the foundation of RAG.
Generate embeddings via an API (OpenAI text-embedding-3, Voyage, Cohere). Cheap, typically less than $0.02 per 1M tokens. Store them in a vector DB (Pinecone) or, much simpler, a Postgres extension (pgvector).
Reach for embeddings any time keyword search fails because your users say things differently than you wrote them. Which is always.
/* RAG (retrieval-augmented generation) */
RAG is one pattern. Take a user query, search your knowledge base, stuff the relevant chunks into the prompt, ask the model to answer using them.
- Index: chop your documents into chunks (typically 200 to 500 tokens each), embed each chunk, store them.
- Retrieve: when the user asks something, embed their question, find the closest chunks.
- Generate: put those chunks in the system prompt with “answer using only this context,” call the model.
Why not just dump all your docs in the prompt? Cost (huge prompts are expensive), latency (long contexts are slow), and accuracy (models genuinely lose the plot in long contexts).
The art is in step 1. Bad chunking equals bad retrieval equals garbage in the prompt. Get chunking right and even GPT-3.5 looks brilliant. Get it wrong and Opus 4.7 hallucinates.
/* tool use / function calling */
Tool use is how you give a model the ability to do things (call your APIs, query your DB, send emails) instead of just generating text.
Mechanically: you define a list of “tools” (functions) with names, descriptions, and parameters. You include the list in the API call. The model decides whether to call one, returns a structured request (“call get_weather with city=Berlin”), and you actually run the function. You send the result back in the next call.
The same idea has three names depending on who’s selling it:
- Function calling: OpenAI’s name, oldest.
- Tool use: Anthropic’s name, current standard.
- MCP (Model Context Protocol): Anthropic’s open spec for connecting tools that any app can use.
Functionally identical at the per-call level. Pick MCP if you want your tools usable across providers and apps (this is the future). Otherwise tool use is fine.
/* agents (and why most aren’t) */
Strip away the marketing and an “AI agent” is a while-loop that calls an LLM, parses the response, runs whatever it asked for, feeds the result back, and repeats until done.
while not done:
response = llm.call(history + new_input)
if response.has_tool_call:
history.append(execute(response.tool_call))
else:
done = True
That’s an agent. Everything else is engineering. How do you stop infinite loops? What if the model picks the wrong tool? How do you persist state between turns? How do you handle failures from the tools themselves?
Real “agentic” systems care about three things. Planning (does the model think before acting?), recovery (can it fix its own mistakes?), and stopping conditions (when is the task actually done?). The interesting work is in those constraints, not in the loop.
If a tool sells you “agents,” ask: what’s the loop, what’s the stopping condition, what’s the recovery story? If they can’t answer, it’s marketing.
/* streaming */
Streaming means the model returns its response as a series of small pieces (usually a few tokens at a time) instead of one big payload at the end. Same total content, much better perceived speed.
Use it when:
- Users wait on a response (chat, completions, anything interactive).
- The output is long enough that total time exceeds about 2 seconds.
- You want to show partial results as they arrive.
Don’t use it when:
- You need the complete response before you can do anything (validation, structured output extraction, tool calling).
- The output is short enough that streaming overhead beats the wait.
- You’re building a backend pipeline that no human watches.
The trap: streaming a JSON object means users see half-formed JSON before it’s valid. Either render only complete keys, or use a streaming-safe parser (Anthropic ships one). Don’t stream raw partial JSON to the UI.
/* evals */
Evals are how you know if your prompts actually got better. Without them, every change feels like an improvement and you can’t prove anything.
The minimum viable eval setup for an indie builder, in three steps:
- Pick 20 real examples of inputs your app actually gets.
- For each, write down what a good response looks like (or three things it must contain, three things it must not).
- Every time you change the prompt, run all 20 through it. Score each by hand or with a separate LLM-as-judge call.
That’s it. No frameworks needed. Graduate to Promptfoo, Braintrust, or Langfuse when you outgrow a Google Sheet.
The thing nobody admits: most teams don’t have evals. They ship by vibes. The teams whose AI products feel reliable are the ones with even minimally rigorous evaluation. It’s an unfair advantage you can build in an afternoon.
/* fine-tuning vs prompting vs RAG */
When do you reach for each?
- Prompting: the model doesn’t know how to do this task in the format you want. Fix: write better instructions and examples in the prompt. Cheapest, fastest, the right answer 90% of the time.
- RAG: the model doesn’t know your private or recent information (your docs, your customer’s account, today’s news). Fix: retrieve and inject the relevant context at query time. Right answer when knowledge is the constraint.
- Fine-tuning: the model can do the task in principle but won’t reliably do it in your specific style or format, no matter how you prompt. Fix: train on examples. Right answer when style or format consistency is the constraint, and you have hundreds of high-quality labeled examples.
The trap: people reach for fine-tuning when prompting would have worked. Fine-tuning costs more ($50 to $500 per training run, then ongoing inference markup), takes longer, and locks you to a specific model version. Prompt first. RAG when you’re missing knowledge. Fine-tune almost never.
The field guide in 50 words
Tokens are the unit. Embeddings are meaning-as-numbers. RAG is search-then-prompt. Tool use is letting the model act. Agents are while-loops with tools. Stream when users wait. Eval before you ship. Prompt first, RAG when knowledge is the gap, fine-tune almost never.
That’s the field guide. The deeper dives go below. Click into whichever concept you’re currently trying to ship. ↓