Tokens, context, and why your prompt costs more than you think
A token is the unit your prompts are measured (and billed) in. It’s not a character, not a word, not a sentence. It’s a sub-word chunk the model treats as one piece. Get the mental model right and you’ll stop being surprised by your API bill, hit fewer context limits, and write prompts that are actually more accurate.
The short version
- 1 token ≈ 4 characters of English, or 0.75 of a word.
- You pay per token, both ways. Output is usually 5x the input rate.
- Context windows are limits, not budgets to fill.
- Tighter prompts are cheaper, faster, AND more accurate. Win-win-win.
What a token actually is
Models don’t see characters. They see tokens, the chunks the tokenizer split your text into. Common words usually get a single token; rare words get split into pieces.
“Hello world” → 2 tokens “The cat sat” → 3 tokens “Antidisestablishment” → 5 tokens (split into sub-pieces) “👋🏽” → 3-4 tokens (emoji are expensive) Code and JSON → more tokens per character than prose Non-English → often 2-4x more tokens per character
The rule of thumb is good enough for budgeting: characters ÷ 4 ≈ tokens. If you need an exact count, run your text through Anthropic’s or OpenAI’s tokenizer endpoint before sending.
Why this matters: the math is brutal
Pricing is linear with token count, but usage scales with whatever you forget to optimize. Same task, two implementations:
- Lazy: stuff 50k tokens of context into every query.
- Tight: retrieve 1k tokens of relevant context per query.
At 10k queries per day on Claude Sonnet 4.6 ($3 per M input):
- Lazy: 50k × 10k × $3/M = $1,500 per day = $45,000 per month
- Tight: 1k × 10k × $3/M = $30 per day = $900 per month
- 50x cheaper for the same product.
For real numbers from real projects, see Claude API pricing: real costs from 6 months of use.
Output tokens: the invisible cost
Output tokens cost roughly 5x what input tokens do. A 100-token prompt that generates a 2,000-token essay costs more than a 2,000-token prompt that returns a one-word classification.
Two practical levers:
- Set
max_tokensexplicitly on every call. Without it, the model fills whatever budget the API allows. Saying “at most 200 tokens” in the prompt also helps but is less reliable than a hard cap. - Tell the model to be terse. “Answer in 1-2 sentences” cuts output by 5x to 10x compared to the model’s default chatty mode.
Context windows: the trap of having more than you need
Context window is the maximum total tokens the model can “see” in one request. Prompt + history + context + output:
- Sonnet 4.6: 200k tokens (about 150 pages)
- Opus 4.7: 1M tokens (about 750 pages)
- Haiku 4.5: 200k tokens
Just because you can stuff 1M tokens in doesn’t mean you should. Three reasons:
- Lost in the middle. Information buried in long contexts gets ignored. Models pay attention to the start and end far more than the middle. This is documented across every major model.
- Latency. A 200k-token prompt takes 10 to 30 seconds to process. A 1k-token one returns in under a second. Users notice.
- Cost. See above. The math doesn’t care about your context window; it cares about how many tokens you actually send.
Use RAG to retrieve only the relevant context per query, instead of stuffing everything in. That’s the whole point of RAG.
The hidden token costs nobody warns you about
- System prompts. Charged on every call. A 5,000-token system prompt is 5,000 tokens billed for every single request. Tighten it, or cache it.
- Few-shot examples. Same problem. Three good examples of 200 tokens each adds 600 tokens to every call.
- Tool definitions. Each tool you offer the model adds tokens. 10 tools × 100 tokens of schema = 1,000 tokens, every call.
- Conversation history. Grows linearly. By message 20, history dominates the input. Truncate or summarize.
- Retried failures. A failed call costs the same as a successful one. Validate inputs.
How to slim a prompt
- Cut filler. Models don’t need “please” or context-setting paragraphs. Be direct.
- Compress instructions. “Format your response as JSON with the keys ‘name’ and ‘email’” becomes “Respond as JSON: {name, email}”.
- Drop redundant examples. Three good few-shot examples beat ten mediocre ones, and cost a third.
- Use prompt caching for any prompt segment that doesn’t change between calls (system prompt, tool defs, fixed examples).
- Use a smaller model for the parts that don’t need a big one. Haiku 4.5 at $0.80 per M is 18x cheaper than Opus.
Count before you scale
The five-minute exercise that prevents bill shock:
- Take a typical input + system prompt + expected output. Count the tokens.
- Multiply by your expected daily query count.
- Multiply by 30. That’s your monthly bill at the model’s rate.
- If it’s ugly, decide what to cut: smaller model, tighter prompts, caching, or RAG to reduce the per-call context.
Tokens are the constraint behind almost every other AI engineering trade-off. Once they’re intuitive, the rest of this stuff stops being magic and starts being math.
One Comment