Every LLM API charges by the token. Every context window is sized in tokens. Every "why is my bill so high" debugging session ends in tokens. Yet most people use LLMs daily without ever seeing one. Here's what they actually are.
A token is roughly a word piece
When an LLM reads text, the first thing it does is split the text into tokens using a tokenizer. The tokenizer is a fixed lookup table — typically 50,000 to 200,000 entries — built once when the model is trained, and used identically forever.
For English, a token is usually:
- A common word ("the", "computer", "running") — one token each
- A common subword piece ("-ing", "un-", "ization") — one token
- A single character or punctuation, for rare or out-of-vocabulary tokens
A rough rule: 1 token ≈ 0.75 English words. So 1,000 tokens is about 750 words, or two pages of typical English prose.
For Chinese, Japanese, and Korean (CJK) it's different. Most modern tokenizers (Claude's, GPT's, Gemini's) treat each Chinese character as 1 to 2 tokens. So 1,000 Chinese characters ≈ 1,500-2,000 tokens. Chinese is more expensive per character than English.
For code, tokens are smaller. A typical line of Python is 5-15 tokens.
You can see the actual tokenization
If you want to ground this in reality, use a tokenizer playground:
- OpenAI: tiktokenizer.vercel.app or platform.openai.com/tokenizer
- Anthropic: count_tokens API
- Hugging Face: any tokenizer can be loaded with
AutoTokenizer.from_pretrained(...)
Paste your text and see exactly which chunks the model sees. Things that surprise people:
- "strawberry" is one token in GPT's vocabulary. That's why GPT historically miscounted the
rs. - Numbers split unpredictably — "1234" might be one token, but "12345" might split into "123" + "45".
- A trailing space matters. "hello" and " hello" (leading space) are different tokens.
- Whitespace before code structure (
{,},, etc) often gets its own token, which is why code costs more than the character count would suggest.
Why it matters for cost
API pricing is per million tokens, separately for input and output:
- Input tokens are everything you send: system prompt + conversation history + your message. Cheaper.
- Output tokens are what the model generates back. More expensive (usually 3-5×).
In 2026 typical pricing per million tokens:
- Claude Sonnet: ~$3 input, ~$15 output
- GPT-5: comparable, varies by tier
- Gemini 2.5 Pro: cheaper, ~$1-3 input
- DeepSeek V3: very cheap, often <$1 input
- Open-weight Llama 70B self-hosted: pay per GPU-hour, not per token
This means prompt length is your bill. A 50K-token system prompt on every request can torpedo your unit economics.
Mitigations: prompt caching (5-10× cheaper for repeated prefixes), shorter system prompts, RAG instead of dumping the whole knowledge base, and using cheaper models for easy queries ("LLM routing").
Why it matters for context windows
If the model has a 200K context window, the limit is in tokens, not pages. So:
- A 100-page English document (~50K words ~ 65K tokens) easily fits.
- A 100-page Chinese document (~50K characters ~ 75K-100K tokens) might just fit.
- A 100-page Python codebase varies wildly depending on density.
When designing an app, plan a token budget. Don't think "the document is 5MB" — count the tokens.
Why it matters for output speed
Models generate text one token at a time. Speed is measured in tokens per second (tps).
- A frontier API like Claude Sonnet does ~50-80 tps for output
- Smaller models (Haiku, Flash) hit 100-300 tps
- Self-hosted with vLLM or TensorRT-LLM you can push 500+ tps for 7B models
If the user wants a 1,000-token response and your model does 50 tps, that's 20 seconds wall-clock time. For chat UIs, this is why streaming matters — show tokens as they arrive, not after the whole response is done.
A quick estimation cheat sheet
Keep these in your head:
- 1 page of English ≈ 500 words ≈ 650 tokens
- 1 page of Chinese ≈ 500 chars ≈ 750-1000 tokens
- 1 typical user chat message ≈ 50-200 tokens
- 1 system prompt for a serious app ≈ 500-3000 tokens
- 1 RAG retrieval (5 chunks of ~300 tokens) ≈ 1500 tokens
- 1 long PDF report ≈ 10K-50K tokens
Multiply by your monthly request volume to get a rough cost forecast before you ship anything serious.
When NOT to obsess over token counts
If you're using a chat product (ChatGPT, Claude.ai) on a flat monthly plan, you don't pay per token — you pay $20/month and the product handles fair-use limits. Counting tokens is mostly an API/builder concern.
For casual chat use, the time you'd spend optimizing prompt length is worth more than the token savings.
Further reading
- What is a Large Language Model (LLM)
- What is a context window
- Tokens vs words: how LLM pricing actually works
- Why input tokens cost less than output tokens
- LLM routing: route easy queries to cheap models