Tokens vs words: how LLM pricing actually works

LLM pricing is in tokens, not words. The conversion is messy, language-specific, and often surprising. Builders who treat "token" as roughly equivalent to "word" routinely underestimate API bills by 30-100%, especially for non-English content.

What a token actually is

A token is a unit of text the model processes. For English, the rough rule is:

1 token ≈ 0.75 English words
1 token ≈ 4 characters
100 English words ≈ 130-150 tokens

For Chinese, Japanese, Korean, Thai, and other non-Latin scripts, the rules are different and worse:

1 Chinese character ≈ 2-3 tokens
1 Japanese character ≈ 2-3 tokens
1 Korean character ≈ 1.5-2 tokens

A 100-character Chinese article uses 200-300 tokens. The same idea expressed in 100 English words uses ~130 tokens. Same content, ~2× the cost in Chinese.

Why the disparity exists

LLM tokenizers were trained on data heavily skewed toward English. The Byte-Pair Encoding (BPE) algorithms used by GPT and Claude break English into efficient subwords ("strawberry" → 1-2 tokens) but break Chinese into individual characters or smaller units ("草莓" → 4-6 tokens).

Newer tokenizers (Claude 3.5+, GPT-5, DeepSeek, Qwen) have improved Chinese efficiency, but the gap remains. Models trained primarily on Chinese (Qwen, DeepSeek) tokenize Chinese more efficiently than English-first models do.

The pricing structure

Most APIs charge separately for input tokens (your prompt) and output tokens (the response):

Input tokens are the cheaper category. They're what you send.
Output tokens are the expensive category. They're what the model generates. Usually 3-5× more expensive than input.

For Claude 4.5 Sonnet in 2026:

Input: $3 per million tokens
Output: $15 per million tokens

For GPT-5:

Input: $2.50 per million tokens
Output: $10 per million tokens

For DeepSeek V3:

Input: $0.14 per million tokens (much cheaper)
Output: $0.28 per million tokens

The gap between input and output is intentional — it's expensive to generate tokens (compute-intensive), cheap to read them.

Estimating cost before shipping

A practical mental model for a chatbot processing 1000 conversations per day:

Average conversation: 5 turns
Each turn: ~500 input tokens (history) + ~300 output tokens (response)
Per conversation: ~2500 input + 1500 output tokens
Per day at 1000 convos: 2.5M input + 1.5M output
Daily cost on Claude 4.5 Sonnet: $3 × 2.5 + $15 × 1.5 = $7.50 + $22.50 = $30/day = $900/month

Swap in DeepSeek V3 for the same volume: $4.50 + $4.20 ≈ $9/month. The model choice changes the bill 100×.

Hidden cost multipliers

Conversation history. Each turn re-sends all prior turns as input. A 10-turn conversation has each turn sending all preceding turns. Input cost grows quadratically.

System prompts. A long system prompt is sent on every request. A 1000-token system prompt × 10000 requests = 10M tokens just for the system prompt.

RAG context. Each retrieval-augmented query sends the retrieved chunks as input. 5 chunks × 500 tokens × every query = significant input cost.

Tool use. Tool descriptions and tool call results count as tokens. Complex agents with 20 tool definitions add 2000+ tokens per request.

Multimodal. Images and audio convert to tokens. A 1024×1024 image is ~1500 tokens. Multiple images per request multiply this.

How to count tokens accurately

Use the official tokenizers:

OpenAI: tiktoken Python library
Anthropic: count_tokens API or claude-tokenizer library
Google: google-cloud-aiplatform Python SDK

For quick estimation in browsers, OpenAI's tokenizer.tiktokenizer.com gives instant counts. Anthropic has a similar tool.

Never estimate by character count or word count when accuracy matters; use the tokenizer.

Cost optimization tactics

Use cheaper models for simple tasks. GPT-5 Mini, Claude Haiku 4.5, Gemini Flash do most production work for 1/10 the price of frontier models. Reserve frontier for genuinely hard tasks.

Cache aggressively. Anthropic's prompt caching reduces repeated input cost by 90%. If your system prompt is fixed, cache it.

Shorten system prompts. Every token of the system prompt is sent on every request. Audit ruthlessly.

Trim context. Don't pass the entire conversation history if only recent context matters. Summarize older turns.

Stream output and cap tokens. Set max_output_tokens to reasonable limits. Long completions are expensive.

Route by complexity. Use a small model first; only escalate to larger when confidence is low.

The Chinese-builder cost trap

If your audience is Chinese, your prompts and outputs are heavily Chinese. Even though models like Claude 4.5 work well in Chinese, you're paying ~2× the per-character rate compared to English content. For high-volume products targeting Chinese audiences, consider:

DeepSeek V3 / Qwen 2.5 (Chinese-optimized tokenization, much cheaper)
Hybrid routing (Chinese requests to DeepSeek, English to Claude)
Self-hosted models for high volume

This isn't theoretical. A Chinese-language SaaS doing 10M tokens/day saves thousands of dollars per month by switching tokenizer-friendly models.

When NOT to obsess over tokens

For low-volume usage (under 1M tokens/month total), the cost differences are noise. A $20/month bill vs a $5/month bill isn't worth optimizing for. Focus on quality.

For high-stakes user-facing tasks, model quality matters more than cost. A $0.01 cost saving that drops user satisfaction is bad math.

For experimentation, use whichever model is fastest to iterate with. Cost optimization is for production, not prototyping.

Decision framework

Hobby project: don't think about tokens; pick best model
Production at scale: count tokens carefully; optimize aggressively
Chinese-language product: prefer Chinese-optimized models for cost
Mixed-language product: route by language to optimal tokenizer
Compliance / privacy bound: self-hosted; tokens still matter for capacity

Next steps

Use the official tokenizer for whatever model you're building on
Monitor your daily input/output token volume; surprises hide in unexpected places
Read about prompt caching specifically; it's the biggest win
For Chinese content, A/B test DeepSeek vs Claude on quality before committing