Tokenization

The process of splitting raw text into tokens — the units (sub-words, words, or characters) that an LLM actually processes.

Tokenization splits text into tokens — the small chunks an LLM reads and writes. Modern LLMs use sub-word tokenizers (BPE, SentencePiece, WordPiece) where common words become a single token ("the", "hello") and rare words split into pieces ("unbelievable" might become "un" + "believ" + "able"). Each token maps to an integer ID that goes into the model. It matters because tokens are the unit of cost, the unit of context, and the unit of what the model can actually understand. Pricing is per-token. Context window limits are per-token. Model performance on a language correlates with how efficiently that language tokenizes — Chinese and Japanese typically take more tokens per character than English on most tokenizers, which is why API calls in CJK languages cost more for the same content. A concrete example: "hello world" is 2 tokens in GPT-4. "你好世界" is 4-6 tokens depending on the tokenizer (each Chinese character can split into multiple bytes). This is why Chinese-specialized models like Qwen and DeepSeek invest in better Chinese tokenizers — the same article uses fewer tokens, costs less, and fits into context more easily. You rarely tokenize manually, but knowing token boundaries explains weird LLM behavior: counting letters in a word is hard because the model only sees tokens, not characters. Related: BPE, context window, vocabulary, subword.