Byte Pair Encoding (BPE)

A sub-word tokenization algorithm that builds a vocabulary by repeatedly merging the most frequent pair of adjacent tokens in the training data.

Byte Pair Encoding starts with a vocabulary of single characters (or bytes) and iteratively merges the most frequent adjacent pair into a new token. After thousands of merges, you end up with a vocabulary where common words are single tokens, rare words split into recognizable sub-words, and any string in the world can still be encoded — falling back to bytes for emoji, rare scripts, and typos. It matters because BPE is the tokenizer behind GPT, Llama, Mistral, and most modern open-source LLMs. The algorithm hits a sweet spot: vocabulary stays manageable (usually 32k-128k tokens), common patterns get efficient single-token encoding, and nothing is unrepresentable. The byte-level variant (used by GPT-2 onwards) means even random binary data can be tokenized. A concrete example: with a BPE tokenizer trained on English, "tokenization" might encode as ["token", "ization"] — two tokens. The same word in a Chinese-centric tokenizer (like Qwen's) might handle Chinese efficiently but encode "tokenization" as 4-5 pieces. Picking the right tokenizer for your domain matters. Variants worth knowing: SentencePiece (used by Llama and many multilingual models, treats whitespace as a regular character), WordPiece (used by BERT, similar idea different scoring), and Unigram (probabilistic alternative). Related: tokenization, vocabulary, subword.