Perplexity

A metric measuring how surprised a language model is by the actual next token — lower is better. The exponentiated average negative log-likelihood.

Perplexity measures how well a language model predicts a held-out text. If the model expected the actual next token with high probability, perplexity is low. If it was surprised, perplexity is high. Mathematically it's the exponential of the average negative log-likelihood per token. Lower is better. It matters because perplexity is the cheapest, most reproducible signal of language modeling quality during pre-training and ablations. You don't need humans, prompts, or task-specific evals — just a held-out corpus and a forward pass. Researchers use perplexity to compare architectures, hyperparameters, and tokenizers before running expensive downstream evaluations. A concrete example: training Llama 3, you could plot perplexity on a held-out subset of Wikipedia at every checkpoint. As the model learns, perplexity drops from ~1000 (random) to single digits. When perplexity stops improving, you've roughly converged or need more data. A caveat: perplexity is comparable only across models that share a tokenizer. A Chinese-tokenized model and an English-tokenized model can't be compared by perplexity alone. Also, low perplexity doesn't guarantee good downstream task performance — some abilities (reasoning, instruction-following) emerge in ways perplexity doesn't capture. Important to distinguish from Perplexity the search-engine company. Related: cross-entropy loss, evaluation, MMLU, scaling laws.