困惑度 (Perplexity)

衡量語言模型對下一個 token 有多「意外」的指標，數值越低越好；本質是平均 negative log-likelihood 的指數。

Perplexity（困惑度）衡量語言模型預測一段保留文本的能力。如果模型預期下一個真實 token 的機率高，perplexity 就低；如果模型很意外，perplexity 就高。數學上是 token 平均 negative log-likelihood 的指數。越低越好。它重要的原因是：perplexity 是 pre-training 和 ablation 階段最便宜、最可重現的語言模型品質訊號。不需要人類、不需要 prompt、不需要任務 eval——只要一份保留語料和一次 forward pass。研究人員會用 perplexity 比較架構、超參數、tokenizer，再決定要不要跑昂貴的下游評估。舉個例子：訓練 Llama 3 時，你可以在每個 checkpoint 用一份 Wikipedia 保留資料畫 perplexity 曲線。隨著模型學習，perplexity 從 ~1000（隨機）掉到個位數。當 perplexity 停止下降，等於收斂了或要加更多資料。注意：perplexity 只能在共用同一 tokenizer 的模型間比較。中文 tokenizer 跟英文 tokenizer 的模型不能直接比 perplexity。而且 perplexity 低不代表下游任務一定好——有些能力（推理、follow 指令）的出現方式 perplexity 抓不到。也要跟 Perplexity 那家搜尋引擎公司分開。延伸閱讀：cross-entropy loss、evaluation、MMLU、scaling laws。