BLEU 分數

機器翻譯的自動評分指標，比對模型輸出與參考翻譯的 n-gram 重疊程度。

BLEU（Bilingual Evaluation Understudy）是 2002 年提出的機器翻譯自動評分指標。它計算候選翻譯裡的 1-gram、2-gram、3-gram、4-gram 在參考翻譯出現的比例，再乘上 brevity penalty 防止模型只翻簡單部分作弊。它重要的原因是：BLEU 出現以前，評估 MT 要付錢請人比對。BLEU 便宜、deterministic、可重現——20 年來都是這領域追蹤進度的方式。NMT 論文很多 SOTA 宣稱都還是用 BLEU。舉個例子：候選「the cat sat on mat」對照參考「the cat sat on the mat」，4-gram overlap 高、brevity penalty 小，分數約 60。差的翻譯「on mat the cat sat」雖然用了全部一樣的字，n-gram overlap 低，分數較低。大缺點：BLEU 跟品質的相關性其實不強。它會懲罰合理的同義改寫（「the feline rested on the rug」對照「the cat sat on the mat」幾乎是 0 分），對沒詞邊界的語言（中文、日文、泰文）或自由生成幾乎沒意義。現代替代品有 chrF、COMET、BLEURT、人類評估。LLM benchmarking 用的 BLEU 已經是歷史。延伸閱讀：ROUGE、machine translation、evaluation。