BLEU · BuilderWorld

An automatic metric for machine translation quality, comparing n-gram overlap between the model output and one or more reference translations.

BLEU (Bilingual Evaluation Understudy) is a metric introduced in 2002 to score machine translation against human reference translations. It computes how many 1-grams, 2-grams, 3-grams, and 4-grams in the candidate translation appear in any reference, with a brevity penalty so the model can't game it by translating only easy parts. It mattered because before BLEU, evaluating MT meant paying humans for every comparison. BLEU is cheap, deterministic, and reproducible — for two decades it was how the field tracked progress. Many SOTA claims in NMT papers are still reported in BLEU. A concrete example: candidate "the cat sat on mat" vs reference "the cat sat on the mat" has high 4-gram overlap and a small brevity penalty, scoring around 60. A bad translation "on mat the cat sat" has lower n-gram overlap, scoring lower despite reusing all the words. The big caveat: BLEU correlates with quality only loosely. It punishes valid paraphrases ("the feline rested on the rug" gets near zero against "the cat sat on the mat"), and it's nearly meaningless for languages without word boundaries (Chinese, Japanese, Thai) or for free-form generation. Modern alternatives include chrF, COMET, BLEURT, and human evaluation. For LLM benchmarking, BLEU is mostly historical. Related: ROUGE, machine translation, evaluation.