A family of metrics for summarization quality based on n-gram overlap between generated summary and human reference — ROUGE-1, ROUGE-2, and ROUGE-L are the common variants.
ROUGE (Recall-Oriented Understudy for Gisting Evaluation) is a family of metrics for evaluating summarization. Where BLEU was designed for translation and emphasizes precision ("how many of my words are in the reference?"), ROUGE emphasizes recall ("how many reference words did I capture?"). Common variants: ROUGE-1 (unigram overlap), ROUGE-2 (bigram), and ROUGE-L (longest common subsequence).
It matters because summarization papers and many text-generation papers report ROUGE as their headline metric. If you compare summarization models in the academic literature, you're comparing ROUGE-1/2/L. Datasets like CNN/DailyMail and XSum are scored this way.
A concrete example: summarizing a news article. Reference summary: "The Federal Reserve raised rates by 0.25 percentage points on Wednesday." Generated: "On Wednesday the Fed hiked rates a quarter point." The two share most ideas but few exact words — ROUGE-1 might be okay, ROUGE-2 is poor, even though a human would score them equivalent.
The limitations are similar to BLEU: it rewards lexical overlap, not semantic equivalence. Models that copy more words from the source can score higher than models that genuinely understand and rephrase. For LLM evaluation, ROUGE is increasingly being supplemented or replaced by LLM-as-judge approaches and human eval. Related: BLEU, summarization, evaluation, LLM-as-judge.
We use cookies
Anonymous analytics help us improve the site. You can opt out anytime. Learn more