ROUGE
ROUGE scores a generated summary by how much it overlaps with a human reference summary — leaning on recall, how much of the reference's content the output captured. It's the standard automatic metric for summarization.
Also known as: ROUGE score
ROUGE (Recall-Oriented Understudy for Gisting Evaluation) is BLEU’s counterpart for summarization. Where BLEU leans on precision (did the output’s words appear in the reference), ROUGE leans on recall (did the output cover the reference’s content), with common variants for overlapping word sequences (ROUGE-N) and longest common subsequence (ROUGE-L).
Like BLEU, it measures word overlap, not meaning, so it shares the same blind spot: a good abstractive summary that rephrases the source can score low, and a copy-paste extract can score high. It’s a fine automatic signal for tracking summarization quality across model changes, but for “is this summary actually good,” you pair it with semantic metrics or human/LLM judgment.