BERTScore

BERTScore

BERTScore compares generated and reference text by the similarity of their embeddings rather than exact word overlap. Because it works in meaning-space, it credits a correct paraphrase that BLEU or ROUGE would mark down.

Jun 17, 2026 · Chain of Thought

BERTScore was a response to the core flaw in BLEU and ROUGE: they match words, not meaning. Instead of counting shared n-grams, BERTScore embeds both the candidate and the reference with a language model and measures how close their tokens are in vector space. A summary that says the same thing in different words now scores well, because the embeddings are close even when the words aren’t.

That makes it a better fit for open-ended generation, but it isn’t free of issues: scores depend on which embedding model you use, they’re harder to interpret than a simple overlap count, and it still needs a reference text. It’s a solid middle ground between cheap n-gram metrics and full LLM-as-a-judge evaluation.

Related terms