Introduction

Recall-Oriented Understudy for Gisting Evaluation (ROUGE) is a set of evaluation metrics used to measure the quality of automatic summarization and machine translation outputs.

In this article, we will talk about 2 variations ROUGE-1 (unigram overlap), and ROUGE-L (longest common subsequence).

Explanation

ROUGE-1 is a metric that measures the overlap of unigrams (single words) between the generated summary and reference summary. It computes precision, recall, and F1 score based on the count of overlapping unigrams.

ROUGE-L measures the longest common subsequence (LCS) between the generated and reference summaries. A subsequence is a sequence of words that appear in the same order, but not necessarily consecutively. The LCS is the longest subsequence that appears in both summaries. This measures the extent to which the generated summary captures the same sequence of concepts as the reference summary.

Example

Lets say we have

  • Machine-generated Summary: The quick brown fox jumped over the lazy dog

  • Reference : The lazy dog was jumped over by the quick brown fox

Calculate the length of summary and reference

  1. len_summary = len(summary) = 9

  2. len_reference = len(reference) = 11

Calculation on ROUGE-1

  1. Precision = (number of overlapping unigrams) / (number of unigrams in machine-generated summary) = 9/9 = 1

  2. Recall = (number of overlapping unigrams) / (number of unigrams in reference summary) = 9 / 11 = 0.818

  3. F1 = 0.9

Caculation on ROUGE-L

  1. Calculate the longest common subsequence length (LCS) between the summary and reference. LCS = 4

  2. Precision = len_LCS / len_summary = 4 / 9 = 0.44

  3. Recall = len_LCS / len_reference = 4 / 11 = 0.36

  4. F1 = 0.4

Python Example

from rouge import Rouge

# Initialize ROUGE scorer
rouge = Rouge()

# Machine-generated summary
summary = "The quick brown fox jumped over the lazy dog."

# Reference summary
reference = "The lazy dog was jumped over by the quick brown fox."

# Compute ROUGE-L scores
scores = rouge.get_scores(summary, reference, avg=True)

# Print ROUGE-L scores
print("ROUGE-L precision:", scores["rouge-l"]["p"])
print("ROUGE-L recall:", scores["rouge-l"]["r"])
print("ROUGE-L F1 score:", scores["rouge-l"]["f"])