File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/i05-5002_metho.xml
Size: 22,202 bytes
Last Modified: 2025-10-06 14:09:40
<?xml version="1.0" standalone="yes"?> <Paper uid="I05-5002"> <Title>Automatically Constructing a Corpus of Sentential Paraphrases</Title> <Section position="4" start_page="9" end_page="10" type="metho"> <SectionTitle> 3 Source Data </SectionTitle> <Paragraph position="0"> The Microsoft Research Paraphrase Corpus (MSRP) is distilled from a database of 13,127,938 sentence pairs, extracted from 9,516,684 sentences in 32,408 news clusters collected from the World Wide Web over a 2year period, The methods and assumptions used in building this initial data set are discussed in Quirk et al. (2004) and Dolan et al. (2004). Two heuristics based on shared lexical properties and sentence position in the document were employed to construct the initial database: Word-based Levenshtein edit distance of 1 < e g1157 20; and a length ratio</Paragraph> <Paragraph position="2"> Both sentences in the first three sentences of each file; and length ratio > 50%.</Paragraph> <Paragraph position="3"> Within this initial dataset we were able to automatically identify the names of both authors and copyright holders of 61,618 articles.1 Limiting ourselves only to sentences found in those articles, we further narrowed the range of candidate pairs using the following criteria: The number of words in both sentences in words is 5 [?] n [?] 40; The two sentences shared at least three words in common; The length of the shorter of the two sentences, in words, is at least 66.6% that of the longer; and The two sentences had a bag-of-words lexical distance of e [?] 8 edits.</Paragraph> <Paragraph position="4"> This enabled us extract a set of 49,375 initial candidate sentence pairs whose author was known, The purpose of these heuristics was two-fold: 1) to narrow the search space for subsequent application of the classifier algorithm and human evaluation, and 2) to ensure at least some diversity among the sentences. In particular, we sought to exclude the large number of sentence pairs whose differences might be attributable only to typographical errors, variance between British and American spellings, and minor editorial variations. Lexical distance was computed by constructing an alphabetized list of unique vocabulary items from each of the sentences and measuring the number of insertions and deletions. Note that the number of sentence pairs collected in this first pass was relatively small compared with the overall size of the dataset; the requirement of author identification significantly circumscribed the available dataset. 1 Author identification was performed on the basis of pattern matching datelines and other textual information. We made a strong effort to ensure correct attribution.</Paragraph> </Section> <Section position="5" start_page="10" end_page="10" type="metho"> <SectionTitle> 4 Constructing a Classifier </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="10" end_page="10" type="sub_section"> <SectionTitle> 4.1 Sequential Minimal Optimization </SectionTitle> <Paragraph position="0"> To extract candidate pairs from this ~49K list, we used a Support Vector Machine. (Vapnik, 1995), in this case an implementation of the Sequential Minimal Optimization (SMO) algorithm described in Platt (1999),2 which has been shown to be useful in text classification tasks (Dumais 1998; Dumais et al., 1998).</Paragraph> </Section> <Section position="2" start_page="10" end_page="10" type="sub_section"> <SectionTitle> 4.2 Training Set </SectionTitle> <Paragraph position="0"> A separate set of 10,000 sentence pairs had previously been extracted from randomly held-out clusters and hand-tagged by two annotators according to whether the sentence pairs constituted paraphrases. This yielded a set of 2968 positive examples and 7032 negative examples.</Paragraph> <Paragraph position="1"> The sentences represented a random mixture of held out sentences; no attempt was made to match their characteristics to those of the candidate data set.</Paragraph> </Section> <Section position="3" start_page="10" end_page="10" type="sub_section"> <SectionTitle> 4.3 Classifiers </SectionTitle> <Paragraph position="0"> In the classifier we restricted the feature set to a small set of feature classes. The main classes are given below. More details can be found in Brockett and Dolan (2005).</Paragraph> <Paragraph position="1"> String Similarity Features: Absolute and relative length in words, number of shared words, word-based edit distance, and bagof-words-based lexical distance.</Paragraph> <Paragraph position="2"> Morphological Variants: A morphological variant lexicon consisting of 95,422 word pairs was created using a hand-crafted stemmer. Each pair is then treated as a feature in the classifier.</Paragraph> <Paragraph position="3"> WordNet Lexical Mappings: 314,924 word synonyms and hypernym pairs were extracted from WordNet, (Fellbaum, 1998; http://www.cogsci.princeton.edu/~wn/).</Paragraph> <Paragraph position="4"> Only pairs identified as occurring in either training data or the corpus to be classified were included in the final classifier.</Paragraph> <Paragraph position="5"> 2 The pseudocode for SMO may be found in the appendix of Platt (1999) Encarta Thesaurus: 125,054 word synonym pairs were extracted from the Encarta Thesaurus (Rooney, 2001).</Paragraph> <Paragraph position="6"> Composite Features: Additional, more abstract features summarized the frequency with which each feature or class of features occurred in the training data, both independently, and in correlation with other features or feature classes.</Paragraph> </Section> <Section position="4" start_page="10" end_page="10" type="sub_section"> <SectionTitle> 4.4 Results of Applying the Classifier </SectionTitle> <Paragraph position="0"> Since our purpose was not to evaluate the potential effectiveness of the classifier itself, but to identify a reasonably large set of both positive and plausible &quot;near-miss&quot; negative examples, the classifier was applied with output probabilities deliberately skewed towards overidentification, i.e., towards Type 1 errors, assuming non-paraphrase (0) as null hypothesis.</Paragraph> <Paragraph position="1"> This yielded 20,574 pairs out the initial 49,375pair data set, from which 5801 pairs were then further randomly selected for human assessment.</Paragraph> </Section> </Section> <Section position="6" start_page="10" end_page="12" type="metho"> <SectionTitle> 5 Human Evaluation </SectionTitle> <Paragraph position="0"> The 5801 sentences selected by the classifier as likely paraphrase pairs were examined by two independent human judges. Each judge was asked whether the two sentences could be considered &quot;semantically equivalent&quot;. Disagreements were resolved by a 3rd judge, with the final binary judgment reflecting the majority vote.3 After resolving differences between raters, 3900 (67%) of the original pairs were judged &quot;semantically equivalent&quot;.</Paragraph> <Section position="1" start_page="10" end_page="11" type="sub_section"> <SectionTitle> 5.1 Semantic Divergence </SectionTitle> <Paragraph position="0"> In many instances, the two sentences judged &quot;semantically equivalent&quot; in fact diverge semantically to at least some degree. For instance, both judges considered the following two to be paraphrases: null 3 This annotation task was carried out by an independent company, the Butler Hill Group, LLC. Monica Corston-Oliver directed the effort, with Jeff Stevenson, Amy Muia, and David Rojas acting as raters.</Paragraph> <Paragraph position="1"> Charles O. Prince, 53, was named as Mr. Weill's successor.</Paragraph> <Paragraph position="2"> Mr. Weill's longtime confidant, Charles O. Prince, 53, was named as his successor.</Paragraph> <Paragraph position="3"> If a full paraphrase relationship can be described as &quot;bidirectional entailment&quot;, then the majority of the &quot;equivalent&quot; pairs in this dataset exhibit &quot;mostly bidirectional entailments&quot;, with one sentence containing information that differs from or is not contained in the other. Our decision to adopt this relatively loose tagging criterion was ultimately a practical one: insisting on complete sets of bidirectional entailments would have limited the dataset to pairs of sentences that are practically identical at the string level, as in the following examples.</Paragraph> <Paragraph position="4"> The euro rose above US$1.18, the highest price since its January 1999 launch.</Paragraph> <Paragraph position="5"> The euro rose above $1.18 the highest level since its launch in January 1999.</Paragraph> <Paragraph position="6"> However, without a carefully controlled study, there was little clear proof that the operation actually improves people's lives.</Paragraph> <Paragraph position="7"> But without a carefully controlled study, there was little clear proof that the operation improves people's lives.</Paragraph> <Paragraph position="8"> Such pairs are commonplace in the raw data, reflecting the tendency of news agencies to publish and republish the same articles, with editors introducing small and often inexplicable changes (is &quot;however&quot; really better than &quot;but&quot;?) along the way. The resulting alternations are useful sources of information about synonymy and local syntactic changes, but our goal was to produce a richer type of corpus; one that provides information about the large-scale alternations that typify complex paraphrases.4 4 Recall that in an effort to focus on sentence pairs that are not simply trivial variants of some original single source, we restricted our original dataset by removing all pairs with a minimum word-based Levenshtein distance of 8.</Paragraph> </Section> <Section position="2" start_page="11" end_page="11" type="sub_section"> <SectionTitle> 5.2 Complex Alternations </SectionTitle> <Paragraph position="0"> Some sentence pairs in the news data capture complex and full paraphrase alternations: Wynn paid $23.5 million for Renoir's &quot;In the Roses (Madame Leon Clapisson)&quot; at a Sotheby auction taking her to court for %MONEY% million after saying she threw a lamp at him and beat him in drunken rages It quickly became clear, that in order to collect significant numbers of sentential paraphrase pairs, our standards for what constitutes &quot;semantic equivalence&quot; would have to be relaxed.</Paragraph> </Section> <Section position="3" start_page="11" end_page="12" type="sub_section"> <SectionTitle> 5.3 Rater Instructions </SectionTitle> <Paragraph position="0"> Raters were told to use their best judgment in deciding whether 2 sentences, at a high level, &quot;mean the same thing&quot;. Under our relatively loose definition of semantic equivalence, any 2 of the following sentences would have qualified as &quot;paraphrases&quot;, despite obvious differences in information content: The genome of the fungal pathogen that causes Sudden Oak Death has been sequenced by US scientists Researchers announced Thursday they've completed the genetic blueprint of the blight-causing culprit responsible for sudden oak death Scientists have figured out the complete genetic code of a virulent pathogen that has killed tens of thousands of California native oaks The East Bay-based Joint Genome Institute said Thursday it has unraveled the genetic blueprint for the diseases that cause the sudden death of oak trees Several classes of named entities were replaced by generic tags in sentences presented to the raters, so that &quot;Tuesday&quot; became %%DAY%%, &quot;$10,000&quot; became &quot;%%MONEY%%, and so on. In the released version of the dataset, however, these placeholders were replaced by the original strings. After a good deal of trial-and-error, some specific rating criteria were developed and included in a tagging specification. For the most part, though, the degree of mismatch allowed before the pair was judged &quot;non-equivalent&quot; was left to the discretion of the individual rater: did a particular set of asymmetries alter the meanings of the sentences so much that they could not be regarded as paraphrases? The following sentences, for example, were judged &quot;not equivalent&quot; despite some significant content overlap: The Gerontology Research Group said Slough was born on %DATE%, making her %NUMBER% years old at the time of her death.</Paragraph> <Paragraph position="1"> &quot;[Mrs. Slough&quot;] is the oldest living American as of the time she died, L. Stephen Coles, Executive Director of the Gerontology Research Group, said %DATE%.</Paragraph> <Paragraph position="2"> The tagging task was ill-defined enough that we were surprised at how high inter-rater agreement was (averaging 84%). The Kappa score of 62 is good, but low enough to be indicative of the difficulty of the rating task. We believe that with more practice and discussion between raters, agreement on the task could be improved.</Paragraph> <Paragraph position="3"> Interestingly, a series of experiments aimed at making the judging task more concrete resulted in uniformly degraded inter-rater agreement. Providing a checkbox to allow judges to specify that one sentence fully entailed another, for instance, left the raters frustrated, slowed down the tagging, and had a negative impact on agreement. Similarly, efforts to identify classes of syntactic alternations that would not count against an &quot;equivalent&quot; judgment resulted, in most cases, in a collapse in inter-rater agreement. After completing hundreds of judgments, the raters themselves were asked for suggestions as to what checkboxes or instructions might improve tagging speed and accuracy. In the end, few generalizations seemed useful in streamlining the task; each pair is sufficiently idiosyncratic that that common sense has to take precedence over formal guidelines.</Paragraph> <Paragraph position="4"> In a few cases, firm tagging guidelines were found to be useful. One example was the treatment of pronominal and NP anaphora. Raters were instructed to treat anaphors and their full forms as equivalent, regardless of how great the disparity in length or lexical content between the two sentences. (Often these correspondences are extremely interesting, and in sufficient quantity would provide interesting fodder for learning models of anaphora.) SCC argued that Lexmark was trying to shield itself from competition...</Paragraph> <Paragraph position="5"> The company also argued that Lexmark was trying to squash competition... null But Secretary of State Colin Powell brushed off this possibility %%day%%.</Paragraph> <Paragraph position="6"> Secretary of State Colin Powell last week ruled out a nonaggression treaty.</Paragraph> <Paragraph position="7"> Note that many of the 33% of sentence pairs judged to be &quot;not equivalent&quot; still overlap significantly in information content and even wording. These pairs reflect a range of relationships, from pairs that are completely unrelated semantically, to those that are partially overlapping, to those that are almost-but-not-quite semantically equivalent.</Paragraph> </Section> </Section> <Section position="7" start_page="12" end_page="13" type="metho"> <SectionTitle> 6 Discussion </SectionTitle> <Paragraph position="0"> Given that MSRP reflects both the initial heuristics and the SVM methodology that was employed to identify paraphrase candidates for human evaluation, it is also limited by that technology. The 67% ratio of positive to negative judgments is a reasonably reliable indicator of the precision of our technique--though it should be recalled that parameters were deliberately distorted to yield imprecise results that included positive and a large number of &quot;near-miss&quot; negatives. Coverage is hard to estimate reliably. we calculate that fewer than 30% of the pairs in a set of matched first-two sentences extracted from clustered news data, after application of simple heuristics, are paraphrases (Dolan et al., 2004). It seems reasonable to assume that the reduction to 10% seen in the initial data set still leaves many valid paraphrase pairs uncaptured in the corpus. The need to limit the corpus to those sentences for which authorship can be verified, and more specifically. to no more than a single sentence extracted from each article.</Paragraph> <Paragraph position="1"> further constrains the coverage in ways whose consequences are not yet known. In addition, the three-shared-words heuristic further guarantees that an entire class of paraphrases in which no words are shared in common have been excluded from the data. It has been observed that the mean lexical overlap in the corpus is a relatively high 0.7 (Weeds et al, 2005), suggesting that more lexically divergent examples will be needed. In these respects, as Wu (2005) points out, the corpus is far from distributionally neutral. This is a matter that we hope to remedy in the future, since in many ways this excluded set of pairs is the most interesting of all.</Paragraph> <Paragraph position="2"> The above limitations, together with its relatively small size, perhaps make the MRSP inappropriate for direct use as a training corpus. We show separately that the results of training a classifier on the present corpus may be inferior to other training sets, though better than crude string or text-based heuristics (Brockett & Dolan, 2005). We expect that the utility of the corpus will stem primarily from its use as a tool for evaluating paraphrase recognition algorithms. It has already been applied in this way by Corley & Mihalcea (2005) and Wu (2005).</Paragraph> </Section> <Section position="8" start_page="13" end_page="13" type="metho"> <SectionTitle> 7 A Virtual Super Corpus? </SectionTitle> <Paragraph position="0"> Although larger than any other non-translation-based labeled paraphrase corpus currently publicly available, MSRP is tiny compared with the huge bilingual parallel corpora publicly available within the Machine Translation community, for example, the Canadian Hansards, the Hong Kong Parliamentary corpus, or the United Nations documents. It is improbable that we will ever encounter a &quot;naturally occurring&quot; paraphrase corpus on the scale of any of these bilingual corpora. Moreover, whatever extraction technique is employed to identify paraphrases in other kinds of data will be apt to reflect the implicit biases of the methodology employed.</Paragraph> <Paragraph position="1"> Here we would like to put forward a proposal.</Paragraph> <Paragraph position="2"> The paraphrase research community might be able to construct a &quot;virtual paraphrase corpus&quot; that would be adequately large for both training and testing purposes and minimize selectional biases. This could be achieved in something like the following manner. Research groups could compile their own labeled paraphrase corpora, applying whatever learning techniques they choose to select their initial data. If enough interested groups were to release a sufficiently large number of reasonably-sized corpora, it might be possible to achieve some sort consensus, in a manner analogous to the division of the Penn Treebank into sections, whereby classifiers and other tools are conventionally trained on one subset of corpora, and tested against another subset. Though this would present issues of its own, it would obviate many of the problems of extraction bias inherent in automated extraction, and allow better cross comparison across systems. null</Paragraph> </Section> <Section position="9" start_page="13" end_page="14" type="metho"> <SectionTitle> 8 Future Directions </SectionTitle> <Paragraph position="0"> For our part we plan to expand the MSRP, both by extending the number of sentence pairs, and also improving the balance of positive and negative examples. We anticipate using multiple classifiers to reduce inherent biases in candidate corpus selection, and with better author identification to ensure proper attribution, to be able to draw on a larger dataset for consideration by our judges.</Paragraph> <Paragraph position="1"> In future releases we expect to make available more information about individual evaluator judgments. Burger & Ferro (2005) have suggested that this data may allow researchers greater freedom to construct models based on the judgments of specific judges or combinations of judges, permitting more fine-grained use of the corpus.</Paragraph> <Paragraph position="2"> One further issue that we will also be attempting to address is the need to provide a better metric for corpus coverage and quality. Until reliable metrics can be established for end-to- null end paraphrase tasks--these will probably need to be application specific--the Alignment Error Rate strategy that was successfully applied in early development of machine translation systems (Och & Ney, 2000, 2003) offers a useful intermediate representation of the coverage and precision of a corpus and extraction techniques. Though fullscale reliability studies have yet to be performed, the AER technique is already finding application in other fields such as summarization (Daume & Marcu, forthcoming). We expect to be able to provide a reasonably large corpus of word-aligned paraphrase sentences in the near future that we hope will serve as some sort of standard by which corpus extraction techniques can be measured and compared in a uniform fashion.</Paragraph> <Paragraph position="3"> One other path that we are concurrently exploring is collection and validation of paraphrase data by volunteers on the web. Some initial efforts using game formats for elicitation are presented in Chklovski (2005) and Brockett & Dolan (2005). It is our hope that web volunteers will prove a useful source of colloquial paraphrases of written text, and-if paraphrase identification can be effectively embedded in the game-of paraphrase judgments.</Paragraph> </Section> class="xml-element"></Paper>