File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/05/i05-5001_intro.xml
Size: 7,946 bytes
Last Modified: 2025-10-06 14:03:01
<?xml version="1.0" standalone="yes"?> <Paper uid="I05-5001"> <Title>Support Vector Machines for Paraphrase Identification and Corpus Construction</Title> <Section position="3" start_page="0" end_page="2" type="intro"> <SectionTitle> 2 Background </SectionTitle> <Paragraph position="0"> Two broad approaches have dominated the literature on constructing paraphrase corpora. One approach utilizes multiple translations of a single source language text, where the source language text guarantees semantic equivalence in the target language texts (e.g., Barzilay & McKeown, 2001; Pang et al., 2003). Such corpora are of limited availability, however, since multiple translations of the same document are uncommon in non-literary domains.</Paragraph> <Paragraph position="1"> The second strain of corpora construction involves mining paraphrase strings or sentences from news articles, with document clustering typically providing the topical coherence necessary to boost the likelihood that any two arbitrary sentences in the cluster are paraphrases. In this vein, Shinyama et al. (2002) use named entity anchors to extract paraphrases within a narrow domain. Barzilay & Lee (2003) employ</Paragraph> <Section position="1" start_page="1" end_page="1" type="sub_section"> <SectionTitle> Multiple Sequence Alignment (MSA, e.g., </SectionTitle> <Paragraph position="0"> Durbin et al., 1998) to align strings extracted from closely related news articles. Although the MSA approach can produce dramatic results, it is chiefly effective in extracting highly templatic data, and appears to be of limited extensibility to broad domain application (Quirk et al. 2004).</Paragraph> <Paragraph position="1"> Recent work by Dolan, et al. (2004) describes the construction of broad-domain corpora of aligned paraphrase pairs extracted from newscluster data on the World Wide Web using two heuristic strategies: 1) pairing sentences based on a word-based edit distance heuristic; and 2) a naive text-feature-based heuristic in which the first two sentences of each article in a cluster are cross-matched with each other, their assumption being that the early sentences of a news article will tend to summarize the whole article and are thus likely to contain the same information as other early sentences of other articles in the cluster. The word-based edit distance heuristic yields pairs that are relatively clean but offer relatively minor rewrites in generation, especially when compared to the MSA model of (Barzilay & Lee, 2003). The text-based heuristic, on the other hand, results in a noisy &quot;comparable&quot; corpus: only 29.7% of sentence pairs are paraphrases, resulting in degraded performance on alignment metrics. This latter technique, however, does afford large numbers of pairings that are widely divergent at the string level; capturing these is of primary interest to paraphrase research. In this paper, we use an annotated corpus and an SVM classifier to refine the output of this second heuristic in an attempt to better identify sentence pairs containing richer paraphrase material, and minimize the noise generated by unwanted and irrelevant data.</Paragraph> </Section> <Section position="2" start_page="1" end_page="1" type="sub_section"> <SectionTitle> 3 Constructing a Classifier 3.1 Sequential Minimal Optimization </SectionTitle> <Paragraph position="0"> Although any of a number of machine learning algorithms, including Decision Trees, might be equally applicable here, Support Vector Machines (Vapnik, 1995) have been extensively used in text classification problems and with considerable success (Dumais 1998; Dumais et al., 1998; Joachims 2002). In particular, SVMs are known to be robust in the face of noisy training data. Since they permit solutions in high dimensional space, SVMs lend themselves readily to bulk inclusion of lexical features such as morphological and synonymy information.</Paragraph> <Paragraph position="1"> For our SVM, we employed an off-the-shelf implementation of the Sequential Minimal Optimization (SMO) algorithm described in Platt (1999). 1 SMO offers the benefit of relatively short training times over very large feature sets, and in particular, appears well suited to handling the sparse features encountered in natural language classification tasks. SMO has been de- null Wednesday that it would close its doors by Dec. 1, 2004.</Paragraph> <Paragraph position="2"> San Jose Medical Center has announced that it will close its doors by Dec. 1, 2004.</Paragraph> </Section> <Section position="3" start_page="1" end_page="1" type="sub_section"> <SectionTitle> First Two Sentences </SectionTitle> <Paragraph position="0"> The genome of the fungal pathogen that causes Sudden Oak Death has been sequenced by US scientists Researchers announced Thursday they've completed the genetic blueprint of the blight-causing culprit responsible for Sudden Oak</Paragraph> </Section> <Section position="4" start_page="1" end_page="2" type="sub_section"> <SectionTitle> Death </SectionTitle> <Paragraph position="0"> ployed a variety of text classification tasks (e.g., Dumais 1998; Dumais et al., 1998).</Paragraph> </Section> <Section position="5" start_page="2" end_page="2" type="sub_section"> <SectionTitle> 3.2 Datasets </SectionTitle> <Paragraph position="0"> To construct our corpus, we collected news articles from news clusters on the World Wide Web.</Paragraph> <Paragraph position="1"> A database of 13,127,938 candidate sentence pairs was assembled from 9,516,684 sentences in 32,408 clusters collected over a 2-year period, using simple heuristics to identify those sentence pairs that were most likely to be paraphrases, and thereby prune the overall search space.</Paragraph> <Paragraph position="2"> Word-based Levenshtein edit distance of 1 < e [?] 20; and a length ratio</Paragraph> <Paragraph position="4"> Both sentences in the first three sentences of each file; and length ratio > 50%.</Paragraph> <Paragraph position="5"> From this database, we extracted three datasets. The extraction criteria, and characteristics of these datasets are given in Table 2. The data sets are labled L(evenshtein) 12, F(irst) 2 and F(irst) 3 reflecting their primary selection characteristics. The L12 dataset represents the best case achieved so far, with Alignment Error Rates beginning to approach those reported for alignment of closely parallel bilingual corpora. The F2 dataset was constructed from the first two sentences of the corpus on the same assumptions as those used in Dolan et al. (2004). To avoid conflating the two data types, however, sentence pairs with an edit distance of 12 or less were excluded. Since this resulted in a corpus that was significantly smaller than that desirable for exploring extraction techniques, we also created a third data set, F3 that consisted of the cross-pairings of the first three sentences of each article in each cluster, excluding those where the edit distance is e [?] 12.</Paragraph> </Section> <Section position="6" start_page="2" end_page="2" type="sub_section"> <SectionTitle> 3.3 Training Data </SectionTitle> <Paragraph position="0"> Our training data consisted of 10,000 sentence pairs extracted from randomly held-out clusters and hand-tagged by two annotators according to whether in their judgment (1 or 0) the sentence pairs constituted paraphrases. The annotators were presented with the sentences pairs in isolation, but were informed that they came from related document sets (clusters). A conservative interpretation of valid paraphrase was adopted: if one sentence was a superstring of the other, e.g., if a clause had no counterpart in the other sentence, the pair was counted as a nonparaphrase. Wherever the two annotators disagreed, the pairs were classed as nonparaphrases. The resultant data set contains 2968 positive and 7032 negative examples.</Paragraph> </Section> </Section> class="xml-element"></Paper>