File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/05/i05-5002_intro.xml

Size: 3,410 bytes

Last Modified: 2025-10-06 14:03:03

<?xml version="1.0" standalone="yes"?>
<Paper uid="I05-5002">
  <Title>Automatically Constructing a Corpus of Sentential Paraphrases</Title>
  <Section position="3" start_page="0" end_page="9" type="intro">
    <SectionTitle>
2 Motivation
</SectionTitle>
    <Paragraph position="0"> The success of Statistical Machine Translation (SMT) has sparked a successful line of investigation that treats paraphrase acquisition and generation essentially as a monolingual machine translation problem (e.g., Barzilay &amp; Lee, 2003; Pang et al., 2003; Quirk et al., 2004; Finch et al., 2004). However, a lack of standardly-accepted corpora on which to train and evaluate models is a major stumbling block to the successful application of SMT models or other machine learning algorithms to paraphrase tasks. Since paraphrase is not apparently a common &amp;quot;natural&amp;quot; task--under normal circumstances people do not attempt to create extended paraphrase texts--the field lacks a large readily identifiable dataset comparable to, for example, the Canadian Hansard corpus in SMT that can serve as a standard against which algorithms can be trained and evaluated.</Paragraph>
    <Paragraph position="1"> What paraphrase data is currently available is usually too small to be viable for either training or testing, or exhibits narrow topic coverage, limiting its broad-domain applicability. One class of paraphrase data that is relatively widely available is multiple translations of sentences in a second language. These, however, tend to be rather restricted in their domain (e.g. the ATR English-Chinese paraphrase corpus, which con- null sists of translations of travel phrases (Zhang &amp; Yamamoto, 2002)), are limited to short hand-crafted predicates (e.g. the ATR Japanese-English corpus (Shirai, et al., 2002)), or exhibit quality problems stemming from insufficient command of the target language by the translators of the documents in question, e.g. the Linguistic Data Consortium's Multiple-Translation Chinese Corpus (Huang et al., 2002). Multiple translations of novels, such as those used in (Barzilay &amp; McKeown, 2001) provide a relatively limited dataset to work with, and - since these usually involve works that are out of copyright - usually exhibit older styles of language that have little in common with modern language resources or application requirements.</Paragraph>
    <Paragraph position="2"> Likewise, the data made available by (Barzilay &amp; Lee, 2003: http://www.cs.cornell.edu/ Info/Projects/NLP/statpar.html), while invaluable in understanding and evaluating their results, is too limited in size and domain coverage to serve as either training or test data.</Paragraph>
    <Paragraph position="3"> Attempting to evaluate models of paraphrase acquisition and generation under limitations can thus be an exercise in frustration. Accordingly, we have tried to create a reasonably large corpus of naturally-occurring, non-handcrafted sentence pairs, along with accompanying human judgments, that can be used as a resource for training or testing purposes. Since the search space for identifying any two sentence pairs occurring &amp;quot;in the wild&amp;quot; is huge, and provides far too many negative examples for humans to wade through, clustered news articles were used to constrain the initial search space to data that was likely to yield paraphrase pairs.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML