File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/concl/05/i05-5002_concl.xml

Size: 1,528 bytes

Last Modified: 2025-10-06 13:54:38

<?xml version="1.0" standalone="yes"?>
<Paper uid="I05-5002">
  <Title>Automatically Constructing a Corpus of Sentential Paraphrases</Title>
  <Section position="10" start_page="14" end_page="14" type="concl">
    <SectionTitle>
9 Conclusion
</SectionTitle>
    <Paragraph position="0"> We have used heuristic techniques and a classifier to automatically create a corpus of 5801 &amp;quot;naturally occurring&amp;quot; (non-constructed) sentence pairs, labeled according to whether, in the judgment of our evaluators, the sentences &amp;quot;mean the same thing&amp;quot; or not. To our knowledge, MSRP constitutes the largest currently-available broad-domain corpus of paraphrase pairs that does not have its origins in translations from another language. We hope that others will utilize it, find it useful, and provide feedback when it is not.</Paragraph>
    <Paragraph position="1"> The methodology that we have described for extracting this corpus is readily adaptable by others, and is not limited to news clusters, but can be readily extended to any flat corpus containing a large number of semantically similar sentences on which topic-based document clustering is possible. We have shown that by allowing a statistical learning algorithm to constrain the search space, it is possible to identify a manageable-sized candidate corpus on the basis of which human judges can label sentence pairs for paraphrase content quickly and in a cost effective manner. We hope that others will follow our example.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML