File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/abstr/05/i05-5002_abstr.xml

Size: 1,132 bytes

Last Modified: 2025-10-06 13:44:19

<?xml version="1.0" standalone="yes"?>
<Paper uid="I05-5002">
  <Title>Automatically Constructing a Corpus of Sentential Paraphrases</Title>
  <Section position="1" start_page="0" end_page="0" type="abstr">
    <SectionTitle>
Abstract
</SectionTitle>
    <Paragraph position="0"> An obstacle to research in automatic paraphrase identification and generation is the lack of large-scale, publicly-available labeled corpora of sentential paraphrases. This paper describes the creation of the recently-released Microsoft Research Paraphrase Corpus, which contains 5801 sentence pairs, each hand-labeled with a binary judgment as to whether the pair constitutes a paraphrase. The corpus was created using heuristic extraction techniques in conjunction with an SVM-based classifier to select likely sentence-level paraphrases from a large corpus of topicclustered news data. These pairs were then submitted to human judges, who confirmed that 67% were in fact semantically equivalent. In addition to describing the corpus itself, we explore a number of issues that arose in defining guidelines for the human raters.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML