XML Viewer - w04-3216

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/04/w04-3216_intro.xml
Size: 3,912 bytes
Last Modified: 2025-10-06 14:02:51
<?xml version="1.0" standalone="yes"?>
<Paper uid="W04-3216">
  <Title>A Phrase-Based HMM Approach to Document/Abstract Alignment</Title>
  <Section position="2" start_page="0" end_page="0" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> There are a wealth of document/abstract pairs that statistical summarization systems could leverage to learn how to create novel abstracts. Detailed studies of such pairs (Jing, 2002) show that human abstractors perform a range of very sophisticated operations when summarizing texts, which include reordering, fusion, and paraphrasing. Unfortunately, existing document/abstract alignment models are not powerful enough to capture these operations.</Paragraph>
    <Paragraph position="1"> To get around directly tackling this problem, researchers in text summarization have employed one of several techniques.</Paragraph>
    <Paragraph position="2"> Some researchers (Banko et al., 2000) have developed simple statistical models for aligning documents and headlines. These models, which implement IBM Model 1 (Brown et al., 1993), treat documents and headlines as simple bags of words and learn probabilistic word-based mappings between the words in the documents and the words in the headlines. As our results show, these models are too weak for capturing the operations that are employed by humans in summarizing texts beyond the headline level.</Paragraph>
    <Paragraph position="3"> Other researchers have developed models that make unreasonable assumptions about the data, which lead to the utilization of a very small percent of available data. For instance, the document and sentence compression models of Daum'e III, Knight, and Marcu (Knight and Marcu, 2002; Daum'e III and Marcu, 2002a) assume that sentences/documents can be summarized only through deletion of contiguous text segments. Knight and Marcu found that from a corpus of 39;060 abstract sentences, only 1067 sentence extracts existed: a recall of only 2:7%.</Paragraph>
    <Paragraph position="4"> An alternate techinque employed in a large variety of systems is to treat the summarization problem as a sentence extraction problem. Such systems can be trained either on human constructed extracts or extracts generated automatically from document/abstract pairs (see (Marcu, 1999; Jing and McKeown, 1999) for two such approaches).</Paragraph>
    <Paragraph position="5"> None of these techniques is adequate. Even for a relatively simple sentence from an abstract, we can see that none of the assumptions listed above holds.</Paragraph>
    <Paragraph position="6"> In Figure 1, we observe several phenomena: Alignments can occur at the granularity of words and at the granularity of phrases.</Paragraph>
    <Paragraph position="7"> The ordering of phrases in an abstract can be different from the ordering in the document.</Paragraph>
    <Paragraph position="8"> Some abstract words do not have direct correspondents in the document, and some document words are never used.</Paragraph>
    <Paragraph position="9"> It is thus desirable to be able to automatically construct alignments between documents and their abstracts, so that the correspondences between the pairs are obvious. One might be initially tempted to use readily-available machine translation systems like GIZA++ (Och and Ney, 2003) to perform such Connecting Point has become the single largest Mac retailer after tripling it 's Macintosh sales since January 1989 . Connecting Point Systems tripled it 's sales of Apple Macintosh systems since last January . It is now the single largest seller of Macintosh .  alignments. However, as we will show, the alignments produced by such a system are inadequate for this task.</Paragraph>
    <Paragraph position="10"> The solution that we propose to this problem is an alignment model based on a novel mathematical structure we call the Phrase-Based HMM.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML