File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/04/c04-1057_intro.xml

Size: 4,117 bytes

Last Modified: 2025-10-06 14:02:05

<?xml version="1.0" standalone="yes"?>
<Paper uid="C04-1057">
  <Title>A Formal Model for Information Selection in Multi-Sentence Text Extraction</Title>
  <Section position="2" start_page="0" end_page="0" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> Many natural language processing tasks involve the collection and assembling of pieces of information from multiple sources, such as different documents or different parts of a document. Text summarization clearly entails selecting the most salient information (whether generically or for a specific task) and putting it together in a coherent summary. Question answering research has recently started examining the production of multi-sentence answers, where multiple pieces of information are included in the final output.</Paragraph>
    <Paragraph position="1"> When the answer or summary consists of multiple separately extracted (or constructed) phrases, sentences, or paragraphs, additional factors influence the selection process. Obviously, each of the selected text snippets should individually be important. However, when many of the competing passages are included in the final output, the issue of information overlap between the parts of the output comes up, and a mechanism for addressing redundancy is needed. Current approaches in both summarization and long answer generation are primarily oriented towards making good decisions for each potential part of the output, rather than examining whether these parts overlap. Most current methods adopt a statistical framework, without full semantic analysis of the selected content passages; this makes the comparison of content across multiple selected text passages hard, and necessarily approximated by the textual similarity of those passages.</Paragraph>
    <Paragraph position="2"> Thus, most current summarization or longanswer question-answering systems employ two levels of analysis: a content level, where every textual unit is scored according to the concepts or features it covers, and a textual level, when, before being added to the final output, the textual units deemed to be important are compared to each other and only those that are not too similar to other candidates are included in the final answer or summary.</Paragraph>
    <Paragraph position="3"> This comparison can be performed purely on the basis of text similarity, or on the basis of shared features that may be the same as the features used to select the candidate text units in the first place.</Paragraph>
    <Paragraph position="4"> In this paper, we propose a formal model for integrating these two tasks, simultaneously performing the selection of important text passages and the minimization of information overlap between them.</Paragraph>
    <Paragraph position="5"> We formalize the problem by positing a textual unit space, from which all potential parts of the summary or answer are drawn, a conceptual unit space, which represents the distinct conceptual pieces of information that should be maximally included in the final output, and a mapping between conceptual and textual units. All three components of the model are application- and task-dependent, allowing for different applications to operate on text pieces of different granularity and aim to cover different conceptual features, as appropriate for the task at hand. We cast the problem of selecting the best textual units as an optimization problem over a general scoring function that measures the total coverage of conceptual units by any given set of textual units, and provide general algorithms for obtaining a solution.</Paragraph>
    <Paragraph position="6"> By integrating redundancy checking into the selection of the textual units we provide a unified framework for addressing content overlap that does not require external measures of similarity between textual units. We also account for the partial overlap of information between textual units (e.g., a single shared clause), a situation which is common in natural language but not handled by current methods for reducing redundancy.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML