File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/00/p00-1038_intro.xml

Size: 9,131 bytes

Last Modified: 2025-10-06 14:00:51

<?xml version="1.0" standalone="yes"?>
<Paper uid="P00-1038">
  <Title>Query-Relevant Summarization using FAQs</Title>
  <Section position="2" start_page="0" end_page="0" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> An important distinction in document summarization is between generic summaries, which capture the central ideas of the document in much the same way that the abstract of this paper was designed to distill its salient points, and query-relevant summaries, which reflect the relevance of a document to a user-specified query. This paper discusses query-relevant summarization, sometimes also called &amp;quot;user-focused summarization&amp;quot; (Mani and Bloedorn, 1998).</Paragraph>
    <Paragraph position="1"> Query-relevant summaries are especially important in the &amp;quot;needle(s) in a haystack&amp;quot; document retrieval problem: a user has an information need expressed as a query (What countries export smoked salmon?), and a retrieval system must locate within a large collection of documents those documents most likely to fulfill this need. Many interactive retrieval systems--web search engines like Altavista, for instance--present the user with a small set of candidate relevant documents, each summarized; the user must then perform a kind of triage to identify likely relevant documents from this set. The web page summaries presented by most search engines are generic, not query-relevant, and thus provide very little guidance to the user in assessing relevance. Query-relevant summarization (QRS) aims to provide a more effective characterization of a document by accounting for the user's information need when generating a summary.</Paragraph>
    <Paragraph position="2">  query a33 , search engines typically first (a) identify a set of documents which appear potentially relevant to the query, and then (b) produce a short characterization a34a36a35a38a37a40a39a41a33a43a42 of each document's relevance to a33 . The purpose of a34a36a35a38a37a40a39a41a33a43a42 is to assist the user in finding documents that merit a more detailed inspection.</Paragraph>
    <Paragraph position="3"> As with almost all previous work on summarization, this paper focuses on the task of extractive summarization: selecting as summaries text spans--either complete sentences or paragraphs--from the original document. null</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
1.1 Statistical models for summarization
</SectionTitle>
      <Paragraph position="0"> From a document a44 and query a45 , the task of query-relevant summarization is to extract a portion a46 from a44 which best reveals how the document relates to the query. To begin, we start with a collection a47 of  a44a50a49a51a45a40a49a52a46a54a53 triplets, where a46 is a human-constructed summary of a44 relative to the query a45 . From such a collec-Snow is not unusual in France...</Paragraph>
      <Paragraph position="2"/>
      <Paragraph position="4"> tion requires a set of documents summarized with respect to queries. Here we show three imaginary triplets a81a82a37a40a39a41a33a83a39a41a84a82a85 , but the statistical learning techniques described in Section 2 require thousands of examples.</Paragraph>
      <Paragraph position="5"> tion of data, we fit the best function a86a88a87a90a89a91a45a40a49a51a44a40a92a94a93a95a46 mapping document/query pairs to summaries.</Paragraph>
      <Paragraph position="6"> The mapping we use is a probabilistic one, meaning the system assigns a value a96a90a89a97a46a50a98a99a44a50a49a52a45a83a92 to every possible summary a46 of a89a97a44a50a49a51a45a83a92 . The QRS system will summarize a a89a97a44a50a49a51a45a83a92 pair by selecting</Paragraph>
      <Paragraph position="8"> There are at least two ways to interpret a96a112a89a97a46a112a98a105a44a50a49a51a45a83a92 .</Paragraph>
      <Paragraph position="9"> First, one could view a96a112a89a97a46a50a98a99a44a50a49a52a45a83a92 as a &amp;quot;degree of belief&amp;quot; that the correct summary of a44 relative to a45 is a46 .</Paragraph>
      <Paragraph position="10"> Of course, what constitutes a good summary in any setting is subjective: any two people performing the same summarization task will likely disagree on which part of the document to extract. We could, in principle, ask a large number of people to perform the same task.</Paragraph>
      <Paragraph position="11"> Doing so would impose a distribution a96a112a89a105a113a114a98a105a44a50a49a51a45a83a92 over candidate summaries. Under the second, or &amp;quot;frequentist&amp;quot; interpretation, a96a112a89a97a46a112a98a105a44a50a49a51a45a83a92 is the fraction of people who would select a46 --equivalently, the probability that a person selected at random would prefer a46 as the summary. null The statistical model a96a112a89a105a113a114a98a105a44a50a49a51a45a83a92 is parametric, the values of which are learned by inspection of the  a44a50a49a52a45a40a49a51a46a110a53 triplets. The learning process involves maximum-likelihood estimation of probabilistic language models and the statistical technique of shrinkage (Stein, 1955).</Paragraph>
      <Paragraph position="12"> This probabilistic approach easily generalizes to the generic summarization setting, where there is no query. In that case, the training data consists of a48 a44a50a49a51a46a110a53 pairs, where a46 is a summary of the document a44 . The goal, in this case, is to learn and apply a mapping  Amniocenteses, or amnio, is a prenatal test in which...</Paragraph>
      <Paragraph position="13"> What can it detect? One of the main uses of amniocentesis is to detect chromosomal abnormalities...</Paragraph>
      <Paragraph position="14"> What are the risks of amnio? The main risk of amnio is that it may increase the chance of miscarriage...</Paragraph>
      <Paragraph position="15">  on a single topic; the FAQ depicted here is part of an informational document on amniocentesis. This paper views answers in a FAQ as different summaries of the FAQ: the answer to the a134 th question is a summary of the FAQ relative to that question.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
1.2 Using FAQ data for summarization
</SectionTitle>
      <Paragraph position="0"> We have proposed using statistical learning to construct a summarization system, but have not yet discussed the one crucial ingredient of any learning procedure: training data. The ideal training data would contain a large number of heterogeneous documents, a large number of queries, and summaries of each document relative to each query. We know of no such publicly-available collection. Many studies on text summarization have focused on the task of summarizing newswire text, but there is no obvious way to use news articles for query-relevant summarization within our proposed framework.</Paragraph>
      <Paragraph position="1"> In this paper, we propose a novel data collection for training a QRS model: frequently-asked question documents. Each frequently-asked question document (FAQ) is comprised of questions and answers about a specific topic. We view each answer in a FAQ as a summary of the document relative to the question which preceded it. That is, an FAQ with a135 question/answer pairs comes equipped with a135 different queries and summaries: the answer to the a136 th question is a summary of the document relative to the a136 th question. While a somewhat unorthodox perspective, this insight allows us to enlist FAQs as labeled training data for the purpose of learning the parameters of a statistical QRS model.</Paragraph>
      <Paragraph position="2"> FAQ data has some properties that make it particularly attractive for text learning: a137 There exist a large number of Usenet FAQs-several thousand documents--publicly available on the Web1. Moreover, many large companies maintain their own FAQs to streamline the customer-response process.</Paragraph>
      <Paragraph position="3"> a137 FAQs are generally well-structured documents, so the task of extracting the constituent parts (queries and answers) is amenable to automation.</Paragraph>
      <Paragraph position="4"> There have even been proposals for standardized FAQ formats, such as RFC1153 and the Minimal Digest Format (Wancho, 1990).</Paragraph>
      <Paragraph position="5"> a137 Usenet FAQs cover an astonishingly wide variety of topics, ranging from extraterrestrial visitors to mutual-fund investing. If there's an online community of people with a common interest, there's likely to be a Usenet FAQ on that subject.</Paragraph>
      <Paragraph position="6"> There has been a small amount of published work involving question/answer data, including (Sato and Sato, 1998) and (Lin, 1999). Sato and Sato used FAQs as a source of summarization corpora, although in quite a different context than that presented here. Lin used the datasets from a question/answer task within the Tipster project, a dataset of considerably smaller size than the FAQs we employ. Neither of these paper focused on a statistical machine learning approach to summarization.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML