XML Viewer - p00-1038

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/00/p00-1038_metho.xml
Size: 19,703 bytes
Last Modified: 2025-10-06 14:07:14
<?xml version="1.0" standalone="yes"?>
<Paper uid="P00-1038">
  <Title>Query-Relevant Summarization using FAQs</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 A probabilistic model of
</SectionTitle>
    <Paragraph position="0"> summarization Given a query a45 and document a44 , the query-relevant summarization task is to find</Paragraph>
    <Paragraph position="2"> the a posteriori most probable summary for a89a97a44a50a49a51a45a83a92 .</Paragraph>
    <Paragraph position="3"> Using Bayes' rule, we can rewrite this expression as</Paragraph>
    <Paragraph position="5"> where the last line follows by dropping the dependence on a44 in a96a112a89a91a45a142a98a99a46a128a49a51a44a40a92 .</Paragraph>
    <Paragraph position="6"> Equation (1) is a search problem: find the summary a46 a138 which maximizes the product of two factors: 1. The relevance a96a112a89a91a45a151a98a105a46a149a92 of the query to the summary: A document may contain some portions directly relevant to the query, and other sections bearing little or no relation to the query. Consider, for instance, the problem of summarizing a  survey on the history of organized sports relative to the query &amp;quot;Who was Lou Gehrig?&amp;quot; A summary mentioning Lou Gehrig is probably more relevant to this query than one describing the rules of volleyball, even if two-thirds of the survey happens to be about volleyball.</Paragraph>
    <Paragraph position="7"> 2. The fidelity a96a112a89a91a46a112a98a105a44a143a92 of the summary to the document: Among a set of candidate summaries whose relevance scores are comparable, we should prefer that summary a46 which is most representative of the document as a whole. Summaries of documents relative to a query can often mislead a reader into overestimating the relevance of an unrelated document. In particular, very long documents are likely (by sheer luck) to contain some portion which appears related to the query. A document having nothing to do with Lou Gehrig may include a mention of his name in passing, perhaps in the context of amyotropic lateral sclerosis, the disease from which he suffered. The fidelity term guards against this occurrence by rewarding or penalizing candidate summaries, depending on whether they are germane to the main theme of the document.</Paragraph>
    <Paragraph position="8"> More generally, the fidelity term represents a prior, query-independent distribution over candidate summaries. In addition to enforcing fidelity, this term could serve to distinguish between more and less fluent candidate summaries, in much the same way that traditional language models steer a speech dictation system towards more fluent hypothesized transcriptions.</Paragraph>
    <Paragraph position="9"> In words, (1) says that the best summary of a document relative to a query is relevant to the query (exhibits a large a96a112a89a97a45a142a98a105a46a145a92 value) and also representative of the document from which it was extracted (exhibits a large a96a112a89a97a46a112a98a105a44a40a92 value). We now describe the parametric form of these models, and how one can determine optimal values for these parameters using maximum-likelihood estimation.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.1 Language modeling
</SectionTitle>
      <Paragraph position="0"> The type of statistical model we employ for both a96a112a89a91a45a142a98a99a46a149a92 and a96a112a89a91a46a112a98a105a44a143a92 is a unigram probability distribution over words; in other words, a language model.</Paragraph>
      <Paragraph position="1"> Stochastic models of language have been used extensively in speech recognition, optical character recognition, and machine translation (Jelinek, 1997; Berger et al., 1994). Language models have also started to find their way into document retrieval (Ponte and Croft, 1998; Ponte, 1998).</Paragraph>
      <Paragraph position="2"> The fidelity model a96a152a89a91a46a112a98a105a44a143a92 One simple statistical characterization of an a153 -word document a44 a100 a48a114a154a156a155 a49 a154a133a157 a49a149a158a145a158a149a158 a154a128a159 a53 is the frequency of each word in a44 --in other words, a marginal distribution over words. That is, if word a160 appears a136 times in a44 , then a96a162a161a143a89a91a160a163a92  In the text classification literature, this characterization of a44 is known as a &amp;quot;bag of words&amp;quot; model, since the distribution a96a162a161 does not take account of the order of the words within the document a44 , but rather views a44 as an unordered set (&amp;quot;bag&amp;quot;) of words. Of course, ignoring word order amounts to discarding potentially valuable information. In Figure 3, for instance, the second question contains an anaphoric reference to the preceding question: a sophisticated context-sensitive model of language might be able to detect thatit in this context refers to amniocentesis, but a context-free model will not.</Paragraph>
      <Paragraph position="3"> The relevance model a96a152a89a91a45a142a98a99a46a149a92 In principle, one could proceed analogously to (2), and take</Paragraph>
      <Paragraph position="5"> for a length-a136 query a45 a100 a48 a193 a155 a49a51a193 a157 a158a149a158a145a158a41a193a141a194a123a53 . But this strategy suffers from a sparse estimation problem. In contrast to a document, which we expect will typically contain a few hundred words, a normal-sized summary contains just a handful of words. What this means is that a96 a111 will assign zero probability to most words, and</Paragraph>
      <Paragraph position="7"> swer in document a217 is a convex combination of five distributions: (1) a uniform modela212a83a218 . (2) a corpus-wide model a212a220a219 ; (3) a model a212a83a221a223a222 constructed from the document containing a84 a214a215 ; (4) a model a212a120a224a40a222a225 constructed from a84 a214a215 and the neighboring sentences in a37a162a214 ; (5) a model a212a83a226 a222a225 constructed from a84 a214a215 alone. (The a212a120a224 distribution is omitted for clarity.) any query containing a word not in the summary will receive a relevance score of zero.</Paragraph>
      <Paragraph position="8"> (The fidelity model doesn't suffer from zeroprobabilities, at least not in the extractive summarization setting. Since a summary a46 is part of its containing document a44 , every word in a46 also appears in a44 , and therefore a96 a161 a89 a175 a92a228a227a230a229 for every word a175 a183a231a46 . But we have no guarantee, for the relevance model, that a summary contains all the words in the query.) We address this zero-probability problem by interpolating or &amp;quot;smoothing&amp;quot; the a96 a111 model with four more robustly estimated unigram word models. Listed in order of decreasing variance but increasing bias away from a96 a111 , they are: a96a83a232 : a probability distribution constructed using not only a46 , but also all words within the six summaries (answers) surrounding a46 in a44 . Since a96 a232 is calculated using more text than just a46 alone, its parameter estimates should be more robust that those of a96 a111 . On the other hand, the a96 a232 model is, by construction, biased away from a96 a111 , and therefore provides only indirect evidence for the relation between a45 and a46 .</Paragraph>
      <Paragraph position="9"> a96a162a161 : a probability distribution constructed over the entire document a44 containing a46 . This model has even less variance than a96a83a232 , but is even more biased away from a96 a111 .</Paragraph>
      <Paragraph position="10"> a96a83a233 : a probability distribution constructed over all documents a44 .</Paragraph>
      <Paragraph position="11"> a96a43a234 : the uniform distribution over all words.</Paragraph>
      <Paragraph position="12"> Figure 4 is a hierarchical depiction of the various language models which come into play in calculating a96a112a89a91a45a142a98a99a46a149a92 . Each summary model a96 a111 lives at a leaf node, and the relevancea96a112a89a97a45a142a98a99a46a149a92 of a query to that summary is a convex combination of the distributions at each node  known as shrinkage (Stein, 1955), a simple form of the EM algorithm (Dempster et al., 1977).</Paragraph>
      <Paragraph position="13"> As a practical matter, if one assumes the a168 a111 model assigns probabilities independently of a46 , then we can drop the a168 a111 term when ranking candidate summaries, since the score of all candidate summaries will receive an identical contribution from the a168 a111 term. We make this simplifying assumption in the experiments reported in the following section.</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Results
</SectionTitle>
    <Paragraph position="0"> To gauge how well our proposed summarization technique performs, we applied it to two different real-world collections of answered questions: Usenet FAQs: A collection of a171a223a229 a170 frequently-asked question documents from the comp.* Usenet hierarchy. The documents contained a170a3a29 a229a223a229 questions/answer pairs in total.</Paragraph>
    <Paragraph position="1"> Call-center data: A collection of questions submitted by customers to the companies Air Canada, Ben and Jerry, Iomagic, and Mylex, along with the answers supplied by company</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2By incorporating a
</SectionTitle>
      <Paragraph position="0"> a212a83a221 model into the relevance model, equation (6) has implicitly resurrected the dependence on a37 which we dropped, for the sake of simplicity, in deriving (1). representatives. These four documents contain</Paragraph>
      <Paragraph position="2"> We conducted an identical, parallel set of experiments on both. First, we used a randomly-selected subset of 70% of the question/answer pairs to calculate the language models a96 a111 a49a41a96 a232 a49a208a96 a161 a49a41a96 a233 --a simple matter of counting word frequencies. Then, we used this same set of data to estimate the model weights</Paragraph>
      <Paragraph position="4"> a234a90a53 using shrinkage. We reserved the remaining 30% of the question/answer pairs to evaluate the performance of the system, in a manner described below.</Paragraph>
      <Paragraph position="5"> Figure 5 shows the progress of the EM algorithm in calculating maximum-likelihood values for the smoothing coefficients a235a236 , for the first of the three runs on the Usenet data. The quick convergence and the final a235a236 values were essentially identical for the other partitions of this dataset.</Paragraph>
      <Paragraph position="6"> The call-center data's convergence behavior was similar, although the final a235a236 values were quite different. Figure 6 shows the final model weights for the first of the three experiments on both datasets. For the Usenet FAQ data, the corpus language model is the best predictor of the query and thus receives the highest weight. This may seem counterintuitive; one might suspect that answer to the query (a46 , that is) would be most similar to, and therefore the best predictor of, the query. But the corpus model, while certainly biased away from the distribution of words found in the query, contains (by construction) no zeros, whereas each summary model is typically very sparse.</Paragraph>
      <Paragraph position="7"> In the call-center data, the corpus model weight is lower at the expense of a higher document model weight. We suspect this arises from the fact that the documents in the Usenet data were all quite similar to one another in lexical content, in contrast to the call-center documents. As a result, in the call-center data the document containing a46 will appear much more relevant than the corpus as a whole.</Paragraph>
      <Paragraph position="8"> To evaluate the performance of the trained QRS model, we used the previously-unseen portion of the FAQ data in the following way. For each test a89a91a44a50a49a51a45a83a92 pair, we recorded how highly the system ranked the correct summary a46 a138 --the answer to a45 in a44 --relative to the other answers in a44 . We repeated this entire sequence three times for both the Usenet and the call-center data.</Paragraph>
      <Paragraph position="9"> For these datasets, we discovered that using a uniform fidelity term in place of the a96a112a89a97a46 a98a83a44a40a92 model described above yields essentially the same result. This is not surprising: while the fidelity term is an important component of a real summarization system, our evaluation was conducted in an answer-locating framework, and in this context the fidelity term--enforcing that the summary be similar to the entire document from which  using a single, randomly-selected 70% portion of the Usenet FAQ dataset. Left: The weights a20 for the models are initialized to a33a14a34a36a35 , but within a few iterations settle to their final values. Right: The progression of the likelihood of the training data during the execution of the EM algorithm; almost all of the improvement comes in the first five iterations.  components of the relevance model a212a40a35a38a33 a213a91a84a82a42 . Left: Weights assigned to the constituent models from the Usenet FAQ data. Right: Corresponding breakdown for the call-center data. These weights were calculated using shrinkage.</Paragraph>
      <Paragraph position="10"> it was drawn--is not so important.</Paragraph>
      <Paragraph position="11"> From a set of rankings a48a48a49 a155 a49 a49 a157 a49a145a158a149a158a149a158 a49a42a50 a53 , one can measure the the quality of a ranking algorithm using the harmonic mean rank:  A lower number indicates better performance; a51 a100a88a170 , which is optimal, means that the algorithm consistently assigns the first rank to the correct answer. Table 1 shows the harmonic mean rank on the two collections. The third column of Table 1 shows the result of a QRS system using a uniform fidelity model, the fourth corresponds to a standard tfidf-based ranking method (Ponte, 1998), and the last column reflects the performance of randomly guessing the correct summary from all answers in the document.</Paragraph>
      <Paragraph position="12"> trial # trials LM tfidf random  rization on the Usenet and call-center datasets. The numbers reported in the three rightmost columns are harmonic mean ranks: lower is better.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Extensions
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.1 Question-answering
</SectionTitle>
      <Paragraph position="0"> The reader may by now have realized that our approach to the QRS problem may be portable to the problem of question-answering. By question-answering, we mean a system which automatically extracts from a potentially lengthy document (or set of documents) the answer to a user-specified question. Devising a high-quality question-answering system would be of great service to anyone lacking the inclination to read an entire user's manual just to find the answer to a single question. The success of the various automated question-answering services on the Internet (such as AskJeeves) underscores the commercial importance of this task.</Paragraph>
      <Paragraph position="1"> One can cast answer-finding as a traditional document retrieval problem by considering each candidate answer as an isolated document and ranking each candidate answer by relevance to the query. Traditional tfidf-based ranking of answers will reward candidate answers with many words in common with the query.</Paragraph>
      <Paragraph position="2"> Employing traditional vector-space retrieval to find answers seems attractive, since tfidf is a standard, timetested algorithm in the toolbox of any IR professional. What this paper has described is a first step towards more sophisticated models of question-answering.</Paragraph>
      <Paragraph position="3"> First, we have dispensed with the simplifying assumption that the candidate answers are independent of one another by using a model which explicitly accounts for the correlation between text blocks--candidate answers--within a single document. Second, we have put forward a principled statistical model for answerranking; a101a54a103a99a104a50a106a108a101a110a109a32a53 a96a112a89a97a46 a98a40a44a50a49a52a45a83a92 has a probabilistic interpretation as the best answer to a45 within a44 is a46 . Question-answering and query-relevant summarization are of course not one and the same. For one, the criterion of containing an answer to a question is rather stricter than mere relevance. Put another way, only a small number of documents actually contain the answer to a given query, while every document can in principle be summarized with respect to that query.</Paragraph>
      <Paragraph position="4"> Second, it would seem that the a96a112a89a97a46a50a98a99a44a40a92 term, which acts as a prior on summaries in (1), is less appropriate in a question-answering setting, where it is less important that a candidate answer to a query bears resemblance to the document containing it.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.2 Generic summarization
</SectionTitle>
      <Paragraph position="0"> Although this paper focuses on the task of query-relevant summarization, the core ideas--formulating a probabilistic model of the problem and learning the values of this model automatically from FAQ-like data--are equally applicable to generic summarization. In this case, one seeks the summary which best typifies the document. Applying Bayes' rule as in (1),  The first term on the right is a generative model of documents from summaries, and the second is a prior distribution over summaries. One can think of this factorization in terms of a dialogue. Alice, a newspaper editor, has an idea a46 for a story, which she relates to Bob. Bob researches and writes the story a44 , which we can view as a &amp;quot;corruption&amp;quot; of Alice's original idea a46 . The task of generic summarization is to recover a46 , given only the generated document a44 , a model a96a112a89a97a44a241a98a99a46a149a92 of how the Alice generates summaries from documents, and a prior distributiona96a112a89a91a46a145a92 on ideas a46 . The central problem in information theory is reliable communication through an unreliable channel. We can interpret Alice's idea a46 as the original signal, and the process by which Bob turns this idea into a document a44 as the channel, which corrupts the original message.</Paragraph>
      <Paragraph position="1"> The summarizer's task is to &amp;quot;decode&amp;quot; the original, condensed message from the document.</Paragraph>
      <Paragraph position="2"> We point out this source-channel perspective because of the increasing influence that information theory has exerted on language and information-related applications. For instance, the source-channel model has been used for non-extractive summarization, generating titles automatically from news articles (Witbrock and Mittal, 1999).</Paragraph>
      <Paragraph position="3"> The factorization in (6) is superficially similar to (1), but there is an important difference: a185 a89a91a44a108a98a105a46a149a92 is a generative, from a summary to a larger document, whereas a185 a89a97a45a142a98a105a46a145a92 is compressive, from a summary to a smaller query. This distinction is likely to translate in practice into quite different statistical models and training procedures in the two cases.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML