File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/h05-1115_metho.xml

Size: 23,241 bytes

Last Modified: 2025-10-06 14:09:33

<?xml version="1.0" standalone="yes"?>
<Paper uid="H05-1115">
  <Title>Using Random Walks for Question-focused Sentence Retrieval</Title>
  <Section position="4" start_page="915" end_page="915" type="metho">
    <SectionTitle>
2 Formal description of the problem
</SectionTitle>
    <Paragraph position="0"> Our goal is to build a question-focused sentence retrieval mechanism using a topic-sensitive version of the LexRankmethod. Incontrast to previous PRsystems such as Okapi (Robertson et al., 1992), which ranks documents for relevancy and then proceeds to find paragraphs related to a question, we address the finer-grained problem of finding sentences containing answers. In addition, the input to our system is a set of documents relevant to the topic of the query that the user has already identified (e.g. via a search engine). Our system does not rank the input documents, nor is it restricted in terms of the number of sentences that may be selected from the same document. null The output of our system, a ranked list of sentences relevant to the user's question, can be subsequently used as input to an answer selection system in order to find specific answers from the extracted sentences. Alternatively, the sentences can be returned to the user as a question-focused summary. This is similar to &amp;quot;snippet retrieval&amp;quot; (Wu et al., 2004). However, in our system answers are extracted from a set of multiple documents rather than on a document-by-document basis.</Paragraph>
  </Section>
  <Section position="5" start_page="915" end_page="917" type="metho">
    <SectionTitle>
3 Our approach: topic-sensitive LexRank
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="915" end_page="916" type="sub_section">
      <SectionTitle>
3.1 The LexRank method
</SectionTitle>
      <Paragraph position="0"> In (Erkan and Radev, 2004), the concept of graph-based centrality was used to rank a set of sentences, in producing generic multi-document summaries.</Paragraph>
      <Paragraph position="1"> To apply LexRank, a similarity graph is produced for the sentences in an input document set. In the graph, each node represents a sentence. There are edges between nodes for which the cosine similarity between the respective pair of sentences exceeds a given threshold. The degree of a given node is an indication of how much information the respective sentence has in common with other sentences.</Paragraph>
      <Paragraph position="2"> Therefore, sentences that contain the most salient information in the document set should be very central within the graph.</Paragraph>
      <Paragraph position="3"> Figure 2 shows an example of a similarity graph for a set of five input sentences, using a cosine similarity threshold of 0.15. Once the similarity graph is constructed, the sentences are then ranked according to their eigenvector centrality. As previously mentioned, theoriginal LexRankmethod performed well in the context of generic summarization. Below, we describe a topic-sensitive version of LexRank, which is more appropriate for the question-focused sentence retrieval problem. In the new approach, the  score of asentence isdetermined byamixture model of the relevance of the sentence to the query and the similarity of the sentence to other high-scoring sentences. null</Paragraph>
    </Section>
    <Section position="2" start_page="916" end_page="916" type="sub_section">
      <SectionTitle>
3.2 Relevance to the question
</SectionTitle>
      <Paragraph position="0"> In topic-sensitive LexRank, we first stem all of the sentences in a set of articles and compute word IDFs by the following formula:</Paragraph>
      <Paragraph position="2"> where N is the total number of sentences in the cluster, and sf w is the number of sentences that the word w appears in.</Paragraph>
      <Paragraph position="3"> We also stem the question and remove the stop words from it. Then the relevance of a sentence s to the question q is computed by:</Paragraph>
      <Paragraph position="5"> are the number of times w appears in s and q, respectively. This model has proven to be successful in query-based sentence retrieval (Allan et al., 2003), and is used as our competitive baseline in this study (e.g. Tables 4, 5 and 7).</Paragraph>
    </Section>
    <Section position="3" start_page="916" end_page="917" type="sub_section">
      <SectionTitle>
3.3 The mixture model
</SectionTitle>
      <Paragraph position="0"> The baseline system explained above does not make use of any inter-sentence information in a cluster.</Paragraph>
      <Paragraph position="1"> We hypothesize that a sentence that is similar to the high scoring sentences in the cluster should also have a high score. For instance, if a sentence that gets a high score in our baseline model is likely to contain an answer to the question, then a related sentence, which may not be similar to the question itself, is also likely to contain an answer.</Paragraph>
      <Paragraph position="2"> This idea is captured by the following mixture model, where p(s|q), the score of a sentence s given a question q, is determined as the sum of its relevance to the question (using the same measure as the baseline described above) and the similarity to the other sentences in the document cluster:</Paragraph>
      <Paragraph position="4"> where C is the set of all sentences in the cluster. The value of d, which we will also refer to as the &amp;quot;question bias,&amp;quot; is a trade-off between two terms in the  graph with a cosine threshold of 0.15.</Paragraph>
      <Paragraph position="5"> equation and is determined empirically. For higher values of d, we give more importance to the relevance to the question compared to the similarity to the other sentences in the cluster. The denominators in both terms are for normalization, which are described below. We use the cosine measure weighted by word IDFs as the similarity between two sentences in a cluster:</Paragraph>
      <Paragraph position="7"> Equation 3 can be written in matrix notation as follows:</Paragraph>
      <Paragraph position="9"> A is the square matrix such that for a given index i, all the elements in the i th column are proportional to rel(i|q). B is also a square matrix such that each entry B(i,j) is proportional to sim(i,j).Bothmatrices are normalized so that row sums add up to 1. Note that as a result of this normalization, all rows of the resulting square matrixQ =[dA+(1[?]d)B] also add up to 1. Such a matrix is called stochastic and defines a Markov chain. If we view each sentence as a state in a Markov chain, thenQ(i,j) specifies the transition probability from state i to state j in the corresponding Markov chain. The vector p we are looking for in Equation 5 is the stationary distribution of the Markov chain. An intuitive interpretation of the stationary distribution can be under- null stood by the concept of a random walk on the graph representation of the Markov chain.</Paragraph>
      <Paragraph position="10"> With probability d, a transition is made from the current node (sentence) to the nodes that are similar to the query. With probability (1-d), a transition is made to the nodes that are lexically similar to the current node. Every transition is weighted according to the similarity distributions. Each element of the vector p gives the asymptotic probability of ending up at the corresponding state in the long run regardless of the starting state. The stationary distribution of a Markov chain can be computed by a simple iterative algorithm, called power method.</Paragraph>
      <Paragraph position="11">  A simpler version of Equation 5, where A is a uniform matrix andBis a normalized binary matrix, is known as PageRank (Brin and Page, 1998; Page et al., 1998) and used to rank the web pages by the Google search engine. It was also the model used to rank sentences in (Erkan and Radev, 2004).</Paragraph>
    </Section>
    <Section position="4" start_page="917" end_page="917" type="sub_section">
      <SectionTitle>
3.4 Experiments with topic-sensitive LexRank
</SectionTitle>
      <Paragraph position="0"> We experimented with different values of d on our training data. We also considered several threshold values for inter-sentence cosine similarities, where we ignored the similarities between the sentences that are below the threshold. In the training phase of the experiment, we evaluated all combinations of LexRank with d in the range of [0,1] (in increments of 0.10) and with a similarity threshold ranging from [0,0.9] (in increments of 0.05). We then found all configurations that outperformed the baseline. These configurations were then applied to our development/test set. Finally, our best sentence retrieval system was applied to our test data set and evaluated against the baseline. The remainder of the paper will explain this process and the results in detail. null</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="917" end_page="920" type="metho">
    <SectionTitle>
4 Experimental setup
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="917" end_page="917" type="sub_section">
      <SectionTitle>
4.1 Corpus
</SectionTitle>
      <Paragraph position="0"> We built a corpus of 20 multi-document clusters of complex news stories, such as plane crashes, political controversies and natural disasters. The data  The stationary distribution is unique and the power method is guaranteed to converge provided that the Markov chain is ergodic (Seneta, 1981). A non-ergodic Markov chain can be made ergodic by reserving a small probability for jumping to any other state from the current state (Page et al., 1998). clusters and their characteristics are shown in Table 1. The news articles were collected from various sources. &amp;quot;Newstracker&amp;quot; clusters were collected automatically by our Web-based news summarization system. The number of clusters randomly assigned to the training, development/test and test data sets were 11, 3 and 6, respectively.</Paragraph>
      <Paragraph position="1"> Next, we assigned each cluster of articles to an annotator, who was asked to read all articles in the cluster. He or she then generated a list of factual questions key to understanding the story. Once we collected the questions for each cluster, two judges independently annotated nine of the training clusters. For each sentence and question pair in a given cluster, the judges were asked to indicate whether or not the sentence contained a complete answer to the question. Once an acceptable rate of inter-judge agreement was verified on the first nine clusters (Kappa (Carletta, 1996) of 0.68), the remaining 11 clusters were annotated by one judge each.</Paragraph>
      <Paragraph position="2"> In some cases, the judges did not find any sentences containing the answer for a given question.</Paragraph>
      <Paragraph position="3"> Such questions were removed from the corpus. The final number of questions annotated for answers over the entire corpus was 341, and the distributions of questions per cluster can be found in Table 1.</Paragraph>
    </Section>
    <Section position="2" start_page="917" end_page="919" type="sub_section">
      <SectionTitle>
4.2 Evaluation metrics and methods
</SectionTitle>
      <Paragraph position="0"> To evaluate our sentence retrieval mechanism, we produced extract files, which contain a list of sentences deemed to be relevant to the question, for the system and from human judgment. To compare different configurations of our system to the baseline system, we produced extracts at a fixed length of 20 sentences. While evaluations of question answering systems are often based on a shorter list of ranked sentences, we chose to generate longer lists for several reasons. One is that we are developing a PR system, of which the output can then be input to an answer extraction system for further processing. In such a setting, we would most likely want to generate a relatively longer list of candidate sentences. As previously mentioned, in our corpus the questions often have more than one relevant answer, so ideally, our PR system would find many of the relevant sentences, sending them on to the answer component to decide which answer(s) should be returned to the user. Each system's extract file lists the document  and sentence numbers of the top 20 sentences. The &amp;quot;gold standard&amp;quot; extracts list the sentences judged as containing answers to a given question by the annotators (and therefore have variable sizes) in no particular order.</Paragraph>
      <Paragraph position="1">  We evaluated the performance of the systems using two metrics - Mean Reciprocal Rank (MRR) (Voorhees and Tice, 2000) and Total Reciprocal Document Rank (TRDR) (Radev et al., 2005).</Paragraph>
      <Paragraph position="2"> MRR, used in the TREC Q&amp;A evaluations, is the reciprocal rank of the first correct answer (or sentence, in our case) to a given question. This measure gives us an idea of how far down we must look in the ranked list in order to find a correct answer. To contrast, TRDR is the total of the reciprocal ranks of all answers found by the system. In the context of answering questions from complex stories, wherethere is often more than one correct answer to a question, and where answers are typically time-dependent, we should focus on maximizing TRDR, which gives us  For clusters annotated by two judges, all sentences chosen by at least one judge were included.</Paragraph>
      <Paragraph position="3"> a measure of how many of the relevant sentences were identified by the system. However, we report both the average MRR and TRDR over all questions in a given data set.</Paragraph>
      <Paragraph position="4"> 5 LexRank versus the baseline system In the training phase, we searched the parameter space for the values of d (the question bias) and the similarity threshold inorder to optimize the resulting TRDR scores. For our problem, we expected that a relatively low similarity threshold pair with a high question bias would achieve the best results. Table 2 shows the effect of varying the similarity threshold.  The notation LR[a,d] is used, where a is the similarity threshold and d is the question bias. The optimal range for the parameter a was between 0.14 and 0.20. This is intuitive because if the threshold is too high, such that only the most lexically similar sentences are represented in the graph, the method does not find sentences that are related but are more lex- null A threshold of -1 means that no threshold was used such that all sentences were included in the graph.</Paragraph>
      <Paragraph position="5">  ically diverse (e.g. paraphrases). Table 3 shows the effect of varying the question bias at two different similarity thresholds (0.02 and0.20). Itisclear that a high question bias isneeded. However, asmall probability for jumping to a node that is lexically similar to the given sentence (rather than the question itself) is needed. Table 4 shows the configurations of LexRank that performed better than the baseline system on the training data, based on mean TRDR scores over the 184 training questions. We applied all four of these configurations to our unseen development/test data, in order to see if we could further differentiate their performances.</Paragraph>
    </Section>
    <Section position="3" start_page="919" end_page="920" type="sub_section">
      <SectionTitle>
5.1 Development/testing phase
</SectionTitle>
      <Paragraph position="0"> The scores for the four LexRank systems and the baseline on the development/test data are shown in  formed the baseline, both in terms of average MRR and TRDR scores. An analysis of the average scores over the 72 questions within each of the three clusters for the best system, LR[0.20,0.95], is shown in Table 6. While LexRank outperforms the base-line system on the first two clusters both in terms of MRR and TRDR, their performances are not substantially different on the third cluster. Therefore, we examined properties of the questions within each cluster in order to see what effect they might have on system performance.</Paragraph>
      <Paragraph position="1"> We hypothesized that the baseline system, which compares the similarity of each sentence to the question using IDF-weighted word overlap, should perform well on questions that provide many content words. To contrast, LexRank might perform better when the question provides fewer content words, since it considers both similarity to the query and inter-sentence similarity. Out of the 72 questions in the development/test set, the baseline system outperformed LexRank on 22 of the questions. In fact, the average number of content words among these 22 questions was slightly, but not significantly, higher than the average on the remaining questions (3.63 words per question versus 3.46). Given this observation, we experimented with two mixed strategies, in which the number of content words in a question determined whether LexRank or the baseline system was used for sentence retrieval. We tried threshold values of 4 and 6 content words, however, this did not improve the performance over the pure strategy of system LR[0.20,0.95]. Therefore, we applied this</Paragraph>
    </Section>
    <Section position="4" start_page="920" end_page="920" type="sub_section">
      <SectionTitle>
5.2 Testing phase
</SectionTitle>
      <Paragraph position="0"> As shown in Table 7, LR[0.20,0.95] outperformed the baseline system on the test data both in terms of average MRR and TRDR scores. The improvement in average TRDR score was statistically significant with a p-value of 0.0619. Since we are interested in a passage retrieval mechanism that finds sentences relevant to a given question, providing input to the question answering component of our system, the improvement in average TRDR score is very promising. While we saw in Section 5.1 that LR[0.20,0.95] may perform better on some question or cluster types than others, weconclude that it beats the competitive baseline when one is looking to optimize mean TRDR scores over a large set of questions. However, in future work, we will continue to improve the performance, perhaps by developing mixed strategies using different configurations of LexRank.</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="920" end_page="920" type="metho">
    <SectionTitle>
6 Discussion
</SectionTitle>
    <Paragraph position="0"> The idea behind using LexRank for sentence retrieval is that a system that considers only the similarity between candidate sentences and the input query, and not the similarity between the candidate sentences themselves, is likely to miss some important sentences. When using any metric to compare sentences and a query, there is always likely to be a tie between multiple sentences (or, similarly, there may be cases where fewer than the number of desired sentences have similarity scores above zero).</Paragraph>
    <Paragraph position="1"> LexRank effectively provides a means to break such ties. An example of such a scenario is illustrated in Tables 8and 9, whichshowthe top ranked sentences by the baseline and LexRank, respectively for the question &amp;quot;What caused the Kursk to sink?&amp;quot; from the Kursk submarine cluster. It can be seen that all top five sentences chosen by the baseline system have</Paragraph>
  </Section>
  <Section position="8" start_page="920" end_page="920" type="metho">
    <SectionTitle>
Rank Sentence Score Relevant?
1 The Russian governmental commission on the 4.2282 N
</SectionTitle>
    <Paragraph position="0"> accident of the submarine Kursk sinking in the Barents Sea on August 12 has rejected 11 original explanations for the disaster, but still cannot conclude what caused the tragedy indeed, Russian Deputy Premier Ilya Klebanov said here Friday.</Paragraph>
    <Paragraph position="1"> 2 There has been no final word on what caused 4.2282 N the submarine to sink while participating in a major naval exercise, but Defense Minister Igor Sergeyev said the theory that Kursk may have collided with another object is receiving increasingly concrete confirmation.</Paragraph>
  </Section>
  <Section position="9" start_page="920" end_page="920" type="metho">
    <SectionTitle>
3 Russian Deputy Prime Minister Ilya Klebanov 4.2282 Y
</SectionTitle>
    <Paragraph position="0"> said Thursday that collision with a big object caused the Kursk nuclear submarine to sink to the bottom of the Barents Sea.</Paragraph>
  </Section>
  <Section position="10" start_page="920" end_page="920" type="metho">
    <SectionTitle>
4 Russian Deputy Prime Minister Ilya Klebanov 4.2282 Y
</SectionTitle>
    <Paragraph position="0"> said Thursday that collision with a big object caused the Kursk nuclear submarine to sink to the bottom of the Barents Sea.</Paragraph>
  </Section>
  <Section position="11" start_page="920" end_page="921" type="metho">
    <SectionTitle>
5 President Clinton's national security adviser, 4.2282 N
</SectionTitle>
    <Paragraph position="0"> Samuel Berger, has provided his Russian counterpart with a written summary of what U.S. naval and intelligence officials believe caused the nuclear-powered submarine Kursk to sink last month in the Barents Sea, officials  on the question &amp;quot;What caused the Kursk to sink?&amp;quot;. the same sentence score (similarity to the query), yet the top ranking two sentences are not actually relevant according to the judges. To contrast, LexRank achieved a better ranking of the sentences since it is better able to differentiate between them. It should be noted that both for the LexRank and baseline systems, chronological ordering of the documents and sentences is preserved, such that in cases where two sentences have the same score, the one published earlier is ranked higher.</Paragraph>
    <Paragraph position="1"> 7Conclusion We presented topic-sensitive LexRank and applied it to the problem of sentence retrieval. In a Web-based news summarization setting, users of our system could choose to see the retrieved sentences (as in Table 9) as a question-focused summary. As indicated in Table 9, each of the top three sentences were judged by our annotators as providing a complete answer to the respective question. While the first two sentences provide the same answer (a collision caused the Kursk to sink), the third sentence provides a different answer (an explosion caused the disaster). While the last two sentences do not provide answers according to our judges, they do provide context information about the situation. Alternatively, the user might prefer to see the extracted  said Thursday that collision with a big object caused the Kursk nuclear submarine to sink to the bottom of the Barents Sea.</Paragraph>
  </Section>
  <Section position="12" start_page="921" end_page="921" type="metho">
    <SectionTitle>
2 Russian Deputy Prime Minister Ilya Klebanov 0.0133 Y
</SectionTitle>
    <Paragraph position="0"> said Thursday that collision with a big object caused the Kursk nuclear submarine to sink to the bottom of the Barents Sea.</Paragraph>
  </Section>
  <Section position="13" start_page="921" end_page="921" type="metho">
    <SectionTitle>
3 The Russian navy refused to confirm this, 0.0125 Y
</SectionTitle>
    <Paragraph position="0"> but officers have said an explosion in the torpedo compartment at the front of the submarine apparently caused the Kursk to sink.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML