File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/02/w02-1033_metho.xml
Size: 21,341 bytes
Last Modified: 2025-10-06 14:08:02
<?xml version="1.0" standalone="yes"?> <Paper uid="W02-1033"> <Title>An Analysis of the AskMSR Question-Answering System</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 System Architecture </SectionTitle> <Paragraph position="0"> As shown in Figure 1, the architecture of our system can be described by four main steps: queryreformulation, n-gram mining, filtering, and n-gram tiling. In the remainder of this section, we will briefly describe these components. A more detailed description can be found in [Brill et al., 2001].</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.1 Query Reformulation </SectionTitle> <Paragraph position="0"> Given a question, the system generates a number of weighted rewrite strings which are likely sub-strings of declarative answers to the question. For example, &quot;When was the paper clip invented?&quot; is rewritten as &quot;The paper clip was invented&quot;. We then look through the collection of documents in search of such patterns. Since many of these string rewrites will result in no matching documents, we also produce less precise rewrites that have a much greater chance of finding matches. For each query, we generate a rewrite which is a backoff to a simple ANDing of all of the non-stop words in the query.</Paragraph> <Paragraph position="1"> The rewrites generated by our system are simple string-based manipulations. We do not use a parser or part-of-speech tagger for query reformulation, but do use a lexicon for a small percentage of rewrites, in order to determine the possible parts-of-speech of a word as well as its morphological variants. Although we created the rewrite rules and associated weights manually for the current system, it may be possible to learn query-toanswer reformulations and their weights (e.g., Agichtein et al., 2001; Radev et al., 2001).</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.2 N-Gram Mining </SectionTitle> <Paragraph position="0"> Once the set of query reformulations has been generated, each rewrite is formulated as a search engine query and sent to a search engine from which page summaries are collected and analyzed. From the page summaries returned by the search engine, n-grams are collected as possible answers to the question. For reasons of efficiency, we use only the page summaries returned by the engine and not the full-text of the corresponding web page.</Paragraph> <Paragraph position="1"> The returned summaries contain the query terms, usually with a few words of surrounding context. The summary text is processed in accordance with the patterns specified by the rewrites. Unigrams, bigrams and trigrams are extracted and subsequently scored according to the weight of the query rewrite that retrieved it. These scores are summed across all summaries containing the n-gram (which is the opposite of the usual inverse document frequency component of document/passage ranking schemes). We do not count frequency of occurrence within a summary (the usual tf component in ranking schemes). Thus, the final score for an n-gram is based on the weights associated with the rewrite rules that generated it and the number of unique summaries in which it occurred.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.3 N-Gram Filtering </SectionTitle> <Paragraph position="0"> Next, the n-grams are filtered and reweighted according to how well each candidate matches the expected answer-type, as specified by a handful of handwritten filters. The system uses filtering in the following manner. First, the query is analyzed and assigned one of seven question types, such as who-question, what-question, or how-manyquestion. Based on the query type that has been assigned, the system determines what collection of filters to apply to the set of potential answers found during the collection of n-grams. The candidate n-grams are analyzed for features relevant to the filters, and then rescored according to the presence of such information.</Paragraph> <Paragraph position="1"> A collection of 15 simple filters were developed based on human knowledge about question types and the domain from which their answers can be drawn. These filters used surface string features, such as capitalization or the presence of digits, and consisted of handcrafted regular expression patterns.</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.4 N-Gram Tiling </SectionTitle> <Paragraph position="0"> Finally, we applied an answer tiling algorithm, which both merges similar answers and assembles longer answers from overlapping smaller answer fragments. For example, &quot;A B C&quot; and &quot;B C D&quot; is tiled into &quot;A B C D.&quot; The algorithm proceeds greedily from the top-scoring candidate - all subsequent candidates (up to a certain cutoff) are checked to see if they can be tiled with the current candidate answer. If so, the higher scoring candidate is replaced with the longer tiled n-gram, and the lower scoring candidate is removed. The algorithm stops only when no n-grams can be further tiled.</Paragraph> </Section> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Experiments </SectionTitle> <Paragraph position="0"> For experimental evaluations we used the first 500 TREC-9 queries (201-700) (Voorhees and Harman, 2000). We used the patterns provided by NIST for automatic scoring. A few patterns were slightly modified to accommodate the fact that some of the answer strings returned using the Web were not available for judging in TREC-9. We did this in a very conservative manner allowing for more specific correct answers (e.g., Edward J. Smith vs.</Paragraph> <Paragraph position="1"> Edward Smith) but not more general ones (e.g., Smith vs. Edward Smith), and also allowing for simple substitutions (e.g., 9 months vs. nine months). There also are substantial time differences between the Web and TREC databases (e.g., the correct answer to Who is the president of Bolivia? changes over time), but we did not modify the answer key to accommodate these time differences, because it would make comparison with earlier TREC results impossible. These changes influence the absolute scores somewhat but do not change relative performance, which is our focus here.</Paragraph> <Paragraph position="2"> All runs are completely automatic, starting with queries and generating a ranked list of 5 candidate answers. For the experiments reported in this paper we used Google as a backend because it provides query-relevant summaries that make our n-gram mining efficient. Candidate answers are a maximum of 50 bytes long, and typically much shorter than that. We report the Mean Reciprocal Rank (MRR) of the first correct answer, the Number of Questions Correctly Answered (NAns), and the proportion of Questions Correctly Answered (%Ans).</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.1 Basic System Performance </SectionTitle> <Paragraph position="0"> Using our current system with default settings we obtain a MRR of 0.507 and answers 61% of the queries correctly (Baseline, Table 1). The average answer length was 12 bytes, so the system is returning short answers, not passages. Although it is impossible to compare our results precisely with TREC-9 groups, this is very good performance and would place us near the top of 50-byte runs for TREC-9.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.2 Contributions of Components </SectionTitle> <Paragraph position="0"> Table 1 summarizes the contributions of the different system components to this overall performance. We report summary statistics as well as percent change in performance when components are removed (%Drop MRR).</Paragraph> <Paragraph position="1"> Query Rewrites: As described earlier, queries are transformed to successively less precise formats, with a final backoff to simply ANDing all the non-stop query terms. More precise queries have higher weights associated with them, so n-grams found in these responses are given priority. If we set all the re-write weights to be equal, MRR drops from 0.507 to 0.489, a drop of 3.6%. Another way of looking at the importance of the query rewrites is to examine performance where the only rewrite the system uses is the backoff AND query. Here the drop is more substantial, down to 0.450 which represents a drop of 11.2%.</Paragraph> <Paragraph position="2"> Query rewrites are one way in which we capitalize on the tremendous redundancy of data on the web - that is, the occurrence of multiple linguistic formulations of the same answers increases the chances of being able to find an answer that occurs within the context of a simple pattern match with the query. Our simple rewrites help compared to doing just AND matching. Soubbotin and Soubbotin (2001) have used more specific regular expression matching to good advantage and we could certainly incorporate some of those ideas as well.</Paragraph> <Paragraph position="3"> Unigrams, bigrams and trigrams are extracted from the (up to) 100 best-matching summaries for each rewrite, and scored according the weight of the query rewrite that retrieved them. The score assigned to an n-gram is a weighted sum across the summaries containing the n-grams, where the weights are those associated with the rewrite that retrieved a particular summary. The best-scoring n-grams are then filtered according to seven query types. For example the filter for the query How many dogs pull a sled in the Iditarod? prefers a number, so candidate n-grams like dog race, run, Alaskan, dog racing, many mush move down the list and pool of 16 dogs (which is a correct answer) moves up. Removing the filters decreases MRR by 17.9% relative to baseline (down to 0.416). Our simple n-gram filtering is the most important individual component of the system.</Paragraph> <Paragraph position="4"> N-Gram Tiling: Finally, n-grams are tiled to create longer answer strings. This is done in a simple greedy statistical manner from the top of the list down. Not doing this tiling decreases performance by 14.2% relative to baseline (down to 0.435). The advantages gained from tiling are two-fold. First, with tiling substrings do not take up several answer slots, so the three answer candidates: San, Francisco, and San Francisco, are conflated into the single answer candidate: San Francisco. In addition, longer answers can never be found with only trigrams, e.g., light amplification by stimulted emission of radiation can only be returned by tiling these shorter n-grams into a longer string.</Paragraph> <Paragraph position="5"> Not surprisingly, removing all of our major components except the n-gram accumulation (weighted sum of occurrences of unigrams, bigrams and trigrams) results in substantially worse performance than our full system, giving an MRR of 0.266, a decrease of 47.5%. The simplest entirely statistical system with no linguistic knowledge or processing employed, would use only AND queries, do no filtering, but do statistical tiling. This system uses redundancy only in summing n-gram counts across summaries. This system has MRR 0.338, which is a 33% drop from the best version of our system, with all components enabled. Note, however, that even with absolutely no linguistic processing, the performance attained is still very reasonable performance on an absolute scale, and in fact only one TREC-9 50-byte run achieved higher accuracy than this.</Paragraph> <Paragraph position="6"> To summarize, we find that all of our processing components contribute to the overall accuracy of the question-answering system. The precise weights assigned to different query rewrites seems relatively unimportant, but the rewrites themselves do contribute considerably to overall accuracy.</Paragraph> <Paragraph position="7"> N-gram tiling turns out to be extremely effective, serving in a sense as a &quot;poor man's named-entity recognizer&quot;. Because of the effectiveness of our tiling algorithm over large amounts of data, we do not need to use any named entity recognition components. The component that identifies what filters to apply over the harvested n-grams, along with the actual regular expression filters themselves, contributes the most to overall performance.</Paragraph> </Section> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 Component Problems </SectionTitle> <Paragraph position="0"> Above we described how components contributed to improving the performance of the system. In this section we look at what components errors are attributed to. In Table 2, we show the distribution of error causes, looking at those questions for which the system returned no correct answer in the top five hypotheses.</Paragraph> <Paragraph position="1"> The biggest error comes from not knowing what units are likely to be in an answer given a question (e.g. How fast can a Corvette go xxx mph). Interestingly, 34% of our errors (Time and Correct) are not really errors, but are due to time problems or cases where the answer returned is truly correct but not present in the TREC-9 answer key. 16% of the failures come from the inability of our n-gram tiling algorithm to build up the full string necessary to provide a correct answer.</Paragraph> <Paragraph position="2"> Number retrieval problems come from the fact that we cannot query the search engine for a number without specifying the number. For example, a good rewrite for the query How many islands does Fiji have would be << Fiji has <NUM> islands >>, but we are unable to give this type of query to the search engine. Only 12% of the failures we classify as being truly outside of the system's current paradigm, rather than something that is either already correct or fixable with minor system enhancements. null</Paragraph> </Section> <Section position="6" start_page="0" end_page="0" type="metho"> <SectionTitle> 5 Knowing When We Don't Know </SectionTitle> <Paragraph position="0"> Typically, when deploying a question answering system, there is some cost associated with returning incorrect answers to a user. Therefore, it is important that a QA system has some idea as to how likely an answer is to be correct, so it can choose not to answer rather than answer incorrectly. In the TREC QA track, there is no distinction made in scoring between returning a wrong answer to a question for which an answer exists and returning no answer. However, to deploy a real system, we need the capability of making a trade-off between precision and recall, allowing the system not to answer a subset of questions, in hopes of attaining high accuracy for the questions which it does answer.</Paragraph> <Paragraph position="1"> Most question-answering systems use hand-tuned weights that are often combined in an ad-hoc fashion into a final score for an answer hypothesis (Harabagiu et al., 2000; Hovy et al., 2000; Prager et al., 2000; Soubbotin & Soubbotin, 2001; Brill et. al., 2001). Is it still possible to induce a useful precision-recall (ROC) curve when the system is not outputting meaningful probabilities for answers? We have explored this issue within the AskMSR question-answering system.</Paragraph> <Paragraph position="2"> Ideally, we would like to be able to determine the likelihood of answering correctly solely from an analysis of the question. If we can determine we are unlikely to answer a question correctly, then we need not expend the time, cpu cycles and network traffic necessary to try to answer that question.</Paragraph> <Paragraph position="3"> We built a decision tree to try to predict whether the system will answer correctly, based on a set of features extracted from the question string: word unigrams and bigrams, sentence length (QLEN), the number of capitalized words in the sentence, the number of stop words in the sentence (NUMSTOP), the ratio of the number of nonstop words to stop words, and the length of longest word (LONGWORD). We use a decision tree because we also wanted to use this as a diagnostic tool to indicate what question types we need to put further developmental efforts into. The decision tree built from these features is shown in Figure 2.</Paragraph> <Paragraph position="4"> The first split of the tree asks if the word &quot;How&quot; appears in the question. Indeed, the system performs worst on &quot;How&quot; question types. We do best on short &quot;Who&quot; questions with a large number of stop words.</Paragraph> <Paragraph position="5"> We can induce an ROC curve from this decision tree by sorting the leaf nodes from the highest probability of being correct to the lowest.</Paragraph> <Paragraph position="6"> Then we can gain precision at the expense of recall by not answering questions in the leaf nodes that have the highest probability of error. The result of doing this can be seen in Figures 3 and 4, the line labeled &quot;Question Features&quot;. The decision tree was trained on Trec 9 data. Figure 3 shows the results when applied to the same training data, and Figure 4 shows the results when testing on Trec 10 data. As we can see, the decision tree overfits the training data and does not generalize sufficiently to give useful results on the Trec 10 (test) data.</Paragraph> <Paragraph position="7"> Next, we explored how well answer correctness correlates with answer score in our system. As discussed above, the final score assigned to an answer candidate is a somewhat ad-hoc score based upon the number of retrieved passages the n-gram occurs in, the weight of the rewrite used to retrieve each passage, what filters apply to the ngram, and the effects of merging n-grams in answer tiling. In Table 3, we show the correlation coefficient calculated between whether a correct answer appears in the top 5 answers output by the system and (a) the score of the system's first ranked answer and (b) the score of the first ranked answer minus the score of the second ranked answer. A correlation coefficient of 1 indicates strong positive association, whereas a correlation of -1 indicates strong negative association. We see that there is indeed a correlation between the scores output by the system and the answer accuracy, with the correlation being tighter when just considering the score of the first answer.</Paragraph> <Paragraph position="8"> ness? Because a number of answers returned by our system are correct but scored wrong according to the TREC answer key because of time mismatches, we also looked at the correlation, limiting ourselves to Trec 9 questions that were not timesensitive. Using this subset of questions, the correlation coefficient between whether a correct answer appears in the system's top five answers, and the score of the #1 answer, increases from .363 to .401. In Figure 3 and 4, we show the ROC curve induced by deciding when not to answer a question based on the score of the first ranked answer (the line labeled &quot;score of #1 answer&quot;). Note that the score of the top ranked answer is a significantly better predictor of accuracy than what we attain by considering features of the question string, and gives consistent results across two data sets.</Paragraph> <Paragraph position="9"> Finally, we looked into whether other attributes were indicative of the likelihood of answer correctness. For every question, a set of snippets is gathered. Some of these snippets come from AND queries and others come from more refined exact string match rewrites. In Table 4, we show MRR as a function of the number of non-AND snippets retrieved. For instance, when all of the snippets come from AND queries, the resulting MRR was found to be only .238. For questions with 100 to 400 snippets retrieved from exact string match rewrites, the MRR was .628.</Paragraph> <Paragraph position="10"> We built a decision tree to predict whether a correct answer appears in the top 5 answers, based on all of the question-derived features described earlier, the score of the number one ranking answer, as well as a number of additional features describing the state of the system in processing a particular query. Some of these features include: the total number of matching passages retrieved, the number of non-AND matching passages retrieved, whether a filter applied, and the weight of the best rewrite rule for which matching passages were found. We show the resulting decision tree in Figure 5, and resulting ROC curve constructed from this decision tree, in Figure 3 and 4 (the line labeled &quot;All Features&quot;). In this case, the decision tree does give a useful ROC curve on the test data (Figure 4), but does not outperform the simple technique of using the ad hoc score of the best answer returned by the system. Still, the decision tree has proved to be a useful diagnostic in helping us understand the weaknesses of our system. null</Paragraph> </Section> class="xml-element"></Paper>