File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/n03-1004_metho.xml
Size: 28,206 bytes
Last Modified: 2025-10-06 14:08:07
<?xml version="1.0" standalone="yes"?> <Paper uid="N03-1004"> <Title>In Question Answering, Two Heads Are Better Than One</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 A Multi-Agent QA Architecture </SectionTitle> <Paragraph position="0"> In order to enable a multi-source and multi-strategy approach to question answering, we developed a modular and extensible QA architecture as shown in Figure 1 (Chu-Carroll et al., 2003a; Chu-Carroll et al., 2003b).</Paragraph> <Paragraph position="1"> With a consistent interface defined for each component, this architecture allows for easy plug-and-play of individual components for experimental purposes.</Paragraph> <Paragraph position="2"> In our architecture, a question is first processed by the question analysis component. The analysis results are represented as a QFrame, which minimally includes a set of question features that help activate one or more answering agents. Each answering agent takes the QFrame and generates its own set of requests to a variety of knowledge sources. This may include performing search against a text corpus and extracting answers from the resulting passages, or performing a query against a structured knowledge source, such as WordNet (Miller, 1995) or Cyc (Lenat, 1995). The (intermediate) results from the individual answering agents are then passed on to the answer resolution component, which combines and resolves the set of results, and either produces the system's final answers or feeds the intermediate results back to the answering agents for further processing.</Paragraph> <Paragraph position="3"> We have developed multiple answering agents, some general purpose and others tailored for specific question types. Figure 1 shows the answering agents currently available in PIQUANT. The knowledge-based and statistical answering agents are general-purpose agents that adopt different processing strategies and consult a number of different text resources. The definition-Q agent targets definition questions (e.g., &quot;What is penicillin?&quot; and &quot;Who is Picasso?&quot;) with a technique called Virtual Annotation using the external knowledge source WordNet (Prager et al., 2001). The KSP-based answering agent focuses on a subset of factoid questions with specific logical forms, such as capital(?COUNTRY) and state tree(?STATE). The answering agent sends requests to the KSP (Knowledge Sources Portal), which returns, if possible, an answer from a structured knowledge source (Chu-Carroll et al., 2003a).</Paragraph> <Paragraph position="4"> In the rest of this paper, we briefly describe our two general-purpose answering agents. We then focus on a multi-level answer resolution algorithm, applicable at different points in the QA process of these two answering agents. Finally, we discuss experiments conducted to discover effective methods for combining results from multiple answering agents.</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Component Answering Agents </SectionTitle> <Paragraph position="0"> We focus on two end-to-end answering agents designed to answer short, fact-seeking questions from a collection of text documents, as motivated by the requirements of the TREC QA track (Voorhees, 2003). Both answering agents adopt the classic pipeline architecture, consisting roughly of question analysis, passage retrieval, and answer selection components. Although the answering agents adopt fundamentally different strategies in their individual components, they have performed quite comparably in past TREC QA tracks (Voorhees, 2001; Voorhees, 2002).</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.1 Knowledge-Based Answering Agent </SectionTitle> <Paragraph position="0"> Our first answering agent utilizes a primarily knowledge-driven approach, based on Predictive Annotation (Prager et al., 2000). A key characteristic of this approach is that potential answers, such as person names, locations, and dates, in the corpus are predictively annotated. In other words, the corpus is indexed not only with keywords, as is typical for most search engines, but also with the semantic classes of these pre-identified potential answers.</Paragraph> <Paragraph position="1"> During the question analysis phase, a rule-based mechanism is employed to select one or more expected answer types, from a set of about 80 classes used in the predictive annotation process, along with a set of question keywords. A weighted search engine query is then constructed from the keywords, their morphological variations, synonyms, and the answer type(s). The search engine returns a hit list of typically 10 passages, each consisting of 1-3 sentences. The candidate answers in these passages are identified and ranked based on three criteria: 1) match in semantic type between candidate answer and expected answer, 2) match in weighted grammatical relationships between question and answer passages, and 3) frequency of answer in candidate passages (redundancy).</Paragraph> <Paragraph position="2"> The answering agent returns the top n ranked candidate answers along with a confidence score for each answer.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.2 Statistical Answering Agent </SectionTitle> <Paragraph position="0"> The second answering agent takes a statistical approach to question answering (Ittycheriah, 2001; Ittycheriah et al., 2001). It models the distribution p(c|q,a), which measures the &quot;correctness&quot; (c) of an answer (a) to a question (q), by introducing a hidden variable representing the answer type (e) as follows: p(c|q,a) =summationtexte p(c,e|q,a) =summationtexte p(c|e,q,a)p(e|q,a) p(e|q,a) is the answer type model which predicts, from the question and a proposed answer, the answer type they both satisfy. p(c|e,q,a) is the answer selection model.</Paragraph> <Paragraph position="1"> Given a question, an answer, and the predicted answer type, it seeks to model the correctness of this configuration. These distributions are modeled using a maximum entropy formulation (Berger et al., 1996), using training data which consists of human judgments of question answer pairs. For the answer type model, 13K questions were annotated with 31 categories. For the answer selection model, 892 questions from the TREC 8 and TREC 9 QA tracks were used, along with 4K trivia questions.</Paragraph> <Paragraph position="2"> During runtime, the question is first analyzed by the answer type model, which selects one out of a set of 31 types for use by the answer selection model. Simultaneously, the question is expanded using local context analysis (Xu and Croft, 1996) with an encyclopedia, and the top 1000 documents are retrieved by the search engine.</Paragraph> <Paragraph position="3"> From these documents, the top 100 passages are chosen that 1) maximize the question word match, 2) have the desired answer type, 3) minimize the dispersion of question words, and 4) have similar syntactic structures as the question. From these passages, candidate answers are extracted and ranked using the answer selection model. The top n candidate answers are then returned, each with an associated confidence score.</Paragraph> </Section> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 Answer Resolution </SectionTitle> <Paragraph position="0"> Given two answering agents with the same pipeline architecture, there are multiple points in the process at which (intermediate) results can be combined, as illustrated in Figure 2. More specifically, it is possible for one answering agent to provide input to the other after the question analysis, passage retrieval, and answer selection phases.</Paragraph> <Paragraph position="1"> In PIQUANT, the knowledge based agent may accept input from the statistical agent after each of these three phases.1 The contributions from the statistical agent are taken into consideration by the knowledge based answering agent in a phase-dependent fashion. The rest of this section details our combination strategies for each phase.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.1 Question-Level Combination </SectionTitle> <Paragraph position="0"> One of the key tasks of the question analysis component is to determine the expected answer type, such as PERSON for &quot;Who discovered America?&quot; and DATE for &quot;When did World War II end?&quot; This information is taken into account by most existing QA systems when ranking candidate answers, and can also be used in the passage retrieval process to increase the precision of candidate passages.</Paragraph> <Paragraph position="1"> We seek to improve the knowledge-based agent's performance in passage retrieval and answer selection through better answer type identification by consulting the statistical agent's expected answer type. This task, however, is complicated by the fact that QA systems employ different sets of answer types, often with different granularities and/or with overlapping types. For instance, while one system may generate ROYALTY for the question &quot;Who was the King of France in 1702?&quot;, another system may produce PERSON as the most specific answer type in its repertoire. This is quite a serious problem for us as the knowledge based agent uses over 80 answer types while the statistical agent adopts only 31 categories.</Paragraph> <Paragraph position="2"> In order to distinguish actual answer type discrepancies from those due to granularity differences, we first manually created a mapping between the two sets of answer types. This mapping specifies, for each answer type used by the statistical agent, a set of possible corresponding types used by the knowledge-based agent. For example, the GEOLOGICALOBJ class is mapped to a set of finer grained classes: RIVER, MOUNTAIN, LAKE, and OCEAN.</Paragraph> <Paragraph position="3"> At processing time, the statistical agent's answer type is mapped to the knowledge-based agent's classes (SA1Although it is possible for the statistical agent to receive input from the knowledge based agent as well, we have not pursued that option because of implementation issues.</Paragraph> <Paragraph position="4"> lected by the knowledge-based agent itself (KBA-types) as follows: 1. If the intersection of KBA-types and SA-types is non-null, i.e., the two agents produced consistent answer types, then the merged set is KBA-types.</Paragraph> <Paragraph position="5"> 2. Otherwise, the two sets of answer types are truly in disagreement, and the merged set is the union of KBA-types and SA-types.</Paragraph> <Paragraph position="6"> The merged answer types are then used by the knowledge-based agent in further processing.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.2 Passage-Level Combination </SectionTitle> <Paragraph position="0"> The passage retrieval component selects, from a large text corpus, a small number of short passages from which answers are identified. Oftentimes, multiple passages that answer a question are retrieved. Some of these passages may be better suited than others for the answer selection algorithm employed downstream. For example, consider &quot;When was Benjamin Disraeli prime minister?&quot;, whose answer can be found in both passages below: 1. Benjamin Disraeli, who had become prime minister in 1868, was born into Judaism but was baptized a Christian at the age of 12.</Paragraph> <Paragraph position="1"> 2. France had a Jewish prime minister in 1936, England in 1868, and Spain, of all countries, in 1835, but none of them, Leon Blum, Benjamin Disraeli or Juan Alvarez Mendizabel, were devoutly observant, as Lieberman is.</Paragraph> <Paragraph position="2"> Although the correct answer, 1868, is present in both passages, it is substantially easier to identify the answer from the first passage, where it is directly stated, than from the second passage, where recognition of parallel constructs is needed to identify the correct answer. Because of strategic differences in question analysis and passage retrieval, our two answering agents often retrieve different passages for the same question. Thus, we perform passage-level combination to make a wider variety of passages available to the answer selection component, as shown in Figure 2. The potential advantages are threefold. First, passages from agent 2 may contain answers absent in passages retrieved by agent 1. Second, agent 2 may have retrieved passages better suited for the downstream answer selection algorithm than those retrieved by agent 1. Third, passages from agent 2 may contain additional occurrences of the correct answer, which boosts the system's confidence in the answer through the redundancy measure.2 Our passage-level combination algorithm adds to the passages extracted by the knowledge-based agent the top-ranked passages from the statistical agent that contain candidate answers of the right type. More specifically, the statistical agent's passages are semantically annotated and the top 10 passages containing at least one candidate of the expected answer type(s) are selected.3</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.3 Answer-Level Combination </SectionTitle> <Paragraph position="0"> The answer selection component identifies, from a set of passages, the top n answers for the given question, with their associated confidence scores. An answer-level combination algorithm takes the top answer(s) from the individual answering agents and determines the overall best answer(s). Of our three combination algorithms, this most closely resembles traditional ensemble methods, as voting takes place among the end results of individual an2On the other hand, such redundancy may result in error compounding, as discussed in Section 5.3.</Paragraph> <Paragraph position="1"> 3We selected the top 10 passages so that the same number of passages are considered from both answering agents.</Paragraph> <Paragraph position="2"> swering agents to determine the final output of the ensemble. null We developed two answer-level combination algorithms, both utilizing a simple confidence-based voting mechanism, based on the premise that answers selected by both agents with high confidence are more likely to be correct than those identified by only one agent.4 In both algorithms, named entity normalization is first performed on all candidate answers considered. In the first algorithm, only the top answer from each agent is taken into account. If the two top answers are equivalent, the answer is selected with the combined confidence from both agents; otherwise, the more confident answer is selected.5 In the second algorithm, the top 5 answers from each agent are allowed to participate in the voting process. Each instance of an answer votes with a weight equal to its confidence value and the weights of equivalent answers are again summed. The answer with the highest weight, or confidence value, is selected as the system's final answer. Since in our evaluation, the second algorithm uniformly outperforms the first, it is adopted as our answer-level combination algorithm in the rest of the paper.</Paragraph> </Section> </Section> <Section position="6" start_page="0" end_page="0" type="metho"> <SectionTitle> 5 Performance Evaluation </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 5.1 Experimental Setup </SectionTitle> <Paragraph position="0"> To assess the effectiveness of our multi-level answer resolution algorithm, we devised experiments to evaluate the impact of the question, passage, and answer-level combination algorithms described in the previous section.</Paragraph> <Paragraph position="1"> The baseline systems are the knowledge-based and statistical agents performing individually against a single reference corpus. In addition, our earlier experiments showed that when employing a single answer finding strategy, consulting multiple text corpora yielded better performance than using a single corpus. We thus configured a version of our knowledge-based agent to make use of three available text corpora,6 the AQUAINT corpus (news articles from 1998-2000), the TREC corpus (news articles from 1988-1994),7 and a subset of the Encyclopedia Britannica. This multi-source version of the knowledge-based agent will be used in all answer resolution experiments in conjunction with the statistical agent. We configured multiple versions of PIQUANT to evaluate our question, passage, and answer-level combination algorithms individually and cumulatively. For cumulative effects, we 1) combined the algorithms pair-wise, and 2) employed all three algorithms together. The two test sets were selected from the TREC 10 and 11 QA track questions (Voorhees, 2002; Voorhees, 2003). For both test sets, we eliminated those questions that did not have known answers in the reference corpus. Furthermore, from the TREC 10 test set, we discarded all definition questions,8 since the knowledge-based agent adopts a specialized strategy for handling definition questions which greatly reduces potential contributions from other answering agents. This results in a TREC 10 test set of 313 questions and a TREC 11 test set of 453 questions.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 5.2 Experimental Results </SectionTitle> <Paragraph position="0"> We ran each of the baseline and combined systems on the two test sets. For each run, the system outputs its top answer and its confidence score for each question. All answers for a run are then sorted in descending order of the confidence scores. Two established TREC QA evaluation metrics are adopted to assess the results for each run as follows: 1. % Correct: Percentage of correct answers.</Paragraph> <Paragraph position="1"> 2. Average Precision: A confidence-weighted score that rewards systems with high confidence in correct answers as follows, where N is the number of questions:</Paragraph> <Paragraph position="3"> # correct up to question i/i Table 1 shows our experimental results. The top section shows the comparable baseline results from the statistical agent (SA-SS) and the single-source knowledge-based agent (KBA-SS). It also includes results for the multi-source knowledge-based agent (KBA-MS), which improve upon those for its single-source counterpart (KBA-SS).</Paragraph> <Paragraph position="4"> The middle section of the table shows the answer resolution results, including applying the question, passage, and answer-level combination algorithms individually (Q, P, and A, respectively), applying them pair-wise (Q+P, P+A, and Q+A), and employing all three algorithms (Q+P+A). Finally, the last row of the table shows the relative improvement by comparing the best performing system configuration (highlighted in boldface) with the better performing single-source, single-strategy base-line system (SA-SS or KBA-SS, in italics).</Paragraph> <Paragraph position="5"> Overall, PIQUANT's multi-strategy and multi-source approach achieved a 35.0% relative improvement in the 8Definition questions were intentionally excluded by the track coordinator in the TREC 11 test set.</Paragraph> <Paragraph position="6"> number of correct answers and a 32.8% improvement in average precision on the TREC 11 data set. Of the combined improvement, approximately half was achieved by the multi-source aspect of PIQUANT, while the other half was obtained by PIQUANT's multi-strategy feature. Although the absolute average precision values are comparable on both test sets and the absolute percentage of correct answers is lower on the TREC 11 data, the improvement is greater on TREC 11 in both cases. This is because the TREC 10 questions were taken into account for manual rule refinement in the knowledge-based agent, resulting in higher baselines on the TREC 10 test set. We believe that the larger improvement on the previously unseen TREC 11 data is a more reliable estimate of PIQUANT's performance on future test sets.</Paragraph> <Paragraph position="7"> We applied an earlier version of our combination algorithms, which performed between our current P and P+A algorithms, in our submission to the TREC 11 QA track.</Paragraph> <Paragraph position="8"> Using the average precision metric, that version of PIQUANT was among the top 5 best performing systems out of 67 runs submitted by 34 groups.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 5.3 Discussion and Analysis </SectionTitle> <Paragraph position="0"> A cursory examination of the results in Table 1 allows us to draw two general conclusions about PIQUANT's performance. First, all three combination algorithms applied individually improved upon the baseline using both evaluation metrics on both test sets. In addition, overall performance is generally better the later in the process the combination occurs, i.e., the answer-level combination algorithm outperformed the passage-level combination algorithm, which in turn outperformed the question-level combination algorithm. Second, the cumulative improvement from multiple combination algorithms is in general greater than that from the components. For instance, the Q+A algorithm uniformly outperformed the Q and A algorithms alone. Note, however, that the Q+P+A algorithm achieved the highest performance only on the that this is because of compounding errors that occurred during the multiple combination process.</Paragraph> <Paragraph position="1"> In ensemble methods, the individual components must make different mistakes in order for the combined system to potentially perform better than the component systems (Dietterich, 1997). We examined the differences in results between the two answering agents from their question analysis, passage retrieval, and answer selection components. We focused our analysis on the potential gain/loss from incorporating contributions from the statistical agent, and how the potential was realized as actual performance gain/loss in our end-to-end system.</Paragraph> <Paragraph position="2"> At the question level, we examined those questions for which the two agents proposed incompatible answer types. On the TREC 10 test set, the statistical agent introduced correct answer types in 6 cases and incorrect answer types in 9 cases. As a result, in some cases the question-level combination algorithm improved system performance (comparing A and Q+A) and in others it degraded performance (comparing P and Q+P). On the other hand, on the TREC 11 test set, the statistical agent introduced correct and incorrect answer types in 15 and 6 cases, respectively. As a result, in most cases performance improved when the question-level combination algorithm was invoked. The difference in question analysis performance again reflects the fact that TREC 10 questions were used in question analysis rule refinement in the knowledge-based agent.</Paragraph> <Paragraph position="3"> At the passage level, we examined, for each question, whether the candidate passages contained the correct answer. Table 2 shows the distribution of questions for which correct answers were (+) and were not (-) present in the passages for both agents. The bold-faced cells represent questions for which the statistical agent retrieved passages with correct answers while the knowledge-based agent did not. There were 43 and 58 such questions in the TREC 10 and TREC 11 test sets, respectively, and employing the passage-level combination algorithm resulted only in an additional 18 and 8 correct answers on each test set. This is because the statistical agent's proposes in its 10 passages, on average, 29 candidate answers, most of which are incorrect, of the proper semantic type per question. As the downstream answer selection component takes redundancy into account in answer ranking, incorrect answers may reinforce one another and become top ranked answers. This suggests that the relative contributions of our answer selection features may not be optimally tuned for our multi-agent approach to QA. We plan to investigate this issue in future work.</Paragraph> <Paragraph position="4"> At the answer level, we analyzed each agent's top 5 answers, used in the combination algorithm's voting process. Table 3 shows the distribution of questions for which an answer was found in 1st place, in 2nd-5th place, and not found in top 5. Since we employ a linear voting strategy based on confidence scores, we classify the cells in Table 3 as follows based on the perceived likelihood that the correct answers for questions in each cell wins in the voting process. The boldfaced and underlined cells contain highly likely candidates, since a correct answer was found in 1st place by both agents.9 The bold-faced cells consist of likely candidates, since a 1st place correct answer was supported by a 2nd-5th place answer.</Paragraph> <Paragraph position="5"> The italicized and underlined cells contain possible candidates, while the rest of the cells cannot produce correct 1st place answers using our current voting algorithm. On TREC 10 data, 194 questions fall into the highly likely, likely, and possible categories, out of which the voting algorithm successfully selected 155 correct answers in 1st place. On TREC 11 data, 197 correct answers were selected out of 248 questions that fall into these categories. These results represent success rates of 79.9% and 79.4% for our answer-level combination algorithm on the two test sets.</Paragraph> </Section> </Section> <Section position="7" start_page="0" end_page="0" type="metho"> <SectionTitle> 6 Related Work </SectionTitle> <Paragraph position="0"> There has been much work in employing ensemble methods to increase system performance in machine learning.</Paragraph> <Paragraph position="1"> In NLP, such methods have been applied to tasks such as POS tagging (Brill and Wu, 1998), word sense disambiguation (Pedersen, 2000), parsing (Henderson and Brill, 1999), and machine translation (Frederking and Nirenburg, 1994).</Paragraph> <Paragraph position="2"> In question answering, a number of researchers have investigated federated systems for identifying answers to questions. For example, (Clarke et al., 2003) and (Lin et al., 2003) employ techniques for utilizing both unstruc9These cells are not marked as definite because in a small number of cases, the two answers are not equivalent. For example, for the TREC 9 question, &quot;Who is the emperor of Japan?&quot;, Hirohito, Akihito, and Taisho are all considered correct answers based on the reference corpus.</Paragraph> <Paragraph position="3"> tured text and structured databases for question answering. However, the approaches taken by both these systems differ from ours in that they enforce an order between the two strategies by attempting to locate answers in structured databases first for select question types and falling back to unstructured text when the former fails, while we explore both options in parallel and combine the results from multiple answering agents.</Paragraph> <Paragraph position="4"> The multi-agent approach to question answering most similar to ours is that by Burger et al. (2003). They applied ensemble methods to combine the 67 runs submitted to the TREC 11 QA track, using an unweighted centroid method for selecting among the 67 proposed answers for each question. However, their combined system did not outperform the top scoring system(s). Furthermore, their approach differs from ours in that they focused on combining the end results of a large number of systems, while we investigated a tightly-coupled design for combining two answering agents.</Paragraph> </Section> class="xml-element"></Paper>