File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/p06-1114_metho.xml
Size: 29,451 bytes
Last Modified: 2025-10-06 14:10:21
<?xml version="1.0" standalone="yes"?> <Paper uid="P06-1114"> <Title>Methods for Using Textual Entailment in Open-Domain Question Answering</Title> <Section position="4" start_page="905" end_page="905" type="metho"> <SectionTitle> QUESTION ANSWERING SYSTEM </SectionTitle> <Paragraph position="0"> lows. Section 2 describes the three methods of using textual entailment in open-domain question answering that we have identi ed, while Section 3 presents the textual entailment system we have used. Section 4 details our experimental methods and our evaluation results. Finally, Section 5 provides a discussion of our ndings, and Section 6 summarizes our conclusions.</Paragraph> </Section> <Section position="5" start_page="905" end_page="906" type="metho"> <SectionTitle> 2 Integrating Textual Entailment in </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="905" end_page="906" type="sub_section"> <SectionTitle> Question Answering </SectionTitle> <Paragraph position="0"> In this section, we describe three different methods for integrating a textual entailment (TE) system into the architecture of an open-domain Q/A system.</Paragraph> <Paragraph position="1"> Work on the semantics of questions (Groenendijk, 1999; Lewis, 1988) has argued that the formal answerhood relation found between a question and a set of (correct) answers can be cast in terms of logical entailment. Under these approaches (referred to as licensing by (Groenendijk, 1999) and aboutness by (Lewis, 1988)), p is considered to be an answer to a question ?q iff ?q logically entails the set of worlds in which p is true(i.e. ?p). While the notion of textual entailment has been de ned far less rigorously than logical entailment, we believe that the recognition of textual entailment between a question and a set of candidate answers or between a question and questions generated from answers can enable Q/A systems to identify correct answers with greater precision than current keyword- or pattern-based techniques.</Paragraph> <Paragraph position="2"> As illustrated in Figure 1, most open-domain Q/A systems generally consist of a sequence of three modules: (1) a question processing (QP) module; (2) a passage retrieval (PR) module; and (3) an answer processing (AP) module. Questions are rst submitted to a QP module, which extracts a set of relevant keywords from the text of the question and identi es the question's expected answer type (EAT). Keywords along with the question's EAT are then used by a PR module to retrieve a ranked list of paragraphs which may contain answers to the question. These paragraphs are then sent to an AP module, which extracts an exact candidate answer from each passage and then ranks each candidate answer according to the likelihood that it is a correct answer to the original question.</Paragraph> <Paragraph position="3"> Method 1. In Method 1, each of a ranked list of answers that do not meet the minimum conditions for TE are removed from consideration and then re-ranked based on the entailment con dence (a real-valued number ranging from 0 to 1) assigned by the TE system to each remaining example. The system then outputs a new set of ranked answers which do not contain any answers that are not entailed by the user's question.</Paragraph> <Paragraph position="4"> Table 1 provides an example where Method 1 could be used to make the right prediction for a set of answers. Even though A1 was ranked in sixth position, the identi cation of a high-con dence positive entailment enabled it to be returned as the top answer. In contrast, the recognition of a negative entailment for A2 caused this answer to be dropped from consideration altogether.</Paragraph> <Paragraph position="5"> Q1: What did Peter Minuit buy for the equivalent of $24.00? (0.89) 1st Everyone knows that, back in 1626, Peter Minuit bought Manhattan from the Indians for $24 worth of trinkets.</Paragraph> <Paragraph position="7"> In 1626, an enterprising Peter Minuit agged down some passing locals, plied them with beads, cloth and trinkets worth an estimated $24, and walked away with the whole island.</Paragraph> <Paragraph position="8"> intensive process for most Q/A systems, we expect that TE information can be used to limit the number of passages considered during AP. As illustrated in Method 2 in Figure 1, lists of passages retrieved by a PR module can either be ranked (or ltered) using TE information. Once ranking is complete, answer extraction takes place only on the set of entailed passages that the system considers likely to contain a correct answer to the user's question.</Paragraph> <Paragraph position="9"> Method 3. In previous work (Harabagiu et al., 2005b), we have described techniques that can be used to automatically generate well-formed natural language questions from the text of paragraphs retrieved by a PR module. In our current system, sets of automatically-generated questions (AGQ) are created using a stand-alone AutoQUAB generation module, which assembles question-answer pairs (known as QUABs) from the top-ranked passages returned in response to a question. Table 2 lists some of the questions that this module has produced for the question Q2: How hot does the inside of an active volcano get? .</Paragraph> <Paragraph position="10"> Q2: How hot does the inside of an active volcano get? A2 Tamagawa University volcano expert Takeyo Kosaka said lava fragments belched out of the mountain on January 31 were as hot as 300 degrees Fahrenheit. The intense heat from a second eruption on Tuesday forced rescue operations to stop after 90 minutes. Because of the high temperatures, the bodies of only ve of the volcano's initial victims were retrieved.</Paragraph> <Paragraph position="11"> Positive Entailment AGQ1 What temperature were the lava fragments belched out of the mountain on January 31? AGQ2 How many degrees Fahrenheit were the lava fragments belched out of the mountain on January 31? Negative Entailment AGQ3 When did rescue operations have to stop? AGQ4 How many bodies of the volcano's initial victims were retrieved? Following (Groenendijk, 1999), we expect that if a question ?q logically entails another question ?qprime, then some subset of the answers entailed by ?qprime should also be interpreted as valid answers to ?q. By establishing TE between a question and AGQs derived from passages identi ed by the Q/A system for that question, we expect we can identify a set of answer passages that contain correct answers to the original question. For example, in Table 2, we nd that entailment between questions indicates the correctness of a candidate answer: here, establishing that Q2 entails AGQ1 and AGQ2 (but not AGQ3 or AGQ4) enables the system to select A2 as the correct answer.</Paragraph> <Paragraph position="12"> When at least one of the AGQs generated by the AutoQUAB module is entailed by the original question, all AGQs that do not reach TE are ltered from consideration; remaining passages are assigned an entailment con dence score and are sent to the AP module in order to provide an exact answer to the question. Following this process, candidate answers extracted from the AP module were then re-associated with their AGQs and resubmitted to the TE system (as in Method 1). Question-answer pairs deemed to be positive instances of entailment were then stored in a database and used as additional training data for the AutoQUAB module. When no AGQs were found to be entailed by the original question, however, passages were ranked according to their entailment con dence and sent to AP for further processing and validation.</Paragraph> </Section> </Section> <Section position="6" start_page="906" end_page="909" type="metho"> <SectionTitle> 3 The Textual Entailment System </SectionTitle> <Paragraph position="0"> Processing textual entailment, or recognizing whether the information expressed in a text can be inferred from the information expressed in another text, can be performed in four ways. We can try to (1) derive linguistic information from the pair of texts, and cast the inference recognition as a classi cation problem; or (2) evaluate the probability that an entailment can exist between the two texts; (3) represent the knowledge from the pair of texts in some representation language that can be associated with an inferential mechanism; or (4) use the classical AI de nition of entailment and build models of the world in which the two texts are respectively true, and then check whether the models associated with one text are included in the models associated with the other text. Although we believe that each of these methods should be investigated fully, we decided to focus only on the rst method, which allowed us to build the TE system illustrated in Figure 2.</Paragraph> <Paragraph position="1"> Our TE system consists of (1) a Preprocessing Module, which derives linguistic knowledge from the text pair; (2) an Alignment Module, which takes advantage of the notions of lexical alignment and textual paraphrases; and (3) a Classi cation Module, which uses a machine learning classi er (based on decision trees) to make an entailment judgment for each pair of texts.</Paragraph> <Paragraph position="2"> As described in (Hickl et al., 2006), the Preprocessing module is used to syntactically parse texts, identify the semantic dependencies of predicates, label named entities, normalize temporal and spatial expressions, resolve instances of coreference, and annotate predicates with polarity, tense, and modality information.</Paragraph> <Paragraph position="3"> Following preprocessing, texts are sent to an Alignment Module which uses a Maximum Entropy-based classi er in order to estimate the probability that pairs of constituents selected from texts encode corresponding information that could be used to inform an entailment judgment. This module assumes that since sets of entailing texts necessarily predicate about the same set of individuals or events, systems should be able to identify elements from each text that convey similar types of presuppositions. Examples of predicates and arguments aligned by this module are presented in Figure 3.</Paragraph> <Paragraph position="4"> Aligned constituents are then used to extract sets of phrase-level alternations (or paraphrases ) from the WWW that could be used to capture correspondences between texts longer than individual constituents. The top 8 candidate paraphrases for two of the aligned elements from Figure 3 are presented in Table 3.</Paragraph> <Paragraph position="5"> Finally, the Classi cation Module employs a</Paragraph> <Section position="1" start_page="907" end_page="907" type="sub_section"> <SectionTitle> Judgment Paraphrase </SectionTitle> <Paragraph position="0"> YES lava fragments in pyroclastic ows can reach 400 degrees YES an active volcano can get up to 2000 degrees NO an active volcano above you are slopes of 30 degrees YES the active volcano with steam reaching 80 degrees YES lava fragments such as cinders may still be as hot as 300 degrees NO lava is a liquid at high temperature: typically from 700 degrees decision tree classi er in order to determine whether an entailment relationship exists for each pair of texts. This classi er is learned using features extracted from the previous modules, including features derived from (1) the (lexical) alignment of the texts, (2) syntactic and semantic dependencies discovered in each text passage, (3) paraphrases derived from web documents, and (4) semantic and pragmatic annotations. (A complete list of features can be found in Figure 4.) Based on these features, the classi er outputs both an entailment judgment (either yes or no) and a con dence value, which is used to rank answers or paragraphs in the architecture illustrated in Figure 1.</Paragraph> </Section> <Section position="2" start_page="907" end_page="908" type="sub_section"> <SectionTitle> 3.1 Lexical Alignment </SectionTitle> <Paragraph position="0"> Several approaches to the RTE task have argued that the recognition of textual entailment can be enhanced when systems are able to identify or align corresponding entities, predicates, or phrases found in a pair of texts. In this section, we show that by using a machine learning-based classi er which combines lexico-semantic information from a wide range of sources, we are able to accurately identify aligned constituents in pairs of texts with over 90% accuracy.</Paragraph> <Paragraph position="1"> We believe the alignment of corresponding entities can be cast as a classi cation problem which uses lexico-semantic features in order to compute an alignment probability p(a), which corresponds to the likelihood that a term selected from one text entails a term from another text. We used constituency information from a chunk parser to decompose the pair of texts into a set of disjoint seg- null from the PropBank-style annotations assigned by the semantic parser.</Paragraph> <Paragraph position="2"> diamondmath1diamondmath ENTITY-ARG MATCH: This is a boolean feature which res when aligned entities were assigned the same argument role label. diamondmath2diamondmath ENTITY-NEAR-ARG MATCH: This feature is collapsing the arguments Arg1 and Arg2 (as well as the ArgM subtypes) into single categories for the purpose of counting matches.</Paragraph> <Paragraph position="3"> diamondmath3diamondmath PREDICATE-ARG MATCH: This boolean feature is agged when at least two aligned arguments have the same role.</Paragraph> <Paragraph position="4"> diamondmath4diamondmath PREDICATE-NEAR-ARG MATCH: This feature is collapsing the arguments Arg1 and Arg2 (as well as the ArgM subtypes) into single categories for the purpose of counting matches.</Paragraph> <Paragraph position="5"> PARAPHRASE FEATURES: These three features are derived from the paraphrases acquired for each pair.</Paragraph> <Paragraph position="6"> diamondmath1diamondmath SINGLE PATTERN MATCH: This is a boolean feature which red when a paraphrase matched either of the texts.</Paragraph> <Paragraph position="7"> diamondmath2diamondmath BOTH PATTERN MATCH: This is a boolean feature which red when paraphrases matched both texts.</Paragraph> <Paragraph position="8"> diamondmath3diamondmath CATEGORY MATCH: This is a boolean feature which red when paraphrases could be found from the same paraphrase cluster that matched both texts.</Paragraph> <Paragraph position="9"> SEMANTIC/PRAGMATIC FEATURES: These six features are extracted by the preprocessing module.</Paragraph> <Paragraph position="10"> diamondmath1diamondmath NAMED ENTITY CLASS: This feature has a different value for each of the 150 named entity classes.</Paragraph> <Paragraph position="11"> diamondmath2diamondmath TEMPORAL NORMALIZATION: This boolean feature is agged when the temporal expressions are normalized to the same ISO 9000 equivalents.</Paragraph> <Paragraph position="12"> diamondmath3diamondmath MODALITY MARKER: This boolean feature is agged when the two texts use the same modal verbs.</Paragraph> <Paragraph position="13"> diamondmath4diamondmath SPEECH-ACT: This boolean feature is agged when the lexicons indicate the same speech act in both texts.</Paragraph> <Paragraph position="14"> diamondmath5diamondmath FACTIVITY MARKER: This boolean feature is agged when the factivity markers indicate either TRUE or FALSE in both texts simultaneously. null diamondmath6diamondmath BELIEF MARKER: This boolean feature is set when the belief markers indicate either TRUE or FALSE in both texts simultaneously. CONTRAST FEATURES: These six features are derived from the opposing information provided by antonymy relations or chains. diamondmath1diamondmath NUMBER OF LEXICAL ANTONYMY RELATIONS: This feature counts the number of antonyms from WordNet that are discovered between the two texts.</Paragraph> <Paragraph position="15"> diamondmath2diamondmath NUMBER OF ANTONYMY CHAINS: This feature counts the number of antonymy chains that are discovered between the two texts. diamondmath3diamondmath CHAIN LENGTH: This feature represents a vector with the lengths of the antonymy chains discovered between the two texts. diamondmath4diamondmath NUMBER OF GLOSSES: This feature is a vector representing the number of Gloss relations used in each antonymy chain.</Paragraph> <Paragraph position="16"> diamondmath5diamondmath NUMBER OF MORPHOLOGICAL CHANGES: This feature is a vector representing the number of Morphological-Derivation relations found in each antonymy chain.</Paragraph> <Paragraph position="17"> diamondmath6diamondmath NUMBER OF NODES WITH DEPENDENCIES: This feature is a vector indexing the number of nodes in each antonymy chain that contain dependency relations.</Paragraph> <Paragraph position="18"> diamondmath7diamondmath TRUTH-VALUE MISMATCH: This is a boolean feature which red when two aligned predicates differed in any truth value.</Paragraph> <Paragraph position="19"> diamondmath8diamondmath POLARITY MISMATCH: This is a boolean feature which red when predicates were assigned opposite polarity values.</Paragraph> <Paragraph position="20"> ments known as alignable chunks . Alignable chunks from one text (Ct) and the other text (Ch) are then assembled into an alignment matrix (Ctx Ch). Each pair of chunks (p [?] Ct x Ch) is then submitted to a Maximum Entropy-based classier which determines whether or not the pair of chunks represents a case of lexical entailment.</Paragraph> <Paragraph position="21"> Three classes of features were used in the Alignment Classi er: (1) a set of statistical features (e.g. cosine similarity), (2) a set of lexico-semantic features (including WordNet Similarity (Pedersen et al., 2004), named entity class equality, and part-of-speech equality), and (3) a set of string-based features (such as Levenshtein edit distance and morphological stem equality).</Paragraph> <Paragraph position="22"> As in (Hickl et al., 2006), we used a two-step approach to obtain suf cient training data for the Alignment Classi er. First, humans were tasked with annotating a total of 10,000 alignment pairs (extracted from the 2006 PASCAL Development Set) as either positive or negative instances of alignment. These annotations were then used to train a hillclimber that was used to annotate a larger set of 450,000 alignment pairs selected at random from the training corpora described in Section 3.3. These machine-annotated examples were then used to train the Maximum Entropy-based classi er that was used in our TE system. Table 4 presents results from TE's linearand Maximum Entropy-based Alignment Classiers on a sample of 1000 alignment pairs selected at random from the 2006 PASCAL Test Set.</Paragraph> </Section> <Section position="3" start_page="908" end_page="909" type="sub_section"> <SectionTitle> 3.2 Paraphrase Acquisition </SectionTitle> <Paragraph position="0"> Much recent work on automatic paraphrasing (Barzilay and Lee, 2003) has used relatively simple statistical techniques to identify text passages that contain the same information from parallel corpora. Since sentence-level paraphrases are generally assumed to contain information about the same event, these approaches have generally assumed that all of the available paraphrases for a given sentence will include at least one pair of entities which can be used to extract sets of paraphrases from text.</Paragraph> <Paragraph position="1"> The TE system uses a similar approach to gather phrase-level alternations for each entailment pair.</Paragraph> <Paragraph position="2"> In our system, the two highest-con dence entity alignments returned by the Lexical Alignment module were used to construct a query which was used to retrieve the top 500 documents from Google, as well as all matching instances from our training corpora described in Section 3.3. This method did not always extract true paraphrases of either texts. In order increase the likelihood that only true paraphrases were considered as phrase-level alternations for an example, extracted sentences were clustered using complete-link clustering using a technique proposed in (Barzilay and Lee, 2003).</Paragraph> </Section> <Section position="4" start_page="909" end_page="909" type="sub_section"> <SectionTitle> 3.3 Creating New Sources of Training Data </SectionTitle> <Paragraph position="0"> In order to obtain more training data for our TE system, we extracted more than 200,000 examples of textual entailment from large newswire corpora.</Paragraph> <Paragraph position="1"> Positive Examples. Following an idea proposed in (Burger and Ferro, 2005), we created a corpus of approximately 101,000 textual entailment examples by pairing the headline and rst sentence from newswire documents. In order to increase the likelihood of including only positive examples, pairs were ltered that did not share an entity (or an NP) in common between the headline and the rst sentence</Paragraph> </Section> <Section position="5" start_page="909" end_page="909" type="sub_section"> <SectionTitle> Judgment Example </SectionTitle> <Paragraph position="0"> YES Text-1: Sydney newspapers made a secret deal not to report on the fawning and spending during the city's successful bid for the 2000 Olympics, former Olympics Minister Bruce Baird said today.</Paragraph> <Paragraph position="1"> Text-2: Papers Said To Protect Sydney Bid YES Text-1: An IOC member expelled in the Olympic bribery scandal was consistently drunk as he checked out Stockholm's bid for the 2004 Games and got so offensive that he was thrown out of a dinner party, Swedish of cials said.</Paragraph> <Paragraph position="2"> Text-2: Of cials Say IOC Member Was Drunk Negative Examples. Two approaches were used to gather negative examples for our training set. First, we extracted 98,000 pairs of sequential sentences that included mentions of the same named entity from a large newswire corpus. We also extracted 21,000 pairs of sentences linked by connectives such as even though, in contrast and but.</Paragraph> <Paragraph position="3"> ing up oil slicks are extremely costly and are never completely ef cient.</Paragraph> <Paragraph position="4"> Text-2: In contrast, he stressed, Clean Mag has a 100 percent pollution retrieval rate, is low cost and can be recycled.</Paragraph> </Section> </Section> <Section position="7" start_page="909" end_page="910" type="metho"> <SectionTitle> 4 Experimental Results </SectionTitle> <Paragraph position="0"> In this section, we describe results from four sets of experiments designed to explore how textual entailment information can be used to enhance the quality of automatic Q/A systems. We show that by incorporating features from TE into a Q/A system which employs no other form of textual inference, we can improve accuracy by more than 20% over a baseline.</Paragraph> <Paragraph position="1"> We conducted our evaluations on a set of 500 factoid questions selected randomly from questions previously evaluated during the annual TREC Q/A evaluations. 2 Of these 500 questions, 335 (67.0%) were automatically assigned an answer type from our system's answer type hierarchy ; the remaining 165 (33.0%) questions were classi ed as having an unknown answer type. In order to provide a baseline for our experiments, we ran a version of our Q/A system, known as FERRET (Harabagiu et al., 2005a), that does not make use of textual entailment information when identifying answers to questions. Results from this baseline are presented in Table 7.</Paragraph> <Paragraph position="2"> The performance of the TE system described in Section 3 was rst evaluated in the 2006 PASCAL RTE Challenge. In this task, systems were tasked with determining whether the meaning of a sentence (referred to as a hypothesis) could be reasonably inferred from the meaning of another sentence (known as a text). Four types of sentence pairs were evaluated in the 2006 RTE Challenge, including: pairs derived from the output of (1) automatic question-answering (QA) systems, (2) information extraction systems (IE), (3) in- null formation retrieval (IR) systems, and (4) multi-document summarization (SUM) systems. The accuracy of our TE system across these four tasks is presented in Table 8.</Paragraph> <Paragraph position="3"> In previous work (Hickl et al., 2006), we have found that the type and amount of training data available to our TE system signi cantly (p < 0.05) impacted its performance on the 2006 RTE Test Set. When our system was trained on the training corpora described in Section 3.3, the overall accu- null from 65.25% to 75.38%. In order to provide training data that replicated the task of recognizing entailment between a question and an answer, we assembled a corpus of 5000 question-answer pairs selected from answers that our baseline Q/A system returned in response to a new set of 1000 questions selected from the TREC test sets. 2500 positive training examples were created from answers identi ed by human annotators to be correct answers to a question, while 2500 negative examples were created by pairing questions with incorrect answers returned by the Q/A system.</Paragraph> <Paragraph position="4"> After training our TE system on this corpus, we performed the following four experiments: Method 1. In the rst experiment, the ranked lists of answers produced by the Q/A system were submitted to the TE system for validation. Under this method, answers that were not entailed by the question were removed from consideration; the top-ranked entailed answer was then returned as the system's answer to the question. Results from this method are presented in Table 9.</Paragraph> <Paragraph position="5"> Method 2. In this experiment, entailment information was used to rank passages returned by the PR module. After an initial relevance ranking was determined from the PR engine, the top 50 passages were paired with the original question and were submitted to the TE system. Passages were re-ranked using the entailment judgment and the entailment con dence computed for each pair and then submitted to the AP module. Features derived from the entailment con dence were then combined with the keyword- and relation-based features described in (Harabagiu et al., 2005a) in order to produce a nal ranking of candidate answers. Results from this method are presented in to select AGQs that were entailed by the question submitted to the Q/A system. Here, AutoQUAB was used to generate questions for the top 50 candidate answers identi ed by the system. When at least one of the top 50 AGQs were entailed by the original question, the answer passage associated with the top-ranked entailed question was returned as the answer. When none of the top 50 AGQs were entailed by the question, question-answer pairs were re-ranked based on the entailment con dence, and the top-ranked answer was returned. Results for both of these conditions are presented in Table 9.</Paragraph> <Paragraph position="6"> Hybrid Method. Finally, we found that the best results could be obtained by combining aspects of each of these three strategies. Under this approach, candidate answers were initially ranked using features derived from entailment classi cations performed between (1) the original question and each candidate answer and (2) the original question and the AGQ generated from each candidate answer. Once a ranking was established, answers that were not judged to be entailed by the question were also removed from nal ranking. Results from this hybrid method are provided in Table 9.</Paragraph> </Section> <Section position="8" start_page="910" end_page="911" type="metho"> <SectionTitle> 5 Discussion </SectionTitle> <Paragraph position="0"> The experiments reported in this paper suggest that current TE systems may be able to provide open-domain Q/A systems with the forms of semantic inference needed to perform accurate answer validation. While probabilistic or web-based methods for answer validation have been previously explored in the literature (Magnini et al., 2002), these approaches have modeled the relationship between a question and a (correct) answer in terms of relevance and have not tried to approximate the deeper semantic phenomena that are involved in determining answerhood.</Paragraph> <Paragraph position="1"> Our work suggests that considerable gains in performance can be obtained by incorporating TE during both answer processing and passage retrieval. While best results were obtained using the Hybrid Method (which boosted performance by nearly 28% for questions with known EATs), each of the individual methods managed to boost the overall accuracy of the Q/A system by at least 7%. When TE was used to lter non-entailed answers from consideration (Method 1), the over-all accuracy of the Q/A system increased by 12% over the baseline (when an EAT could be identi ed) and by nearly 9% (when no EAT could be identi ed). In contrast, when entailment information was used to rank passages and candidate answers, performance increased by 22% and 10% respectively. Somewhat smaller performance gains were achieved when TE was used to select amongst AGQs generated by our Q/A system's AutoQUAB module (Method 3). We expect that by adding features to TE system speci cally designed to account for the semantic contributions of a question's EAT, we may be able to boost the performance of this method.</Paragraph> </Section> class="xml-element"></Paper>