File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/w03-1201_metho.xml

Size: 18,515 bytes

Last Modified: 2025-10-06 14:08:35

<?xml version="1.0" standalone="yes"?>
<Paper uid="W03-1201">
  <Title>Question answering via Bayesian inference on lexical relations</Title>
  <Section position="4" start_page="1" end_page="1" type="metho">
    <SectionTitle>
3 Proposed approach
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="1" end_page="1" type="sub_section">
      <SectionTitle>
3.1 An inferencing approach to QA
</SectionTitle>
      <Paragraph position="0"> Given a question and a passage that contains the answer, how do we correlate the two ? Take for example, the following question What type of animal is Winnie the Pooh? and the answer passage is A Canadian town that claims to be the birthplace of Winnie the Pooh wants to erect a giant statue of the famous bear; but Walt Disney Studios will not permit it.</Paragraph>
      <Paragraph position="1"> It is clear that there is a linkage between the question word animal and the answer word bear. That the word bear occurred in the answer, in the context of Winnie, means that there was a hidden &amp;quot;cause&amp;quot; for the occurrence of bear, and that was the concept of a6 animala7 .</Paragraph>
      <Paragraph position="2"> In general, there could be multiple words in the question and answer that are connected by many hidden causes. This scenario is depicted in figure a5 1. The causes themselves may have hidden causes associated with them.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="1" end_page="4" type="metho">
    <SectionTitle>
QUESTION ANSWER
</SectionTitle>
    <Paragraph position="0"> These causal relationships are represented in ontologies and WordNets. The familiar English Word-Net, in particular, encodes relations between words and concepts. For instance WordNet gives the hypernymy relation between the concepts a6 animala7 and a6 beara7 .</Paragraph>
    <Section position="1" start_page="2" end_page="2" type="sub_section">
      <SectionTitle>
3.2 WordNet
</SectionTitle>
      <Paragraph position="0"> WordNet (Fellbaum, 1998b) is an online lexical reference system in which English nouns, verbs, adjectives and adverbs are organized into synonym sets or synsets, each representing one underlying lexical concept. Noun synsets are related to each other through hypernymy (generalization), hyponymy (specialization), holonymy (whole of) and meronymy (part of) relations. Of these, (hypernymy, hyponymy) and (meronymy,holonymy) are complementary pairs.</Paragraph>
      <Paragraph position="1"> The verb and adjective synsets are very sparsely connected with each other. No relation is available between noun and verb synsets. However, 4500 adjective synsets are related to noun synsets with pertainyms (pertaining to) and attra (attributed with) relations. null  Figurea5 2 shows that the synset a6 dog, domestic dog, canis familiarisa7 has a hyponymy link to a6 corgi, welshcorgia7 and meronymy link to a6 flaga7 (&amp;quot;a conspicuously marked or shaped tail&amp;quot;). While the hyponymy link helps us answer the question (TREC#371) &amp;quot;A corgi is a kind of what?&amp;quot;, the meronymy connection here is perhaps more confusing than useful: this sense of flag is rare.</Paragraph>
    </Section>
    <Section position="2" start_page="2" end_page="2" type="sub_section">
      <SectionTitle>
3.3 Inferencing on lexical relations
</SectionTitle>
      <Paragraph position="0"> It is surprisingly difficult to make the simple idea of bridging passage to query through lexical networks perform well in practice. Continuing the example of Winnie the bear (section a5 3.1), the English WordNet has five synsets on the path from bear to animal: a6 carnivore...a7 , a6 placental mammal...a7 ,</Paragraph>
      <Paragraph position="2"> Some of these intervening synsets would be extremely unlikely to be associated with a corpus that is not about zoology; a common person would more naturally think of a bear as a kind of animal, skipping through the intervening nodes.</Paragraph>
      <Paragraph position="3"> It is, however, dangerous to design an algorithm which is generally eager to skip across links in a lexical network. E.g., few QA applications are expected to need an expansion of &amp;quot;bottle&amp;quot; beyond &amp;quot;vessel&amp;quot; and &amp;quot;container&amp;quot; to &amp;quot;instrumentality&amp;quot; and beyond. Another example would be the shallow verb hierarchy in the English WordNet, with completely dissimilar verbs within very few links of each other.</Paragraph>
      <Paragraph position="4"> There is also the problem of missing links.</Paragraph>
      <Paragraph position="5"> Another important issue is which 'hidden causes' (synsets) should be inferred to have caused words in the text. This is a classical problem called word sense disambiguation (WSD). For instance, the word dog belongs to 6 noun synsets in Word-Net. Which of the a8 synsets should be treated as the 'hidden cause' that generated the word dog in the passage could be inferred from the fact that collie is related to dog only through one of the latter's senses - it's sense as a6 dog, domestic dog, Canis familiarisa7 . But this problem of finding the 'appropriate' hidden causes, in general, in non-trivial. Given that state-of-the-art WSD systems perform not better than 74% (Sanderson, 1994) (Lewis and Jones, 1996) (Fellbaum, 1998b), in this paper, we use a probabilistic approach to WSD - called 'soft WSD' (Pushpak, ) ; hidden nodes are considered to have probabilistically 'caused' words in the question and answer or in other words, causes are probabilistically 'switched on'.</Paragraph>
      <Paragraph position="6"> Clearly, any scoring algorithm that seeks to utilize WordNet link information must also discriminate between them based (at least) on usage statistics of the connected synsets. Also required is an estimate of the likelihood of instantiating a synset into a token because it was &amp;quot;activated&amp;quot; by a closely related synset. We find a Bayesian belief network (BBN) a natural structure to encode such combined knowledge from WordNet and corpus.</Paragraph>
    </Section>
    <Section position="3" start_page="2" end_page="3" type="sub_section">
      <SectionTitle>
3.4 Bayesian Belief Network
</SectionTitle>
      <Paragraph position="0"> A Bayesian Network (Heckerman, 1995) for a set of random variables a9a11a10a12a6a13a9a15a14a13a16a17a9a19a18a20a16a22a21a22a21a22a21a23a16a17a9a25a24a26a7 consists of a directed acyclic graph (DAG) that encodes a set of conditional independence assertions about variables in a9 and a set of local probability distributions  associated with each variable. Let a27a29a28a31a30 denote the set of immediate parents of a9a25a30 in the DAG, and a32a33a28  &amp;quot;conditional probability table&amp;quot; (CPT). Figure a5 3 shows a Bayesian belief network interpretation for a part of WordNet. The synset a6 corgi, welsh corgia7 has a causal relation from a6 dog, domestic dog, canis familiarisa7 . A possible conditional probability table for the network is shown to the right of the structure.</Paragraph>
      <Paragraph position="1">  The idea of constructing BBN from WordNet has been proposed by (Rebecca, 1998). But that idea is centered around doing hard-sense disambiguation to find the 'correct' sense each word in the text.</Paragraph>
      <Paragraph position="2"> In this paper, we particularly explore the idea of doing soft sense disambiguation i.e. synsets are probabilistically considered to be causes of their constituent words. Moreover, WSD is not an end in itself. The goal is to connect the words within question and answer passage and also across the question and answer passage. WSD is only a by-product.</Paragraph>
      <Paragraph position="3"> Our goal is to build a QA system which implements a clear division of labor between the knowledge base and the scoring algorithm, codifies the knowledge base in a uniform manner, and thereby enables a generic algorithm and a shared, extensible knowledge base. Based on the discussion above, our knowledge representation must be probabilistic, and our system must combine and be robust to multiple, noisy sources of information from query and answer terms.</Paragraph>
      <Paragraph position="4"> Moreover, we would like to be able to learn important properties of our knowledge base from continual training of our system with corpus samples as well as samples of successful and unsuccessful (question, answer) pairs. In essence, we would like to automate as far as possible, the customization of lexical networks to QA tasks. Given the English WordNet, it should be possible to reconstruct our algorithm completely from this paper.</Paragraph>
      <Paragraph position="5"> Toward these ends, we describe how to induce a Bayesian Belief Network (BBN) from a lexical network of relations. Specifically, we propose a semi-supervised learning mechanism which simultaneously trains the BBN and associates text tokens ,which are words, to synsets in the WordNet in a probabilistic manner (&amp;quot;soft WSD&amp;quot;). Finally, we use the trained BBN to score passages in response to a question.</Paragraph>
    </Section>
    <Section position="4" start_page="3" end_page="4" type="sub_section">
      <SectionTitle>
3.5 Building a BBN from WordNet
</SectionTitle>
      <Paragraph position="0"> Our model of the BBN is that each synset from WordNet is a boolean event associated with a question, a passage, or both. Textual tokens are also events. Each event is a node in the BBN. Events can cause other events to happen in a probabilistic manner, which is encoded in CPTs. The specific form of CPT we use is the well-known noisy-OR of Pearl (Pearl, 1988).</Paragraph>
      <Paragraph position="1"> We introduce a node in the BBN for each noun, verb, and adjective synset in WordNet. We also introduce a node for each (non-stop-word) token in the corpus and all questions. Hyponymy, meronymy, and attribute links are introduced from WordNet.</Paragraph>
      <Paragraph position="2"> Sense links are used to attach tokens to potentially matching synsets. E.g., the string &amp;quot;flag&amp;quot; may be attached to synset nodes a6 sag, droop, swag, flaga7 and a6 a conspicuously marked or shaped taila7 . (The purpose of probabilistic disambiguation is to estimate the probability that the string &amp;quot;flag&amp;quot; was caused by each connected synset node.) This process creates a hierarchy in which the parent-child relationship is defined by the semantic relations in WordNet. a51 is a parent of a52 iff a51 is the hypernym or holonym or attribute-of or a51 is a synset containing the word a52 . The process by which the Bayesian Network is built from the WordNet hypergraph of synsets and and from the mapping between words and synsets is depicted in figure 4. We define going-up the hierarchy as the traversal from child to parent.</Paragraph>
      <Paragraph position="3"> Ideally, we should update the entire BBN and its CPTs while scanning over the training corpus. In practice, BBN training and inference are CPU- and memory-intensive processes.</Paragraph>
      <Paragraph position="4"> We compromise by first attaching the token nodes  Add words as children to their synsets  to their synsets and then walking up the WordNet hierarchy up to a maximum height decided purely by CPU and memory limitations. We believe that the probabilistic influence from distant nodes is too feeble and unreliable to warrant modeling.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="4" end_page="5" type="metho">
    <SectionTitle>
4 Our QA system
</SectionTitle>
    <Paragraph position="0"> The overall question answering system that we propose is depicted in figure 5. The corresponding algorithm is outlined in figure 6.</Paragraph>
    <Paragraph position="1">  The question triggers the TFIDF retrieval module to pick up 50 most relevant documents. These documents are subjected to a sliding window to produce a53 passages of length a4 each. The Bayesian belief network described in section 3.5 ranks these passages. The first ranked passage is supposed to contain the answer. The belief network parameters are the CPTs, which are initialized as noisy-or CPTs. The Bayesian belief network is trained offline using  1: Construct a Bayesian Network structure using the Word-Net structure 2: Train the Bayesian network parameters on the corpus containing the answers 3: Do question answering with trained Bayesian Network Figure 6: The over-all question answering algorithm 1: while CPTs do not converge do 2: for each window of a54 words in the text do 3: Clamp the word nodes in the Bayesian Network to a state of 'present' 4: for each node in Bayesian network do 5: find its joint probabilities with all configurations of its parent nodes (E Step) 6: end for 7: end for 8: Update the conditional probability tables for all random variables (M Step) 9: end while  1977) on windows sliding over the whole corpus.</Paragraph>
    <Section position="1" start_page="4" end_page="4" type="sub_section">
      <SectionTitle>
4.1 Training the belief network
</SectionTitle>
      <Paragraph position="0"> The figure 7 describes the algorithm for training the BBN obtained from the WordNet. We initialize the CPTs as noisy-or. The instances we use for training are windows of length a3 each from the corpus. Since the corpus is normally not tagged with WordNet senses, all variables, other than the words observed in the window (i.e. the synset nodes in the BBN) are hidden or unobserved. Hence we use the Expectation Maximization algorithm (Dempster, 1977) for parameter learning. For each instance, we find the expected values of the hidden variables, given the present state of each of the observed variables. These expected values are used after each pass through the corpus to update the CPT for each node. The iterations through the corpus are done till the sum of the squares of Kullback-Liebler divergences between CPTs in successive iterations do not differ more than a threshold, or in other words, till the convergence criterion is met. Figure a5 7 outlines the algorithm for training the Bayesian Network over a corpus. We basically customize the Bayesian Network CPTs to a particular corpus by learning the local CPTs.</Paragraph>
    </Section>
    <Section position="2" start_page="4" end_page="5" type="sub_section">
      <SectionTitle>
4.2 Ranking answer passages
</SectionTitle>
      <Paragraph position="0"> Given a question, we rank the passages with the joint probability of the question words, given the candidate answer. Every question or answer can be looked upon as an event in which the its word nodes are switched to the state 'present'. Therefore, if a55a56a14a13a16a57a55a26a18a43a21a58a21a58a21a58a21a55a31a24 are passages and a59 is the question, the answer is that passage a55a31a30 which maximizes a60a61a37 a59a62a48a55a31a30a63a41 over all passagesa55a31a30 deemed as candidate answers. a34a36a35a23a37 a59a62a48a55a31a30a64a41 is the joint probability of the words of a59 , each being in state 'present' in the Bayesian network, given that all the word nodes for a55a26a30 are clamped to the state 'present' in the belief network.</Paragraph>
      <Paragraph position="1">  1: Load the Bayesian Network parameters 2: for each question q do 3: for each candidate passage p do 4: clamp the variables (nodes) corresponding to the passage words in network to a state of 'present' 5: Find the joint probability of all question words being in state 'present' i.e., a65a67a66a49a68a58a69a43a70a71a73a72 6: end for 7: end for 8: Report the passages in decreasing order of a65a67a66a49a68a58a69a43a70a71a73a72  rithm.</Paragraph>
      <Paragraph position="2"> The reason for choosing a34a36a35a43a37 a59a67a48a55a26a30a74a41 over a34a36a35a23a37a55a31a30a75a48a59a73a41 is that (a) a59 typically contains very few words.</Paragraph>
      <Paragraph position="3"> a34a36a35a23a37 a55a26a30a49a48a59a73a41 , therefore, may not help in bridging the relation between answer words. (b) The passage will be penalized if contains many words which are not present in the question and are also not closely related to the question words through the WordNet.</Paragraph>
      <Paragraph position="4"> This could happen despite the fact that the passage contains a few words which are all present in the question and/or are semantically closely related to the question, in addtion to containing the answer to the question. Also, (c) if passages a55a26a30 's are of varying lengths, a34a76a35a40a37 a59a62a48a55a31a30a63a41 's are brought to the same scale--that of question words which are fixed across passages/snippets, whereas, a34a76a35a40a37a55a31a30a49a48a59a73a41 can be affected and penalized by long snippets.</Paragraph>
      <Paragraph position="5"> In fact, our apprehensions about using a34a36a35a23a37a55a31a30a75a48a59a73a41 will be justified in the experimental section - the QA performace obtained using a34a36a35a23a37a55 a30 a48a59a73a41 is drastically poorer - in fact it is worse than the baseline QA algorithm.</Paragraph>
      <Paragraph position="6"> Dealing with non-WordNet words: Suppose, there is a word a77 in the question which is not there in the WordNet. Like the answer passages, we could have ignored such words. But, the question may be seeking an answer to precisely such a word. Also, the number of words being very small in the question, no word in the question should be ignored. We deal with this situation in the following way. We call a word, a connecting word if it the key word that links the passage to the question. Note that for WordNet words, the connecting nodes were Word-Net concepts. In the case of non-WordNet words, we don't have any hidden, connecting nodes. So we consider the words themselves to be possible connections. null Let a78a80a79a43a81a82a81a84a83a13a78a86a85a87a77 be a random variable which takes the state 'present' if a77 is a connecting word between the question and the answer. It's state is 'absent' if it is not a connecting word. Let a77a29a59 , a77a36a55 be random variables that are 'present' if a77 occurs in the question or answer respectively, else they are 'absent'. By Bayes rule, we get the following probability that the word a77 occurs in the question, given that it occurs in the answer (1=Present, 0=absent).</Paragraph>
      <Paragraph position="8"> and a60a115a112a67a37 a78a80a79a43a81a82a81a84a83a13a78a86a85a87a77a120a10a121a114a23a41 and their complements are estimated from question answer pairs. Moreover, the occurrence of non WordNet words is assumed to be independent of each other and also of the occurrence of WordNet words.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML