File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/03/w03-1201_evalu.xml

Size: 15,700 bytes

Last Modified: 2025-10-06 13:59:01

<?xml version="1.0" standalone="yes"?>
<Paper uid="W03-1201">
  <Title>Question answering via Bayesian inference on lexical relations</Title>
  <Section position="7" start_page="5" end_page="8" type="evalu">
    <SectionTitle>
5 Experiments and results
</SectionTitle>
    <Paragraph position="0"> We perform extensive experiments to evaluate our system, using the TREC http://trec.nist.</Paragraph>
    <Paragraph position="1"> gov/data/qa.html QA benchmark. We find that our algorithm is a substantial improvement beyond a baseline IR approach to passage ranking.</Paragraph>
    <Paragraph position="2"> Based on published numbers, it also appears to be in the same league as the top performers at recent TREC QA events. We also note that training our system improves the quality of our ranking, even though WSD accuracy does not increase, which affirms the belief that passage scoring need not depend on perfect WSD, given we use a robust, 'soft WSD'.</Paragraph>
    <Paragraph position="3"> See section a5 3.3.</Paragraph>
    <Section position="1" start_page="5" end_page="6" type="sub_section">
      <SectionTitle>
5.1 Experimental setup
</SectionTitle>
      <Paragraph position="0"> We use the Text REtrieval Conference (TREC) (Vorhees, 2000) corpus and question/answers from its QA track. The corpus is 2 GB of newspaper articles. There is a set a122 of about 690 factual questions. For each question, we retrieve the top a123a20a124 documents using a standard TFIDF-based IR engine such as SMART. We used the question set and corresponding top 50 document collection from TREC 2001 for our experiments. We used MXPOST (Ratnaparkhi, 1996), a maximum entropy based POS tagger. The part of speech tag is used while mapping document and question terms to their corresponding nodes in the BBN.</Paragraph>
      <Paragraph position="1"> The passage length we chose was a4a12a10a126a125a20a124 words. Unless otherwise stated explicitly, the maximum  height upto which the BBN was used for inferencing for each Q-passage pair can be assumed to be a127 .</Paragraph>
    </Section>
    <Section position="2" start_page="6" end_page="6" type="sub_section">
      <SectionTitle>
5.2 Evaluation
</SectionTitle>
      <Paragraph position="0"> TREC QA evaluation has two runs based on the length of system response to a question. In the first the response is a passage up to 250 bytes in size.</Paragraph>
      <Paragraph position="1"> The second, more ambitious run asks for shorter responses of up to 50 bytes. (More recently, TREC has updated its requirements to demand exact, extracted answers.) To determine if the response is actually an answer to the question, TREC provides a set of regular expressions for each question. The presence of any of these in the response indicates that it is a valid answer. For evaluation the system is required to submit its top five responses for each question. This is used to calculate the performance measure mean reciprocal rank (MRR) for the system, defined as</Paragraph>
      <Paragraph position="3"> Here a35a75a139a141a140a62a142 a136 is the first rank at which correct answer occurs for question a59a146a145a126a122 . If for a question a59 the correct answer is not in the top 5 responses then  a147a57a148a150a149a86a151a100a152 is taken to be zero.</Paragraph>
    </Section>
    <Section position="3" start_page="6" end_page="7" type="sub_section">
      <SectionTitle>
5.3 Results
</SectionTitle>
      <Paragraph position="0"> IR baseline: IR technology is widely accessible, and forms our baseline. We construct 250-byte windows of text as passages and compute the similarity between these passages and the query. Because we would not like to penalize passages for having terms not in the question (provided they have at least some query terms), we use an asymmetric TFIDF similarity. Under this measure, the score of a passage is the sum of the IDFs of the question terms contained in the passage. If a153 is the document collection and a153a155a154 is the set of documents containing a85 , then one common form of IDF weighting (used by SMART again)</Paragraph>
      <Paragraph position="2"> The IR baseline MRR is only about 0.3, which is far short of Falcon, which has an MRR of almost 0.7.</Paragraph>
      <Paragraph position="3"> The baseline MRR is low for the obvious reasons: the IR engine cannot bridge the lexical gap.</Paragraph>
      <Paragraph position="4">  Base BBN: Initialized with our default parameters, our BBN-based approach achieves an MRR of 0.429, which is already a significant step up from the IR baseline. A large component of this improvement is caused by conflating different strings to common synsets.</Paragraph>
      <Paragraph position="5"> Trained BBN: We recalibrated our system after training the BBN with the corpus. This resulted in a visible improvement in our MRR, from 0.429 to 0.467, which takes us into the same league as the systems from University of Waterloo and Queens College, reported at TREC QA.</Paragraph>
      <Paragraph position="6"> Tables a5 1 and a5 2 summarize our MRR results and juxtapose them with the published MRRs for some of the best-performing QA system in TREC 2000.</Paragraph>
      <Paragraph position="7"> Given that we have invested zero customization effort in WordNet, it is impressive that our MRR compares favorably with all but the best system.</Paragraph>
      <Paragraph position="8"> Experiments for varying heights of BBN: The MRR obtained went down to a124a42a21a166a165 a127 when the height of the traced BBN was restricted to a114 , i.e. only words and their immediate synsets were considered. It is significant to note that even with immediate synset expansion, there is a marginal improvement over assymmetric TFIDF. The MRR improved to a124a42a21a127 a125 and a124a42a21 a127 a123 when the height was increased to a125 and a165 respectively. These results are tabulated in table a5 3. Experiments for restricting to WordNets of different parts of speech: The MRR found by using only the noun WordNet was a124a42a21a127 a114a13a123 . Words in the remaining parts of speech were treated as  non WordNet words in this experiment. The MRR dropped to a124a42a21a166a165 a127 a124 when only the adjective WordNet was used. The MRR found using only the verb WordNet was a low a124a42a21a166a165a141a125 . This is because the verb WordNet is very shallow and many semantically distant verbs are connected closely together. The MRR score obtained by considering noun+adjective part of WordNet was a124a42a21a127a141a127 a125 , that obtained by considering noun+verb part of WordNet was a124a42a21a166a165a141a173a141a165 and that obtained by considering verb+adjective part of Word-Net was a124a42a21a166a165a141a165a141a165 . These results are summarized in table a5 4. The results seem to justify the observation that the verb WordNet in its current form is shallow in height and has high in/out degree for each node; this is mainly due to the high ambiguity of verbs.</Paragraph>
      <Paragraph position="9"> But coupled with noun and adjective WordNets, the verb WordNet improves overall performance.</Paragraph>
      <Paragraph position="10"> Miscellaneous experiments: The MRR obtained by considering only WordNet words was a124a42a21a166a165a73a174a43a124 which indicates that we cannot afford to ignore the non-WordNet words. Also it seems that inducing 'semantic-similarity' between words not in the WordNet vocabulary is not so much required. By skipping Bayesian inferencing altogether, we get an MRR of a124a42a21a166a165a20a124 which is the same as for asymmetric TFIDF mentioned earlier. The MRR drastically fell to a124a42a21a175a124a73a125a42a114 when a60a61a37a57a60a176a167a22a169 a48a131a122a113a170a171a83 a167 a85a87a172a64a79a43a81a33a41 was used to rank the passages. This partly justifies the apprehension about finding the probability of passage given question which was expressed earlier - that is, passages get penalized if they contain lots of words which are not either not there in the question or are not related to words in the question. These results are summarized in table a5 5.</Paragraph>
      <Paragraph position="11"> The effect of WSD: It is interesting to note that training does not substantially affect disambiguation accuracy (which stays at about 75%), and MRR improves despite this fact. This seems to indicate that learning joint distributions between query and candidate answer keywords (via synset nodes, which are &amp;quot;bottleneck&amp;quot; variables in BBN parlance) is as important for QA as is WSD. Furthermore, we conjecture that &amp;quot;soft&amp;quot; WSD is key to maintaining QA MRR in the face of modest WSD accuracy.</Paragraph>
    </Section>
    <Section position="4" start_page="7" end_page="7" type="sub_section">
      <SectionTitle>
5.4 Analysis
</SectionTitle>
      <Paragraph position="0"> In the following, we analyse how Bayesian inferencing on lexical relations contributes towards ranking passages.</Paragraph>
      <Paragraph position="1"> How joint probability helps For finding the probability of question given a passage, we take the joint probability of the question words, conditioned on the (evidence of) answer words. Thus we attempt to overcome the usual bottleneck of assumption of independence of words as in the naive Bayes model.</Paragraph>
      <Paragraph position="2"> The relations of question words between themselves and with words in the answer is what precisely helps in giving a joint probability that is different from a naive product of marginals. This will be illustrated in section a5 5.5.</Paragraph>
      <Paragraph position="3"> How parameter smoothing helps If a question word does not occur in the answer, the marginal probability of that word should be high if it strongly relates to one or more words in the answer through WordNet. Without using WordNet, one could resort to finding this marginal probability from a corpus. These probabilities are remarkably low even for words that are very semantically related to words in the answer and this will be illustrated in the case studies in section a5 5.5. This problem could be attributed to data sparsity</Paragraph>
    </Section>
    <Section position="5" start_page="7" end_page="8" type="sub_section">
      <SectionTitle>
5.5 Case studies
</SectionTitle>
      <Paragraph position="0"> Case 1: This example shows that the passage in figure a5 10 contains the correct answer to the question in figure a5 9 and was given rank a114 . The interesting observation is that the words kind and type are related correctly through the WordNet to give high marginal probability to the word kind (0.557435) in the question, even though it does not occur in the answer. This is depicted in figure a5 12.</Paragraph>
      <Paragraph position="1"> The marginal probability of the same word (given that its is absent in the answer passage), as determined by corpus statistics is 0.00020202 - which is very small. This illustrates the advantage of parameter smoothing.</Paragraph>
      <Paragraph position="2">  kind: 0.557435 ....corgis: They are of course collie-type dogs originally bred for cattle herding. As such they will chase anything particularly ankles.... null Figure 10: Answer for Q1, Rank 1, Score(Joint Probability) = 0.893133, (Document ID:AP881106-0015) Bayesian Marginal Probs: corgi:1.000000, kind:0.006421 ....current favorite. So are bulldogs. Jack Russell terriers are popular with the horsy set. &amp;quot; The short-legged welsh corgi is big ( QueenP elizabeth ii has at least one ). And so, of course, is the english bull terrier (thanks to Anheuser-Busch, Bud Light and Spuds. MacKen- null Additionally, the joint probability of question words given the passage words of figure a5 10 (a124a42a21a166a177a141a173a141a165a42a114a13a165a141a165 ) is not the product of their  this is that the word dog that occurs in the answer passage is related to the word corgie in the question through WordNet as shown in figurea5 13. It can be seen easily that these lexical relations increase the joint probability of the question words, given the answer words, over the product of the marginals of the individual words.</Paragraph>
      <Paragraph position="3"> In contrast, the passage of figure a5 11 which contains no answer to the question, also contains no word which is closely related to the word  kind through WordNet. Therefore, the marginal probabilities as well as the joint probability of same question words given this passage are low as compared to the passage of figurea5 10. As a result the second passage gets a low rank.</Paragraph>
      <Paragraph position="4"> Case 2: The passage in figure a5 15 was highest ranked for the question in figurea5 14, even though it does not contain the answer central america. This is because, all question words occur in the passage and therefore, the passage gets a rank of a114 . This highlights a limitation of our mechanism. On the other hand, the passage ranked a125 a24a20a188 contains the answer. It gets a joint probability score of a124a42a21a166a177a141a173a20a124a67a114a13a173a141a125 , even though the word belize does not occur in the answer. This is because belize is connected to the word central america and also to country through WordNet. The passage shown in figurea5 17, which does not contain the answer, got a pretty low rank of 10 because it induced a low joint probability of</Paragraph>
      <Paragraph position="6"> a123a42a114 on the question even though the word belize was present in the passage, because locate was absent in the passage and it is not immediatly connected to other words in the passage. This again illustrates the advantage of using Bayesian inferencing on lexical relations.</Paragraph>
      <Paragraph position="7"> Case 3: Here we present an example to illustrate where the mechanism can go wrong due to the of absence of links. The passage in figure a5 19 induces a conditional joint probability of a114 on the question in figure a5 18, because the passage contains all the words present in the question. The passage however does not answer the question. On the other hand, the passage shown in figure a5 20 contains the answer, but induces a lower joint probability on the question - because the verb stand for is not closely related, through WordNet to any of the words in the passage. In fact, one would have expected stands for and stand for to be related to each other through  Bayesian Marginal Probs: belize: 1.000000, locate: 1.000000 ....settlers has been confirmed to the east of the historic monuments that are being used as a reference point with Belize . She pointed out that in case they prove the settlement is located in the protected Mayan biosphere area and that it was established illegally , the settlers will have to leave the area , but the.....</Paragraph>
      <Paragraph position="9"> Bayesian Marginal Probs: belize: 0.889529, locate: 1.000000 ....confirmed that the Belizean Government will assume responsibility for its own defense as of 1 January 1994 and announced that it had started the &amp;quot; immediate withdrawal of the UK troops stationed in that country located in the central american isthmus . Lourdes ......</Paragraph>
      <Paragraph position="10"> Figure 16: Answer to Q2, Rank = 2, Score(Joint Probability) = 0.890192, DocID: FBIS3-50428 Bayesian Marginal Probs: belize: 1.000000, locate: 0.033451 ....prepared to begin negotiations on the territorial dispute with Guatemala ; : adding that a commission has been created for this purpose and only the final details must be settled . The Guatemalan Government has recognized Belize 's independence ; : therefore , we have accepted the fact that a .....</Paragraph>
      <Paragraph position="11"> Figure 17: Non-answer for Q2, Rank = 10, Score(Joint Probability) = 0.452310, DocID: FBIS4-56830 WordNet.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML