File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/00/a00-1041_evalu.xml

Size: 8,352 bytes

Last Modified: 2025-10-06 13:58:32

<?xml version="1.0" standalone="yes"?>
<Paper uid="A00-1041">
  <Title>Answer Extraction</Title>
  <Section position="4" start_page="297" end_page="299" type="evalu">
    <SectionTitle>
3 Evaluation
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="297" end_page="298" type="sub_section">
      <SectionTitle>
3.1 Results on the TREC-8 Evaluation
</SectionTitle>
      <Paragraph position="0"> The system was evaluated in the TREC-8 question-answering track. TREC provided 198 questions as a blind test set: systems were required to provide five potential answers for each question, ranked in order of plausibility. The output from each system was then scored by hand by evaluators at NIST, each answer being marked as either correct or incorrect. The system's score on a particular question is a function of whether it got a correct answer in the five ranked answers, with higher scores for the answer appearing higher in the ranking. The system receives a score of 1, 1/2, 1/3, 1/4, 1/5, or 0, re2perhaps less desirably, people would not be recognized as a synonym of lives in this example: 200 people would be indistinguishable from 200 pumpkins.</Paragraph>
      <Paragraph position="1">  spectively, according as the correct answer is ranked 1st, 2nd, 3rd, 4th, 5th, or lower in the system output. The final score for a system is calculated as its mean score on the 198 questions.</Paragraph>
      <Paragraph position="2"> The TREC evaluation considered two question-answering scenarios: one where answers were limited to be less than 250 bytes in length, the other where the limit was 50 bytes. The output from the passage retrieval component (section 2.1), with some trimming of passages to ensure they were less than 250 bytes, was submitted to the 250 byte scenario.</Paragraph>
      <Paragraph position="3"> The output of the full entity-based system was submitted to the 50 byte track. For comparison, we also submitted the output of a 50-byte system based on IR techniques alone. In this system single-sentence passages were retrieved as potential answers, their score being calculated using conventional IR methods. Some trimming of sentences so that they were less than 50 bytes in length was performed.</Paragraph>
      <Paragraph position="4"> Figure 1 shows results on the TREC-8 evaluation.</Paragraph>
      <Paragraph position="5"> The 250-byte passage-based system found a correct answer somewhere in the top five answers on 68% of the questions, with a final score of 0.545. The 50-byte passage-based system found a correct answer on 38.9% of all questions, with an average score of 0.261. The reduction in accuracy when moving from the 250-byte limit to the 50-byte limit is expected, because much higher precision is required; the 50-byte limit allows much less extraneous material to be included with the answer. The benefit of the including less extraneous material is that the user can interpret the output with much less effort.</Paragraph>
      <Paragraph position="6"> Our entity-based system found a correct answer in the top five answers on 46% of the questions, with a final score of 0.356. The performance is not as good as that of the 250-byte passage-based system.</Paragraph>
      <Paragraph position="7"> But when less extraneous material is permitted, the entity-based system outperforms the passage-based approach. The accuracy of the entity-based system is significantly better than that of the 50-byte passage-based system, and it returns virtually no extraneous material, as reflected in the average answer length of only 10.5 bytes. The implication is that NLP techniques become increasingly useful when short answers are required.</Paragraph>
    </Section>
    <Section position="2" start_page="298" end_page="298" type="sub_section">
      <SectionTitle>
3.2 Error Analysis of the Entity-Based
System
3.2.1 Ranking of Answers
</SectionTitle>
      <Paragraph position="0"> As a first point, we looked at the performance of the entity-based system, considering the queries where the correct answer was found somewhere in the top 5 answers (46% of the 198 questions). We found that on these questions, the percentage of answers ranked 1, 2, 3, 4, and 5 was 66%, 14%, 11%, 4%, and 4% respectively. This distribution is by no means uniform; it is clear that when the answer is somewhere in the top five, it is very likely to be ranked 1st or 2nd. The system's performance is quite bimodah it either completely fails to get the answer, or else recovers it with a high ranking.</Paragraph>
    </Section>
    <Section position="3" start_page="298" end_page="298" type="sub_section">
      <SectionTitle>
3.2.2 Accuracy on Different Categories
</SectionTitle>
      <Paragraph position="0"> Figure 2 shows the distribution of question types in the TREC-8 test set (&amp;quot;Percentage of Q's&amp;quot;), and the performance of the entity-based system by question type (&amp;quot;System Accuracy&amp;quot;). We categorized the questions by hand, using the eight categories described in section 2.3, plus two categories that essentially represent types that were not handled by the system at the time of the TREC competition: Monetary Amount and Miscellaneous.</Paragraph>
      <Paragraph position="1"> &amp;quot;System Accuracy&amp;quot; means the percentage of questions for which the correct answer was in the top five returned by the system. There is a sharp division in the performance on different question types. The categories Person, Location, Date and Quantity are handled fairly well, with the correct answer appearing in the top five 60% of the time. These four categories make up 67% of all questions. In contrast, the other question types, accounting for 33% of the questions, are handled with only 15% accuracy.</Paragraph>
      <Paragraph position="2"> Unsurprisingly, the Miscellaneous and Other Named Entity categories are problematic; unfortunately, they are also rather frequent. Figure 3 shows some examples of these queries. They include a large tail of questions seeking other entity types (mountain ranges, growth rates, films, etc.) and questions whose answer is not even an entity (e.g., &amp;quot;Why did David Koresh ask the FBI for a word processor?&amp;quot;) For reference, figure 4 gives an impression of the sorts of questions that the system does well on (correct answer in top five).</Paragraph>
    </Section>
    <Section position="4" start_page="298" end_page="299" type="sub_section">
      <SectionTitle>
3.2.3 Errors by Component
</SectionTitle>
      <Paragraph position="0"> Finally, we performed an analysis to gauge which components represent performance bottlenecks in the current system. We examined system logs for a 50-question sample, and made a judgment of what caused the error, when there was an error. Figure 5 gives the breakdown. Each question was assigned to exactly one line of the table.</Paragraph>
      <Paragraph position="1"> The largest body of errors, accounting for 18% of the questions, are those that are due to unhandled  different question types. &amp;quot;System Accuracy&amp;quot; means percent of questions for which the correct answer was in the top five returned by the system. &amp;quot;Good&amp;quot; types are in the upper block, &amp;quot;Bad&amp;quot; types are in the lower block.</Paragraph>
      <Paragraph position="2">  particular, by component responsible. Numbers are percent of questions in a 50-question sample.</Paragraph>
      <Paragraph position="3"> five, but not at rank one, are almost all due to failures of entity ranking) Various factors contributing to misrankings are the heavy weighting assigned to answers in the top-ranked passage, the failure to adjust frequencies by &amp;quot;complexity&amp;quot; (e.g., it is significant if 22.5 million occurs several times, but not if 3 occurs several times), and the failure of the system to consider the linguistic context in which entities  Miscellaneous questions.</Paragraph>
      <Paragraph position="4"> types, of which half are monetary amounts. (Questions with non-entity answers account for another 4%.) Another large block (16%) is due to the passage retrieval component: the correct answer was not present in the retrieved passages. The linguistic components together account for the remaining 14% of error, spread evenly among them.</Paragraph>
      <Paragraph position="5"> The cases in which the correct answer is in the top</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML