XML Viewer - w02-1906

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/02/w02-1906_evalu.xml
Size: 4,181 bytes
Last Modified: 2025-10-06 13:58:52
<?xml version="1.0" standalone="yes"?>
<Paper uid="W02-1906">
  <Title>Passage Selection to Improve Question Answering</Title>
  <Section position="5" start_page="1" end_page="1" type="evalu">
    <SectionTitle>
4 Evaluation
</SectionTitle>
    <Paragraph position="0"> This section presents the experiments developed for training and evaluating our approach. The experiments have been run on the TREC-9 QA Track question set and document collections.</Paragraph>
    <Section position="1" start_page="1" end_page="1" type="sub_section">
      <SectionTitle>
4.1 Data collection
</SectionTitle>
      <Paragraph position="0"> TREC-9 question test set is made up by 682 questions with answers included in the document collection. The document set consists of 978,952 documents from the TIPSTER and TREC following collections: AP Newswire,</Paragraph>
    </Section>
    <Section position="2" start_page="1" end_page="1" type="sub_section">
      <SectionTitle>
4.2 Training
</SectionTitle>
      <Paragraph position="0"> Training experiments had two objectives. They were designed (1) to calculate the optimum number of sentences (N) that define passage length and (2) to test two different possible ways of applying our method.</Paragraph>
      <Paragraph position="1"> First training experiment consists of working on the output of one of the current best performing IR systems (the ATT system). This experiment re-sorts its output (the first 1,000 ranked documents) by using IR-n. Second experiment consists of using our proposal as the main IR system, that is, indexing the whole collections by means of IR-n. For each experiment, a different number of sentences per passage were tested: 5, 10, 15 and 20 sentences. The relevance of each returned document was measured by means of the tool provided by TREC organization that allows us to determine if a passage contains the right answer. The two experiments are summed up in Figure 1.</Paragraph>
      <Paragraph position="2">  These experiments were performed using only the first 100 questions included in the data collection. Table 1 shows training results for passages of 5, 10, 15 and 20 sentences using both approaches. This results measure the number of questions whose correct answer was included into the top n retrieved passages (or documents) for the training question set. The first experiment (IR-n Ref) uses IR-n on the 1,000 documents returned by ATT system while the second one (IR-n) applies passage retrieval overall collections.</Paragraph>
      <Paragraph position="3"> As we can see, IR-n Ref and IR-n test obtain similar results although using our approach to re-rank the output of a good IR system presents a slight better performance than applying IR-n overall document collection. Regarding to the number of sentences to be taken into account to define passage length, we can observe that best results are obtained with passages of 20 sentences. In this case, both tests improve significantly the performance of ATT-system. It ranges from 12 (IR-n Ref) and 10 (IR-n) points on a passage length of 20 sentences (for only the first 5 documents retrieved) to 8 and 7 points when the first 200 documents are taken into account respectively.</Paragraph>
    </Section>
    <Section position="3" start_page="1" end_page="1" type="sub_section">
      <SectionTitle>
4.3 Experiment
</SectionTitle>
      <Paragraph position="0"> In order to evaluate our proposal we decided to compare the quality of the information retrieved by our approaches with the ranked list retrieved by the ATT IR system. For this evaluation, the 682 questions included in the data collection were processed and the number N of sentences per passage was set to 20. Table 2 shows the results of this evaluation experiment. This table shows the percentage of questions whose answer can be found into the first n documents returned by the ATT IR system and the best n passages returned by IR-n Ref and IR-n respectively.</Paragraph>
      <Paragraph position="1"> These results are also presented in Figure 2 These data confirm training results. In this case, both approaches perform better than ATT system and improvements range form 6 to 12 points for 20 sentences passage length.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML