File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/02/w02-1906_metho.xml

Size: 4,206 bytes

Last Modified: 2025-10-06 14:08:10

<?xml version="1.0" standalone="yes"?>
<Paper uid="W02-1906">
  <Title>Passage Selection to Improve Question Answering</Title>
  <Section position="4" start_page="1" end_page="1" type="metho">
    <SectionTitle>
3 IR-n overview
</SectionTitle>
    <Paragraph position="0"> In this section, we describe the architecture of the proposed PR system, namely IR-n, focusing on its three main modules: indexing, passage retrieval and query expansion.</Paragraph>
    <Section position="1" start_page="1" end_page="1" type="sub_section">
      <SectionTitle>
3.1 Indexing module
</SectionTitle>
      <Paragraph position="0"> The main aim of this module is to generate the dictionaries that contain all the required information for the passage retrieval module. It requires the following information for each term: * The number of documents that contain this term.</Paragraph>
      <Paragraph position="1"> * For each document: [?] The number of times this term appears in the document.</Paragraph>
      <Paragraph position="2"> [?] The position of each term in the document represented as the number of sentence it appears in.</Paragraph>
      <Paragraph position="3"> As term, we consider the stem produced by the Porter stemmer on those words that do not appear in a list of stop-words, list that is similar to those generally used for IR. On the other hand, query terms are also extracted in the same way, that is to say, we only consider the stems of query words that do not appear in the stop-words list.</Paragraph>
    </Section>
    <Section position="2" start_page="1" end_page="1" type="sub_section">
      <SectionTitle>
3.2 Passage retrieval module
</SectionTitle>
      <Paragraph position="0"> This module extracts the passages according to its similarity with the user's query. The scheme in this process is the following:  1. Query terms are sorted according to the number of documents they appear in. Terms that appear in fewer documents are processed firstly. 2. The documents that contain any query term are selected.</Paragraph>
      <Paragraph position="1"> 3. The following similarity measure is calculated  for each passage p (contained in the selected documents) with the query q:</Paragraph>
      <Paragraph position="3"> is the number of times that the term t appears in the passage p. f q,t represents the number of times that the term t appears in the query q. N is the number of documents in the collection and f</Paragraph>
      <Paragraph position="5"> is refers to the number of documents that contain the term t.  4. Only the most relevant passage of each document is selected for retrieval. 5. The selected passages are sorted by their similarity measure.</Paragraph>
      <Paragraph position="6"> 6. Passages are associated with the document they pertain and they are presented in a ranked list form.</Paragraph>
      <Paragraph position="7">  As we can notice, the similarity measure is similar to the cosine measure presented in [15]. The only difference is that the size of each passage (the number of terms) is not used to normalise the results. This proposal performs normalization according to the fixed number of sentences per passage. This difference makes the calculation simpler than other discourse-based PR or IR systems. Another important detail to remark is that we are using N as the number of documents in the collection, instead of the number of passages according to the considerations presented in [9].</Paragraph>
      <Paragraph position="8"> As it has been commented, our PR system uses variable-sized passages that are based on a fixed number of sentences (with different number of terms per passage). The passages overlap each other, that is to say, if a passage contains N sentences, the first passage will be formed by the sentences from 1 to N, the second one from 2 to N+1, and so on. We decided to overlap just one sentence according to the experiments and results presented in [12]. This work studied the optimum number of overlapping sentences in each passage for retrieval purposes concluding, that best results were obtained when only one overlapping sentence was used. Regarding to the optimum number (N) of sentences per passage considered in this paper, it will be experimentally obtained.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML