File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/06/p06-2069_evalu.xml

Size: 13,179 bytes

Last Modified: 2025-10-06 13:59:43

<?xml version="1.0" standalone="yes"?>
<Paper uid="P06-2069">
  <Title>Examining the Content Load of Part of Speech Blocks for Information Retrieval</Title>
  <Section position="6" start_page="533" end_page="537" type="evalu">
    <SectionTitle>
4 Evaluation
</SectionTitle>
    <Paragraph position="0"> We present the experiments realised to test the two hypotheses formulated in Section 1. Section 4.1 presents our experimental settings, and Section 4.2 our evaluation results.</Paragraph>
    <Section position="1" start_page="533" end_page="534" type="sub_section">
      <SectionTitle>
4.1 Experimental Settings
</SectionTitle>
      <Paragraph position="0"> We induce POS blocks from the English language component of the second release of the parallel Europarl corpus(75MB)2. We POS tag the corpus using the TreeTagger3, which is a probabilistic POS tagger that uses the Penn TreeBank tagset  VV, VVD, VVG, VVN, VVP, VVZ VB (Marcus et al., 1993). Since we are solely interested in a POS analysis, we introduce a stage of tagset simplification, during which, any information on top of surface POS classification is lost (Table 1). Practically, this leads to 48 original TreeBank (TB) tag classes being narrowed down to 15 Reduced TreeBank (RTB) tag classes. Additionally, tag names are shortened into two-letter names, for reasons of computational efficiency.</Paragraph>
      <Paragraph position="1"> We consider the TBR tags JJ, FW, NN, and VB as open-class, and the remaining tags as closed class (Lyons, 1977). We extract 214,398,227 POS block tokens and 19,343 POS block types from the corpus. null We retrieve relevant documents from two standard TREC test collections, namely WT2G (2GB) and WT10G (10GB), from the 1999 and 2000 TREC Web tracks, respectively. We use the queries 401-450 from the ad-hoc task of the 1999 Web track, for the WT2G test collection, and the queries 451-500 from the ad-hoc task of the 2000 Web track, for the WT10G test collection, with their respective relevance assessments. Each query contains three fields, namely title, description, and narrative. The title contains keywords describing the information need. The description expands briefly on the information need. The narrative part consists of sentences denoting key concepts to be considered or ignored. We use all three  query fields to match query terms to document keyword descriptors, but extract POS blocks only from the narrative field of the queries. This choice is motivated by the two following reasons. Firstly, the narrative includes the longest sentences in the whole query. For our experiments, longer sentences provide better grounds upon which we can test our hypotheses, since the longer a sentence, the more POS blocks we can match within it. Secondly, the narrative field contains the most noise in the whole query. Especially when using bag-of-words term weighting, such as in our evaluation, information on what is not relevant to the query only introduces noise. Thus, we select the most noisy field of the query to test whether the application of our hypotheses indeed results in the reduction of noise.</Paragraph>
      <Paragraph position="2"> During indexing, we remove stopwords, and stem the collections and the queries, using Porter's4 stemming algorithm. We use the Terrier5 IR platform, and apply five different weighting schemes to match query terms to document descriptors. In IR, term weighting schemes estimate the relevance a0a2a1a4a3a6a5a8a7a10a9 of a document a3 for a query  is the frequency of a term in a document; a66a58a67 , and a68 are parameters; a69 and a70a72a71a72a73 a69 are the document length and the average document length in the collection, respectively; a74 is the number of documents in the collec- null all query terms. We also use the well-established probabilistic BM25 weighting scheme (Robertson et al., 1995), and three distinct weighting schemes from the more recent Divergence From Randomness (DFR) framework (Amati, 2003), namely BB2, PL2, and DLH. Note that, even though we use three weighting schemes from the DFR framework, the said schemes are statistically different to one another. Also, DLH is the only parameter-free  a5a29a3a30a9 variables automatically from the collection statistics.</Paragraph>
      <Paragraph position="3"> We use the default values of all parameters, namely, for the TF IDF and BM25 weighting schemes (Robertson et al., 1995), a66a58a67 a13 a67a19a83a85a84 ,</Paragraph>
      <Paragraph position="5"> tions; while for the PL2 and BB2 term weighting schemes (Amati, 2003), a92 a13a94a93 a83a85a95a19a88 for the WT2G test collection, and a92 a13 a91a31a83a85a91a35a95 for the WT10G test collection. We use default values, instead of tuning the term weighting parameters, because our focus lies in testing our hypotheses, and not in optimising retrieval performance. If the said parameters are optimised, retrieval performance may be further improved. We measure the retrieval performance using the Mean Average Precision (MAP) measure (van Rijsbergen, 1979).</Paragraph>
      <Paragraph position="6"> Throughout all experiments, we set POS block length at a0 = 4. We employ Good-Turing and Laplace smoothing, and set the threshold of high probability of occurrence empirically at a0 = 0.01. We present all evaluation results in tables, the format of which is as follows: GT and LA indicate Good-Turing and Laplace respectively, and a96a98a97 denotes the % difference in MAP from the baseline. Statistically significant scores, as per the Wilcoxon test (a99a101a100a102a88a30a83a103a88a104a91 ), appear in boldface, while highest a96 percentages appear in italics.</Paragraph>
    </Section>
    <Section position="2" start_page="534" end_page="537" type="sub_section">
      <SectionTitle>
4.2 Evaluation Results
</SectionTitle>
      <Paragraph position="0"> Our retrieval baseline consists in testing the performance of each term weighting scheme, with each of the two test collections, using the original queries. We introduce two retrieval combinations on top of the baseline, which we call POS and POSC. The POS retrieval experiments, which relate to our first hypothesis, and the POSC retrieval experiments, which relate to our second hypothesis, are described in Section 4.2.1. Section 4.2.2 presents the assessment of our hypotheses using a performance-boosting retrieval technique, namely query expansion.</Paragraph>
      <Paragraph position="1">  The aim of the POS and POSC experiments is to test our first and second hypotheses, respectively. Firstly, to test the first hypothesis, namely that there is a direct connection between the removal of low-frequency POS blocks from the queries and noise reduction in the queries, we remove all low-frequency POS blocks from the narrative field of  the queries. Secondly, to test our second hypothesis as an extension of our first hypothesis, we refilter the queries used in the POS experiments by removing from them POS blocks that contain more closed class than open class tags. The processes involved in both hypotheses take place prior to the removal of stop words and stemming of the queries. Table 2 displays the relevant evaluation results.</Paragraph>
      <Paragraph position="2"> Overall, the removal of low-probability POS blocks from the queries (Hypothesis 1 section in Table 2) is associated with an improvement in retrieval performance over the baseline in most cases, which sometimes is statistically significant. This improvement is quite similar across the two statistical estimators. Moreover, two interesting patterns emerge. Firstly, the DFR weighting schemes seem to be divided, performance-wise, between the parametric BB2 and PL2, which are associated with the highest improvement in retrieval performance, and the non-parametric DLH, which is associated with the lowest improvement, or even deterioration in retrieval performance.</Paragraph>
      <Paragraph position="3"> This may indicate that the parameter used in BB2 and PL2 is not optimal, which would explain a low baseline, and thus a very high improvement over it. Secondly, when comparing the improvement in performance related to the WT2G and the WT10G test collections, we observe a more marked improvement in retrieval performance with WT2G than with WT10G.</Paragraph>
      <Paragraph position="4"> The combination of our two hypotheses (Hypotheses 1+2 section in Table 2) is associated with an improvement in retrieval performance over the baseline in most cases, which sometimes is statistically significant. This improvement is very similar across the two statistical estimators, namely Good-Turing and Laplace. When combining hypotheses 1+2, retrieval performance improves more than it did for hypothesis 1 only, for the WT2G test collection, which indicates that our second hypothesis might further reduce the amount of noise in the queries successfully.</Paragraph>
      <Paragraph position="5"> For the WT10G collection, we object similar results, with the exception of DLH. Generally, the improvement in performance associated to the WT2G test collection is more marked than the improvement associated to WT10G.</Paragraph>
      <Paragraph position="6"> To recapitulate on the evaluation outcomes of our two hypotheses, we report an improvement in retrieval performance over the baseline for most, but not all cases, which is sometimes statistically significant. This may be indicative of successful noise reduction in the queries, as per our hypotheses. Also, the difference in the improvement in retrieval performance across the two test collections may suggest that data sparseness affects retrieval performance.</Paragraph>
      <Paragraph position="7">  with Query Expansion Query expansion (QE) is a performance-boosting technique often used in IR, which consists in extracting the most relevant terms from the top retrieved documents, and in using these terms to expand the initial query. The expanded query is then used to retrieve documents anew.</Paragraph>
      <Paragraph position="8"> Query expansion has the distinct property of improving retrieval performance when queries do not contain noise, but harming retrieval performance when queries contain noise, furnishing us with a strong baseline, against which we can measure our hypotheses. We repeat the experiments described in Section 4.2.1 with query expansion.</Paragraph>
      <Paragraph position="9"> We use the Bo1 query expansion scheme from the DFR framework (Amati, 2003). We optimise the query expansion settings, so as to maximise its performance. This provides us with an even stronger baseline, against which we can compare our proposed technique, which we tune empirically too through the tuning of the threshold a0 . We optimise query expansion on the basis of the corresponding relevance assessments available for the queries and collections employed, by selecting the most relevant terms from the top retrieved documents. For the WT2G test collection, the relevant terms / top retrieved documents ratio we use is (i) 20/5 with TF IDF, BM25, and DLH; (ii) 30/5 with PL2; and (iii) 10/5 with BB2. For the WT10G collection, the said ratio is (i) 10/5 for TF IDF; (ii) 20/5 for BM25 and DLH; and (iii) 5/5 for PL2 and BB2.</Paragraph>
      <Paragraph position="10"> We repeat our POS and POSC retrieval experiments with query expansion. Table 3 displays the relevant evaluation results.</Paragraph>
      <Paragraph position="11"> Query expansion has overall improved retrieval performance (compare Tables 2 and 3), for both test collections, with two exceptions, where query expansion has made no difference at all, namely for BB2 and PL2, with the WT10G collection.</Paragraph>
      <Paragraph position="12"> The removal of low-probability POS blocks from the queries, as per our first hypothesis, combined with query expansion, is associated with an im- null provement in retrieval performance over the new baseline at all times, which is sometimes statistically significant. This may indicate that noise has been further reduced in the queries. Also, the two statistical estimators lead to similar improvements in retrieval performance. When we compare these results to the ones reported with identical settings but without query expansion (Table 2), we observe the following. Firstly, the previously reported division in the DFR weighting schemes, where BB2 and PL2 improved the most from our hypothesised noise reduction in the queries, while DLH improved the least, is no longer valid. The improvement in retrieval performance now associated to DLH is similar to the improvement associated with the other weighting schemes. Secondly, the difference in the retrieval improvement previously observed between the two test collections is now smaller.</Paragraph>
      <Paragraph position="13"> To recapitulate on the evaluation outcomes of our two hypotheses combined with query expansion, we report an improvement in retrieval performance over the baseline at all times, which is sometimes statistically significant. It appears that the combination of our hypotheses with query expansion tones down previously reported sharp differences in retrieval improvements over the base-line (Table 2), which may be indicative of further noise reduction.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML