File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/99/p99-1045_metho.xml

Size: 17,136 bytes

Last Modified: 2025-10-06 14:15:27

<?xml version="1.0" standalone="yes"?>
<Paper uid="P99-1045">
  <Title>Less is more: Eliminating index terms from subordinate clauses</Title>
  <Section position="4" start_page="0" end_page="349" type="metho">
    <SectionTitle>
2 System description
</SectionTitle>
    <Paragraph position="0"> At the core of the Microsoft English Grammar (MEG), is a broad-coverage parser that produces conventional phrase structure analyses augmented with grammatical relations; this parser is the basis for the grammar checker in Microsoft Word (Heidorn 1999). Syntactic analyses undergo further processing in order to derive logical forms (LFs), which are graph structures that describe labeled dependencies among content words in the original input. LFs normalize certain syntactic alternations (e.g.</Paragraph>
    <Paragraph position="1"> active/passive) and resolve both intrasentential anaphora and long-distance dependencies.</Paragraph>
    <Paragraph position="2"> Over the past two years we have been exploring the use of MEG LFs as a means of  improving IR precision. This work, which is embodied in a natural language query feature in the Microsoft Encarta 99 encyclopedia, augments a traditional keyword document index with a second index that contains linguisticallyinformed terms. Two types of terms are stored in this linguistic index: 1. LF triples. These are subgraphs extracted from the LF. Each triple has the form wordl-relation-word2, describing a dependency relation between two content words. For example, for the sentence Abraham Lincoln, the president, was assassinated by John Wilkes Booth, we extract the following LF triples: t assassinate--LSubj--John_Wilkes_Booth assassinate--LOb j--Abraham_Lincoln Abraham_Lincoln--Equiv--president 2. Subject terms. These are terms that indicate which words served as the grammatical head of a surface syntactic subject in the document, for example: Subject: Abraham_Lincoln This linguistic index is used to postfilter the output of a conventional statistical search algorithm. An input natural language query is first submitted to the statistical search algorithm as a set of content words, resulting in a ranked set of documents. This ranked set is then re-ranked by attempting to find overlap between the set of linguistic terms stored for each of these documents and corresponding linguistic terms determined by processing the query in MEG. Documents that contain linguistic matches are heuristically ranked according to the nature of the match. Documents that fail to match do not receive a rank, and are typically not displayed to the user. The process of building a secondary linguistic index and matching terms from the query is referred to as natural language matching (NLM) in the discussion below. NLM has been used to filter documents retrieved by several different search technologies operating on different genres of text.</Paragraph>
    <Paragraph position="3"> Since NLM was intended for use in consumer products, it was important to minimize index size. We needed an algorithm that would enable us to achieve reductions in index size without adversely affecting precision and recall. At the time when we were conducting these experiments, there did not exist any sufficiently large publicly available corpora of questions and relevant documents for the two genres of interest to us: the word wide web and encyclopedia text. We therefore gathered queries and documents for a web sample (section 3.2) and Encarta 99 (section 3.3), and had nonlinguists perform double-blind evaluations of relevance.</Paragraph>
    <Paragraph position="4"> Three implementation-specific aspects of the NLM index should be noted. First, in order to limit index size, duplicate instances of a term occurring in the same document are stored only once. Second, because of the particular compression scheme used to build the index, all terms require the same number of bits for storage, regardless of the length or number of words they contain. Third, the top ten percent of the NLM terms were suppressed, by analogy with stop words in conventional indexing schemes. Such high frequency terms tended not to be good predictors of document relevance.</Paragraph>
  </Section>
  <Section position="5" start_page="349" end_page="349" type="metho">
    <SectionTitle>
3 Experiments
</SectionTitle>
    <Paragraph position="0"> We conducted experiments in which we eliminated terms from the NLM index, and then measured precision and recall. The experiments were performed on two test corpora: web pages returned by the Alta Vista search service (section 3.2) and articles from the Encarta electronic encyclopedia (section 3.3).</Paragraph>
    <Section position="1" start_page="349" end_page="349" type="sub_section">
      <SectionTitle>
3.1 The kinds of subordinate clauses
</SectionTitle>
      <Paragraph position="0"> In order to test the hypothesis that information contained in subordinate clauses is less useful for IR than matrix clause information, we modified the indexing algorithm so that it eliminated terms that occurred in certain kinds of subordinate clauses. We experimented with the following clause types:  In the experiments described below, terms were eliminated from documents during indexing. However, terms were never eliminated from the queries.</Paragraph>
    </Section>
    <Section position="2" start_page="349" end_page="349" type="sub_section">
      <SectionTitle>
3.2 Alta Vista experiments
</SectionTitle>
      <Paragraph position="0"> We gathered 120 natural language queries from colleagues for submission to Alta Vista. 2 The queries averaged 3.7 content words, with a standard deviation of 1.7. 3 The following are illustrative of the queries submitted: Are there any air-conditioned hotels in Bali? Has anyone ported Eliza to Win95? What are the current weather conditions at Steven' s Pass ? What makes a cat purr? Where is Xian ? When will the next non-rerun showing of Star Trek air?</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="349" end_page="349" type="metho">
    <SectionTitle>
2 Alta Vista's main search page
</SectionTitle>
    <Paragraph position="0"> (http://altavista.com) encourages users to submit natural language queries.</Paragraph>
    <Paragraph position="1"> 3 Words like &amp;quot;know&amp;quot; and &amp;quot;find&amp;quot;, which are common in natural language queries, are included in these counts.</Paragraph>
    <Paragraph position="2"> We examined the first thirty documents returned by Alta Vista (or fewer documents for queries that did not return at least thirty documents). This document set comprised 3,440 documents. Since we were not able to determine what percentage of the web Alta Vista accounted for, it was not possible to calculate the recall of this document set. In the discussion below, we calculate recall as a percentage of the relevant documents returned by Alta Vista. Precision and recall are averaged across all queries submitted to Alta Vista. The documents returned by Alta Vista were indexed using NLM (section 2) and filtered to retain only documents that contained matches.</Paragraph>
    <Paragraph position="3"> Table 1 contrasts the baseline NLM figures (indexing based on terms in all clauses) with the results of eliminating from the documents all terms that occurred in subordinate clauses.</Paragraph>
    <Paragraph position="4"> To measure the trade-off between precision and recall, we calculated the F-measure (Van Rij sbergen 1980), defined as</Paragraph>
    <Paragraph position="6"> recall and \[3 is the relative weight assigned to precision and recall (for these experiments, 13= 1).</Paragraph>
    <Paragraph position="7"> As Table 1 shows, by eliminating terms from all subordinate clauses in the documents, the NLM index size was reduced by 31.4% with only a minor impact (-0.82%) on F-measure.</Paragraph>
    <Paragraph position="8"> Given unique indexing of terms per document, and a constant size per term (section 2), we can deduce that 31.4% of the terms in the NLM index occurred only in subordinate clauses. Had they occurred even once in a main clause, they would not have been removed from the index.</Paragraph>
    <Paragraph position="9"> We ran two comparison experiments. In the first comparison, we deleted one third of all terms as they were produced. Table 2 gives the average results of three runs of this experiment. In each run, a different set of one third of the terms was deleted. Although fewer terms were omitted (28.8% 4 versus 31.4% when all terms in</Paragraph>
  </Section>
  <Section position="7" start_page="349" end_page="354" type="metho">
    <SectionTitle>
4 TelTflS eliminated from a subordinate clause in
</SectionTitle>
    <Paragraph position="0"> one sentence might persist in the index if they occurred in the main clause of another sentence in the same document, hence a reduction of slightly less than 33.3%.</Paragraph>
    <Paragraph position="1">  subordinate clauses were eliminated) the detrimental effect on F-measure was 5.3 times greater than when terms occuring in subordinate clauses were deleted.</Paragraph>
    <Paragraph position="2"> Table 1 Alta Vista: Effects of eliminating subordinate clauses  In the second comparison experiment, we tested the converse of the operation described in the discussion of Table 1 above: we eliminated all search terms from the main clauses of documents, leaving only search terms that occurred in subordinate clauses. Table 3 shows the dramatic effect of this operation: as we expected, the index size was greatly reduced (by 73.8%). However, F-measure was seriously affected, by more than two thirds, or -68.99%. The effect on F-measure is primarily due to the severe impact on recall, which fell from a tolerable baseline of 43.2% to an unacceptable 7.5%. Comparing the reduction in index size to the reduction when subordinate clause information was eliminated (73.8% versus 31.4%, a factor of approximately 2:1) to the reduction in F-measure (-68.99 versus -0.82, a factor of approximately 84:1), it is clear that the impact on F-measure from eliminating terms in main clauses is disproportionate to the reduction in index size.</Paragraph>
    <Paragraph position="3"> Table 3 Alta Vista: Effect of diminating main clauses  kind of subordinate clause. Most remarkable is the fact that eliminating terms that only occur in relative clauses (RELCL) yields a 7.3% reduction in index size while actually improving F-measure. Also worthy of special note is the fact that two kinds of subordinate clauses can be eliminated with no perceptible effect on Fmeasure: eliminating complement clauses (COMPCL), yields a reduction in index size of 7.4%, and eliminating present participial clauses (PRPRTCL) yields a reduction in index size of  clause types, the effects illustrated in Table 4 are not additive. For example, an infinitival clause (INFCL) may contain a noun phrase with an embedded relative clause (RELCL). Elimination of all terms in the infinitival clause would therefore also lead to elimination of terms in the relative clause.</Paragraph>
    <Section position="1" start_page="352" end_page="354" type="sub_section">
      <SectionTitle>
3.3 Encarta experiments
</SectionTitle>
      <Paragraph position="0"> We gathered 348 queries from middleschool students for submission to Encarta, an  I need to know where hyenas live.</Paragraph>
      <Paragraph position="1"> In what event is Amy VanDyken the closest to the world record in swimming ? What color is a giraffe's tongue ? What is the life-expectancy of an elephant? We indexed the text of the Encarta articles, approximately 33,000 files containing approximately 576,000 sentences, using a simple statistical indexing engine. We then submitted each query and gathered the first thirty ranked documents, for a total of 5,218 documents. We constructed an NLM index for the documents returned and, in a second pass, filtered documents using NLM. In the discussion below, recall is calculated as a percentage of the relevant documents that the statistical search returned.</Paragraph>
      <Paragraph position="2"> Table 5 compares the baseline NLM accuracy (indexing all terms) to the accuracy of eliminating terms that occurred in subordinate clauses. The reduction in index size (29.0%) is comparable to the reduction observed in the Alta Vista experiment (31.4%). However, the effect on F-measure of eliminating terms from subordinate clauses is more marked (-4.91%) than in the Alta Vista experiment (-0.82%).</Paragraph>
      <Paragraph position="3"> Table 5 Encarta: Effects of eliminating subordinate clauses  The impact on F-measure is still substantially less than the average of three runs during which arbitrary non-overlapping thirds of the terms were eliminated, as illustrated in  in an 11.57% reduction in F-measure compared to the baseline, approximately 2.4 times greater than the impact of eliminating material subordinate clauses.</Paragraph>
      <Paragraph position="4"> in Table 6 Encarta: Effects of eliminating one third of terms  As Table 7 shows, eliminating terms from main clauses and retaining information in subordinate clauses has a profound effect on recall for the Encarta corpus. As with the Alta Vista experiment (section 3.2), it is instructive to compare the results in Table 7 to the results obtained when terms in subordinate clauses were deleted (Table 5). Approximately 2.7 times as many terms were eliminated from the index, yet the effect on F-measure is almost thirteen times worse.</Paragraph>
      <Paragraph position="5">  Table 8 isolates the effects for Encarta of eliminating terms from each kind of subordinate clause. It is interesting to compare the reduction in index size and the relative change in F-measure for Encarta, a relatively homogeneous corpus of academic articles, to the heterogeneous web sample of section 3.2. For both corpora, eliminating terms that only occur in abbreviated clauses (ABBCL) or present participial clauses (PRPRTCL) results in modest reductions in index size without negatively affecting F-measure. Eliminating terms from adverbial clauses (ADVCL) or infinitival clauses (INFCL) also produces a similar effects on the two corpora: a reduction in index size with a modest (less than 1%) reduction in F-measure. Relative clauses (RELCL) and complement clauses (COMPCL), however, behave differently across the two corpora. In both cases, the effects on F-measure are positive for web documents and negative for Encarta articles. The negative impact of the elimination of material from relative clauses in Encarta can perhaps be attributed to the pervasive use of non-restrictive relative clauses in the definitional encyclopedia text, as illustrated by the underlined sections of the following examples: Sargon H (ruled 722-705 BC), who followed Tiglath-pileser's successor, Shalmaneser V (ruled 727-722 BC), to the throne, extended Assyrian domination in all directions, from southern Anatolia to the Persian Gulf Amaral, Tarsila do (1886-1973), Brazilian painter whose works were instrumental in the development of modernist painting in Brazil. After the so-called Boston Tea Party in 1773, when Bostonians destroyed tea belonging to the East India Company, Parliament enacted four measures as an example to the other rebellious colonies.</Paragraph>
      <Paragraph position="6"> Another peculiar characteristic of the Encarta corpus, namely the pervasive use of  complement taking nominal expressions such as the belief that and the fact that, possibly explains the negative impact of the elimination of complement clause material in Table 8.</Paragraph>
    </Section>
  </Section>
  <Section position="8" start_page="354" end_page="354" type="metho">
    <SectionTitle>
4 Discussion
</SectionTitle>
    <Paragraph position="0"> Although the results presented in section 3 are compelling, it may be possible to refine the identification of clauses from which index terms can be eliminated. In particular, complement clauses subordinate to speech act verbs would appear from failure analysis to warrant special attention. For example, in the following sentence our linguistic intuitions suggest that the content of the complement clause is more informative than the attribution to a speaker in the main clause: John said that the President would not resign in disgrace. Of course, more fine-grained distinctions of this type can only be made given sufficiently rich linguistic analyses as input.</Paragraph>
    <Paragraph position="1"> Another compelling topic for future research would be the impact of less sophisticated analyses to identify various kinds of subordinate clauses.</Paragraph>
    <Paragraph position="2"> The terms eliminated in the experiments presented in this paper were linguistic in nature. However, we would expect similar results if conventional word-based terms were eliminated in similar fashion. In future research, we intend to experiment with eliminating terms from a conventional statistical engine, combining this technique with the standard method of eliminating high frequency index terms.. Rather than eliminating terms from an index, it may also prove fruitful to investigate weighting terms according to the kind of clause in which they occur.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML