File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/00/w00-1008_metho.xml
Size: 19,371 bytes
Last Modified: 2025-10-06 14:07:26
<?xml version="1.0" standalone="yes"?> <Paper uid="W00-1008"> <Title>Using decision trees to select the gran natical relation of a noun phrase</Title> <Section position="5" start_page="0" end_page="66" type="metho"> <SectionTitle> 2. Data </SectionTitle> <Paragraph position="0"> A total of 5,252 mentions were annotated from the Encarta electronic encyclopedia and 4,937 mentions from the Wall Street Journal (WSJ). Sentences were parsed using the Microsoft English Grammar (Heidorn 1999) to extract mentions and linguistic features. These analyses were then hand-corrected to eliminate noise in the training data caused by inaccurate parses, allowing us to determine the upper bound on accuracy for the classification task if the computational analysis were perfect. Zero anaphors were annotated only when they occurred as subjects of coordinated clauses. They have been excluded from the present study since they are invariably discourse-given subjects.</Paragraph> </Section> <Section position="6" start_page="66" end_page="67" type="metho"> <SectionTitle> 3. Features </SectionTitle> <Paragraph position="0"> Nineteen linguistic features were annotated, along with information about the referent of each mention.</Paragraph> <Paragraph position="1"> On the basis of the reference information we extracted the feature \[InformationStatus\], distinguishing &quot;discourse-new&quot; versus &quot;discourseold&quot;. All mentions without a prior coreferential mention in the text were classified as discourse-new, even if they would not traditionally be considered referential. \[InformationStatus\] is not directly observable since it requires the analyst to make decisions about the referent of a mention.</Paragraph> <Paragraph position="2"> In addition to the feature \[InformafionStatus\], the following eighteen observable features were annotated. These are all features that we can reasonably expect syntactic parsers to extract with sufficient accuracy today or in the near future.</Paragraph> <Paragraph position="3"> * \[ClausalStatus\]: Does the mention occur in a main clause (&quot;M&quot;), complement clause (&quot;C&quot;), or subordinate clause (&quot;S&quot;)? * \[Coordinated\] The mention is coordinated with at least one sibling.</Paragraph> <Paragraph position="4"> * \[Definite\] The mention is marked with the definite article or a demonstrative pronoun.</Paragraph> <Paragraph position="5"> \[Fem\] The mention is unambiguously feminine.</Paragraph> <Paragraph position="6"> * \[GrRel\] The grammatical relation of the mention (see below, this section).</Paragraph> <Paragraph position="7"> * \[HasPossessive\] Modified by a possessive pronoun or a possessive NP with the elitic's ors'.</Paragraph> <Paragraph position="8"> * \[HasPP\] Contains a postmodifying prepositional phrase.</Paragraph> <Paragraph position="9"> * \[HasRelC1\] Contains a postmodifying relative clause.</Paragraph> <Paragraph position="10"> * \[InQuotes\] The mention occurs in quoted material.</Paragraph> <Paragraph position="11"> * \[Lex\] The specific inflected form of a pronoun, e.g. he, him.</Paragraph> <Paragraph position="12"> * \[Mase\] The mention is unambiguously masculine.</Paragraph> <Paragraph position="13"> * \[NounClass\] We distinguish common nouns versus proper names. Within proper names, we distinguish the name of a place (&quot;Geo&quot;) versus other proper names (&quot;ProperName&quot;). * \[Plural\] The head of the mention is morphologically marked as plural.</Paragraph> <Paragraph position="14"> * \[POS\] The part of speech of the head of the mention.</Paragraph> <Paragraph position="15"> * \[Prep\] The governing preposition, if any. * \[RelC1\] The mention is a child of a relative clause.</Paragraph> <Paragraph position="16"> * \[TopLevel\] The mention is not embedded within another mention.</Paragraph> <Paragraph position="17"> * \[Words\] The total number of words in the mention, discretized to the following values: {0, 1, 2, 3, 4, 5, 6to10, 1 lto15, abovel5}. Gender (\[Fern\], \[Mast\]) was only annotated for common nouns whose default word sense is gendered (e.g. &quot;mother&quot;, &quot;father&quot;), for common nouns with specific morphology (e.g. with the -ess suffix) and for gender-marked proper names (e.g. &quot;John&quot;, &quot;Mary&quot;). Gender was not marked for pronouns, to avoid difficult encoding decisions such as the use of genetic &quot;he&quot;. ~ Gender was also not marked for cases that would require world knowledge.</Paragraph> <Paragraph position="18"> The feature \[GrRel\] was given a much finer-grained analysis than is usual in computational linguistics. Studies in PAS have demonstrated the need to distinguish finer-grained categories than the traditional grammatical relations of English grammar (&quot;subject&quot;, &quot;object&quot; ere) in order to account for distributional phenomena in discourse. For example, subjects of intransitive verbs pattern with the direct objects of transitive verbs as being the preferred locus for introducing new mentions. Subjects of transitives, however, are strongly dispreferred slots for the expression of new information. The use of fine-grained grammatical relations enables us to make rather specific claims about the distribution of mentions. The taxonomy of fine-grained grammatical relations is given below in Figure 1.</Paragraph> </Section> <Section position="7" start_page="67" end_page="71" type="metho"> <SectionTitle> 4. Decision trees </SectionTitle> <Paragraph position="0"> For a set of annotated examples, we used decision-tree tools to construct the conditional probability of a specific grammatical relation, given other features in the domain, z The decision trees are constructed using a Bayesian learning approach that identifies tree structures with high posterior probability (Chickefing et al. 1997). In particular, a candidate tree structure (S) is evaluated against data (D) using Bayes' rule as follows:</Paragraph> <Paragraph position="2"> For simplicity, we specify a prior distribution over tree structures using a single parameter kappa (k). Assuming that N(S) probabilities are needed to parameterize a tree with structure S, we use: p(S) = c. k 2 Comparison experiments were also done with Support Vector Machines (Platt 2000, Vapnik 1998) using a where 0 < k _< 1, and c is a constant such that p(S) sums to one. Note that smaller values of kappa cause simpler structures to be favored. As kappa grows closer to one (k = 1 corresponds to a uniform prior over all possible tree structures), the learned decision trees become more elaborate. Decision trees were built for k~ {0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 0.95, 0.99, 0.999}.</Paragraph> <Paragraph position="3"> Having selected a decision tree, we use the posterior means of the parameters to specify a probability distribution over the grammatical relations. To avoid overfitting, nodes containing fewer than fifty examples were not split during the learning process. In building decision trees, 70% of the data was used for training and 30% for held-out evaluation.</Paragraph> <Paragraph position="4"> The decision trees constructed can be rather complex, making them difficult to present visually. Figure 2 gives a simpler decision tree that predicts the grammatical relation of a mention for Enearta at variety of kernel functions. The results obtained were indistinguishable from those reported here.</Paragraph> <Paragraph position="6"> k=0.7. The tree was constructed using a subset of the morphological and syntactic features: \[Coordinated\], \[HasPP\], \[Lex\], \[NounClass\], \[Plural\], \[POS\], \[Prep\], \[RelC1\], \[TopLevel\], \[Words\]. Grammatical relations with only a residual probability are omitted for the sake of clarity. The top-ranked grammatical relation at each leaf node appears in bold type. Selecting the top-ranked grammatical relation at each node results in a correct decision 58.82% of the time in the held-out test data. By way of comparison, the best decision tree for Enema computed using all morphological and syntactic features yields 66.05% accuracy at k = 0.999.</Paragraph> <Paragraph position="7"> The distributional facts about the pronoun &quot;he&quot; represented in Figure 2 illustrate the utility of the \[me-grained taxonomy of grammatical relations. The pronoun &quot;he&quot; in embedded NPs (\[Prep\] = &quot;-&quot;,</Paragraph> <Paragraph position="9"> relations have only residual probabilities. The use of the traditional notion of subject would fail to capture the fact that, in this syntactic context, the pronoun &quot;he&quot; tends not to occur as Sc, the subject of a copula. 5. Evaluating decision trees Decision trees were constructed and evaluated for each corpus. We were particularly interested in the accuracy of models built using only observable features. If accurate modeling were to require more abstract discourse features such as \[InformationStatus\], a feature that is not directly observable, then a machine-learning approach to modeling the distribution of mentions would not be computationally feasible. Also of interest was the generality of the models.</Paragraph> <Section position="1" start_page="68" end_page="69" type="sub_section"> <SectionTitle> 5.1 Using Observable Features Only </SectionTitle> <Paragraph position="0"> Decision trees were built for Encarta and the Wall Street Journal using all features except the non-observable discourse feature \[InformationStatus\]. The best accuracy when evaluated against held-out test data and selecting the top-ranked grammatical relation at each leaf node was 66.05% for Encarta at k=0.999 and 65.18% for Wall Street Journal at k=0.99. Previous studies in Preferred Argument Structure (Corston 1996, Du Bois 1987) have established pairings of free-grained grammatical relations with respect to abstract discourse factors. New mentions in discourse, for example, tend to be introduced as the subjects of intransitive verbs or as direct objects, and are extremely unlikely to occur as the subj~ts of transitive verbs. Some languages even give the same morphological and syntactic treatment to subjects of intransitives and direct objects, marking them (so called &quot;absolutive&quot; case marking) in opposition to subjects of transitives (so called &quot;ergative&quot; marking). Human referents, on the other hand, tend to occur as the subjects of transitive verbs and as the subjects of intransitive verbs, rather than as objects. Such discourse tendencies perhaps motivate the Use of one set of pronouns (the so called &quot;nominative&quot; pronouns {&quot;he&quot;, &quot;she&quot;, &quot;we&quot;, &quot;r', &quot;they&quot;}) in a language like English :for subjects and a different set of pronouns for objects (the so called &quot;accusative&quot; set {&quot;him&quot;, &quot;her&quot;, &quot;us&quot;, &quot;me&quot;, &quot;them&quot;}). Thus, we can see that distributional facts about mentions in discourse sometimes cross-cut the morphological and syntactic encoding strategies of a language. With a free-grained set of grammatical relations, we can allow the decision trees to discover such groupings of relations, rather than attempting to specify the groupings in advance.</Paragraph> <Paragraph position="1"> We evaluated the accuracy of the decision trees by counting as a correct decision a grammatical relation that matched the top-ranked grammatical relation for a leaf node or the second ranked gamrnatieal relation for that leaf node. With this evaluation criterion, the accuracy for Enearta is 81.92% at k=0.999 and for Wall Street Journal, 80.70% at k=0.9.</Paragraph> <Paragraph position="2"> It is clearly naive to assume a baseline for comparison in which all grammatical relations have an equal probability of occurrence, i.e. 1/12 or 0.083.</Paragraph> <Paragraph position="3"> Rather, in Table 1 we compare the accuracy to that obtained by predicting the most frequent grammatical relations observed in the training data. The decision trees perform substantially above this baseline. The top two grammatical relations in the two corpora do not form a natural class. In the Wall Street Journal texts, for example, the top two grammatical relations are Or (object of transitive verb) and PPN (prepositional phrase complement of a NP). It is difficult to see how mentions in these two grammatical relations might be related. Objects of transitive verbs, for example, are typically entities affected by the action of the verb. Prepositional phrase complements of NPs, however, are prototypically used to express attributes of the NP, e.g. &quot;the man with the red hat&quot;. The grammatical relations paired by taking the top two predictions at each leaf node in the decision trees constructed for the Wall Street Journal and Encarta, however, frequently correspond to classes that have been previously observed in the literature on Preferred Argument Structure. The groupings {Or, Si}, {Or, Sc} and {Si, ST}, for example, occur on multiple leaf nodes in the decision trees for both corpora.</Paragraph> </Section> <Section position="2" start_page="69" end_page="70" type="sub_section"> <SectionTitle> 5.2 Using All Features </SectionTitle> <Paragraph position="0"> Decision trees were built for Encarta and the Wall Street Journal using all features including the discourse feature \[InformationStatus\]. As it turned out, the feature \[InformationStatus\] was not selected during the automatic construction of the decision tree for the Wall Street Journal. The performance of the decision trees on held-out test data from the Wall Street Journal therefore remained the same as that given in section 5.1. For Encarta, the addition of \[InformationStatus\] yielded only a modest improvement in accuracy. Selecting the top-ranked grammatical relation rose from 66.05% at k=0.999 to 67.32% at k = 0.999. Applying a paired t-test, this is statistically significant at the 0.01 level. Selecting the top two grammatical relations caused accuracy to rise from 81.92% at k=0.999 to 82.23% at k=0.999, not a statistically significant improvement.</Paragraph> <Paragraph position="1"> The fact that the discourse feature \[InformationStatus\] does not make a marked impact on accuracy is not surprising. The information status of an NP is an important factor in determining elements of form, such as the decision to use a pronoun versus a lexical NP, or the degree of elaboration (e.g. by means of adjectives, post-modifying PPs and relative clauses). Those elements of form can be viewed as proxies for the feature \[informationStatus\]. Pronouns and definite NPs, for example, typically refer to given entities, and therefore are compatible with the grammatical relation ST. Similarly, long indefinite lexical NPs are likely to be new mentions.</Paragraph> <Paragraph position="2"> In a separate set of experiments conducted on the same data, we built decision trees to predict the information status of the referent of a noun phrase using the other linguistic features (grammatical relation, clausal status, definiteness and so on.) Zero anaphors were excluded, yielding 4,996 noun phrases for Encarta and 4,758 noun phrases for the Wall Street Journal. The accuracy of the decision trees was 80.45% for Encarta and 78.36% for the Wall Street Journal. To exclude the strong associations between personal pronouns and information status, we also built decision trees for only the lexical noun phrases in the two corpora, a total of 4,542 noun phrases for Enema and 4,153 noun phrases for the Wall Street Journal. The accuracy of the decision trees was 78.14% for Encarta and 77.45% for the Wall Street Journal. The feature \[informationStatus\] can thus be seen to be highly inferrable given the other features used.</Paragraph> </Section> <Section position="3" start_page="70" end_page="71" type="sub_section"> <SectionTitle> 5.3 Domain-specificity of the Decision Trees </SectionTitle> <Paragraph position="0"> The decision trees built for the Encarta and Wall Street Journal corpora differ considerably, as is to be expected for such distinct genres. To measure the specificity of the decision trees, we built models using all the data for one corpus and evaluated on all the data in the other corpus, using all features except \[informationStatus\]. Table 2 gives the baseline figures for this cross-domain evaluation, selecting the most frequent grammatical relations in the training data. The peak accuracy from the decision trees is given in parentheses for comparison. The decision trees perform well above the baseline.</Paragraph> <Paragraph position="1"> Table 3 provides a comparison of the accuracy of decision trees applied across domains compared to those constructed and evaluated within a given domain. The extremely specialized sublanguage of Encarta does not generalize well to the Wall Street Journal. In particular, when selecting the top-ranked grammatical relation, the most severe evaluation of the accuracy of the decision trees, training on Encarta and evaluating on the Wall Street Journal results in a drop in accuracy of 7.54% compared to the Wall Street Journal within-corpus model. By way of contrast, decision trees built from the Wall Street Journal data do generalize well to Enearta, even yielding a modest 0.41% improvement in accuracy over the model built for Encarta. Since the Encarta data contains more mentions (5,252 mentions) than the Wall Street Journal data (4,937 mentions), this effect is not simply due to differences in the size of the training set.</Paragraph> </Section> <Section position="4" start_page="71" end_page="71" type="sub_section"> <SectionTitle> 5.4 Combining the Data </SectionTitle> <Paragraph position="0"> Combining the Wall Street Journal and Encarta data into one dataset and using 70% of the data for training and 30% for testing yielded mixed results.</Paragraph> <Paragraph position="1"> Selecting the top-ranked grammatical relation for the combined data yielded 66.01% at lc~0.99, compared to the Encarta-specific accuracy of 66.05% and the Wall Street Journal-specific peak accuracy of 66.16%. Selecting the top two grammatical relations, the peak accuracy for the combined data was 81.39% at k=0.99, a result approximately midway between the corpus-specific results obtained in section 5.1, namely 81.92% for Encarta and 80.70% for Wall Street Journal.</Paragraph> <Paragraph position="2"> The Wall Street Journal corpus contains a diverse range of articles, including op-ed pieces, mundane financial reporting, and world news. The addition of the relatively homogeneous Encarta articles appears to result in models that are even more robust than those constructed solely on the basis of the Wall Street Journal data. The addition of the heterogeneous Wall Street Journal articles, however, dilutes the focus of the model constructed for Encarta. This perhaps explains the fact that the peak accuracy of the combined model lies above that for the Wall Street Journal but below that of Encarta.</Paragraph> </Section> </Section> class="xml-element"></Paper>