File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/97/p97-1024_metho.xml

Size: 29,276 bytes

Last Modified: 2025-10-06 14:14:36

<?xml version="1.0" standalone="yes"?>
<Paper uid="P97-1024">
  <Title>Independence Assumptions Considered Harmful</Title>
  <Section position="4" start_page="0" end_page="183" type="metho">
    <SectionTitle>
2 Statistical Language Modeling
</SectionTitle>
    <Paragraph position="0"> By &amp;quot;statistical language model&amp;quot;, we refer to a mathematical object that &amp;quot;imitates the properties&amp;quot; of some respects of naturM language, and in turn makes predictions that are useful from a scientific or engineering point of view. Much recent work in this flamework hm~ used written and spoken natural language data to estimate parameters for statisticM models that were characterized by serious limitations: models were either limited to a single explanatory variable or. if more than one explanatory variable wa~s considered, the variables were assumed to be independent. In this section, we describe a method for statistical language modeling that transcends these limitations.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.1 Categorical Data Analysis
</SectionTitle>
      <Paragraph position="0"> Categorical data analysis is the area of statistics that addresses categorical statistical variable: variables whose values are one of a set of categories. An exampie of such a linguistic variable is PART-OF-SPEECH, whose possible values might include nou.n, verb, determiner, preposition, etc.</Paragraph>
      <Paragraph position="1"> We distinguish between a set of explanatory variames. and one response variable. A statistical model can be used to perforin prediction in the following manner: Given the values of the explanatory variables, what is the probability distribution for the response variable, i.e.. what are the probabilities for the different possible values of the response variable?</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="182" type="sub_section">
      <SectionTitle>
2.2 The Contingency Table
</SectionTitle>
      <Paragraph position="0"> Tile ba,sic tool used in categorical data analysis is the contingency table (sometimes called the &amp;quot;crossclassified table of counts&amp;quot;). A contingency table is a matrix with one dimension for each variable, including the response variable. Each cell ill the contingency table records the frequency of data with the appropriate characteristics.</Paragraph>
      <Paragraph position="1"> Since each cell concerns a specific combination of feat.ures, this provides a way to estimate probabilities of specific feature combinations from the observed frequencies, ms the cell counts can easily be converted to probabilities. Prediction is achieved by determining the value of the response variable given the values of the explanatory variables.</Paragraph>
    </Section>
    <Section position="3" start_page="182" end_page="182" type="sub_section">
      <SectionTitle>
2.3 The Loglinear Model
</SectionTitle>
      <Paragraph position="0"> A loglinear model is a statistical model of the effect of a set of categorical variables and their combinations on the cell counts in a contingency table. It can be used to address the problem of sparse data. since it can act a.s a &amp;quot;snmothing device, used to obtain cell estimates for every cell in a sparse array, even if the observed count is zero&amp;quot; (Bishop, Fienberg, and Holland. 1975).</Paragraph>
      <Paragraph position="1"> Marginal totals (sums for all values of some variables) of the observed counts are used to estimate the parameters of the loglinear model; the model in turn delivers estimated expected cell counts, which are smoother than the original cell counts.</Paragraph>
      <Paragraph position="2"> The mathematical form of a loglinear model is a,s follows. Let mi5~ be the expected cell count for cell (i.j. k .... ) in the contingency table. The general form of a loglinear model is ms follows: logm/j~... = u.-{-ltlti).-~lt2(j)-~-U3(k)-~lZl2(ij)-~-.. . (1) In this formula, u denotes the mean of the logarithms of all the expected counts, u+ul(1) denotes the mean of the logarithms of the expected counts with value i of the first variable, u + u2(j) denotes the mean of the logarithms of the expected counts with value j of the second variable, u + ux~_(ii) denotes the mean of the logarithms of the expected counts with value i of the first veriable and value j of the second variable, and so on.</Paragraph>
      <Paragraph position="3"> Thus. the term uzii) denotes the deviation of the mean of the expected cell counts with value i of the first variable from the grand mean u. Similarly, the term Ul2(ij) denotes the deviation of the mean of the expected cell counts with value i of the first variable and value j of the second variable from the grand mean u. In other words, ttl2(ij) represents the combined effect of the values i and j for the first and second variables on the logarithms of the expected cell counts.</Paragraph>
      <Paragraph position="4"> In this way, a loglinear model provides a way to estimate expected cell counts that depend not only on the main effects of the variables, but also on the interactions between variables. This is achieved by adding &amp;quot;interaction terms&amp;quot; such a.s Ul2(ij ) to the nmdel. For further details, see (Fienberg, 1980).</Paragraph>
    </Section>
    <Section position="4" start_page="182" end_page="182" type="sub_section">
      <SectionTitle>
2.4 The Iterative Estimation Procedure
</SectionTitle>
      <Paragraph position="0"> For some loglinear models, it is possible to obtain closed forms for the expected cell counts. For more complicated models, the iterative proportional fitting algorithm for hierarchical loglinear models (Denting and Stephan, 1940) can be used. Briefly, this procedure works ms follows.</Paragraph>
      <Paragraph position="1"> Let the values for the expected cell counts that are estimated by the model be represented by the symbol 7hljk .... The interaction terms in the loglinear nmdels represent constraints on the estimated expected marginal totals. Each of these marginal constraints translates into an adjustment scaling factor for the cell entries. The iterative procedure has the following steps:  1. Start with initial estimates for the estimated expected cell counts. For example, set all 7hijal = 1.0.</Paragraph>
      <Paragraph position="2"> 2. Adjust each cell entry by multiplying it by the scaling factors. This moves the cell entries towards satisfaction of the marginal constraints specified by the nmdel.</Paragraph>
      <Paragraph position="3"> 3. Iterate through the adjustment steps until the  maximum difference e between the marginal totals observed in the sample and the estimated marginal totals reaches a certain minimum threshold, e.g. e = 0.1.</Paragraph>
      <Paragraph position="4"> After each cycle, the estimates satisfy the constraints specified in the model, and the estimated expected marginal totals come closer to matching the observed totals. Thus. the process converges. This results in Maximum Likelihood estimates for both multinomial and independent Poisson sampling schemes (Agresti, 1990).</Paragraph>
    </Section>
    <Section position="5" start_page="182" end_page="183" type="sub_section">
      <SectionTitle>
2.5 Modeling Interactions
</SectionTitle>
      <Paragraph position="0"> For natural language classification and prediction tasks, the aim is to estimate a conditional probability distribution P(H\[E) over the possible values of the hypothesis H, where the evidence E consists of a number of linguistic features el, e2 ..... Much of the previous work in this area assumes independence between the linguistic features:</Paragraph>
      <Paragraph position="2"> For example, a model to predict Part-of-Speech of a word on the basis of its morphological affix and its capitalization might a.ssume independence between the two explanatory variables a,s follows:</Paragraph>
      <Paragraph position="4"> This results ill a considerable computational simplification of the model but, as we shall see below.</Paragraph>
      <Paragraph position="5"> leads to a considerable loss of information and concomitant decrease in prediction accuracy. With a loglinear model, on the other hand. such independence assumptions are not necessary. The loglinear model provides a posterior distribution that is properly conditioned on the evidence, and maximizing the conditional probability P(HIE ) leads to minimum error rate classification (Duda and Hart. 1973).</Paragraph>
      <Paragraph position="6">  s</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="183" end_page="185" type="metho">
    <SectionTitle>
3 Predicting Part-of-Speech
</SectionTitle>
    <Paragraph position="0"> We will now turn to the empirical evidence supporting the argument against independence assumptions. ~ In this section, we will compare two models for pre- e ~ dicting the Part-of-Speech of an unknown word: A ~ simple model that treats the various explanatory variables ms independent, and a model using log-linear smoothing of a contingency table that takes into account the interactions between the explanatory variables.</Paragraph>
    <Section position="1" start_page="183" end_page="183" type="sub_section">
      <SectionTitle>
3.1 Constructing the Model
</SectionTitle>
      <Paragraph position="0"> The model wa~s constructed in the following way.</Paragraph>
      <Paragraph position="1"> First, features that could be used to guess the PUS of a word were determined by examining the training portion of a text corpus. The initial set of features consisted of the following:  * INCLUDES-NUMBER. Does the word include a nunlber? * CAPITALIZED. Is the word in sentence-initial position and capitalized, in any other position and capitalized, or in lower ca~e? * INCLUDES-PERIOD. Does the word include a period? null * INCLUDES-COMMA. Does the word include a colnlna? * FINAL-PERIOD. Is the last character of the word a period? * INCLUDES-HYPHEN. Does the word include a hyphen? * ALL-UPPER-CASE. Is the word in all upper case? * SHORT. Is the length of the word three characters or less? * INFLECTION. Does the word carry one of the English inflectional suffixes? * PREFIX. Does the word carry one of a list of frequently occurring prefixes? * SUFFIX. Does the word carry one of a list of  frequently occurring suffixes? Next, exploratory data analysis was perfornled in order to determine relevant features and their values, and to approximate which features interact. Each word of the training data was then turned into a feature vector, and the feature vectors were crossclassified in a contingency table. The contingency table was smoothed using a loglinear models.</Paragraph>
    </Section>
    <Section position="2" start_page="183" end_page="183" type="sub_section">
      <SectionTitle>
3.2 Data
</SectionTitle>
      <Paragraph position="0"> Training and evaluation data was obtained from the Penn Treebank Brown corpus (Marcus, Santorini, and Marcinkiewicz, 1993). The characteristics of &amp;quot;'rare&amp;quot; words that might show up ms unknown words differ fi'om the characteristics of words in general.</Paragraph>
      <Paragraph position="1"> so a two-step procedure wa~ employed a first time  to obtain a set of &amp;quot;'rare&amp;quot; words ms training data, and again a second time to obtain a separate set of &amp;quot;'rare*&amp;quot; words ms evMuation data. There were 17,000 words in the training data, and 21,000 words in the evaluation data. Ambiguity resolution accuracy was evaluated for the &amp;quot;'overall accuracy&amp;quot; (Percentage that the most likely PUS tag is correct), and &amp;quot;'cutoff factor accuracy&amp;quot; (accuracy of the answer set consisting of all PUS tags whose probability lies within a factor F of the most likely PUS (de Marcken, 1990)).</Paragraph>
    </Section>
    <Section position="3" start_page="183" end_page="184" type="sub_section">
      <SectionTitle>
3.3 Accuracy Results
</SectionTitle>
      <Paragraph position="0"> (Weischedel et al., 1993) describe a model for unknown words that uses four features, but treats the features ms independent. We reimplemented this model by using four features: POS, INFLECTION, CAPITALIZED, and HYPHENATED, In Figures i 2, the results for this model are labeled 4 Independent Features. For comparison, we created a log-linear model with the same four features: the results for this model are labeled 4 Loglinear Features.</Paragraph>
      <Paragraph position="1"> The highest accuracy was obtained by the log-linear model that includes all two-way interactions and consists of two contingency tM)les with the following features: POS, ALL-UPPER-CASE.</Paragraph>
      <Paragraph position="2"> HYPHENATED, INCLUDES-NUMBER, CAPITALIZED, INFLECTION, SHORT. PREFIX, and SUFFIX. The results for this model are lM)eled 9 Loglinear Features. The parameters for all three unknown word models were estimated from the training data. and the models were evaluated on the evaluation data.</Paragraph>
      <Paragraph position="3"> The accuracy of the different models in a.ssigning the most likely POSs to words is summarized in Figure 1. In the left diagram, the two barcharts show two different accuracy memsures: Percent correct (Overall Accuracy), and percent correct within the F=0.4 cutoff factor answer set (F=0.4 Set Accuracy). In both cruses, the loglinear model with four features obtains higher accuracy than the method that assumes independence between the same four features. The loglinear model with nine  features further improves this score.</Paragraph>
    </Section>
    <Section position="4" start_page="184" end_page="184" type="sub_section">
      <SectionTitle>
3.4 Effect of Number of Features on
Accuracy
</SectionTitle>
      <Paragraph position="0"> The performance of the loglinear model can be improved by adding more features, but this is not possible with the simpler nmdel that assumes independence between the features. Figure 2 shows the performance of the two types of nmdels with fenture sets that ranged from a single feature to nine features.</Paragraph>
      <Paragraph position="1"> As the diagram shows, the accuracies for both methods rise with the first few features, but then the two methods show a clear divergence. The accuracy of the simpler method levels off around at around 50-55%, while the loglinear model reaches an accuracy of 70-75%. This shows that the loglinear model is able to tolerate redundant features and use information from more features than the simpler method, and therefore achieves better results at ambiguity resolution.</Paragraph>
    </Section>
    <Section position="5" start_page="184" end_page="185" type="sub_section">
      <SectionTitle>
3.5 Adding Context to the Model
</SectionTitle>
      <Paragraph position="0"> Next, we added of a stochastic POS tagger (Charniak et al., 1993) to provide a model of context. A stochastic POS tagger assigns POS labels to words in a sentence by using two parameters:  the probability of observing tag ti given that the two previous tags ti-1, t,i--2 occurred.</Paragraph>
      <Paragraph position="1"> The tagger maximizes the probability of the tag sequence T = t.l,t, 2 .... ,t.,, given the word sequence W = wz,w2,... ,w,,, which is approximated a.s follows: null</Paragraph>
      <Paragraph position="3"> The accuracy of the combination of the loglinear model for local features and the stochastic POS tagger for contextual features was evaluated empirically by comparing three methods of handling unknown words:  * Unigram: Using the prior probability distribution P(t) of the POS tags for rare words. * ProbabUistic UWM: Using the probabilistic  model that assumes independence between the features.</Paragraph>
      <Paragraph position="4"> * Classifier UWM: Using the loglinear model for unknown words.</Paragraph>
      <Paragraph position="5"> Separate sets of training and evaluation data for the tagger were obtained from from the Penn Treebank Wall Street corpus. Evaluation of the combined syst.em was performed on different configurations of the POS tagger on 30-40 different samples containing 4,000 words each.</Paragraph>
      <Paragraph position="6"> Since the tagger displays considerable variance in its accuracy in assigning POS to unknown words in context, we use boxplots to display the results. Figure 3 compares the tagging error rate on unknown words for the unigram method (left) and the log-linear method with nine features (labeled statistical classifier) at right. This shows that the Ioglinear model significantly improves the Part-of-Speech tagging accuracy of a stochastic tagger on unknown words. The median error rate is lowered considerably, and samples with error rates over 32% are eliminated entirely.</Paragraph>
      <Paragraph position="8"/>
    </Section>
    <Section position="6" start_page="185" end_page="185" type="sub_section">
      <SectionTitle>
3.6 Effect of Proportion of Unknown
Words
</SectionTitle>
      <Paragraph position="0"> Since most of the lexical ambiguity resolution power of stochastic PUS tagging comes from the lexical probabilities, unknown words represent a significant source of error. Therefore, we investigated the effect of different types of models for unknown words on the error rate for tagging text with different proportions of unknown words.</Paragraph>
      <Paragraph position="1"> Samples of text that contained different proportions of unknown words were tagged using the three different methods for handling unknown words described above. The overall tagging error rate increases significantly as the proportion of new words increases. Figure 4 shows a graph of overall tagging accuracy versus percentage of unknown words in the text. The graph compares the three different methods of handling unknown words. The diagram shows that the loglinear model leads to better overall tagging performance than the simpler methods, with a clear separation of all samples whose proportion of new words is above approximately 10%.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="185" end_page="187" type="metho">
    <SectionTitle>
4 Predicting PP Attachment
</SectionTitle>
    <Paragraph position="0"> In the second series of experiments, we compare the performance of different statistical models on the task of predicting Prepositional Phrase (PP) attachment. null</Paragraph>
    <Section position="1" start_page="185" end_page="185" type="sub_section">
      <SectionTitle>
4.1 Features for PP Attachment
</SectionTitle>
      <Paragraph position="0"> First, an initial set of linguistic features that could be useful for predicting PP attachment was determined. The initial set included the following features: null  * PREPOSITION. Possible values of this feature include one of the more frequent prepositions in the training set, or the value other-prep.</Paragraph>
      <Paragraph position="1"> * VERB-LEVEL. Lexical association strength between the verb and the preposition.</Paragraph>
      <Paragraph position="2"> * NOUN-LEVEL. Lexical association strength between the noun and the preposition.</Paragraph>
      <Paragraph position="3"> * NOUN-TAG. Part-of-Speech of the nominal at- null tachment site. This is included to account for correlations between attachment and syntactic category of the nominal attachment site, such as &amp;quot;PPs disfavor attachment to proper nouns.&amp;quot; * NOUN-DEFINITENESS. Does the nominal attachment site include a definite determiner? This feature is included to account for a possible correlation between PP attachment to the nominal site and definiteness, which was derived by (Hirst, 1986) from the principle of presupposition minimization of (Craln and Steedman, 1985).</Paragraph>
      <Paragraph position="4"> * PP-OBJECT-TAG. Part-of-speech of the object of the PP. Certain types of PP objects favor attachment to the verbal or nominal site. For example, temporal PPs, such as &amp;quot;in 1959&amp;quot;, where the prepositional object is tagged CD (cardinal), favor attachment to the VP, because tile VP is more likely to have a temporal dimension. The association strengths for VERB-LEVEL and NOUN-LEVEL were measured using the Mutual Information between the noun or verb, and the preposition. 1 The probabilities were derived ms Maximum Likelihood estimates from all PP cases in the training data. The Mutual Information values were ordered by rank. Then, the a~ssociation strengths were categorized into eight levels (A-H), depending on percentile in the ranked Mutual Information values.</Paragraph>
    </Section>
    <Section position="2" start_page="185" end_page="186" type="sub_section">
      <SectionTitle>
4.2 Experimental Data and Evaluation
</SectionTitle>
      <Paragraph position="0"> Training and evaluation data was prepared from the Penn treebank. All 1.1 million words of parsed text in the Brown Corpus, and 2.6 million words of parsed WSJ articles, were used. All instances of PPs that are attached to VPs and NPs were extracted. This resulted in 82,000 PP cases from the Brown Corpus, and 89,000 PP cases from the WS.\] articles. Verbs and nouns were lemmatized to their root forms if the root forms were attested in the corpus. If the root form did not occur in the corpus, then the inflected form was used.</Paragraph>
      <Paragraph position="1"> All the PP cases from the Brown Curl)us, and 50,000 of the WSJ cases, were reserved ms training data. The remaining 39,00 WSJ PP cases formed the evaluation pool. In each experiment, performance IMutu',d Information provides an estimate of the magnitude of the ratio t)ctw(.(-n the joint prol)ability P(verb/noun,1)reposition), and the joint probability a.~suming indcpendcnce P(verb/noun)P(prcl)osition ) - s(:(, (Church and Hanks, 1990).</Paragraph>
      <Paragraph position="2">  was evaluated oil a series of 25 random samples of 100 PP cases fi'om the evaluation pool. in order to provide a characterization of the error variance.</Paragraph>
    </Section>
    <Section position="3" start_page="186" end_page="186" type="sub_section">
      <SectionTitle>
4.3 Experimental Results: Two
Attachments Sites
</SectionTitle>
      <Paragraph position="0"> Previous work oll automatic PP attachment disambiguation has only considered the pattern of a verb phrase containing an object, and a final PP. This lends to two possible attachment sites, the verb and the object of the verb. The pattern is usually further simplified by considering only the heads of the possible attachment sites, corresponding to the sequence &amp;quot;Verb Noun1 Preposition Noun2&amp;quot;.</Paragraph>
      <Paragraph position="1"> The first set of experiments concerns this pattern.</Paragraph>
      <Paragraph position="2"> There are 53,000 such cases in the training data. and 16,000 such cases in the evaluation pool. A number of methods were evaluated on this pattern according to the 25-sample scheme described above. The results are shown in Figure 5.</Paragraph>
      <Paragraph position="3">  Prepositional phrases exhibit a tendency to attach to the most recent possible attachment site; this is referred to ms the principle of &amp;quot;'Right Association&amp;quot;. For the &amp;quot;V NP PP'&amp;quot; pattern, this means preferring attachment to the noun phra~se. On the evaluation samples, a median of 65% of the PP cases were attached to the noun.</Paragraph>
      <Paragraph position="4">  (Hindle and R ooth. 1993) described a method for obtaining estimates of lexical a.ssociation strengths between nouns or verbs and prepositions, and then using lexical association strength to predict. PP attachment. In our reimplementation of this lnethod. the probabilities were estimated fi'om all the PP cases in the training set. Since our training data are bracketed, it was possible to estimate tile lexical associations with much less noise than Hindle &amp; R ooth, who were working with unparsed text. The median accuracy for our reimplementation of Hindle  &amp; Rooth's method was 81%. This is labeled &amp;quot;Hindle &amp; Rooth'&amp;quot; in Figure 5.</Paragraph>
      <Paragraph position="5">  The loglinear model for this task used the features PREPOSITION. VERB-LEVEL, NOUN-LEVEL, and NOUN-DEFINITENESS, and it included all second-order interaction terms. This model achieved a median accuracy of 82%.</Paragraph>
      <Paragraph position="6"> Hindle &amp; Rooth's lexical association strategy only uses one feature (lexical aasociation) to predict PP attachment, but. ms the boxplot shows, the results from the loglinear model for the &amp;quot;V NP PP&amp;quot; pattern do not show any significant improvement.</Paragraph>
    </Section>
    <Section position="4" start_page="186" end_page="187" type="sub_section">
      <SectionTitle>
4.4 Experimental Results: Three
Attachment Sites
</SectionTitle>
      <Paragraph position="0"> As suggested by (Gibson and Pearlmutter. 1994), PP attachment for the &amp;quot;'Verb NP PP&amp;quot; pattern is relatively easy to predict because the two possible attachment sites differ in syntactic category, and therefore have very different kinds of lexical preferences. For example, most PPs with of attach to nouns, and most PPs with f,o and by attach to verbs.</Paragraph>
      <Paragraph position="1"> In actual texts, there are often more than two possible attachment sites for a PP. Thus, a second, more realistic series of experiments was perforlned that investigated different PP attachment strategies for the pattern &amp;quot;'Verb Noun1 Noun2 Preposition Noun3&amp;quot;' that includes more than two possible attachment sites that are not syntactically heterogeneous. There were 28,000 such cases in the training data. and 8000 ca,~es in the evaluation pool.</Paragraph>
      <Paragraph position="2">  As in the first set of experiments, a number of methods were evaluated an the three attachment site pattern with 25 samples of 100 random PP cases.</Paragraph>
      <Paragraph position="3"> The results are shown in Figures 6-7. The baseline is again provided by attachment according to the principle of &amp;quot;Right Attachment'; to the nmst recent possible site, i.e. attaclunent to Noun2. A median of 69% of the PP cases were attached to Noun2.</Paragraph>
      <Paragraph position="4">  Next, the lexical association method was evaluated on this pattern. First. the method described by Hindle &amp; Rooth was reimplemented by using the lexical association strengths estimated from all PP cases. The results for this strategy are labeled &amp;quot;Basic Lexical Association&amp;quot; in Figure 6. This method only achieved a median accuracy of 59%, which is worse than always choosing the rightmost attachment site.</Paragraph>
      <Paragraph position="5"> These results suggest that Hindle &amp; R.ooth's scoring function worked well in the &amp;quot;'Verb Noun1 Preposition Noun2&amp;quot;' case not only because it was an accurate estimator of lexical associations between individual verbs/nouns and prepositions which determine PP attachment, but also because it accurately predicted the general verb-noun skew of prepositions.</Paragraph>
      <Paragraph position="6">  Association It seems natural that this pattern calls for a combination of a structural feature with lexical association strength. To implement this, we modified Hindle &amp; Rooth's method to estimate attachments to the verb, first noun. and second noun separately.</Paragraph>
      <Paragraph position="7"> This resulted in estimates that combine the structural feature directly with the lexical association strength. The modified method performed better than the original lexical association scoring function, but it still only obtained a median accuracy of 72%.</Paragraph>
      <Paragraph position="8"> This is labeled &amp;quot;Split Hindle &amp; Rooth&amp;quot; in Figure 7.  To create a model that combines various structural and lexical features without independence assumptions, we implemented a loglinear model that includes the variables VERB-LEVEL FIRST-NOUN-LEVEL. and SECOND-NOUN-LEVEL. 2 The loglinear model also includes the variables PREPOSITION and PP-OBJECT-TAG. It, was smoothed with a loglinear model that includes all second-order interactions.</Paragraph>
      <Paragraph position="9"> This method obtained a median accuracy of 79%; this is labeled &amp;quot;Loglinear Model&amp;quot; in Figure 7. As the boxplot shows, it performs significantly better than the methods that only use estimates of lexical a,~soclarion. Compared with the &amp;quot;'Split Hindle Sz Rooth'&amp;quot; method, the samples are a little less spread out, and there is no overlap at all between the central 50% of the samples from the two methods.</Paragraph>
    </Section>
    <Section position="5" start_page="187" end_page="187" type="sub_section">
      <SectionTitle>
4.5 Discussion
</SectionTitle>
      <Paragraph position="0"> The simpler &amp;quot;V NP PP&amp;quot; pattern with two syntactically different attachment sites yielded a null result: The loglinear method did not perform significantly better than the lexical association method. This could mean that the results of the lexical association method can not be improved by adding other features, but it is also possible that the features that could result in improved accuracy were not identified. null The lexical association strategy does not perform well on the more difficult pattern with three possible attachment sites. The loglinear model, on the other hand, predicts attachment with significantly higher accuracy, achieving a clear separation of the central 50% of the evaluation samples.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML