File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/97/w97-1005_metho.xml

Size: 22,351 bytes

Last Modified: 2025-10-06 14:14:42

<?xml version="1.0" standalone="yes"?>
<Paper uid="W97-1005">
  <Title>A Statistical Decision Making Method: A Case Study on Prepositional Phrase Attachment*</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 PPA Problem
</SectionTitle>
    <Paragraph position="0"> Resolving the PPA problem is a common problem in any NLP system that deals with syntactic parsing or text understanding. The Naive Bayes classifier and leading machine learning systems, such as C4.5 (Quinlan, 1993), CN2 (Clark and Niblett, 1989) and PEBLS (Cost and Sahberg, 1993), fail to provide prediction with competitive accuracy rates on this problem (see Table 4 on page 40). A sentence can be so ambiguous that it may not be possible to determine the correct attachment without extra contextual information. (Ratnaparkhi et al., 1994) reported that human experts could reach an accuracy of 93%, if cases were given as whole sentences out of context.</Paragraph>
    <Paragraph position="1"> The PPA problem is illustrated by the following example: I described the problem on the paper. (1) This is an ambiguous sentence, which can be interpreted two different ways, depending on the site of PPA. The prepositional phrase (PP) in the above sentence is &amp;quot;on the paper.&amp;quot; If it is attached to Kayaalp, Pedersen ~ Bruce 33 Statistical PP Attachment Mehmet Kayaalp, Ted Pedersen and Rebecca Bruce (1997) A Statistical Decision Making Method: A Case Study on Prepositional Phrase Attachment. In T.M. Ellison (ed.) CoNLL97: Computational Natural Language Learning, ACL pp 33-42.</Paragraph>
    <Paragraph position="2"> (~) 1997 Association for Computational Linguistics the (object) noun &amp;quot;problem,&amp;quot; then the interpretation would be equal to (2); on the other hand, if it is attached to the verb &amp;quot;describe,&amp;quot; then it would be interpreted as (3).</Paragraph>
    <Paragraph position="3"> I described the problem that was on the paper(2) On the paper, I described the problem. (3) In this paper, we address only the type of PPA problem illustrated above and don't consider other less frequent PPA problems. For the linguistic details of the problem, the reader can refer to (Hirst, 1987).</Paragraph>
    <Paragraph position="4"> We use the PPA data created by (Brill and Resnik, 1994) and (Ratnaparkhi et al., 1994) to objectively compare the performances of the systems. Both data were extracted from the Penn Treebank Wall Street Journal (WSJ) Corpus (Marcus et al., 1993). In order to distinguish these data from each other, we call the former one BSzR data and the latter one IBM data. Both PPA data were formatted in tuples with five variables (4), which denote the class (i.e., the PPA attachment site) and the features (i.e., verb, object noun, preposition and PP noun) in the respective order* Values of these variables for the above example (1) are illustrated in (5), where (A, B, C, D, E) (4) (verb lnoun , &amp;quot;describe&amp;quot;, &amp;quot;problem&amp;quot;, &amp;quot;on&amp;quot;, &amp;quot;paper&amp;quot;X5) For representation convenience, we can map the values of these variables to positive integers as in Table 1. Then, the examples, (2) and (3) can be con- null ated integer labels at the Levels column. The number of levels of five variables are 2, 3845, 5162, 81 and 6625.</Paragraph>
    <Paragraph position="5"> verted to tuples (6) and (7), respectively.</Paragraph>
    <Paragraph position="7"> Using this convention, the PPA data can be represented in a contingency table (Table 2) with five dimensions, where each dimension is dedicated to a variable. The size of a contingency table is determined by the cardinality of values (a.k.a. levels) of these variables (8); for the IBM data, there are 2.13 x 1013 cells in the table (9). Each cell in the table corresponds to a unique combination of the variable values and all combinations are represented in the table.</Paragraph>
    <Paragraph position="9"> cell contains frequency with which the corresponding 5-tuple (i.e., a unique PPA instance) occurs in the data.</Paragraph>
    <Paragraph position="11"> Considering that there are 27,937 PPA observations in the training and test data together, a search space of more than 21 trillion possible distinct cases (represented in the cells of contingency table) indicates that the data is extremely sparse.</Paragraph>
    <Paragraph position="12"> To solve PPA problem, NLP researchers designed domain specific classifier systems. Those systems can be categorized in two classes:  1. Rule based systems (Boggess et al., 1991), (Brill and Resnik, 1994) 2. Statistical and information theoretic approaches (Hindle and Rooth, 1993), (Ratna null parkhi et al., 1994),(Collins and Brooks, 1995), (Franz, 1996) Using lexical collocations to determine PPA with statistical techniques was first proposed by (Hindle and Rooth, 1993). They suggested a score called Lexical Association to predict PPA. It is a log likelihood ratio of probability estimates of two PPA sites. The probability of attachment was based on the frequencies of the 2-tuples (B, D), and (C, D), where B, C, D stand for the variables: verb, object noun</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
Statistical PP Attachment
</SectionTitle>
      <Paragraph position="0"> and preposition, respectively. While (Hindle and Rooth, 1993) stated that this approach was not successful in estimating PPA using small 2-tuple frequencies, which comprised a major portion of the PPA data, the accuracy reported was 79.7%, which is a substantial improvement over the lower bound of 65% (10): tions used in the function). If all fail, the assignment is noun attachment, since 52% of the time the attachment site on the training data was noun.</Paragraph>
      <Paragraph position="2"> The lower bound for the B&amp;R data is 63% (Brill and Resnik, 1994) and for the IBM data is 52% (Ratnaparkhi et al., 1994).</Paragraph>
      <Paragraph position="3"> (Ratnaparkhi et al., 1994) was the first to considered the full four feature set defined in (4). The approach made use of a maximum entropy model (Berger et al., 1996) formulated from frequency information for various combinations of the observed features. The combinations that reduced the entropy most, were chosen. The accuracy of PPA classification using this approach was 77.7% on the IBM data. (For performance comparison of various approaches on available data, please refer to Table 4 on page 40.) (Brill and Resnik, 1994) suggested a rule based approach where the antecedent of each rule specifies values for the feature variables in (4). A typical rule might be as follows:</Paragraph>
      <Paragraph position="5"> 471 such inference rules are found useful and ordered to reduce the error-rate to a minimum. They reported an accuracy of 80.8% on the data that we also use. They also duplicated the experiment of (Hindle and Rooth, 1993), which scored around 5% less than the rule-based approach.</Paragraph>
      <Paragraph position="6"> (Collins and Brooks, 1995) proposed a specific heuristic computation to predict PPAs. The idea originated from the back-off model (Katz, 1987). If the combination of feature values observed for a test instance is also observed in the training set, then that test instance is classified with the most frequent PPA site for those feature values in the training set. Otherwise, probability estimates for the two PPA sites are obtained from functions(12)-(14), via a process similar to model switching. If the highest complexity formulation, (12), cannot be used to classify a test instance (i.e., the required feature value combinations are not observed in the training data), then the decision process is switched to the next function, where functions are ranked based on complexity (i.e., the arity of the frequency distribu-</Paragraph>
      <Paragraph position="8"> If a higher order function cannot classify a test instance, then the decision process is switched to the next function. If all fail, the guess is the noun attachment, since 52% of the time the attachment site on the training data was noun.</Paragraph>
      <Paragraph position="9"> While the probability estimates in (14) are maximum likelihood estimates (MLEs), the estimates in  (12) and (13) are heuristic formulations (i.e., not MLEs). The rationale behind these formulae are: 1. a decision made by utilizing more feature variables should be favorable over the others, 2. the preposition feature D is essential; thus, it is better to keep it in all n-grams of the decision  functions.</Paragraph>
      <Paragraph position="10"> They used IBM data, which we also use, and reported an accuracy of 84.1%.</Paragraph>
      <Paragraph position="11"> (Franz, 1996) proposed a new feature set, which provided a more compact representation of the PPA data. Using a hierarchical log-linear model containing only second order interactions, he achieved a classification performance comparable to that of (Hindle and Rooth, 1993). He also designed another experiment with a less common PPA problem with three attachment sites.</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Decomposable Models
</SectionTitle>
    <Paragraph position="0"> In this paper, PPA is cast as a problem in supervised learning, where a probabilistic classifier is induced from tagged training data in the form of 5-tuples (6) and (7). The task is to predict the value of the tag A given the values of the feature variables B through E.</Paragraph>
    <Paragraph position="1"> Probabilistic models (e.g., decomposable models) specify joint distribution functions that assign probability values to every unique combination of the model variables, where the sum of those values is equal to 1. We adopt a Maximum Likelihood Estimation (MLE) approach. Given a decomposable model, MLE yields the most probable tag to each Kayaalp, Pedersen ~ Bruce 35 Statistical PP Attachment test data instance represented by a 4-tuple of feature values.</Paragraph>
    <Paragraph position="2"> Decomposable models belong to the class of graphical models, 1 where variables are either interdependent or conditionally independent of one another. 2 All graphical models have a graphical representation such that each variable in the model is mapped to a vertex in the graph, and there is an undirected edge between each pair of vertices corresponding to a pair of interdependent variables. While edges represent interactions between pairs of variables, i.e., second order interactions, cliques 3 with n vertices represent n th order interactions. Any two vertices that are not directly connected by an edge are conditionally independent given the values of the vertices on the path that connects them.</Paragraph>
    <Paragraph position="3"> Decomposable models are graphical models that are isomorphic to chordal graphs. In chordal graphs, there is no cycle of four or more without a chord, where a cord is an edge joining two non-consecutive vertices on the cycle. The elementary components of a chordal graph are its cliques; therefore, a chordal graph can be represented as a set of its cliques.</Paragraph>
    <Paragraph position="4"> The chordM graph in Figure 1 represents a decom- null ABD.ABE.ACE. Edges of the separators, AB and AE (corresponding to ABD N ABE and ABE N ACE), are drawn thicker. A separator is a set of vertices whose removal disconnects the graph.</Paragraph>
    <Paragraph position="5"> posable model, which we can mnemonically denote as (15).</Paragraph>
    <Paragraph position="7"> In this model, variables A, B, and D are stochastically dependent since they form a clique. Similar statements can be made for the other cliques in the model. The interactions between AB and AE,</Paragraph>
    <Paragraph position="9"> tex pair is connected with an edge.</Paragraph>
    <Paragraph position="10"> denoted by the corresponding edges AB, AE are observed in two out of the three cliques which indicates their relative importance in describing this distribution. The variable A is observed in all three cliques of the model because we consider only those cliques that contain the class variable A in defining the model. There are three edges missing, BC, CD, and DE, which distinguish this model from the saturated model ABCDE. These missing edges denote three conditional independence relations:  1. The variables D and E are conditionally independent given AB (intersection of two cliques, ABD N ABE).</Paragraph>
    <Paragraph position="11"> 2. The variables B and C are conditionally independent given AE (ABE n ACE).</Paragraph>
    <Paragraph position="12"> 3. The variables C and D are conditionally independent given A (ABD N ACE).</Paragraph>
    <Paragraph position="13">  This approach to classifying PPA is the first to make use of conditional independence in modeling the distribution of feature variables.</Paragraph>
    <Paragraph position="14"> A well known example of a decomposable model is the Naive Bayes model in which all feature variables are conditionally independent given the value of classification variable. For the PPA problem, the Naive Bayes model is AB.AC.AD.AE.</Paragraph>
    <Paragraph position="15"> Decomposable models are important because they are those graphical models that express the joint probability distributions of the variables in terms of the product of their marginal distributions, where each factor of the product corresponds to a clique or a separator in the graphical representation of the model. Because the joint distribution functions of decomposable models have such closed-form expressions, the parameters as Maximum Likelihood Estimates (MLEs) can be calculated directly from the training data without the need for an iterative fitting procedure; hence, those MLEs are also called direct estimates (Bishop et al., i975).</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 Maximum Likelihood Estimation
</SectionTitle>
      <Paragraph position="0"> Let the PPA variables, IAI = I, IBI = J,..., IEI = M resulting in an I x J x K x L x M contingency table (e.g., Table 2). Let the count in each cell (i.e., the frequency with which the corresponding 5-tuple is observed in the training data) be denotes as nijktrn.</Paragraph>
      <Paragraph position="1"> When all variables are considered to be interdependent (i.e., the saturated decomposable model) the maximum likelihood estimate of the probability of any 5-tuple is equal to the count in the corresponding cell noklm divided by the total count N, which is equal to 24,840 for the IBM training data (Table 2).</Paragraph>
      <Paragraph position="2">  Estimates of the marginal probability distributions can be calculated in a similar fashion. If we are interested in the probability of observing a verb attachment when &amp;quot;describe&amp;quot; is the noun, and &amp;quot;on&amp;quot; is the preposition (i.e., A = 1,B = 1, D = 1), regardless of the values of the other variables, it can be calculated as in (17) and (18).</Paragraph>
      <Paragraph position="4"> Let c denote the specific cell coordinates (e.g., 11111 in (16)), and let the model .A4 = {C: U C2 U * .. CO }, where Cd denotes a clique in the graph representation of A//, then the direct estimates (MLEs) are computed as in (19).</Paragraph>
      <Paragraph position="6"> where the factors in the numerator are the marginal probabilities for c in the cliques {Cd}, whose union represents the model. The intersections of cliques {Cd} yield separators {Sa} and the marginal probabilities for c in {Sd} are factors in the denominator (Lauritzen, 1996). For the saturated model,</Paragraph>
      <Paragraph position="8"> MLEs of the model (15) can be computed as in (22), and using this model, MLEs of the examples (2) and (3) can be calculated as in (23) and (24), respectively.</Paragraph>
      <Paragraph position="10"> As seen in this example, decomposable models provide us not only a very powerful representation medium but also computational efficiency in estimating parameters.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Model Switching
</SectionTitle>
    <Paragraph position="0"> Let E1 and E2 be equal to MLEs in (23) and (24).</Paragraph>
    <Paragraph position="1"> There are four cases in determining the class based on these equations.</Paragraph>
    <Paragraph position="3"> In cases (25) and (26), there is no classification and no recall for this test instance with this model. In (27) and (28), the classifications are noun and verb attachments, respectively.</Paragraph>
    <Paragraph position="4"> For the PPA data with five variables, there are only 110 decomposable models, corresponding to all chordal graphs of order five or less, where every clique of the order two and higher contains the vertex that represents the class variable. Since this number is not large, we considered all of these models for classification. 4 Let all test instances be composed in the set T and let</Paragraph>
    <Paragraph position="6"> where 7~ is a set of test instances that can be classified with model A,4i for (1 &lt; i &lt; m = 110); i.e., the outcomes of C/n(AI7~, .h4~) is either in (27) or in (28).</Paragraph>
    <Paragraph position="7"> These estimates may not always be correct, unless the information in features are sufficient and the classification model is perfect; therefore, each set of estimates associated with ~ and A//i has a precision value:</Paragraph>
    <Paragraph position="9"> where 7~c and &amp;quot;/~w are sets of correctly and wrongly classified test instances in set 7~. If we have an ordered list of models (.A41, A42,..., .A4m) as a certificate, where precision(.MilT~ ) &gt; precision(A4~+: l'/~+l) (32) we could use the certificate to maximize the overall classification accuracy.</Paragraph>
    <Paragraph position="10"> Since the first model .N'/1 is associated with the highest precision value, the probability that a test instance is correctly classified with .A41 is higher than 4 For problems with larger variable set additional techniques (Edwards and HavrAnek, 1987) or (Madigan and Raftery, 1994) are necessary to reduce the model space. Kayaalp, Pedersen ~4 Bruce 37 Statistical PP Attachment that probability for any other model; therefore, .M1 should be used to classify all possible test instances.</Paragraph>
    <Paragraph position="12"> After ~ is classified, the process is repeated for the remaining test instances 7 -1 with M2 that is the most &amp;quot;precise&amp;quot; model remaining in the model set.</Paragraph>
    <Paragraph position="13"> This cycle can be generalized as 7-/- 1 = 7~ t3 7 -i, where &amp;quot;/~ N 7 --/= { }, and T = T O (34) and will be iterated k times, where T k = {}. The overall classification accuracy then be calculated as</Paragraph>
    <Paragraph position="15"> The question remains now, how we can find the list of models (.M1, .M2,..., .Mk,..., .Mm) ordered by precision. Since precision is a measure that can be acquired after classifying all test instances, how can we order models based on precision before testing? One approach is to use the error rates of the models acquired through cross-validation. The technique we use here is called leave-one-out cross-validation (Lachenbruch and Mickey, 1968). Let the training data set be TO, where every data instance Pi E T~, i = 1,2,...,r and r = I~ I. When amodelA//j is applied to a data instance Pi, in this technique, all training instances except Pi (i.e., TC/ - pi) are used to compute the direct estimate for pi. This process is repeated for every data instance (i.e., r times).</Paragraph>
    <Paragraph position="16"> This technique is applied to all training instances for every model. The precision score of each model is collected, and based on those scores, the models are ordered.</Paragraph>
    <Paragraph position="17"> If k (the number of models used to classify all PPA instances) is small, then it is expected that after each iteration the test instances remaining to be classified would be decreased significantly; hence, the characteristics of T i-1 and T i might differ substantially and ordering the remaining models based on T i, rather than &amp;quot;T 0, might increase the overall accuracy.</Paragraph>
    <Paragraph position="18"> A second experiment is designed to apply this recursive strategy to order the models via the same cross-validation process. First, the most precise model for the entire (training) data is identified.</Paragraph>
    <Paragraph position="19"> Then, the data instances that are classified with the first model are excluded from the original data set, as in (33). Within the remaining data instances, all models in {.A42, M3,..., Adrn} are searched for the current most precise model. This model selection  cycle is iterated exhaustively (34) until all data instances are classified. The models selected for the IBM data are shown in Table 3.</Paragraph>
    <Paragraph position="20"> The MLE algorithm is a table look up, where each table contains marginal values for a clique of variables as defined in the graph representation. If those values could be stored in a memory array, the time complexity of MLE could be O(1); however, the number of values is huge, thus we have to store each set of clique marginals on disk, and currently the access to the data is through sequential file access with a time complexity O(n), where n is the number of training instances. MLEs need to be computed for m models and for n training instances. During each recursive step a considerable part of the training instances are classified (around 5%); thus we may represent the process as</Paragraph>
    <Paragraph position="22"> Therefore, the average time complexity for the current program is O(mn 2 log(mn2)), but through memoization, 5 the overhead of the recursion will be drastically reduced in newer versions of the program.</Paragraph>
    <Paragraph position="23"> 5A standard dynamic programming technique that stores computed information in a table, which is looked up when that information is needed next time.</Paragraph>
    <Paragraph position="24"> Kayaalp, Pedersen 8/Bruce 38 Statistical PP Attachment The software of MS1 is developed in Perl and is freely available for research purposes only. Interested parties may contact the first author.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML