File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/97/w97-1016_metho.xml
Size: 27,024 bytes
Last Modified: 2025-10-06 14:14:49
<?xml version="1.0" standalone="yes"?> <Paper uid="W97-1016"> <Title>Resolving PP attachment Ambiguities with Memory-Based Learning</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 Memory-Based Learning </SectionTitle> <Paragraph position="0"> Classification-based machine learning algorithms can be applied in learning disambiguation problems by providing them with a set of examples derived from an annotated corpus. Each example consists of an input vector representing the context of an attachment ambiguity in terms of features (e.g. syntactic features, words, or lexical features in the case of PP-attachment); and an output class (one of a finite number of possible attachment positions representing the correct attachment position for the input context). Machine learning algorithms extrapolate from the examples to new input cases, either by extracting regularities from the examples in the form of rules, decision trees, connection weights, or probabilities in greedy learning algorithms, or by a more direct use of analogy in lazy learning algorithms. It is the latter approach which we investigate in this paper. It is our experience that lazy learning (such as the Memory-Based Learning approach adopted here) is more effective for several language-processing problems (see Daelemans (1995) for an overview) than more eager learning approaches. Because language-processing tasks typically can only be described as a complex interaction of regularities, subregularities and (families of) exceptions, storing all empirical data as potentially useful in analogical extrapolation works better than extracting the main regularities and forgetting the individual examples (Daelemans, 1996).</Paragraph> <Paragraph position="1"> Analogy from Nearest Neighbors The techniques used are variants and extensions of the classic k-nearest neighbor (k-NN) classifier algorithm. The instances of a task are stored in a table, together with the associated &quot;correct&quot; output. When a new pattern is processed, the k nearest neighbors of the pattern are retrieved from memory using some similarity metric. The output is determined by extrapolation from the k nearest neighbors. The most common extrapolation method is majority voting which simply chooses the most common class among the k nearest neighbors as an output. null Similarity metrics The most basic metric for patterns with symbolic features is the Overlap metric given in Equations 1 and 2; where A(X, Y) is the distance between patterns X and Y, represented by n features, wi is a weight for feature i, and 5 is the distance per feature. The k-NN algorithm with this metric, and equal weighting for all features is called IB1 (Aha, Kibler, and Albert, 1991). Usually k is set to 1.</Paragraph> <Paragraph position="3"> This metric simply counts the r/umber of (mis)matching feature values in both patterns. If no information about the importance of features is available, this is a reasonable choice. But if we have information about feature relevance, we can add linguistic bias to weight or select different features (Cardie, 1996). An alternative, more empiricist, approach is to look at the behavior of features in the set of examples used for training.</Paragraph> <Paragraph position="4"> We can compute Statistics about the relevance of features by looking at which features are good predictors of the class labels. Information Theory provides a useful tool for measuring feature relevance in this way, see Quinlan (1993).</Paragraph> <Paragraph position="5"> Zavrel, Daelemans ~4 Veenstra 137 Memory-Based PP Attachment Information Gain (IG) weighting looks at each feature in isolation, and measures how much information it contributes to our knowledge of the correct class label. The Information Gain of feature f is measured by computing the difference in uncertainty (i.e. entropy) between the situations without and with knowledge of the value of that feature</Paragraph> <Paragraph position="7"> Where C is the set of class labels, V! is the set of values for feature f, and H(C) : -EeeC P(c) log 2P(c) is the entropy of the class labels. The probabilities are estimated from relative frequencies in the training set. The normalizing factor si(f) (split info) is included to avoid a bias in favor of features with more values.</Paragraph> <Paragraph position="8"> It represents the amount of information needed to represent all values of the feature (Equation 4). The resulting IG values can then be used as weights in Equation 1. The k-NN algorithm with this metric is called IBI-IG, see Daelemans and van den Bosch (1992).</Paragraph> <Paragraph position="9"> The possibility of automatically determining the relevance of features implies that many different and possibly irrelevant features can be added to the feature set. This is a very convenient methodology if theory does not constrain the choice sufficiently beforehand, or if we wish to measure the importance of various information sources experimentally.</Paragraph> <Paragraph position="10"> MVDM and LexSpace Although IBI-IG solves the problem of feature relevance to a certain extent, it does not take into account that the symbols used as values in the input vector features (in this case words, syntactic categories, etc.) are not all equally similar to each other. According to the Overlap metric, the words Japan and China are as similar as Japan and pizza. We would like Japan and China to be more similar to each other than Japan and pizza. This linguistic knowledge could be encoded into the word representations by hand, e.g. by replacing words with semantic labels, but again we prefer a more empiricist approach in which distances between values of the same feature are computed differentially on the basis of properties of the training set. To this end, we use the Modified Value Difference Metric (MVDM), see Cost and Salzberg (1993); a variant of a metric first defined in Stanfill and Waltz (1986). This metric (Equation 5) computes the frequency distribution of each value of a feature over the categories. Depending on the similarity of their distributions, pairs of values are assigned a distance.</Paragraph> <Paragraph position="12"> In this equation, V1 and V2 are two possible values for feature f; the distance is the sum over all n categories; and P (C~ \]1~) is estimated by the relative frequency of the value ~ being classified as category i.</Paragraph> <Paragraph position="13"> In our PP-attachment problem, the effect of this metric is that words (as feature values) are grouped according to the category distribution of the patterns they belong to. It is possible to cluster the distributions of the values over the categories, and obtain classes of similar words in this fashion. For an example of this type of unsupervised learning as a side-effect of supervised learning, see Daelemans, Berck, and Gillis (1996). In a sense, the MVDM can be interpreted as implicitly implementing a statistically induced, distributed, non-symbolic representation of the words. In this case, the category distribution for a specific word is its lexical representation. Note that the representation for each word is entirely dependent on its behavior with respect to a particular classification task.</Paragraph> <Paragraph position="14"> In many practical applications of MB-NLP, we are confronted with a very limited set of examples.</Paragraph> <Paragraph position="15"> This poses a serious problem for the MVD metric.</Paragraph> <Paragraph position="16"> Many values occur only once in the whole data set. This means that if two such values occur with the same class, the MVDM will regard them as identical, and if they occur with two different classes their distance will be maximal. In many cases, the latter condition reduces the MVDM to the overlap metric, and additionaly some cases will be counted as an exact match on the basis of very shaky evidence. It is, therefore, worthwile to investigate whether the value difference matrix 5(~,~) can be reused from one task to another.</Paragraph> <Paragraph position="17"> This would make it possible to reliably estimate all the 5 parameters on a task for which we have a large amount of training material, and to profit from their availability for the MVDM of a smaller domain.</Paragraph> <Paragraph position="18"> Such a possibility of reuse of lexical similarity is found in the application of Lexical Space representation (Schiitze, 1994, Zavrel and Veenstra, 1995). In LexSpace, each word is represented by a vector of Zavrel, Daelemans 8J Veenstra 138 Memory-Based PP Attachment real numbers that stands for a &quot;fingerprint&quot; of the words' distributional behavior across local contexts in a large corpus. The distances between vectors can be taken as a me~ure of similarity. In Table 1, a number of examples of nearest neighbors are shown.</Paragraph> <Paragraph position="19"> For each focus-word f, a score is kept of the number of co-occurrences of words from a fixed set of C context-words wl (1 < i < C) in a large corpus.</Paragraph> <Paragraph position="20"> Previous work by Hughes (1994) indicates that the two neighbors on the left and on the right (i.e. the words in positions n - 2, n - 1, n + 1, n + 2, relative to word n) are a good choice of context. The position of a word in Lexical Space is thus given by a four component vector, of which each component has as many dimensions as there are context words.</Paragraph> <Paragraph position="21"> The dimensions represent the conditional probabilities P(w~-2\[f)... P(w~+ilf).</Paragraph> <Paragraph position="22"> We derived the distributional vectors of all 71479 unique words present in the 3 million words of Wall Street Journal text, taken from the ACL/DCI CD-ROM I (1991). For the contexts, i.e. the dimensions of Lexical Space, we took the 250 most frequent words.</Paragraph> <Paragraph position="23"> To reduce the 1000 dimensional Lexical Space vectors to a manageable format we applied Principal Component Analysis 1 (PCA) to reduce them to a much lower number of dimensions. PCA accomplishes the dimension reduction that preserves as much of the structure of the original data as possible. Using a measure of the correctness of the classification of a word in Lexical Space with respect to a linguistic categorization (see Zavrel and Veenstra (1995)) we found that PCA can reduce the dimensionality from 1000 to as few as 25 dimensions with virtually no loss, and sometimes even an improvement of the quality of the organization.</Paragraph> <Paragraph position="24"> Note that the LexSpace representations are task independent in that they only reflect the structure of neighborhood relations between words in text. However, if the task at hand has some positive relation to context prediction, Lexical Space representations are useful.</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 MBL for PP attachment </SectionTitle> <Paragraph position="0"> This section describes experiments with a number of Memory-Based models for PP attachment disambiguation. The first model is based on the lexical information only, i.e. the attachment decision is made by looking only at the identity of the words in the pattern. The second model considers the issue oflex1 Using the simplesvd package, which was kindly provided by Hinrich Schfitze. This software can be obtained from ftp://csli.stanford.edu /pub/pr os it/papers/s implesvd/.</Paragraph> <Paragraph position="1"> ical representation in the MBL framework, by taking as features either task dependent (MVDM) or task independent (LexSpace) syntactic vector representations for words. The introduction of vector represenations leads to a number of modifications to the distance metrics and extrapolation rules in the MBL framework. A final experiment examines a number of weighted voting rules.</Paragraph> <Paragraph position="2"> The experiments in this section are conducted on a simplified version of the &quot;full&quot; PP-attachment problem, i.e. the attachment of a PP in the sequence: VP NP PP. The data consist of four-tuples of words, extracted from the Wall Street Journal Treebank (Marcus, Santorini, and Marcinkiewicz, 1993) by a group at IBM (Ratnaparkhi, Reynar, and Roukos, 1994). 2 They took all sentences that contained the pattern VP NP PP and extracted the head words from the constituents, yielding a V N1 P N2 pattern. For each pattern they recorded whether the PP was attached to the verb or to the noun in the treebank parse. Example sentences 1 and 2 would then become: 3 eats, pizza, with, fork, Y.</Paragraph> <Paragraph position="3"> 4 eats, pizza, with, anchovies, N.</Paragraph> <Paragraph position="4"> The data set contains 20801 training patterns, 3097 test patterns, and an independent validation set of 4039 patterns for parameter optimization. It has been used in statistical disambiguation methods by Ratnaparkhi, Reynar, and Roukos (1994) and Collins and Brooks (1995); this allows a comparison of our models to the methods they tested. All of the models described below were trained on all of the training examples and the results are given for the 3097 test patterns. For the benchmark comparison with other methods from the literature, we use only results for which all parameters have been optimized on the validation set.</Paragraph> <Paragraph position="5"> In addition to the computational work, Ratnaparkhi, Reynar, and Roukos (1994) performed a study with three human subjects, all experienced treebank annotators, who were given a small random sample of the test sentences (either as four-tuples or as full sentences), and who had to give the same binary decision. The humans, when given the four-tuple, gave the same answer as the Treebank parse 88.2 % of the time, and when given the whole sentence, 93.2 % of the time. As a baseline, we can consider either the Late Closure principle, which always attaches to the noun and yields a score of only 2 The dataset is avaliable from ftp://ftp, cis. upenn, edu/pub/adwait/PPatt achDat a/. We would like to thank Michael Collins for pointing this benchmark out to us. The 10 nearest neighbors of the word in upper case are listed by ascending distance. 59.0 % correct, or the most likely attachment associated with the preposition, which reaches an accuracy of 72.2 %.</Paragraph> <Paragraph position="6"> The training data for this task are rather sparse.</Paragraph> <Paragraph position="7"> Of the 3097 test patterns, only 150 (4.8 %) occurred in the training set; 791 (25.5 %) patterns had at least 1 mismatching word with any pattern in the training set; 1963 (63.4 %) patterns at least 2 mismatches; and 193 (6.2 %) patterns at least 3 mismatches.</Paragraph> <Paragraph position="8"> Moreover, the test set contains many words that are not present in any of the patterns in the training set. Table 2 shows the counts of feature values and unknown values. This table also gives the Information Gain estimates of feature relevance.</Paragraph> <Paragraph position="9"> Overlap-Based Models In a first experiment, we used the IB1 algorithm and the IBi-IG algorithm. The results of these algorithms and other methods from the literature are given in Table 3. The addition of IG weights clearly helps, as the high weight of the P feature in effect penalizes the retrieval of patterns which do not match in the preposition. As we have argued in Zavrel and Daelemans (1997), this corresponds exactly to the behavior of the Back-Off algorithm of Collins and Brooks (1995), so that it comes as no surprise that the accuracy of both methods is the same. Note that the Back-Off model was constructed after performing a number of validation experiments on held-out data to determine which terms to include and, more importantly, which to exclude from the back-off sequence. This process is much more laborious than the automatic computation of IG-weights on the training set.</Paragraph> <Paragraph position="10"> The other methods for which results have been reported on this dataset include decision trees, Maximum Entropy (Ratnaparkhi, Reynar, and Roukos, 1994), and Error-Driven Transformation-Based Learning (Brill and Resnik, 1994), 3 which were clearly outperformed by both IB1 and IBI-IG, even though e.g. Brill ~ Resnik used more elaborate feature sets (words and WordNet classes). Adding aThe results of Brill's method on the present benchmark were reconstructed by Collins and Brooks (1995). more elaborate features is also possible in the MBL framework. In this paper, however, we focus on more effective use of the existing features. Because the Overlap metric neglects information about the degree of mismatch if feature-values are not identical, it is worthwhile to look at more finegrained representations and metrics.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> Continuous Vector Representations for Words </SectionTitle> <Paragraph position="0"> In experiments with Lexical Space representations, every word in a pattern was replaced by its PCA compressed LexSpace vector, yielding patterns with 25x4 numerical features and a discrete target category. The distance metric used was the sum of the LexSpace vector distance per feature, where the distance between two vectors is computed as one minus the cosine, normalized by the cumulative norm. Because no two patterns have the same distance in this case, to use only the nearest neighbor(s) means extrapolating from exactly one nearest neighbor.</Paragraph> <Paragraph position="1"> In preliminary experiments, this was found to give bad results, so we also experimented with various settings for k : the parameter that determines the number of neighbors considered for the analogy. The same was done for the MVDM metric which has a similar behavior. We found that LexSpace performed best when k was set to 13 (83.3 % correct); MVDM obtained its best score when k was set to 50 (80.5 % correct). Although these parameters were found by optimization on the test set, we can see in Figure 1 that LexSpace actually outperforms MVDM for all settings of k. Thus, the representations from LexSpace which represent the behavior of the values independent of the requirements of this particular classification task outperform the task specific representations used by MVDM. The reason is that the task specific representations are derived only from the small number of occurrences of each value in the training set, whereas the amount of text available to refine the LexSpace vectors is practically unlimited. Lexical Space however, does not outperform the simple Overlap metric (83.7 % correct) in this form. We suspected that the reason for this is the fact that when continuous represen-Zavrel, Daelemans ~ Veenstra 140 Memory-Based PP Attachment</Paragraph> <Paragraph position="3"> are taken from Ratnaparkhi et al. (1994); the scores of Transformations and Back-off are taken from Collins & Brooks (1995). The C4.5 decision tree results, and the baselines have been computed by the authors.</Paragraph> <Paragraph position="4"> tations are used, the number of neighbors is exactly fixed to k, whereas the number of neighbors used in the Overlap metric is, in effect, dependent on the specificity of the match.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> Weighted Voting </SectionTitle> <Paragraph position="0"> This section examines possibilities for improving the behavior of LexSpace vectors for MBL by considering various weighted voting methods.</Paragraph> <Paragraph position="1"> The fixed number of neighbors in the continuous metrics can result in an oversmoothing effect. The k-NN classifier tries to estimate the conditional class probabilities from samples in a local region of the data space. The radius of the region is determined by the distance of the k-furthest neighbor.</Paragraph> <Paragraph position="2"> If k is very small and i) the nearest neighbors are not nearby due to data sparseness, or ii) the nearest neighbor classes are unreliable due to noise, the &quot;local&quot; estimate tends to be very poor, as illustrated in Figure 1. Increasing k and thus taking into account a larger region around the query in the dataspace makes it possible to overcome this effect by smoothing the estimate. However, when the majority voting method is used, smoothing can easily become oversmoothing, because the radius of the neighborhood is as large as the distance of the k'th nearest neighbor, irrespective of the local properties of the data. Selected points from beyond the &quot;relevant neighborhood&quot; will receive a weight equal to the close neighbors in the voting function, which can result in unnecessary classification errors.</Paragraph> <Paragraph position="3"> A solution to this problem is the use of a weighted voting rule which weights the vote of each of the nearest neighbors by a function of their distance to the test pattern (query). This type of voting rule was first proposed by Dudani (1976). In his scheme, the nearest neighbor gets a weight of 1, the furthest neighbor a weight of 0, and the other weights are scaled linearly to the interval in between.</Paragraph> <Paragraph position="5"> where dj is the distance to the query of the j'th nearest neighbor, dl the distance of the nearest neighbor, and dk the distance of the furthest (k'th) neighbor.</Paragraph> <Paragraph position="6"> Dudani further proposed the inverse distance weight (Equation 7), which has recently become popular in the MBL literature (Wettschereck, 1994). In Equation 7, a small constant is usually added to the denominator to avoid division by zero.</Paragraph> <Paragraph position="8"> Another weighting function considered here is based on the work of Shepard (1987), who argues for a universal perceptual law, in which the relevance of a previous stimulus for the generalization to a new stimulus is an exponentially decreasing function of its distance in a psychological space. This gives the weighed voting function of Equation 8, where o~ and fl are constants determining the slope and the power of the exponential decay function. In the experiments reported below, oe = 3.0 and fl = 1.0.</Paragraph> <Paragraph position="9"> attachment test set with Lexical Space representations. The values of k, the voting function, and the IG weights were determined on the training and validation sets.</Paragraph> <Paragraph position="11"> Figure 2 shows the results on the test set for a wide range of k for these voting methods when applied to the LexSpace represented PP-attachment dataset.</Paragraph> <Paragraph position="12"> With the inverse distance weighting function the results are better than with majority voting, but here, too, we see a steep drop for k's larger than 17.</Paragraph> <Paragraph position="13"> Using Dudani's weighting function, the results become optimal for larger values of k, and remain good for a wide range of k values. Dudani's weighting function also gives us the best overall result, i.e. if we use the best possible setting for k for each method, as determined by performance on the validation set (see Table 4).</Paragraph> <Paragraph position="14"> The Dudani weighted k-nearest neighbor classifier (k=30) slightly outperforms Collins ~ Brooks' (1995) Back-Off model. A luther small increase was obtained by combining LexSpace representations with IG weighting of the features, and Dudani's weighted voting function. Although the improvement over Back-Off is quite limited, these results are nonetheless interesting because they show that MBL can gain from the introduction of extra information sources, whereas this is very difficult in the Back-Off algorithm. For comparison, consider that the performance of the Maximum Entropy model with distributional word-class features is still only 81.6% on this data.</Paragraph> </Section> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 Discussion </SectionTitle> <Paragraph position="0"> If we compare the accuracy of humans on the V,N,P,N patterns (88.2 % correct) with that of our most accurate method (84.4 %), we see that the paradigm of learning disambiguation methods from corpus statistics offers good prospects for an effective solution to the problem. After the initial effort by Hindle and Rooth (1993), it has become clear that this area needs statistical methods in which an easy integration of many information sources is possible. A number of methods have been applied to the task with this goal in mind.</Paragraph> <Paragraph position="1"> Brill and Resnik (1994) applied Error-Driven Transformation-Based Learning to this task, using the verb, nounl, preposition, and noun2 features.</Paragraph> <Paragraph position="2"> Their method tries to maximize accuracy with a minimal amount of rules. They found an increase in performance by using semantic information from WordNet. Ratnaparkhi, Reynar, and Roukos (1994) used a Maximum Entropy model and a decision tree on the dataset they extracted from the Wall Street Journal corpus. They also report performance gains with word features derived by an unsupervised clustering method. Ratnaparkhi et al. ignored low frequency events. The accuracy of these two approaches is not optimal. This is most likely due to the fact that they treat low frequency events as noise, though these contain a lot of information in a sparse domain such as PP-attachment. Franz (1996) used a Loglinear model for PP attachment. The features he used were the preposition, the verb level of nearest neighbors.</Paragraph> <Paragraph position="3"> (the lexical association between the verb and the preposition), the noun level (idem dito for nounl), the noun tag (POS-tag for nounl), noun definiteness (of nounl), and the PP-object tag (POS-tag for noun2). A Loglinear model keeps track of the interaction between all the features, though at a fairly high computational cost. The dataset that was used in Franz' work is no longer available, making a direct comparison of the performance impossible. Collins and Brooks (1995) used a Back-Off model, which enables them to take low frequency effects into account on the Ratnaparkhi dataset (with good results). In Zavrel and Daelemans (1997) it is shown that Memory-Based and Back-Off type methods are closely related, which is mirrored in the performance levels. Collins and Brooks got slightly better results (84.5 %) after reducing the sparse data problem by preprocessing the dataset, e.g. replacing all four-digit words with 'YEAR'. The experiments with Lexical Space representations have as yet not shown impressive performance gains over Back-Off, but they have demonstrated that the MBL framework is well-suited to experimentation with rich lexical representations.</Paragraph> </Section> class="xml-element"></Paper>