File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/n03-1033_metho.xml
Size: 28,274 bytes
Last Modified: 2025-10-06 14:08:12
<?xml version="1.0" standalone="yes"?> <Paper uid="N03-1033"> <Title>Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 Bidirectional Dependency Networks </SectionTitle> <Paragraph position="0"> When building probabilistic models for tag sequences, we often decompose the global probability of sequences using a directed graphical model (e.g., an HMM (Brants, 2000) or a conditional Markov model (CMM) (Ratnaparkhi, 1996)). In such models, the probability assigned to a tagged sequence of words x = <t,w> is the product of a sequence of local portions of the graphical model, one from each time slice. For example, in the left-to-right</Paragraph> <Paragraph position="2"> That is, the replicated structure is a local model P(t0|t[?]1,w0).2 Of course, if there are too many conditioned quantities, these local models may have to be estimated in some sophisticated way; it is typical in tagging to populate these models with little maximum entropy models. For example, we might populate a model for P(t0|t[?]1,w0) with a maxent model of the form:</Paragraph> <Paragraph position="4"> In this case, the w0 and t[?]1 can have joint effects on t0, but there are not joint features involving all three variables (though there could have been such features). We say that this model uses the feature templates <t0, t[?]1> (previous tag features) and <t0,w0> (current word features).</Paragraph> <Paragraph position="5"> Clearly, both the preceding tag t[?]1 and following tag t+1 carry useful information about a current tag t0. Unidirectional models do not ignore this influence; in the case of a left-to-right CMM, the influence of t[?]1 on t0 is explicit in the P(t0|t[?]1,w0) local model, while the influence of t+1 on t0 is implicit in the local model at the next position (via P(t+1|t0,w+1)). The situation is reversed for the right-to-left CMM in figure 1(b).</Paragraph> <Paragraph position="6"> From a seat-of-the-pants machine learning perspective, when building a classifier to label the tag at a certain position, the obvious thing to do is to explicitly include in the local model all predictive features, no matter on which side of the target position they lie. There are two good formal reasons to expect that a model explicitly conditioning on both sides at each position, like figure 1(c) could be advantageous. First, because of smoothing effects and interaction with other conditioning features (like the words), left-to-right factors like P(t0|t[?]1,w0) do not always suffice when t0 is implicitly needed to determine t[?]1. For example, consider a case of observation bias (Klein and Manning, 2002) for a first-order left-to-right CMM. The word to has only one tag (TO) in the PTB tag set. The TO tag is often preceded by nouns, but rarely by modals (MD). In a sequence will to fight, that trend indicates that will should be a noun rather than a modal verb. However, that effect is completely lost in a CMM like (a): P(twill|will,<start> ) prefers the modal tagging, and P(TO|to, twill) is roughly 1 regardless of twill . While the model has an arrow between the two tag positions, that path of influence is severed.3 The same problem exists in the other direction. If we use the symmetric right- null or &quot;observation bias&quot;, these effects are really just unwanted explaining-away effects (Cowell et al., 1999, 19), where two nodes which are not actually in causal competition have been modeled as if they were.</Paragraph> <Paragraph position="7"> P(A)P(B|A), (b) the Bayes' net for P(A|B)P(B), (c) a bidirectional net with models of P(A|B) and P(B|A), which is not a Bayes' net.</Paragraph> <Paragraph position="8"> to-left model, fight will receive its more common noun tagging by symmetric reasoning. However, the bidirectional model (c) discussed in the next section makes both directions available for conditioning at all locations, using replicated models of P(t0|t[?]1, t+1,w0), and will be able to get this example correct.4</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.1 Semantics of Dependency Networks </SectionTitle> <Paragraph position="0"> While the structures in figure 1(a) and (b) are well-understood graphical models with well-known semantics, figure 1(c) is not a standard Bayes' net, precisely because the graph has cycles. Rather, it is a more general dependency network (Heckerman et al., 2000). Each node represents a random variable along with a local conditional probability model of that variable, conditioned on the source variables of all incoming arcs. In this sense, the semantics are the same as for standard Bayes' nets.</Paragraph> <Paragraph position="1"> However, because the graph is cyclic, the net does not correspond to a proper factorization of a large joint probability estimate into local conditional factors. Consider the two-node cases shown in figure 2. Formally, for the net in (a), we can write P(a, b) = P(a)P(b|a). For (b) we write P(a, b) = P(b)P(a|b). However, in (c), the nodes A and B carry the information P(a|b) and P(b|a) respectively. The chain rule doesn't allow us to reconstruct P(a, b) by multiplying these two quantities. Under appropriate conditions, we could reconstruct P(a, b) from these quantities using Gibbs sampling, and, in general, that is the best we can do. However, while reconstructing the joint probabilities from these local conditional probabilities may be difficult, estimating the local probabilities themselves is no harder than it is for acyclic models: we take observations of the local environments and use any maximum likelihood estimation method we desire. In our experiments, we used local maxent models, but if the event space allowed, (smoothed) relative counts would do.</Paragraph> <Paragraph position="2"> 4The effect of indirect influence being weaker than direct influence is more pronounced for conditionally structured models, but is potentially an issue even with a simple HMM. The probabilistic models for basic left-to-right and right-to-left HMMs with emissions on their states can be shown to be equivalent using Bayes' rule on the transitions, provided start and end symbols are modeled. However, this equivalence is violated in practice by the addition of smoothing.</Paragraph> <Paragraph position="3"> function bestScore()</Paragraph> <Paragraph position="5"> order bidirectional CMM (memoized version).</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.2 Inference for Linear Dependency Networks </SectionTitle> <Paragraph position="0"> Cyclic or not, we can view the product of local probabilities from a dependency network as a score:</Paragraph> <Paragraph position="2"> where Pa(xi) are the nodes with arcs to the node xi . In the case of an acyclic model, this score will be the joint probability of the event x, P(x). In the general case, it will not be. However, we can still ask for the event, in this case the tag sequence, with the highest score. For dependency networks like those in figure 1, an adaptation of the Viterbi algorithm can be used to find the maximizing sequence in polynomial time. Figure 3 gives pseudocode for the concrete case of the network in figure 1(d); the general case is similar, and is in fact just a max-plus version of standard inference algorithms for Bayes' nets (Cowell et al., 1999, 97). In essence, there is no difference between inference on this network and a second-order left-to-right CMM or HMM. The only difference is that, when the Markov window is at a position i, rather than receiving the score for P(ti|ti[?]1, ti[?]2,wi), one receives the score for P(ti[?]1|ti, ti[?]2,wi[?]1).</Paragraph> <Paragraph position="3"> There are some foundational issues worth mentioning. As discussed previously, the maximum scoring sequence need not be the sequence with maximum likelihood according to the model. There is therefore a worry with these models about a kind of &quot;collusion&quot; where the model locks onto conditionally consistent but jointly unlikely sequences. Consider the two-node network in figure 2(c). If we have the following distribution of observations (in the form ab) <11, 11, 11, 12, 21, 33> , then clearly the most likely state of the network is 11. However, the score of 11 is P(a = 1|b = 1)P(b = 1|a = 1) = 3/4 x 3/4 = 9/16, while the score of 33 is 1. An additional related problem is that the training set loss (sum of negative logarithms of the sequence scores) does not bound the training set error (0/1 loss on sequences) from above. Consider the following training set, for the same network, with each entire data point considered as a label: <11, 22> . The relative-frequency model assigns loss 0 to both training examples, but cannot do better than 50% error in regenerating the training data labels. These issues are further discussed in Heckerman et al. (2000).</Paragraph> <Paragraph position="4"> Preliminary work of ours suggests that practical use of dependency networks is not in general immune to these theoretical concerns: a dependency network can choose a sequence model that is bidirectionally very consistent but does not match the data very well. However, this problem does not appear to have prevented the networks from performing well on the tagging problem, probably because features linking tags and observations are generally much sharper discriminators than tag sequence features.</Paragraph> <Paragraph position="5"> It is useful to contrast this framework with the conditional random fields of Lafferty et al. (2001). The CRF approach uses similar local features, but rather than chaining together local models, they construct a single, globally normalized model. The principal advantage of the dependency network approach is that advantageous bidirectional effects can be obtained without the extremely expensive global training required for CRFs.</Paragraph> <Paragraph position="6"> To summarize, we draw a dependency network in which each node has as neighbors all the other nodes that we would like to have influence it directly. Each node's neighborhood is then considered in isolation and a local model is trained to maximize the conditional likelihood over the training data of that node. At test time, the sequence with the highest product of local conditional scores is calculated and returned. We can always find the exact maximizing sequence, but only in the case of an acyclic net is it guaranteed to be the maximum likelihood sequence.</Paragraph> </Section> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Experiments </SectionTitle> <Paragraph position="0"> The part of speech tagged data used in our experiments is the Wall Street Journal data from Penn Treebank III (Marcus et al., 1994). We extracted tagged sentences from the parse trees.5 We split the data into training, development, and test sets as in (Collins, 2002). Table 1 lists character5Note that these tags (and sentences) are not identical to those obtained from the tagged/pos directories of the same disk: hundreds of tags in the RB/RP/IN set were changed to be more consistent in the parsed/mrg version. Maybe we were the last to discover this, but we've never seen it in print.</Paragraph> <Paragraph position="1"> istics of the three splits.6 Except where indicated for the model BEST, all results are on the development set.</Paragraph> <Paragraph position="2"> One innovation in our reporting of results is that we present whole-sentence accuracy numbers as well as the traditional per-tag accuracy measure (over all tokens, even unambiguous ones). This is the quantity that most sequence models attempt to maximize (and has been motivated over doing per-state optimization as being more useful for subsequent linguistic processing: one wants to find a coherent sentence interpretation). Further, while some tag errors matter much more than others, to a first cut getting a single tag wrong in many of the more common ways (e.g., proper noun vs. common noun; noun vs.</Paragraph> <Paragraph position="3"> verb) would lead to errors in a subsequent processor such as an information extraction system or a parser that would greatly degrade results for the entire sentence. Finally, the fact that the measure has much more dynamic range has some appeal when reporting tagging results.</Paragraph> <Paragraph position="4"> The per-state models in this paper are log-linear models, building upon the models in (Ratnaparkhi, 1996) and (Toutanova and Manning, 2000), though some models are in fact strictly simpler. The features in the models are defined using templates; there are different templates for rare words aimed at learning the correct tags for unknown words.7 We present the results of three classes of experiments: experiments with directionality, experiments with lexicalization, and experiments with smoothing.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.1 Experiments with Directionality </SectionTitle> <Paragraph position="0"> In this section, we report experiments using log-linear CMMs to populate nets with various structures, exploring the relative value of neighboring words' tags. Table 2 lists the discussed networks. All networks have the same vertical feature templates: <t0,w0> features for known words and various <t0,s(w1n)> word signature features for all words, known or not, including spelling and capitalization features (see section 3.3).</Paragraph> <Paragraph position="1"> Just this vertical conditioning gives an accuracy of 93.69% (denoted as &quot;Baseline&quot; in table 2).8 Condition6Tagger results are only comparable when tested not only on the same data and tag set, but with the same amount of training data. Brants (2000) illustrates very clearly how tagging performance increases as training set size grows, largely because the percentage of unknown words decreases while system performance on them increases (they become increasingly restricted as to word class).</Paragraph> <Paragraph position="2"> 7Except where otherwise stated, a count cutoff of 2 was used for common word features and 35 for rare word features (templates need a support set strictly greater in size than the cutoff before they are included in the model).</Paragraph> <Paragraph position="3"> 8Charniak et al. (1993) noted that such a simple model got 90.25%, but this was with no unknown word model beyond a prior distribution over tags. Abney et al. (1999) raise this baseline to 92.34%, and with our sophisticated unknown word model, it gets even higher. The large number of unambiguous tokens and ones with very skewed distributions make the base-</Paragraph> <Paragraph position="5"> ing on the previous tag as well (model L, <t0, t[?]1> features) gives 95.79%. The reverse, model R, using the next tag instead, is slightly inferior at 95.14%. Model L+R, using both tags simultaneously (but with only the individual-direction features) gives a much better accuracy of 96.57%. Since this model has roughly twice as many tag-tag features, the fact that it outperforms the uni-directional models is not by itself compelling evidence for using bidirectional networks. However, it also out-performs model L+L2 which adds the <t0, t[?]2> secondprevious word features instead of next word features, which gives only 96.05% (and R+R2 gives 95.25%). We conclude that, if one wishes to condition on two neighboring nodes (using two sets of 2-tag features), the symmetric bidirectional model is superior.</Paragraph> <Paragraph position="6"> High-performance taggers typically also include joint three-tag counts in some way, either as tag trigrams (Brants, 2000) or tag-triple features (Ratnaparkhi, 1996, Toutanova and Manning, 2000). Models LL, RR, and CR use only the vertical features and a single set of tag-triple features: the left tags (t[?]2, t[?]1 and t0), right tags (t0, t+1, t+2), or centered tags (t[?]1, t0, t+1) respectively. Again, with roughly equivalent feature sets, the left context is better than the right, and the centered context is better than either unidirectional context.</Paragraph> <Paragraph position="7"> line for this task high, while substantial annotator noise creates an unknown upper bound on the task.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.2 Lexicalization </SectionTitle> <Paragraph position="0"> Lexicalization has been a key factor in the advance of statistical parsing models, but has been less exploited for tagging. Words surrounding the current word have been occasionally used in taggers, such as (Ratnaparkhi, 1996), Brill's transformation based tagger (Brill, 1995), and the HMM model of Lee et al. (2000), but nevertheless, the only lexicalization consistently included in tagging models is the dependence of the part of speech tag of a word on the word itself.</Paragraph> <Paragraph position="1"> In maximum entropy models, joint features which look at surrounding words and their tags, as well as joint features of the current word and surrounding words are in principle straightforward additions, but have not been incorporated into previous models. We have found these features to be very useful. We explore here lexicalization both alone and in combination with preceding and following tag histories.</Paragraph> <Paragraph position="2"> Table 3 shows the development set accuracy of several models with various lexical features. All models use the same rare word features as the models in Table 2. The first two rows show a baseline model using the current word only. The count cutoff for this feature was 0 in the first model and 2 for the model in the second row. As there are no tag sequence features in these models, the accuracy drops significantly if a higher cutoff is used (from a per tag accuracy of about 93.7% to only 60.2%).</Paragraph> <Paragraph position="3"> The third row shows a model where a tag is decided solely by the three words centered at the tag position (3W). As far as we are aware, models of this sort have not been explored previously, but its accuracy is surprisingly high: despite having no sequence model at all, it is more accurate than a model which uses standard tag fourgram HMM features (<t0,w0> ,<t0, t[?]1> , <t0, t[?]1, t[?]2> ,<t0, t[?]1, t[?]2, t[?]3> , shown in Table 2, model L+LL+LLL).</Paragraph> <Paragraph position="4"> The fourth and fifth rows show models with bi-directional tagging features. The fourth model (3W+TAGS) uses the same tag sequence features as the last model in Table 2 (<t0, t[?]1> ,<t0, t[?]1, t[?]2> , <t0, t[?]1, t+1> ,<t0, t+1> ,<t0, t+1, t+2> ) and current, previous, and next word. The last model has in addition the feature templates <t0,w0, t[?]1> ,<t0,w0, t+1> , <t0,w[?]1,w0> , and <t0,w0,w+1> , and includes the improvements in unknown word modeling discussed in section 3.3.9 We call this model BEST. BEST has a token accuracy on the final test set of 97.24% and a sentence accuracy of 56.34% (see Table 4). A 95% confidence interval for the accuracy (using a binomial model) is (97.15%, 97.33%).</Paragraph> <Paragraph position="5"> In order to understand the gains from using right context tags and more lexicalization, let us look at an example of an error that the enriched models learn not to make. An interesting example of a common tagging error of the simpler models which could be corrected by a deterministic fixup rule of the kind used in the IDIOMTAG module of (Marshall, 1987) is the expression as X as (often, as far as). This should be tagged as/RB X/{RB,JJ} as/IN in the Penn Treebank. A model using only current word and two left tags (model L+L2 in Table 2), made 87 errors on this expression, tagging it as/IN X as/IN - since the tag sequence probabilities do not give strong reasons to disprefer the most common tagging of as (it is tagged as IN over 80% of the time). However, the model 3W+TAGS, which uses two right tags and the two surrounding words in addition, made only 8 errors of this kind, and model BEST made only 6 errors.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.3 Unknown word features </SectionTitle> <Paragraph position="0"> Most of the models presented here use a set of unknown word features basically inherited from (Ratnaparkhi, 1996), which include using character n-gram prefixes and suffixes (for n up to 4), and detectors for a few other prominent features of words, such as capitalization, hyphens, and numbers. Doing error analysis on unknown words on a simple tagging model (with <t0, t[?]1> , <t0, t[?]1, t[?]2> , and <w0, t0> features) suggested several additional specialized features that can usefully improve 9Thede and Harper (1999) use <t[?]1, t0,w0> templates in their &quot;full-second order&quot; HMM, achieving an accuracy of 96.86%. Here we can add the opposite tiling and other features.</Paragraph> <Paragraph position="1"> performance. By far the most significant is a crude company name detector which marks capitalized words followed within 3 words by a company name suffix like Co.</Paragraph> <Paragraph position="2"> or Inc. This suggests that further gains could be made by incorporating a good named entity recognizer as a preprocessor to the tagger (reversing the most common order of processing in pipelined systems!), and is a good example of something that can only be done when using a conditional model. Minor gains come from a few additional features: an allcaps feature, and a conjunction feature of words that are capitalized and have a digit and a dash in them (such words are normally common nouns, such as CFC-12 or F/A-18). We also found it advantageous to use prefixes and suffixes of length up to 10. Together with the larger templates, these features contribute to our unknown word accuracies being higher than those of previously reported taggers.</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.4 Smoothing </SectionTitle> <Paragraph position="0"> With so many features in the model, overtraining is a distinct possibility when using pure maximum likelihood estimation. We avoid this by using a Gaussian prior (aka quadratic regularization or quadratic penalization) which resists high feature weights unless they produce great score gain. The regularized objective F is:</Paragraph> <Paragraph position="2"> Since we use a conjugate-gradient procedure to maximize the data likelihood, the addition of a penalty term is easily incorporated. Both the total size of the penalty and the partial derivatives with repsect to each lj are trivial to compute; these are added to the log-likelihood and log-likelihood derivatives, and the penalized optimization procedes without further modification.</Paragraph> <Paragraph position="3"> We have not extensively experimented with the value of s2 - which can even be set differently for different parameters or parameter classes. All the results in this paper use a constant s2 = 0.5, so that the denominator disappears in the above expression. Experiments on a simple model with s made an order of magnitude higher or lower both resulted in worse performance than with s2 = 0.5.</Paragraph> <Paragraph position="4"> Our experiments show that quadratic regularization is very effective in improving the generalization performance of tagging models, mostly by increasing the number of features which could usefully be incorporated. The tures with support [?] cutoff are excluded from the model). number of features used in our complex models - in the several hundreds of thousands, is extremely high in comparison with the data set size and the number of features used in other machine learning domains. We describe two sets of experiments aimed at comparing models with and without regularization. One is for a simple model with a relatively small number of features, and the other is for a model with a large number of features.</Paragraph> <Paragraph position="5"> The usefulness of priors in maximum entropy models is not new to this work: Gaussian prior smoothing is advocated in Chen and Rosenfeld (2000), and used in all the stochastic LFG work (Johnson et al., 1999). However, until recently, its role and importance have not been widely understood. For example, Zhang and Oles (2001) attribute the perceived limited success of logistic regression for text categorization to a lack of use of regularization. At any rate, regularized conditional loglinear models have not previously been applied to the problem of producing a high quality part-of-speech tagger: Ratnaparkhi (1996), Toutanova and Manning (2000), and Collins (2002) all present unregularized models. Indeed, the result of Collins (2002) that including low support features helps a voted perceptron model but harms a maximum entropy model is undone once the weights of the maximum entropy model are regularized.</Paragraph> <Paragraph position="6"> Table 5 shows results on the development set from two pairs of experiments. The first pair of models use common word templates <t0,w0> ,<t0, t[?]1, t[?]2> and the same rare word templates as used in the models in table 2. The second pair of models use the same features as model BEST with a higher frequency cutoff of 5 for common word features.</Paragraph> <Paragraph position="7"> For the first pair of models, the error reduction from smoothing is 5.3% overall and 20.1% on unknown words.</Paragraph> <Paragraph position="8"> For the second pair of models, the error reduction is even bigger: 16.2% overall after convergence and 5.8% if looking at the best accuracy achieved by the unsmoothed model (by stopping training after 75 iterations; see below). The especially large reduction in unknown word error reflects the fact that, because penalties are effectively stronger for rare features than frequent ones, the presence of penalties increases the degree to which more general cross-word signature features (which apply to unknown words) are used, relative to word-specific sparse features (which do not apply to unknown words).</Paragraph> <Paragraph position="9"> Secondly, use of regularization allows us to incorporate features with low support into the model while improving performance. Whereas Ratnaparkhi (1996) used feature support cutoffs and early stopping to stop overfitting of the model, and Collins (2002) contends that including low support features harms a maximum entropy model, our results show that low support features are useful in a regularized maximum entropy model. Table 6 contrasts our results with those from Collins (2002). Since the models are not the same, the exact numbers are incomparable, but the difference in direction is important: in the regularized model, performance improves with the inclusion of low support features.</Paragraph> <Paragraph position="10"> Finally, in addition to being significantly more accurate, smoothed models train much faster than unsmoothed ones, and do not benefit from early stopping. For example, the first smoothed model in Table 5 required 80 conjugate gradient iterations to converge (somewhat arbitrarily defined as a maximum difference of 10[?]4 in feature weights between iterations), while its corresponding unsmoothed model required 335 iterations, thus training was roughly 4 times slower.10 The second pair of models required 134 and 370 iterations respectively. As might be expected, unsmoothed models reach their highest generalization capacity long before convergence and accuracy on an unseen test set drops considerably with further iterations. This is not the case for smoothed models, as their test set accuracy increases almost monotonically with training iterations.11 Figure 4 shows a graph of training iterations versus accuracy for the second pair of models on the development set.</Paragraph> </Section> </Section> class="xml-element"></Paper>