File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/00/w00-1308_metho.xml

Size: 25,080 bytes

Last Modified: 2025-10-06 14:07:25

<?xml version="1.0" standalone="yes"?>
<Paper uid="W00-1308">
  <Title>Enriching the Knowledge Sources Used in a Maximum Entropy Part-of-Speech Tagger</Title>
  <Section position="2" start_page="0" end_page="65" type="metho">
    <SectionTitle>
1 The Baseline Maximum Entropy Model
</SectionTitle>
    <Paragraph position="0"> We started with a maximum entropy based tagger that uses features very similar to the ones proposed in Ratnaparkhi (1996). The tagger learns a loglinear conditional probability model from tagged text, using a maximum entropy method.</Paragraph>
    <Paragraph position="1"> The model assigns a probability for every tag t in the set T of possible tags given a word and its context h, which is usually def'med as the sequence of several words and tags preceding the word. This model can be used for estimating the probability of a tag sequence h...tn given a sentence w~...wn: n n p(t,...t n I wl...w,) = I~I p(t, \[t~.. 2,_,,w~...w,) = II p(ti I h~) iffil i=! As usual, tagging is the process of assigning the maximum likelihood tag sequence to a string of words.</Paragraph>
    <Paragraph position="2"> The idea of maximum entr?py modeling is to choose the probability distribution p that has the highest entropy out of those distributions  that satisfy a certain set of constraints. The constraints restrict the model to behave in accordance with a set of statistics collected from the training data. The statistics are expressed as the expected values of appropriate functions defined on the contexts h and tags t. In particular, the constraints demand that the expectations of the features for the model match the empirical expectations of the *features over the training data.</Paragraph>
    <Paragraph position="3"> For example, if we want to constrain the model to tag make as a verb or noun with the same frequency as the empirical model induced by the training data, we define the features: fl(h,t)=l iff wi=makeandt=NN f2(h,t)=l iff w i=makeandt=VB Some commonly used statistics for part of speech tagging are: how often a certain word was tagged in a certain way; how often two tags appeared in sequence or how often three tags appeared in sequence. These look a lot like the statistics a Markov Model would use. However, in the maximum entropy framework it is possible to easily define and incorporate much more complex statistics, not restricted to n-gram sequences.</Paragraph>
    <Paragraph position="4"> The constraints in our model are that the expectations of these features according to the joint distribution p are equal to the expectations of the features in the empirical (training data) distribution ~ : Ep~h.,)fi (h, t) = E~h,,) ~ (h, t). Having defined a set of constraints that our model should accord with, we proceed to find the model satisfying the constraints that maximizes the conditional entropy of p. The intuition is that such a model assumes nothing apart from that it should satisfy the given constraints. Following Berger et al. (1996), we approximate p(h,t), the joint distribution of contexts and tags, by the product of ~(h), the empirical distribution of histories h, and the conditional distribution p(t l h): p(h,t) = ~(h). p(t lh). Then for the example above, our constraints would be the following, for j E {1,2}: ~(h, t)f i (h, t) = ~ ,~(h)p(t \[ h)f i (h, t) hEH.tET hsH, t~T This approximation is used to enable efficient computation. The expectation for a feature f is:</Paragraph>
    <Paragraph position="6"> where H is the space of possible contexts h when predicting a part of speech tag t. Since the contexts contain sequences of words and tags and other information, the space H is huge. But using this approximation, we can instead sum just over the smaller space of observed contexts X in the training sample, because the empirical prior ~(h) is zero for unseen contexts h:</Paragraph>
    <Paragraph position="8"> The model that is a solution to this constrained optimization task is an exponential (or equivalently, loglinear) model with the parametric form:</Paragraph>
    <Paragraph position="10"> where the denominator is a normalizing term (sometimes referred to as the partition function).</Paragraph>
    <Paragraph position="11"> The parameters X: correspond to weights for the features 3T-We will not discuss in detail the characteristics of the model or the parameter estimation procedure used - Improved Iterative Scaling.</Paragraph>
    <Paragraph position="12"> For a more extensive discussion of maximum entropy methods, see Berger et al. (1996) and Jelinek (1997). However, we note that our parameter estimation algorithm directly uses equation (1). Ratnaparkhi (1996: 134) suggests use of an approximation summing over the training data, which does not sum over possible tags:</Paragraph>
    <Paragraph position="14"> However, we believe this passage is in error: such an estimate is ineffective in the iterative scaling algorithm. Further, we note that expectations of the form (1) appear in Ratnaparkhi (1998: 12).</Paragraph>
    <Paragraph position="15"> 1.1 Features in the Baseline Model In our baseline model, the context available when predicting the part of speech tag of a word wi in a sentence of words {wl... wn} with tags {tl... t~} is {ti.l tin wi wi+l}. The features that define the constraints on the model are obtained by instantiation of feature templates as in Ratnaparkhi (1996). Special feature templates exist for rare words in the training data, to increase the model's predictioff-capacity for unknown words.</Paragraph>
    <Paragraph position="16">  The actual feature templates for this model are shown in the next table. They are a subset of the features used in Ratnaparkhi (1996).</Paragraph>
    <Paragraph position="17">  No. Feature Type Template 1. General wi=X &amp; ti =T 2. General b.l=Tl &amp; ti=T 3. General tia=T\] &amp; ti.2=T2 &amp; ti=T 4. General Wi+I=X &amp; ti =T 5. Rare Suffix of wi =S, IS1&lt;5 ,~ t,=T 6. Rare Prefix of w~=P, l&lt;IPI&lt;5 &amp; ti=T 7. Rare w~ contains a number &amp; t~=T 8. Rare wi contains an uppercase character &amp; t~=T 9. Rare w~ contains a hyphen &amp; ti=T  Table 1 Baseline Model Features General feature templates can be instantiated by arbitrary contexts, whereas rare feature templates are instantiated only by histories where the current word wi is rare. Rare words are defined to be words that appear less than a certain number of times in the training data (here, the value 7 was used).</Paragraph>
    <Paragraph position="18"> In order to be able to throw out features that would give misleading statistics due to sparseness or noise in the data, we use two different cutoff values for general and rare feature templates (in this implementation, 5 and 45 respectively). As seen in Table 1 the features are conjunctions of a boolean function on the history h and a boolean function on the tag t. Features whose first conjuncts are true for more than the corresponding threshold number of histories in the training data are included in the model.</Paragraph>
    <Paragraph position="19"> The feature templates in Ratnaparkhi (1996) that were left out were the ones that look at the previous word, the word two positions before the current, and the word two positions after the current. These features are of the same form as template 4 in Table 1, but they look at words in different positions.</Paragraph>
    <Paragraph position="20"> Our motivation for leaving these features out was the results from some experiments on successively adding feature templates. Adding template 4 to a model that incorporated the general feature templates 1 to 3 only and the rare feature templates 5-8 significantly increased the accuracy on the development set from 96.0% to 96.52%. The addition of a feature template that looked at the preceding word and the current tag to the resulting model slightly reduced the accuracy.</Paragraph>
    <Section position="1" start_page="64" end_page="65" type="sub_section">
      <SectionTitle>
1.2 Testing and Performance
</SectionTitle>
      <Paragraph position="0"> The model was trained and tested on the part-of-speech tagged WSJ section of the Penn Treebank. The data was divided into contiguous parts: sections 0-20 were used for training, sections 21-22 as a development test set, and sections 23-24 as a final test set. The data set sizes are shown below together with numbers of unknown words.</Paragraph>
      <Paragraph position="1">  The testing procedure uses a beam search to find the tag sequence with maximal probability given a sentence. In our experiments we used a beam of size 5. Increasing the beam size did not result in improved accuracy.</Paragraph>
      <Paragraph position="2"> The preceding tags for the word at the beginning of the sentence are regarded as having the pseudo-tag NA. In this way, the information that a word is the first word in a sentence is available to the tagger. We do not have a special end-of-sentence symbol.</Paragraph>
      <Paragraph position="3"> We used a tag dictionary for known words in testing. This was built from tags found in the training data but augmented so as to capture a few basic systematic tag ambiguities that are found in English. Namely, for regular verbs the -ed form can be either a VBD or a VBN and similarly the stem form can be either a VBP or VB. Hence for words that had occurred with only one of these tags in the training data the other was also included as possible for assignment.</Paragraph>
      <Paragraph position="4"> The results on the test set for the Baseline model are shown in Table 3.</Paragraph>
      <Paragraph position="5">  accuracy figure for our model is higher overall  but lower for unknown words. This may stem from the differences between the two models' feature templates, thresholds, and approximations of the expected values for the features, as discussed in the beginning of the section, or may just reflect differences in the choice of training and test sets (which are not precisely specified in Ratnaparkhi (1996)).</Paragraph>
      <Paragraph position="6"> The differences are not great enough to justify any definite statement about the different use of feature templates or other particularities of the model estimation. One conclusion that we can draw is that at present the additional word features used in Ratnaparkhi (1996) - looking at words more than one position away from the current - do not appear to be helping the overall performance of the models.</Paragraph>
    </Section>
    <Section position="2" start_page="65" end_page="65" type="sub_section">
      <SectionTitle>
1.3 Discussion of Problematic Cases
</SectionTitle>
      <Paragraph position="0"> A large number of words, including many of the most common words, can have more than one syntactic category. This introduces a lot of ambiguities that the tagger has to resolve. Some of the ambiguities are easier for taggers to resolve and others are harder.</Paragraph>
      <Paragraph position="1"> Some of the most significant confusions that the Baseline model made on the test set can be seen in Table 5. The row labels in Table 5 signify the correct tags, and the column labels signify the assigned tags. For example, the number 244 in the (NN, JJ) position is the number of words that were NNs but were incorrectly assigned the JJ category. These particular confusions, shown in the table, account for a large percentage of the total error (2652/3651 = 72.64%). Table 6 shows part of the Baseline model's confusion matrix for just unknown words.</Paragraph>
      <Paragraph position="2"> Table 4 shows the Baseline model's overall assignment accuracies for different parts of speech. For example, the accuracy on nouns is greater than the accuracy on adjectives. The accuracy on NNPS (plural proper nouns) is a surprisingly low 41.1%.</Paragraph>
      <Paragraph position="3">  of speech for the Baseline model.</Paragraph>
      <Paragraph position="4"> Tagger errors are of various types. Some are the result of inconsistency in labeling in the training data (Ratnaparkhi 1996), which usually reflects a lack of linguistic clarity or determination of the correct part of speech in context. For instance, the status of various noun premodifiers (whether chief or maximum is NN or JJ, or whether a word in -ing is acting as a JJ or VBG) is of this type. Some, such as errors between NN/NNP/NNPS/NNS largely reflect difficulties with unknown words. But other cases, such as VBN/VBD and VB/VBP/NN, represent systematic tag ambiguity patterns in English, for which the fight answer is invariably clear in context, and for which there are in general good structural contextual clues that one should be able to use to disarnbiguate. Finally, in another class of cases, of which the most prominent is probably the RP/IN/RB ambiguity of words like up, out, and on, the linguistic distinctions, while having a sound empirical basis (e.g., see Baker (1995: 198-201), are quite subtle, and often require semantic intuitions. There are not good syntactic cues for the correct tag (and furthermore, human taggers not infrequently make errors). Within this classification, the greatest hopes for tagging improvement appear to come from minimizing errors in the second and third classes of this classification.</Paragraph>
      <Paragraph position="5"> In the following sections we discuss how we include additional knowledge sources to help in the assignment of tags to forms of verbs, capitalized unknown words, particle words, and in the overall accuracy of part of speech assignments.</Paragraph>
    </Section>
  </Section>
  <Section position="3" start_page="65" end_page="67" type="metho">
    <SectionTitle>
2 Improving the Unknown Words Model
</SectionTitle>
    <Paragraph position="0"> The accuracy of the baseline model is markedly lower for unknown words than for previously seen ones. This is also the case for all other taggers, and reflects the importance of lexical information to taggers: in the best accuracy figures punished for corpus-based taggers, known word accuracy is around 97%, whereas unknown word accuracy is around 85%.</Paragraph>
    <Paragraph position="1"> In following experiments, we examined ways of using additional features to improve the accuracy of tagging unknown words. As previously discussed in Mikbeev (1999), it is possible to improve the accuracy on capitalized words that might be proper nouns or the first word in a sentence, etc.</Paragraph>
    <Paragraph position="2"> .r.</Paragraph>
    <Paragraph position="3">  1. Current word 15,832 15,832 15,837 15,927 2. Previous tag 1,424 1,424 1,424 1,424 3. Previous two tags 16,124 16,124 16,124 16,124 4. Next word 80,075 80,075 80,075 80,075 5. Suffixes 3,361 3,361 3,361 3,387 6. Prefixes 5,311 0 0 0 7. Contains uppercase character 34 34 34 34 8. Contains number 7 7 7 7 9. Contains hyphen 20 20 20 20 10. Capitalized and mid. sentence 0 33 33 33 11. All letters uppercase 0 30 30 30 12. VBPIVB feature 0 0 2 2 13. VBDIVBN feature 0 0 3 3 14. Particles, type 1 0 0 0 9 15. Particles, type 2 0 0 0 2,178  For example, the error on the proper noun category (NNP) accounts :for a significantly larger percent of the total error for unknown words than for known words. In the Baseline model, of the unknown word error 41.3% is due to words being NNP and assigned to some other category, or being of other category and assigned NNP. The percentage of the same type of error for known words is 16.2%.</Paragraph>
    <Paragraph position="4"> The incorporation of the following two feature schemas greatly improved NNP accuracy:  (1) A feature that looks at whether all the letters  of a word are uppercase. The feature that looked at capitalization before (cf. Table 1, feature No. 8) is activated when the word contains an uppercase character. This turns out to be a notable distinction because, for example, in titles in the WSJ data all words are in all uppercase, and the distribution of tags for these words is different from the overall distribution for words that contain an uppercase character.</Paragraph>
    <Paragraph position="5"> (2) A feature that is activated when the word contains an uppercase character and it is not at the start of a sentence. These word tokens also have a different tag distribution from the distribution for all tokens that contain an uppercase character.</Paragraph>
    <Paragraph position="6"> Conversely, empirically it was found that the prefix features for rare words were having a net negative effect on accuracy. We do not at present have a good explanation for this phenomenon. The addition of the features (1) and (2) and the removal of the prefix features considerably improved the accuracy on unknown words and the overall accuracy. The results on the test set after adding these features are shown below:  and removing prefix features.</Paragraph>
    <Paragraph position="7"> Unknown word error is reduced by 15% as compared to the Baseline model.</Paragraph>
    <Paragraph position="8"> It is important to note that (2) is composed of information already 'known' to the tagger in some sense. This feature can be viewed as the conjunction of two features, one of which is already in the baseline model, and the other of which is the negation of a feature existing in the baseline model - since for words at the beginning of a sentence, the preceding tag is the pseudo-tag NA, and there is a feature looking at the preceding tag. Even though-our maximum entropy model does not require independence among the predictors, it provides for free only a simple combination of feature weights, and additional 'interaction terms' are needed to model non-additive interactions (in log-space terms) between features.</Paragraph>
  </Section>
  <Section position="4" start_page="67" end_page="68" type="metho">
    <SectionTitle>
3 Features for Disambiguating Verb Forms
</SectionTitle>
    <Paragraph position="0"> Two of the most significant sources of classifier errors are the VBN/VBD ambiguity and the VBP/VB ambiguity. As seen in Table 5, VBN/VBD confusions account for 6.9% of the total word error. The VBP/VB confusions are a smaller 3.7% of the errors. In many cases it is easy for people (and for taggers) to determine the correct form. For example, if there is a to infinitive or a modal directly preceding the VB/VBP ambiguous word, the form is certainly non-finite. But often the modal can be several positions away from the current position - still obvious to a human, but out of sight for the baseline model.</Paragraph>
    <Paragraph position="1"> To help resolve a VB/VBP ambiguity in such cases, we can add a feature that looks at the preceding several words (we have chosen 8 as a threshold), but not across another verb, and activates if there is a to there, a modal verb, or a form of do, let, make, or help (verbs that frequently take a bare infinitive complement).</Paragraph>
    <Paragraph position="2"> Rather than having a separate feature look at each preceding position, we define one feature that looks at the chosen number of positions to the left. This both increases the scope of the available history for the tagger and provides a better statistic because it avoids fragmentation. We added a similar feature for resolving VBD/VBN confusions. It activates if there is a have or be auxiliary form in the preceding several positions (again the value 8 is used in the implementation).</Paragraph>
    <Paragraph position="3"> The form of these two feature templates was motivated by the structural rules of English and not induced from the training data, but it should be possible to look for &amp;quot;predictors&amp;quot; for certain parts of speech in the preceding words in the sentence by, for example, computing association strengths.</Paragraph>
    <Paragraph position="4"> The addition of the two feature schemas helped reduce the VB/VBP and VBD/VBN confusions. Below is the performance on the test set of the resulting model when features for disambiguating verb forms are added to the model of Section 2. The number of VB/VBP confusions  was reduced by 23.1% as compared to the baseline. The number of VBD/VBN confusions was reduced by 12.3%.</Paragraph>
  </Section>
  <Section position="5" start_page="68" end_page="69" type="metho">
    <SectionTitle>
4 Features for Particle Disambiguation
</SectionTitle>
    <Paragraph position="0"> As discussed in section 1.3 above, the task of determining RB/RP/IN tags for words like down, out, up is difficult and in particular examples, there are often no good local syntactic indicators.</Paragraph>
    <Paragraph position="1"> For instance, in (2), we find the exact same sequence of parts of speech, but (2a) is a particle use of on, while (2b) is a prepositional use.</Paragraph>
    <Paragraph position="2"> Consequently, the accuracy on the rarer RP (particles) category is as low as 41.5% for the Baseline model (cf. Table 4).</Paragraph>
    <Paragraph position="3"> (2) a. Kim took on the monster.</Paragraph>
    <Paragraph position="4"> b. Kim sat on the monster.</Paragraph>
    <Paragraph position="5"> We tried to improve the tagger's capability to resolve these ambiguities through adding information on verbs' preferences to take specific words as particles, or adverbs, or prepositions. There are verbs that take particles more than others, and particular words like out are much more likely to be used as a particle in the context of some verb than other words ambiguous between these tags.</Paragraph>
    <Paragraph position="6"> We added two different feature templates to capture this information, consisting as usual of a predicate on the history h, and a condition on the tag t. The first predicate is true if the current word is often used as a particle, and if there is a verb at most 3 positions to the left, which is &amp;quot;known&amp;quot; to have a good chance of taking the current word as a particle. The verb-particle pairs that are known by the system to be very common were collected through analysis of the training data in a preprocessing stage.</Paragraph>
    <Paragraph position="7"> The second feature template has the form: The last verb is v and the current word is w and w has been tagged as a particle and the current tag is t. The last verb is the pseudo-symbol NA if there is no verb in the previous three positions. These features were some help in reducing the RB/IN/RP confusions. The accuracy on the RP category rose to 44.3%. Although the overall confusions in this class were reduced, some of the errors were increased, for example, the number of INs classified as RBs rose slightly. There seems to be still considerable room to improve these results, though the attainable accuracy is limited by the accuracy with which these distinctions are marked in the Penn Treebank (on a quick informal study, this accuracy seems to be around 85%). The next table shows the final performance on the test set.</Paragraph>
    <Paragraph position="8">  For ease of comparison, the accuracies of all models on the test and development sets are shown in Table 7. We note that accuracy is lower on the development set. This presumably corresponds with Charniak's (2000: 136) observation that Section 23 of the Penn Treebank is easier than some others. Table 8 shows the different number of feature templates of each kind that have been instantiated for the different models as well as the total number of features each model has. It can be seen that the features which help disambiguate verb forms, which look at capitalization and the first of the feature templates for particles are a very small number as compared to the features of the other kinds. The improvement in classification accuracy therefore comes at the price of adding very few parameters to the maximum entropy model and does not result in increased model complexity.</Paragraph>
    <Paragraph position="9"> Conclusion Even when the accuracy figures for corpus-based part-of-speech taggers start to look extremely similar, it is still possible to move performance levels up. The work presented in this paper explored just a few information sources in addition to the ones usually used for tagging.</Paragraph>
    <Paragraph position="10"> While progress is slow, because each new feature applies only to a limited range of cases, nevertheless the improvement in accuracy as compared to previous results is noticeable, particularly for the individual decisions on which we focused.</Paragraph>
    <Paragraph position="11"> The potential of maximum entropy methods has not previously been fully exploited for the task of assignment of parts of speech. We incorporated into a maximum entropy-based tagger more linguistically sophisticated features, which are non-local and do not look just at particular positions in the text. We also added features that model the interactions of previously employed  predictors. All of these changes led to modest increases in tagging accuracy.</Paragraph>
    <Paragraph position="12"> This paper has thus presented some initial experiments in improving tagger accuracy through using additional information sources. In the future we hope to explore automatically discovering information sources that can be profitably incorporated into maximum entropy part-of-speech prediction.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML