File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/95/j95-4004_metho.xml
Size: 39,863 bytes
Last Modified: 2025-10-06 14:13:59
<?xml version="1.0" standalone="yes"?> <Paper uid="J95-4004"> <Title>Transformation-Based Error-Driven Learning and Natural Language Processing: A Case Study in Part-of-Speech Tagging</Title> <Section position="4" start_page="546" end_page="551" type="metho"> <SectionTitle> AAAAAA </SectionTitle> <Paragraph position="0"> and the transformation: Computational Linguistics Volume 21, Number 4 Change the label from A to B if the preceding label is A. If the effect of the application of a transformation is not written out until the entire file has been processed for that one transformation, then regardless of the order of processing the output will be: ABBBBB, since the triggering environment of a transformation is always checked before that transformation is applied to any surrounding objects in the corpus. If the effect of a transformation is recorded immediately, then processing the string left to right would result in: ABABAB, whereas processing right to left would result in: ABBBBB.</Paragraph> <Paragraph position="1"> 3. A Comparison With Decision Trees The technique employed by the learner is somewhat similar to that used in decision trees (Breiman et al. 1984; Quinlan 1986; Quinlan and Rivest 1989). A decision tree is trained on a set of preclassified entities and outputs a set of questions that can be asked about an entity to determine its proper classification. Decision trees are built by finding the question whose resulting partition is the purest, 2 splitting the training data according to that question, and then recursively reapplying this procedure on each resulting subset.</Paragraph> <Paragraph position="2"> We first show that the set of classifications that can be provided via decision trees is a proper subset of those that can be provided via transformation lists (an ordered list of transformation-based rules), given the same set of primitive questions. We then give some practical differences between the two learning methods.</Paragraph> <Section position="1" start_page="547" end_page="548" type="sub_section"> <SectionTitle> 3.1 Decision Trees c_ Transformation Lists </SectionTitle> <Paragraph position="0"> We prove here that for a fixed set of primitive queries, any binary decision tree can be converted into a transformation list. Extending the proof beyond binary trees is straightforward.</Paragraph> <Paragraph position="1"> this tree can be converted into the following transformation list: Assume that two decision trees T1 and T2 have corresponding transformation lists L1 and L2. Assume that the arbitrary label names chosen in constructing L1 are not used in L2, and that those in L2 are not used in L1. Given a new decision tree T3 constructed from T1 and T2 as follows: Brill Transformation-Based Error-Driven Learning we construct a new transformation list L3. Assume the first transformation in L1 is: Label with S' and the first transformation in L2 is: Label with S&quot; The first three transformations in L3 will then be: 1. Label with S 2. If X then S --* S' 3. S --+ S&quot; followed by all of the rules in L1 other than the first rule, followed by all of the rules in L2 other than the first rule. The resulting transformation list will first label an item as S' if X is true, or as S&quot; if X is false. Next, the tranformations from L1 will be applied if X is true, since S' is the initial-state label for L1. If X is false, the transformations from L2 will be applied, because S&quot; is the initial-state label for L2. \[\]</Paragraph> </Section> <Section position="2" start_page="548" end_page="549" type="sub_section"> <SectionTitle> 3.2 Decision Trees # Transformation Lists </SectionTitle> <Paragraph position="0"> We show here that there exist transformation lists for which no equivalent decision trees exist, for a fixed set of primitive queries. The following classification problem is one example. Given a sequence of characters, classify a character based on whether the position index of a character is divisible by 4, querying only using a context of two characters to the left of the character being classified.</Paragraph> <Paragraph position="1"> Computational Linguistics Volume 21, Number 4 Assuming transformations are applied left to right on the sequence, the above classification problem can be solved for sequences of arbitrary length if the effect of a transformation is written out immediately, or for sequences up to any prespecified length if a transformation is carried out only after all triggering environments in the corpus are checked. We present the proof for the former case.</Paragraph> <Paragraph position="2"> Given the input sequence: A A A A A A A A A A 0 1 2 3 4 5 6 7 8 9 the underlined characters should be classified as true because their indices are 0, 4, and 8. To see why a decision tree could not perform this classification, regardless of order of classification, note that, for the two characters before both A3 and A4, both the characters and their classifications are the same, although these two characters should be classified differently. Below is a transformation list for performing this classification. Once again, we assume transformations are applied left to right and that the result of a transformation is written out immediately, so that the result of applying transformation x to character ai will always be known when applying transformation x to ai+l. 1. Label with S RESULT: A/S A/S A/S A/S 2. If there is no previous character, RESULT: A/F A/S A/S A/S 3. If the character two to the left is RESULT: A/F A/S A/F A/S 4. If the character two to the left is RESULT: A/F A/S A/S A/S 5. F --+ yes 6. S--* no A/S A/S A/S A/S A/S A/S A/S then S ~ F A/S A/S A/S A/S A/S A/S A/S labelled with F, then S --* F A/F A/S A/F A/S A/F A/S A/F labelled with F, then F ~ S A/F A/S A/S A/S A/F A/S A/S RESULT: A/yes A/no A/no A/no A/yes A/no A/no A/no A/yes A/no A/no The extra power of transformation lists comes from the fact that intermediate results from the classification of one object are reflected in the current label of that object, thereby making this intermediate information available for use in classifying other objects. This is not the case for decision trees, where the outcome of questions asked is saved implicitly by the current location within the tree.</Paragraph> </Section> <Section position="3" start_page="549" end_page="550" type="sub_section"> <SectionTitle> 3.3 Some Practical Differences Between Decision Trees and Transformation Lists </SectionTitle> <Paragraph position="0"> There are a number of practical differences between transformation-based error-driven learning and learning decision trees. One difference is that when training a decision tree, each time the depth of the tree is increased, the average amount of training material available per node at that new depth is halved (for a binary tree). In transformation-based learning, the entire training corpus is used for finding all transformations. Therefore, this method is not subject to the sparse data problems that arise as the depth of the decision tree being learned increases.</Paragraph> <Paragraph position="1"> Transformations are ordered, with later transformations being dependent upon the outcome of applying earlier transformations. This allows intermediate results in Brill Transformation-Based Error-Driven Learning classifying one object to be available in classifying other objects. For instance, whether the previous word is tagged as to-infinitival or to-preposition may be a good cue for determining the part of speech of a word. 3 If, initially, the word to is not reliably tagged everywhere in the corpus with its proper tag (or not tagged at all), then this cue will be unreliable. The transformation-based learner will delay positing a transformation triggered by the tag of the word to until other transformations have resulted in a more reliable tagging of this word in the corpus. For a decision tree to take advantage of this information, any word whose outcome is dependent upon the tagging of to would need the entire decision tree structure for the proper classification of each occurrence of to built into its decision tree path. If the classification of to were dependent upon the classification of yet another word, this would have to be built into the decision tree as well. Unlike decision trees, in transformation-based learning, intermediate classification results are available and can be used as classification progresses. Even if decision trees are applied to a corpus in a left-to-right fashion, they are allowed only one pass in which to properly classify.</Paragraph> <Paragraph position="2"> Since a transformation list is a processor and not a classifier, it can readily be used as a postprocessor to any annotation system. In addition to annotating from scratch, rules can be learned to improve the performance of a mature annotation system by using the mature system as the initial-state annotator. This can have the added advantage that the list of transformations learned using a mature annotation system as the initial-state annotator provides a readable description or classification of the errors the mature system makes, thereby aiding in the refinement of that system.</Paragraph> <Paragraph position="3"> The fact that it is a processor gives a transformation-based learner greater than the classifier-based decision tree. For example, in applying transformation-based learning to parsing, a rule can apply any structural change to a tree. In tagging, a rule such as: Change the tag of the current word to X, and of the previous word to Y, if Z holds can easily be handled in the processor-based system, whereas it would be difficult to handle in a classification system.</Paragraph> <Paragraph position="4"> In transformation-based learning, the objective function used in training is the same as that used for evaluation, whenever this is feasible. In a decision tree, using system accuracy as an objective function for training typically results in poor performance 4 and some measure of node purity, such as entropy reduction, is used instead. The direct correlation between rules and performance improvement in transformation-based learning can make the learned rules more readily interpretable than decision tree rules for increasing population purity, s 4. Part of Speech Tagging: A Case Study in Transformation-Based Error-Driven</Paragraph> </Section> <Section position="4" start_page="550" end_page="551" type="sub_section"> <SectionTitle> Learning </SectionTitle> <Paragraph position="0"> In this section we describe the practical application of transformation-based learning to part-of-speech tagging. 6 Part-of-speech tagging is a good application to test the Computational Linguistics Volume 21, Number 4 learner, for several reasons. There are a number of large tagged corpora available, allowing for a variety of experiments to be run. Part-of-speech tagging is an active area of research; a great deal of work has been done in this area over the past few years (e.g., Jelinek 1985; Church 1988; Derose 1988; Hindle 1989; DeMarcken 1990; Merialdo 1994; Brill 1992; Black et al. 1992; Cutting et al. 1992; Kupiec 1992; Charniak et al. 1993; Weischedel et al. 1993; Schutze and Singer 1994).</Paragraph> <Paragraph position="1"> Part-of-speech tagging is also a very practical application, with uses in many areas, including speech recognition and generation, machine translation, parsing, information retrieval and lexicography. Insofar as tagging can be seen as a prototypical problem in lexical ambiguity, advances in part-of-speech tagging could readily translate to progress in other areas of lexical, and perhaps structural, ambiguity, such as word-sense disambiguation and prepositional phrase attachment disambiguation. 7 Also, it is possible to cast a number of other useful problems as part-of-speech tagging problems, such as letter-to-sound translation (Huang, Son-Bell, and Baggett 1994) and building pronunciation networks for speech recognition. Recently, a method has been proposed for using part-of-speech tagging techniques as a method for parsing with lexicalized grammars (Joshi and Srinivas 1994).</Paragraph> <Paragraph position="2"> When automated part-of-speech tagging was initially explored (Klein and Simmons 1963; Harris 1962), people manually engineered rules for tagging, sometimes with the aid of a corpus. As large corpora became available, it became clear that simple Markov-model based stochastic taggers that were automatically trained could achieve high rates of tagging accuracy (Jelinek 1985). Markov-model based taggers assign to a sentence the tag sequence that maximizes Prob(word I tag),Prob(tag I previous n tags). These probabilities can be estimated directly from a manually tagged corpus, s These stochastic taggers have a number of advantages over the manually built taggers, including obviating the need for laborious manual rule construction, and possibly capturing useful information that may not have been noticed by the human engineer.</Paragraph> <Paragraph position="3"> However, stochastic taggers have the disadvantage that linguistic information is captured only indirectly, in large tables of statistics. Almost all recent work in developing automatically trained part-of-speech taggers has been on further exploring Markov-model based tagging (Jelinek 1985; Church 1988; Derose 1988; DeMarcken 1990; Merialdo 1994; Cutting et al. 1992; Kupiec 1992; Charniak et al. 1993; Weischedel et al. 1993; Schutze and Singer 1994).</Paragraph> </Section> <Section position="5" start_page="551" end_page="551" type="sub_section"> <SectionTitle> 4.1 Transformation-based Error-driven Part-of-Speech Tagging </SectionTitle> <Paragraph position="0"> Transformation-based part of speech tagging works as follows. 9 The initial-state annotator assigns each word its most likely tag as indicated in the training corpus. The method used for initially tagging unknown words will be described in a later section.</Paragraph> <Paragraph position="1"> An ordered list of transformations is then learned, to improve tagging accuracy based on contextual cues. These transformations alter the tagging of a word from X to Y iff 7 In Brill and Resnik (1994), we describe an approach to prepositional phrase attachment disambiguation that obtains highly competitive performance compared to other corpus-based solutions to this problem. This system was derived in under two hours from the transformation-based part of speech tagger described in this paper.</Paragraph> <Paragraph position="2"> 8 One can also estimate these probabilities without a manually tagged corpus, using a hidden Markov model. However, it appears to be the case that directly estimating probabilities from even a very small manually tagged corpus gives better results than training a hidden Markov model on a large untagged corpus (see Merialdo (1994)).</Paragraph> </Section> </Section> <Section position="5" start_page="551" end_page="561" type="metho"> <SectionTitle> 9 Earlier versions of this work were reported in Brill (1992, 1994). </SectionTitle> <Paragraph position="0"> either: 1. The word was not seen in the training corpus OR 2. The word was seen tagged with Y= at least once in the training corpus. In taggers based on Markov models, the lexicon consists of probabilities of the somewhat counterintuitive but proper form P(WORD I TAG). In the transformation-based tagger, the lexicon is simply a list of all tags seen for a word in the training corpus, with one tag labeled as the most likely. Below we show a lexical entry for the word half in the transformation-based tagger. 1deg half: CD DT JJ NN PDT RB VB This entry lists the seven tags seen for half in the training corpus, with NN marked as the most likely. Below are the lexical entries for half in a Markov model tagger, extracted from the same corpus:</Paragraph> <Paragraph position="2"> It is difficult to make much sense of these entries in isolation; they have to be viewed in the context of the many contextual probabilities.</Paragraph> <Paragraph position="3"> First, we will describe a nonlexicalized version of the tagger, where transformation templates do not make reference to specific words. In the nonlexicalized tagger, the transformation templates we use are: Change tag a to tag b when: 1. The preceding (following) word is tagged z.</Paragraph> <Paragraph position="4"> 2. The word two before (after) is tagged z.</Paragraph> <Paragraph position="5"> 3. One of the two preceding (following) words is tagged z. 4. One of the three preceding (following) words is tagged z. 5. The preceding word is tagged z and the following word is tagged w. 6. The preceding (following) word is tagged z and the word two before (after) is tagged w.</Paragraph> <Paragraph position="6"> where a, b, z and w are variables over the set of parts of speech. To learn a transformation, the learner, in essence, tries out every possible transformation, 1I and counts the number of tagging errors after each one is applied. After 10 A description of the partoof-speech tags is provided in Appendix A. 11 All possible instantiations of transformation templates.</Paragraph> <Paragraph position="7"> 1. apply initial-state annotator to corpus 2. while transformations can still be found do 3. for from_tag = tag1 to tagn 4. for to_tag = tag1 to tagn 5. for corpus_position = 1 to corpus_size 6. if (correct_tag(corpus_position) --= to_tag && current_tag(corpus_position) == from_tag) 7. num_good_transformations(tag(corpus_position -1))++ 8. else if (correct_tag(corpus_position) == from_tag && current_tag(corpus_position) == from_tag) 9. num_bad_transformations(tag(corpus_position-1 ))++ 10. find maxT (num_good_transformations(T) - num_bad_transformations(T)) 11. if this is the best-scoring rule found yet then store as best rule: Change tag from from_tag to to_tag if previous tag is T 12. apply best rule to training corpus 13. append best rule to ordered list of transformations Figure 3 Pseudocode for learning transformations.</Paragraph> <Paragraph position="8"> all possible transformations have been tried, the transformation that resulted in the greatest error reduction is chosen. Learning stops when no transformations can be found whose application reduces errors beyond some prespecified threshold. In the experiments described below, processing was done left to right. For each transformation application, all triggering environments are first found in the corpus, and then the transformation triggered by each triggering environment is carried out. The search is data-driven, so only a very small percentage of possible transformations really need be examined. In figure 3, we give pseudocode for the learning algorithm in the case where there is only one transformation template: Change the tag from X to Y if the previous tag is Z.</Paragraph> <Paragraph position="9"> In each learning iteration, the entire training corpus is examined once for every pair of tags X and Y, finding the best transformation whose rewrite changes tag X to tag Y. For every word in the corpus whose environment matches the triggering environment, if the word has tag X and X is the correct tag, then making this transformation will result in an additional tagging error, so we increment the number of errors caused when making the transformation given the part-of-speech tag of the previous word (lines 8 and 9). If X is the current tag and Y is the correct tag, then the transformation will result in one less error, so we increment the number of improvements caused when making the transformation given the part-of-speech tag of the previous word (lines 6 and 7).</Paragraph> <Paragraph position="10"> In certain cases, a significant increase in speed for training the transformation-based tagger can be obtained by indexing in the corpus where different transformations can and do apply. For a description of a fast index-based training algorithm, see Ramshaw and Marcus (1994).</Paragraph> <Paragraph position="11"> In figure 4, we list the first twenty transformations learned from training on the Penn Treebank Wall Street Journal Corpus (Marcus, Santorini, and Marcinkiewicz 1993). 12 The first transformation states that a noun should be changed to a verb if One of the previous three tags is MD One of the previous two tags is MD One of the previous two tags is DT One of the previous three tags is VBZ the previous tag is TO, as in: to~TO conflict/NN--.VB with. The second transformation fixes a tagging such as: might/MD vanish/VBP--.VB. The third fixes might/MD not reply/NN--.VB. The tenth transformation is for the token's, which is a separate token in the Penn Treebank. 's is most frequently used as a possessive ending, but after a personal pronoun, it is a verb (John's, compared to he 's). The transformations changing IN to WDT are for tagging the word that, to determine in which environments that is being used as a synonym of which.</Paragraph> <Section position="1" start_page="553" end_page="557" type="sub_section"> <SectionTitle> 4.2 Lexicalizing the Tagger </SectionTitle> <Paragraph position="0"> In general, no relationships between words have been directly encoded in stochastic n-gram taggers. 13 In the Markov model typically used for stochastic tagging, state transition probabilities (P(Tagi I Tagi_l... Tagi-n)) express the likelihood of a tag immediately following n other tags, and emit probabilities (P(Wordj I Tagi)) express the likelihood of a word, given a tag. Many useful relationships, such as that between a word and the previous word, or between a tag and the following word, are not directly captured by Markov-model based taggers. The same is true of the nonlexicalized transformation-based tagger, where transformation templates do not make reference to words.</Paragraph> <Paragraph position="1"> To remedy this problem, we extend the transformation-based tagger by adding 13 In Kupiec (1992), a limited amount of lexicalization is introduced by having a stochastic tagger with word states for the 100 most frequent words in the corpus.</Paragraph> <Paragraph position="2"> Computational Linguistics Volume 21, Number 4 contextual transformations that can make reference to words as well as part-of-speech tags. The transformation templates we add are: Change tag a to tag b when: The preceding (following) word is w.</Paragraph> <Paragraph position="3"> The word two before (after) is w.</Paragraph> <Paragraph position="4"> One of the two preceding (following) words is w.</Paragraph> <Paragraph position="5"> current word is w and the preceding (following) word is x.</Paragraph> <Paragraph position="6"> current word is w and the preceding (following) word is tagged z.</Paragraph> <Paragraph position="7"> current word is w.</Paragraph> <Paragraph position="8"> preceding (following) word is w and the preceding (following) tag is The current word is w, the preceding (following) word is w2 and the preceding (following) tag is t.</Paragraph> <Paragraph position="9"> where w and x are variables over all words in the training corpus, and z and t are variables over all parts of speech.</Paragraph> <Paragraph position="10"> BelOw we list two lexicalized transformations that were learned, training once again on the Wall Street Journal.</Paragraph> <Paragraph position="11"> Change the tag: (12) From IN to RB if the word two positions to the right is as. (16) From VBP to VB if one of the previous two words is n't. TM The Penn Treebank tagging style manual specifies that in the collocation as ... as, the first as is tagged as an adverb and the second is tagged as a preposition. Since as is most frequently tagged as a preposition in the training corpus, the initial-state tagger will mistag the phrase as tall as as: as/IN tall/JJ as/IN The first lexicalized transformation corrects this mistagging. Note that a bigram tagger trained on our training set would not correctly tag the first occurrence of as. Although adverbs are more likely than prepositions to follow some verb form tags, the fact that P(as \] IN) is much greater than P(as \] RB), and P(JJ \] IN) is much greater than P(JJ \] RB) lead to as being incorrectly tagged as a preposition by a stochastic tagger. A trigram tagger will correctly tag this collocation in some instances, due to the fact that</Paragraph> <Paragraph position="13"> upon the context in which this collocation appears.</Paragraph> <Paragraph position="14"> The second transformation arises from the fact that when a verb appears in a context such as We do n't eat or We did n't usually drink, the verb is in base form. A stochastic trigram tagger would have to capture this linguistic information indirectly from frequency counts of all trigrams of the form shown in figure 5 (where a star can match any part-of-speech tag) and from the fact that P(n't \] RB) is fairly high.</Paragraph> <Paragraph position="15"> 14 In the Penn Treebank, n't is treated as a separate token, so don't becomes do/VBP n't/RB. In Weischedel et al. (1993), results are given when training and testing a Markov-model based tagger on the Penn Treebank Tagged Wall Street Journal Corpus. They cite results making the closed vocabulary assumption that all possible tags for all words in the test set are known. When training contextual probabilities on one million words, an accuracy of 96.7% was achieved. Accuracy dropped to 96.3% when contextual probabilities were trained on 64,000 words. We trained the transformation-based tagger on the same corpus, making the same closed-vocabulary assumption. 15 When training contextual rules on 600,000 words, an accuracy of 97.2% was achieved on a separate 150,000 word test set. When the training set was reduced to 64,000 words, accuracy dropped to 96.7%. The transformation-based learner achieved better performance, despite the fact that contextual information was captured in a small number of simple nonstochastic rules, as opposed to 10,000 contextual probabilities that were learned by the stochastic tagger. These results are summarized in table 1. When training on 600,000 words, a total of 447 transformations were learned. However, transformations toward the end of the list contribute very little to accuracy: applying only the first 200 learned transformations to the test set achieves an accuracy of 97.0%; applying the first 100 gives an accuracy of 96.8%. To match the 96.7% accuracy achieved by the stochastic tagger when it was trained on one million words, only the first 82 transformations are needed.</Paragraph> <Paragraph position="16"> To see whether lexicalized transformations were contributing to the transformation-based tagger accuracy rate, we first trained the tagger using the nonlexical transformation template subset, then ran exactly the same test. Accuracy of that tagger was 97.0%. Adding lexicalized transformations resulted in a 6.7% decrease in the error rate (see table 1). 16 We found it a bit surprising that the addition of lexicalized transformations did not result in a much greater improvement in performance. When transformations are allowed to make reference to words and word pairs, some relevant information is probably missed due to sparse data. We are currently exploring the possibility of incorporating word classes into the rule-based learner, in hopes of overcoming this problem. The idea is quite simple. Given any source of word class information, such 15 In both Weischedel et al. (1993) and here, the test set was incorporated into the lexicon, but was not used in learning contextual information. Testing with no unknown words might seem like an unrealistic test. We have done so for three reasons: (1) to allow for a comparison with previously quoted results, (2) to isolate known word accuracy from unknown word accuracy, and (3) in some systems, such as a closed vocabulary speech recognition system, the assumption that all words are known is valid. (We show results when unknown words are included later in the paper.) 16 The training we did here was slightly suboptimal, in that we used the contextual rules learned with unknown words (described in the next section), and filled in the dictionary, rather than training on a corpus without unknown words.</Paragraph> <Paragraph position="17"> as WordNet (Miller 1990), the learner is extended such that a rule is allowed to make reference to parts of speech, words, and word classes, allowing for rules such as Change the tag from X to Y if the following word belongs to word class Z.</Paragraph> <Paragraph position="18"> This approach has already been successfully applied to a system for prepositional phrase attachment disambiguation (Brill and Resnik 1994).</Paragraph> </Section> <Section position="2" start_page="557" end_page="560" type="sub_section"> <SectionTitle> 4.3 Tagging Unknown Words </SectionTitle> <Paragraph position="0"> So far, we have not addressed the problem of unknown words. As stated above, the initial-state annotator for tagging assigns all words their most likely tag, as indicated in a training corpus. Below we show how a transformation-based approach can be taken for tagging unknown words, by automatically learning cues to predict the most likely tag for words not seen in the training corpus. If the most likely tag for unknown words can be assigned with high accuracy, then the contextual rules can be used to improve accuracy, as described above.</Paragraph> <Paragraph position="1"> In the transformation-based unknown-word tagger, the initial-state annotator naively assumes the most likely tag for an unknown word is &quot;proper noun&quot; if the word is capitalized and &quot;common noun&quot; otherwise. 17 Below, we list the set of allowable transformations.</Paragraph> <Paragraph position="2"> Change the tag of an unknown word (from X) to Y if:</Paragraph> <Paragraph position="4"> Deleting the prefix (suffix) x, Ixl < 4, results in a word (x is any string of length 1 to 4).</Paragraph> <Paragraph position="5"> The first (last) (1,2,3,4) characters of the word are x. Adding the character string x as a prefix (suffix) results in a word (Ixl ~ 4).</Paragraph> <Paragraph position="6"> Word w ever appears immediately to the left (right) of the word. Character z appears in the word.</Paragraph> <Paragraph position="7"> 17 If we change the tagger to tag all unknown words as common nouns, then a number of rules are learned of the form: change tag to proper noun if the prefix is &quot;E', &quot;A&quot;, &quot;B', etc., since the learner is not provided with the concept of upper case in its set of transformation templates. The first 20 transformations for unknown words.</Paragraph> <Paragraph position="8"> An unannotated text can be used to check the conditions in all of the above transformation templates. Annotated text is necessary in training to measure the effect of transformations on tagging accuracy. Since the goal is to label each lexical entry for new words as accurately as possible, accuracy is measured on a per type and not a per token basis.</Paragraph> <Paragraph position="9"> Figure 6 shows the first 20 transformations learned for tagging unknown words in the Wall Street Journal corpus. As an example of how rules can correct errors generated by prior rules, note that applying the first transformation will result in the mistagging of the word actress. The 18th learned rule fixes this problem. This rule states: Change a tag from plural common noun to singular common noun if the word has SUffiX ss.</Paragraph> <Paragraph position="10"> Keep in mind that no specific affixes are prespecified. A transformation can make reference to any string of characters up to a bounded length. So while the first rule specifies the English suffix &quot;s', the rule learner was not constrained from considering such nonsensical rules as: Change a tag to adjective if the word has suffix &quot;xhqr'.</Paragraph> <Paragraph position="11"> Also, absolutely no English-specific information (such as an affix list) need be prespecified in the learner. TM 18 This learner has also been applied to tagging Old English. See Brill (1993b). Although the We then ran the following experiment using 1.1 million words of the Penn Tree-bank Tagged Wall Street Journal Corpus. Of these, 950,000 words were used for training and 150,000 words were used for testing. Annotations of the test corpus were not used in any way to train the system. From the 950,000 word training corpus, 350,000 words were used to learn rules for tagging unknown words, and 600,000 words were used to learn contextual rules; 243 rules were learned for tagging unknown words, and 447 contextual tagging rules were learned. Unknown word accuracy on the test corpus was 82.2%, and overall tagging accuracy on the test corpus was 96.6%. To our knowledge, this is the highest overall tagging accuracy ever quoted on the Penn Treebank Corpus when making the open vocabulary assumption. Using the tagger without lexicalized rules, an overall accuracy of 96.3% and an unknown word accuracy of 82.0% is obtained. A graph of accuracy as a function of transformation number on the test set for lexicalized rules is shown in figure 7. Before applying any transformations, test set accuracy is 92.4%, so the transformations reduce the error rate by 50% over the baseline. The high baseline accuracy is somewhat misleading, as this includes the tagging of unambiguous words. Baseline accuracy when the words that are unambiguous in our lexicon are not considered is 86.4%. However, it is difficult to compare taggers using this figure, as the accuracy of the system depends on the particular lexicon used. For instance, in our training set the word the was tagged with a number of different tags, and so according to our lexicon the is ambiguous. If we instead used a lexicon where the is listed unambiguously as a determiner, the baseline accuracy would be 84.6%. For tagging unknown words, each word is initially assigned a part-of-speech tag based on word and word-distribution features. Then, the tag may be changed based on contextual cues, via contextual transformations that are applied to the entire corpus, both known and unknown-words. When the contextual rule learner learns transformations, it does so in an attempt to maximize overall tagging accuracy, and not unknown-word tagging accuracy. Unknown words account for only a small percentage of the corpus in our experiments, typically two to three percent. Since the distributional behavior of unknown words is quite different from that of known words, and transformations are not English-specific, the set of transformation templates would have to be extended to process languages with dramatically different morphology, since a transformation that does not increase unknown-word tagging accuracy can still be beneficial to overall tagging accuracy, the contextual transformations learned are not optimal in the sense of leading to the highest tagging accuracy on unknown words. Better unknown-word accuracy may be possible by training and using two sets of contextual rules, one maximizing known-word accuracy and the other maximizing unknown-word accuracy, and then applying the appropriate transformations to a word when tagging, depending upon whether the word appears in the lexicon.</Paragraph> <Paragraph position="12"> We are currently experimenting with this idea.</Paragraph> <Paragraph position="13"> In Weischedel et al. (1993), a statistical approach to tagging unknown words is shown. In this approach, a number of suffixes and important features are prespecified. Then, for unknown words: p(W I T) -= p(unknown word I T) * p(Capitalize-feature I T) * p(suffixes, hyphenation I T) Using this equation for unknown word emit probabilities within the stochastic tagger, an accuracy of 85% was obtained on the Wall Street Journal corpus. This portion of the stochastic model has over 1,000 parameters, with 108 possible unique emit probabilities, as opposed to a small number of simple rules that are learned and used in the rule-based approach. In addition, the transformation-based method learns specific cues instead of requiring them to be prespecified, allowing for the possibility of uncovering cues not apparent to the human language engineer. We have obtained comparable performance on unknown words, while capturing the information in a much more concise and perspicuous manner, and without prespecifying any information specific to English or to a specific corpus.</Paragraph> <Paragraph position="14"> In table 2, we show tagging results obtained on a number of different corpora, in each case training on roughly 9.5 x 10 s words total and testing on a separate test set of 1.5-2 x 10 s words. Accuracy is consistent across these corpora and tag sets.</Paragraph> <Paragraph position="15"> In addition to obtaining high rates of accuracy and representing relevant linguistic information in a small set of rules, the part-of-speech tagger can also be made to run extremely fast. Roche and Schabes (1995) show a method for converting a list of tagging transformations into a deterministic finite state transducer with one state transition taken per word of input; the result is a transformation-based tagger whose tagging speed is about ten times that of the fastest Markov-model tagger.</Paragraph> </Section> <Section position="3" start_page="560" end_page="561" type="sub_section"> <SectionTitle> 4.4 K-Best Tags </SectionTitle> <Paragraph position="0"> There are certain circumstances where one is willing to relax the one-tag-per-word requirement in order to increase the probability that the correct tag will be assigned to each word. In DeMarcken (1990) and Weischedel et al. (1993), k-best tags are assigned within a stochastic tagger by returning all tags within some threshold of probability of being correct for a particular word.</Paragraph> <Paragraph position="1"> We can modify the transformation-based tagger to return multiple tags for a word by making a simple modification to the contextual transformations described above. The initial-state annotator is the tagging output of the previously described one-best transformation-based tagger. The allowable transformation templates are the same as the contextual transformation templates listed above, but with the rewrite rule: change tag X to tag Y modified to add tag X to tag Y or add tag X to word W. Instead of changing the tagging of a word, transformations now add alternative taggings to a word.</Paragraph> <Paragraph position="2"> When allowing more than one tag per word, there is a trade-off between accuracy and the average number of tags for each word. Ideally, we would like to achieve as large an increase in accuracy with as few extra tags as possible. Therefore, in training we find transformations that maximize the function: Number of corrected errors Number of additional tags In table 3, we present results from first using the one-tag-per-word transformation-based tagger described in the previous section and then applying the k-best tag transformations. These transformations were learned from a separate 240,000 word corpus. As a baseline, we did k-best tagging of a test corpus. Each known word in the test corpus was tagged with all tags seen with that word in the training corpus and the five most likely unknown-word tags were assigned to all words not seen in the training corpus. 19 This resulted in an accuracy of 99.0%, with an average of 2.28 tags per word. The transformation-based tagger obtained the same accuracy with 1.43 tags per word, one third the number of additional tags as the baseline tagger. 2deg</Paragraph> </Section> </Section> class="xml-element"></Paper>