File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/p04-2011_metho.xml
Size: 17,056 bytes
Last Modified: 2025-10-06 14:09:00
<?xml version="1.0" standalone="yes"?> <Paper uid="P04-2011"> <Title>Beyond N in N-gram Tagging</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 A restriction of HMM tagging </SectionTitle> <Paragraph position="0"> The simplifying assumption, which is the basis for HMM tagging, that the context of a given tag can be fully represented by just the previous two tags, leads to tagging errors where syntactic features that fall outside of this range, and that are needed for determining the identity of the tag at hand, are ignored.</Paragraph> <Paragraph position="1"> One such error in tagging Dutch is related to niteness of verbs. This is discussed in the next paragraph and will be used in explaining the proposed approach. Other possible applications of the technique include assignment of case in German, and assignment of chunk tags in addition to part-of-speech tags. These will be brie y discussed at the end of this paper.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.1 An example from Dutch </SectionTitle> <Paragraph position="0"> In experiments on tagging Dutch text performed in the context of (Prins and van Noord, 2004), the most frequent type of error is a typical example of a mistake caused by a lack of access to global context. In Dutch, the plural nite form of a verb is similar in appearance to the in nitive form of the verb. In example (1-a) the second verb in the sentence, vliegen, is correctly tagged as an in nitive, but in example (1-b) the added adverb creates a surrounding in which the tagger incorrectly labels the verb as the nite plural form.</Paragraph> <Paragraph position="1"> Since a clause normally contains precisely one nite verb, this mistake could be avoided by remembering whether the nite verb for the current clause has already occurred, and using this information in classifying a newly observed verb as either nite or non nite. The trigram tagger has normally forgotten about any nite verb upon reaching a second verb, and is led into a mistake by other parts of the context even if the two verbs are close to each other.</Paragraph> <Paragraph position="2"> Basing the model on n-grams bigger than tri-grams is not a solution as the n-grams would often not occur in the training data, making the associated probabilities hard to estimate.</Paragraph> </Section> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Extending the model </SectionTitle> <Paragraph position="0"> Instead of considering longer n-grams, the model can be extended with speci c long-distance context information. Analogous to how sequences of tags can be modeled as a probabilistic network of events, modeling the probability of a tag given a number of preceding tags, in the same way we can model the syntactic context.</Paragraph> <Paragraph position="1"> For the example problem presented in section 2.1, this network would consist of two states: pre and post. In state pre the nite verb for the current clause has not yet been seen, while in state post is has. In general, the context feature C with values C1:::j and its probability distribution is to be incorporated in the model.</Paragraph> <Paragraph position="2"> In describing how the extra context information is added to the HMM, we will rst look at how the standard model for POS tagging is constructed.</Paragraph> <Paragraph position="3"> Then the probability distribution on which the new model is based is introduced. A distinction is made between a naive approach where the extra context is added to the model by extending the tagset, and a method where the context is added separately from the tags which results in a much smaller increase in the number of probabilities to be estimated from the training data.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.1 Standard model </SectionTitle> <Paragraph position="0"> In the standard second order HMM used for POS tagging (as described for example in chapter 10.2 of (Manning and Schcurrency1utze, 1999)), a single state corresponds to two POS tags, and the observed symbols are words. The transitions between states are governed by probabilities that combine the probabilities for state transitions (tag sequences ti 2;ti 1;ti) and output of observed symbols (words wi):</Paragraph> <Paragraph position="2"> This probability distribution over tags and words is factorized into two separate distributions, using the chain rule P(A;BjC) = P(AjC) P(BjC;A):</Paragraph> <Paragraph position="4"> Finally, the POS tagging assumption that the word only depends on the current tag is applied:</Paragraph> <Paragraph position="6"> If is the size of the tagset, ! the size of the vocabulary, and n the length of the tag n-grams used, then the number of parameters in this standard model is n + !.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.2 Extended model </SectionTitle> <Paragraph position="0"> As a starting point in adding the extra feature to the model, the same probability distribution used as a basis for the standard model is used:</Paragraph> <Paragraph position="2"> Naive method: extending the tagset. The contextual information C with j possible values could be added to the model by extending the set of tags, so that every tag t in the tagset is replaced by a set of tags ftc1;tc2;::: ;tcjg. If is the size of the original tagset, then the number of parameters in this extended model would be njn + j!, the number of tag n-grams being multiplied by eight in our example. In experiments this increase in the number of parameters led to less accurate probability estimates.</Paragraph> <Paragraph position="3"> Better method: adding context to states as a separate feature. In order to avoid the problem associated with the naive method, the context feature is added to the states of the model separately from the tags. This way it is possible to combine probabilities from the different distributions in an appropriate manner, restricting the increase in the number of parameters. For example, it is now stated that as far as the context feature is concerned, the model is rst order. The probabilities associated with state transitions are de ned as follows, where ci is the value of the new context feature at position i:</Paragraph> <Paragraph position="5"> As before, the probability distribution is factorized into separate distributions:</Paragraph> <Paragraph position="7"> The assumption made in the standard POS tagging model that words only depend on the corresponding tag is applied, as well as the assumption that the current context value only depends on the current tag and the previous context value:</Paragraph> <Paragraph position="9"> The total numbers of parameters for this model is nj+ j2+ !. In the case of the example problem this means the number of tag n-grams is multiplied by two. The experiments described in section 5 will make use of this model.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.3 Training the model </SectionTitle> <Paragraph position="0"> The model's probabilities are estimated from annotated training data. Since the model is extended with global context, this has to be part of the annotation. The Alpino wide-coverage parser for Dutch (Bouma et al., 2001) was used to automatically add the extra information to the data. For the example concerning nite plural verbs and in nitives, this means the parser labels every word in the sentence with one of the two possible context values. When the parser encounters a root clause (including imperative clauses and questions) or a subordinate clause (including relative clauses), it assigns the context value pre. When a nite verb is encountered, the value post is assigned. Past the end of a root clause or subordinate clause the context is reset to the value used before the embedded clause began. In all other cases, the value assigned to the previous position is continued.</Paragraph> <Paragraph position="1"> From the text annotated with POS tags and context labels the n-gram probabilities and lexical probabilities needed by the model are estimated based on the frequencies of the corresponding sequences. null</Paragraph> </Section> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 The tagger </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.1 Tagging method </SectionTitle> <Paragraph position="0"> The trigram HMM tagger used in the experiments of section 5 computes the a posteriori probability for every tag. This value is composed of the forward and backward probability of the tag at hand as de ned in the forward-backward algorithm for HMM-training. This idea is also described in (Jelinek, 1998) and (Charniak et al., 1996). The trigram data is combined with bigram and uni-gram data through linear interpolation to reduce the problem of sparse data.</Paragraph> <Paragraph position="1"> Applying the method known as linear interpolation, probabilities of unigrams, bigrams and trigrams are combined in a weighted sum using weights 1, 2 and 3 respectively. The weights are computed for every individual case using the notion of n-gram diversity (Collins, 1999). The diversity of an n-gram is the number of different tags that appear in the position following this n-gram in the training data. The weight 3 assigned to the trigram t1t2t3 is computed on the basis of the diversity and frequency of the pre x bigram t1t2, using the following equation, where c regulates the importance of diversity (c = 6 was used in the experiments described below), and C(x) and D(x) are respectively the count and diversity of x:</Paragraph> <Paragraph position="3"> The bigram weight 2 is computed as a fraction of 1 3 using the bigram version of the above equation. The remaining weight 1 3 2 is used as the unigram weight 1.</Paragraph> <Paragraph position="4"> The tagger uses a lexicon that has been created from the training data to assign an initial set of possible tags to every word. Words that were not seen during training are not in the lexicon, so that another method has to be used to assign initial tags to these words. A technique described and implemented by Jan Daciuk (Daciuk, 1999) was used to create automata for associating words with tags based on suf xes of those words.</Paragraph> </Section> </Section> <Section position="6" start_page="0" end_page="0" type="metho"> <SectionTitle> 5 Tagging experiment </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 5.1 Experiment setup 5.1.1 Method </SectionTitle> <Paragraph position="0"> An extended model was created featuring context information on the occurrence of the nite verb form. The tagger is used to tag a set of sentences, assigning one tag to each word, rst using the standard model and then using the extended model. The results are compared in terms of tagging accuracy. The experiment is conducted twice with different data sets used for both training and testing.</Paragraph> <Paragraph position="1"> The rst set consists of a large amount of Dutch newspaper text that was annotated with syntactical tags by the Alpino parser. This is referred to as the Alpino data. The second and much smaller set of data is the Eindhoven corpus tagged with the Wotan tagset (Berghmans, 1994). This data set was also used in (van Halteren et al., 2001), therefore the second experiment will allow for a comparison of the results with previous work on tagging Dutch. This data will be referred to as the Wotan data.</Paragraph> <Paragraph position="2"> For both sets the contextual information concerning nite verbs is added to the training data by the Alpino parser as described in section 3.3. Due to memory restrictions, the parser was not able to parse 265 of the 36K sentences of Wotan training data. These sentences received no contextual labels and thus not all of the training data used in (van Halteren et al., 2001) could be used in the Wotan experiment.</Paragraph> <Paragraph position="3"> Training data for the Alpino experiment is four years of daily newspaper text, amounting to about 2M sentences (25M words). Test data is a collection of 3686 sentences (59K words) from the Parool newspaper. The data is annotated with a tagset consisting of 2825 tags. (The large size of the Alpino tagset is mainly due to a large number of infrequent tags representing speci c uses of prepositions.) In the Wotan experiment, 36K sentences (628K words) are used for training (compared to 640K words in (van Halteren et al., 2001)), and 4176 sentences (72K words) are used for testing. The Wotan data is annotated with a tagset consisting of 345 tags (although a number of 341 is reported in (van Halteren et al., 2001)). As a baseline method every word is assigned the tag it was most often seen with in the training data. Thus the baseline method is to tag each word w with a tag t such that P(tjw) is maximized. Unknown words are represented by all words that occurred only once. The baseline accuracies are 85.9% on the Alpino data and 84.3% on the Wotan data.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 5.2 Results </SectionTitle> <Paragraph position="0"> The results on the Alpino data are shown in table 1. Using the standard model, accuracy is 93.34% (3946 mistakes). Using the extended model, accuracy is 93.62% (3779 mistakes). This amounts to an overall error reduction of 4.23%. In table 2 and 3 the 6 most frequent tagging errors are listed for tagging using the standard and extended model respectively. Mistakes where verb(pl) is mixed up with verb(inf) sum up to 241 instances (6.11% of all mistakes) when using the standard model, as opposed to 82 cases (2.17%) using the extended model, an error reduction of 65.98%.</Paragraph> <Paragraph position="1"> The results on the Wotan data can be seen in table 4. Using the standard model, accuracy is 92.05% (5715 mistakes). This result is very simibaseline accuracy 85.9% model standard extended lar to the 92.06% reported by Van Halteren, Zavrel and Daelemans in (van Halteren et al., 2001) who used the TnT trigram tagger (Brants, 2000) on the same training and testing data. Using the extended model, accuracy is 92.26% (5564 mistakes). This amounts to an overall error reduction of 2.64%. Mistakes where the plural verb is mixed up with the in nitive sum up to 316 instances (5.53% of all mistakes) when using the standard model, as opposed to 199 cases (3.58%) using the extended model, an error reduction of 37.03%.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 5.3 Discussion of results </SectionTitle> <Paragraph position="0"> Extending the standard trigram tagging model with syntactical information aimed at resolving the most frequent type of tagging error led to a considerable reduction of this type of error in stand-alone POS tagging experiments on two diffreq assigned correct ferent data sets. At the same time, other types of errors were also reduced.</Paragraph> <Paragraph position="1"> The relative error reduction for the speci c type of error involving nite and in nite verb forms is almost twice as high in the case of the Alpino data as in the case of the Wotan data (respectively 65.98% and 37.03%). There are at least two possible explanations for this difference.</Paragraph> <Paragraph position="2"> The rst is a difference in tagsets. Although the Wotan tagset is much smaller than the Alpino tagset, the former features a more detailed treatment of verbs. In the Alpino data, the difference between plural nite verb forms and non nite verb forms is represented through just two tags. In the Wotan data, this difference is represented by 20 tags. An extended model that predicts which of the two forms should be used in a given situation is therefore more complex in the case of the Wotan data.</Paragraph> <Paragraph position="3"> A further important difference between the two data sets is the available amount of training data (25 million words for the Alpino experiment compared to 628 thousand words for the Wotan experiment). In general a stochastic model such as the HMM will become more accurate when more training data is available. The Wotan experiment was repeated with increasing amounts of training data, and the results indicated that using more data would improve the results of both the standard and the extended model. The advantage of the extended model over the standard model increases slightly as more data is available, suggesting that the extended model would bene t more from extra data than the standard model.</Paragraph> </Section> </Section> class="xml-element"></Paper>