File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/c04-1080_metho.xml
Size: 10,999 bytes
Last Modified: 2025-10-06 14:08:43
<?xml version="1.0" standalone="yes"?> <Paper uid="C04-1080"> <Title>Part of Speech Tagging in Context</Title> <Section position="4" start_page="0" end_page="2" type="metho"> <SectionTitle> 3 Building More Context into HMM Tagging </SectionTitle> <Paragraph position="0"> In a traditional HMM tagger, the probability of transitioning into a state representing tag t</Paragraph> <Paragraph position="2"> , and the probability of a word w i is conditioned only on the current tag t i . This formulation ignores dependencies that may exist between a word and the part-of-speech tags of the words which precede and follow it. For example, verbs which subcategorize strongly for a particular part-of-speech but can also be tagged as nouns or pronouns (e.g. &quot;thinking that&quot;) may benefit from modeling dependencies on future tags. To model this relationship, we now estimate the probability of a word w</Paragraph> <Paragraph position="4"> This change in structure, which we will call a contextualized HMM, is depicted in Figure 1. This type of structure is analogous to context-dependent phone models used in acoustic modeling for speech recognition (e.g.Young, 1999, Section 4.3).</Paragraph> <Section position="1" start_page="2" end_page="2" type="sub_section"> <SectionTitle> 3.1 Model Definition </SectionTitle> <Paragraph position="0"> In order to build both left and right-context into an HMM part-of-speech tagger, we reformulate the Given that we are using an increased context size during the estimation of lexical probabilities, thus fragmenting the data, we have found it desirable to smooth these estimates, for which we use a standard absolute discounting scheme (Ney, Essen and Knesser, 1994).</Paragraph> </Section> </Section> <Section position="5" start_page="2" end_page="2" type="metho"> <SectionTitle> 4 Unsupervised Tagging: A Comparison </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="2" end_page="2" type="sub_section"> <SectionTitle> 4.1 Corpora and Lexicon Construction </SectionTitle> <Paragraph position="0"> For our comparison of unsupervised tagging methods, we implemented the HMM taggers described in Merialdo (1991) and Kupiec (1992), as well as the UTBL tagger described in Brill (1995). We also implemented a version of the contextualized HMM using the type of word classes utilized in the Kupiec model. The algorithms were trained and tested using version 3 of the Penn Treebank, using the training, development, and test split described in Collins (2002) and also employed by Toutanova et al.</Paragraph> <Paragraph position="1"> (2003) in testing their supervised tagging algorithm. Specifically, we allocated sections 0018 for training, 19-21 for development, and 22-24 for testing. To avoid the problem of unknown words, each learner was provided with a lexicon constructed from tagged versions of the full Treebank. We did not begin with any estimates of the likelihoods of tags for words, but only the knowledge of what tags are possible for each word in the lexicon, i.e., something we could obtain from a manually-constructed dictionary.</Paragraph> </Section> <Section position="2" start_page="2" end_page="2" type="sub_section"> <SectionTitle> 4.2 The Effect of Lexicon Construction on Tagging Accuracy </SectionTitle> <Paragraph position="0"> To our surprise, we found initial tag accuracies of all methods using the full lexicon extracted from the Penn Treebank to be significantly lower than previously reported. We discovered this was due to several factors.</Paragraph> <Paragraph position="1"> One issue we noticed which impacted tagging accuracy was that of a frequently occurring word</Paragraph> <Paragraph position="3"> being mistagged during Treebank construction, as shown in the example in Figure 2a. Since we are not starting out with any known estimates for probabilities of tags given a word, the learner considers this tag to be just as likely as the word's other, more probable, possibilities. In another, more frequently occurring scenario, human annotators have chosen to tag all words in multi-word names, such as titles, with the proper-noun tag, NNP (Figure 2b). This has the effect of adding noise to the set of tags for many closed-class words.</Paragraph> <Paragraph position="4"> Finally, we noticed that a certain number of frequently occurring words (e.g. a, to, of) are sometimes labeled with infrequently occurring tags (e.g. SYM, RB), as exemplified in Figure 2c. In the case of the HMM taggers, where we begin with uniform estimates of both the state transition probabilities and the lexical probabilities, the learner finds it difficult to distinguish between more and less probable tag assignments.</Paragraph> <Paragraph position="5"> We later discovered that previous implementations of UTBL involved limiting which possible part of speech assignments were placed into the lexicon , which was not explicitly detailed in the published reports. We then simulated, in a similar fashion, the construction of higher quality lexicons by using relative frequencies of tags for each word from the tagged Treebank to limit allowable word-tag assignments. That is, tags that appeared the tag of a particular word less than X% of the time were omitted from the set of possible tags for that word. We varied this threshold until accuracy did not significantly change on our set of heldout data. The effect of thresholding tags based on relative frequency in the training set is shown for our set of part-of-speech taggers in the curve in Figure 3. As shown in Table 1, the elimination of noisy possible part-of-speech assignments raised accuracy back into the realm of previously published results. The best test set accuracies for the learners in the class of HMM taggers are</Paragraph> </Section> </Section> <Section position="6" start_page="2" end_page="2" type="metho"> <SectionTitle> 5 Unsupervised Training With Noisy </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="2" end_page="2" type="sub_section"> <SectionTitle> Lexicons </SectionTitle> <Paragraph position="0"> While placing informed limitations on the tags that can be included in a lexicon can dramatically improve results, it is dependent on some form of supervision - either from manually tagged data or by a human editor who post-filters an automatically constructed list. In the interest of being as unsupervised as possible, we sought to find a way to cope with the noisy aspects of the unfiltered lexicon described in the previous section.</Paragraph> <Paragraph position="1"> We suspected that in order to better control the training of lexical probabilities, having a stable model of state transition probabilities would be of help. We stabilized this model in two ways.</Paragraph> </Section> <Section position="2" start_page="2" end_page="2" type="sub_section"> <SectionTitle> Taggers 5.1 Using Unambiguous Tag Sequences To Initialize Contextual Probabilities </SectionTitle> <Paragraph position="0"> First, we used our unfiltered lexicon along with our tagged corpus to extract non-ambiguous tag sequences. Specifically, we looked for trigrams in which all words contained at most one possible part-of-speech tag. We then used these n-grams and their counts to bias the initial estimates of state transitions in the HMM taggers. This approach is similar to that described in Ratnaparhki (1998), who used unambiguous phrasal attachments to train an unsupervised prepositional phrase attachment model.</Paragraph> </Section> <Section position="3" start_page="2" end_page="2" type="sub_section"> <SectionTitle> 5.2 HMM Model Training Revised </SectionTitle> <Paragraph position="0"> Second, we revised the training paradigm for HMMs, in which lexical and transition probabilities are typically estimated simultaneously. We decided to train the transition model probabilities first, keeping the lexical probabilities constant and uniform. Using the estimates initially biased by the method previously mentioned, we train the transition model until it reaches convergence on a heldout set. We then use this model, keeping it fixed, to train the lexical probabilities, until they eventually converge on heldout data.</Paragraph> </Section> <Section position="4" start_page="2" end_page="2" type="sub_section"> <SectionTitle> 5.3 Results </SectionTitle> <Paragraph position="0"> We implemented this technique for the Kupiec, Merialdo and Contextualized HMM taggers. From our training data, we were able to extract data for on the order of 10,000 unique unambiguous tag sequences which were then be used for better initializing the state transition probabilities. As shown in Table 2, this method improved tagging accuracy of the Merialdo and contextual taggers over traditional simultaneous HMM training, reducing error by 0.4 in the case of Merialdo and 0.7 for the contextual HMM part-of-speech tagger.</Paragraph> </Section> <Section position="5" start_page="2" end_page="2" type="sub_section"> <SectionTitle> Accuracy </SectionTitle> <Paragraph position="0"> In this paradigm, tagging accuracy of the Kupiec HMM did not change.</Paragraph> </Section> </Section> <Section position="7" start_page="2" end_page="2" type="metho"> <SectionTitle> 6 Contextualized Tagging with Supervision </SectionTitle> <Paragraph position="0"> As one more way to assess the potential benefit from using left and right context in an HMM tagger, we tested our tagging model in the supervised framework, using the same sections of the Treebank previously allocated for unsupervised training, development and testing. In addition to comparing against a baseline tagger, which always chooses a word's most frequent tag, we implemented and trained a version of a standard HMM trigram tagger. For further comparison, we evaluated these part of speech taggers against Toutanova et al's supervised dependency-network based tagger, which currently achieves the highest accuracy on this dataset to date. The best result for this tagger, at 97.24%, makes use of both lexical and tag features coming from the left and right sides of the target. We also chose to examine this tagger's results when using only <t</Paragraph> <Paragraph position="2"> feature templates, which represents the same amount of context built into our contextualized tagger.</Paragraph> <Paragraph position="3"> As shown in Table 3, incorporating more context into an HMM when estimating lexical probabilities improved accuracy from 95.87% to 96.59%, relatively reducing error rate by 17.4%. With the contextualized tagger we witness a small improvement in accuracy over the current state of the art when using the same amount of context. It is important to note that this accuracy can be obtained without the intensive training required by Toutanova et. al's log-linear models. This result falls only slightly below the full-blown trainingintensive dependency-based conditional model.</Paragraph> </Section> class="xml-element"></Paper>