File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/i05-3005_metho.xml
Size: 23,544 bytes
Last Modified: 2025-10-06 14:09:33
<?xml version="1.0" standalone="yes"?> <Paper uid="I05-3005"> <Title>Morphological features help POS tagging of unknown words across language varieties</Title> <Section position="5" start_page="32" end_page="32" type="metho"> <SectionTitle> 3 Corpus analysis </SectionTitle> <Paragraph position="0"> We begin with an analytic study of potential problems for POS tagging on cross language variety data.</Paragraph> <Section position="1" start_page="32" end_page="32" type="sub_section"> <SectionTitle> 3.1 More unknown words across varieties? </SectionTitle> <Paragraph position="0"> We first test our hypothesis that a test set from a different language variety will contain more unknown words. Table 1 has the number of words in our devset that were unseen in the XH-only training set (we describe our training/dev/test split more fully in the next section). The devset contains equal amounts of data from all three varieties (XH, HKSAR, and SM). As table 1 shows, in data taken from the same</Paragraph> </Section> </Section> <Section position="6" start_page="32" end_page="33" type="metho"> <SectionTitle> 1 Xinhua Agency 2 Information Services Department of Hong Kong Special Admin- istrative Region 3 Sinorama magazine </SectionTitle> <Paragraph position="0"> source as the training data (XH), 4.63% of the words were unseen in training, compared to the much larger numbers of unknown words in the cross-variety data-sets (14.3% and 16.7%). Some of this difference is probably due to genre as well, especially for the out-</Paragraph> <Section position="1" start_page="32" end_page="33" type="sub_section"> <SectionTitle> 3.2 What are the unknown words? </SectionTitle> <Paragraph position="0"> In this section, we analyze the part-of-speech characteristics of the unknown words in our devset.</Paragraph> <Paragraph position="1"> Table 2 Word class distribution of unknown words in devset, XH, HKSAR, SM. Devset represents the conjunction of the three varieties. CC, DT, LC, P, PN, PU, and SP are considered as closed classes by CTB.</Paragraph> <Paragraph position="2"> Table 2 shows that the majority of Chinese unknown words are common nouns (NN) and verbs (VV). This holds both within and across different varieties. Beyond the content words, we find that 10.96% and 21.31% of unknown words are function words in HKSAR and SM data. Such unknown function words include the determiner gewei (&quot;everybody&quot;), the conjunction huoshi (&quot;or&quot;), the preposition liantong (&quot;with&quot;), the pronoun nali (&quot;where&quot;), and symbols used as quotes &quot;&quot; and &quot;&quot; (punctuation). XH does contain words with similar function (huozhe &quot;or&quot;, yu &quot;with&quot;, dajia &quot;everybody&quot;, quotation marks &quot;&quot;and &quot;&quot;). Our result thus suggests that each Mandarin variety may have characteristic function words.</Paragraph> </Section> <Section position="2" start_page="33" end_page="33" type="sub_section"> <SectionTitle> 3.3 Cross language comparison </SectionTitle> <Paragraph position="0"> A key goal of our work is to understand the way that unknown words differ across languages. We thus compare Chinese, German, and English. Following Brants (2000), we extracted 10% of the data from the Penn Treebank Wall Street Journal (WSJ 4 ) and NEGRA5 (Brants et al., 1999) as observation samples to compare to the rest of the corpora.</Paragraph> <Paragraph position="1"> In these observation samples, we found that Chinese words are more ambiguous in POS than English and German; 29.9% of tokens in CTB have more than one POS tag, while only 19.8% and 22.9% of tokens are ambiguous in English and German, respectively.</Paragraph> <Paragraph position="2"> Table 3 shows that 40.6% of unknown words are proper nouns6 in English, while both Chinese and German have less than 15% of unknown words as proper nouns. Unlike English, 60% of the unknown words in Chinese and German are verbs and common nouns. In the next section we investigate the cause of this similarity between Chinese and German unknown word distribution.</Paragraph> <Paragraph position="3"> Table 3 Comparison of unknown words in English, German and Mandarin. The English and German data are extracted from WSJ and NEGRA. Chinese data is our CTB devset.</Paragraph> </Section> </Section> <Section position="7" start_page="33" end_page="34" type="metho"> <SectionTitle> 4 Morphological analysis </SectionTitle> <Paragraph position="0"> In order to understand the causes of the similarity of Chinese and German, and to help suggest possible features, we turn here to an introduction to Chinese morphology and its implications for part-of-speech tagging.</Paragraph> <Paragraph position="1"> 4 WSJ unknown words are those in WSJ 19-21 but unseen in WSJ 0-18; these are the devset and training set from Toutanova et al. (2003).</Paragraph> <Paragraph position="2"> 5 The unknown words of NEGRA are words in a 10% randomly extracted set that were unseen in the rest of the corpus. 6 We treat NNP (proper noun) and NNPS(proper noun plural) as proper nouns, NN(noun) and NNS(noun plural) as other nouns, and V* as verbs in WSJ. We treat NE (Eigennamen) as proper nouns, NN (Normales Nomen) as other nouns, and V* as verbs in NEGRA. We treat NR as proper nouns, NN and NT as other nouns, and V* as verbs in CTB.</Paragraph> <Section position="1" start_page="33" end_page="33" type="sub_section"> <SectionTitle> 4.1 Chinese morphology </SectionTitle> <Paragraph position="0"> Chinese words are typically formed by four morphological processes: affixation, compounding, idiomization, and reduplication, as shown in Table 4.</Paragraph> <Paragraph position="1"> In affixation, a bound morpheme is added to other morphemes, forming a larger unit. Chinese has a small number of prefixes and infixes7 and numerous suffixes (Chao 1968, Li and Thompson 1981). Chinese prefixes include items such as gui (&quot;noble&quot;) in guixing (&quot;your name&quot;), bu (&quot;not&quot;) in budaode (&quot;immoral&quot;), and lao (&quot;senior&quot;) in laohu (&quot;tiger&quot;) and laoshu (&quot;mouse&quot;). There are a number of Chinese suffixes, including zhe (&quot;marks a person who is an agent of an action&quot;) in zuozhe (&quot;author&quot;), shi (&quot;master&quot;) in laoshi (&quot;teacher&quot;), ran (-ly) in huran (&quot;suddenly&quot;), and xin (-ity or -ness) in kenengxin (&quot;possibility&quot;).</Paragraph> <Paragraph position="2"> Compound words are composed of multiple stem morphemes. Chao (1968) describes a few of the different compounding rules in Mandarin, such as coordinate compound, subject predicate compound, noun noun compound, adj noun compound and so on. Two examples of coordinate compounds are anpai ARRANGE-ARRANGE (&quot;to arrange, arrangement&quot;) and xuexi STUDY-STUDY (&quot;to study&quot;).</Paragraph> <Paragraph position="3"> Table 4 Chinese morphological rules and examples</Paragraph> </Section> <Section position="2" start_page="33" end_page="34" type="sub_section"> <SectionTitle> Examples </SectionTitle> <Paragraph position="0"> Prefix lao (&quot;senior&quot;) in laohu ( &quot;tiger&quot;) Suffix shi (&quot;master&quot;) in laoshi (&quot;teacher&quot;) Compounding xuexi (&quot;to study&quot;, &quot;study&quot;) Idiomization wanshiruyi (&quot;everything is fine&quot;) Reduplication changchang (&quot;taste a bit&quot;) Compounding is extremely common in both Chinese and German. The phrase &quot;income tax&quot; is treated as an NP in English, but it is a word in German, Einkommensteuer, and in Chinese, suodesui. We suggest that it is this rich use of compounding that causes the wide variety of unknown common nouns and verbs in Chinese and German. However, there are still differences in their compound rules. German compounds can compose with a large number of elements, but Chinese compounds normally consist of two bases. Most German compounds are nouns, but Chinese has both noun and verb compounds.</Paragraph> <Paragraph position="1"> Two final types of Chinese morphological processes that we will not focus on are idiomization (in which a whole phrase such as wanshiruyi (&quot;everything is fine&quot;) functions as a word, and reduplication, in which a morpheme or word is repeated to form a new word such as the formation of changchang (&quot;taste a 7 Chinese only has two infixes, which are de and bu (not). We do not discuss infixes in the paper, because they are handled phrasally rather than lexically in CTB.</Paragraph> <Paragraph position="2"> bit&quot;), from chang &quot;taste&quot;. (Chao 1968, Li and Thompson 1981).</Paragraph> </Section> <Section position="3" start_page="34" end_page="34" type="sub_section"> <SectionTitle> 4.2 Difficulty </SectionTitle> <Paragraph position="0"> The morphological characteristics of Chinese create various problems for part-of-speech tagging. First, Chinese suffixes are short and sparse. Because of the prevalence of compounding and the fact that the morphemes are short (1 character long), there are more than 4000 affixes. This means that the identity of an affix is often a sparsely-seen feature for predicting POS. Second, Chinese affixes are poor cues to POS because they are ambiguous; for example 63% of Chinese suffix tokens in CTB have more than one possible tag, while only 31% of English suffix tokens in WSJ have more than one tag. Most English suffixes are derivational and inflectional suffixes like able, -s and -ed. Such functional suffixes are used to indicate word classes or syntactic function. Chinese, however, has no inflectional suffixes and only a few derivational suffixes and so suffixes may not be as good a cue for word classes. Finally, since Chinese has no derivational morpheme for nominalization, it is difficult to distinguish a nominalization and a verb.</Paragraph> <Paragraph position="1"> These points suggest that morpheme identity, which is the major feature used in previous research on unknown words in English and German, will be insufficient in Chinese. This suggests the need for more sophisticated features, which we will introduce below. null</Paragraph> </Section> </Section> <Section position="8" start_page="34" end_page="36" type="metho"> <SectionTitle> 5 Experiments </SectionTitle> <Paragraph position="0"> We evaluate our tagger under several experimental conditions: after showing the effects of data cleanup we show basic results based on features found to be useful by previous research. Next, we introduce additional morphology-based unknown word features, and finally, we experiment with training data of variable sizes and different language varieties.</Paragraph> <Section position="1" start_page="34" end_page="34" type="sub_section"> <SectionTitle> 5.1 Data sets </SectionTitle> <Paragraph position="0"> To study the significance of training on different varieties of data, we created three training sets: training set I contains data only from one variety, training set II contains data from 3 varieties, and is similar in total size to training set I. Training set III also contains data from 3 varieties and has twice much data as training set I. To facilitate comparison of performance both between and within Mandarin varieties, both the devset and the test set we created are composed of three varieties of data. The XH test data we selected was identical to the test set used in previous parsing research by Bikel and Chiang (2000). For the remaining data, we included HKSAR and SM data that is similar in size to the XH test set. Table 5 details characteristics of the data sets.</Paragraph> </Section> <Section position="2" start_page="34" end_page="35" type="sub_section"> <SectionTitle> 5.2 The model </SectionTitle> <Paragraph position="0"> Our model builds on research into loglinear models by Ng and Low (2004), Toutanova et al., (2003) and Ratnaparkhi (1996). The first research uses independent maximum entropy classifiers, with a sequence model imposing categorical valid tag sequence constraints. The latter two use maximum entropy Markov models (MEMM) (McCallum et al., 2000), that use log-linear models to obtain the probabilities of a state transition given an observation and the previous state, as illustrated in Figure 1 (a).</Paragraph> <Paragraph position="1"> Figure 1 Graphical representation of transition probability calculation used in maximum entropy Markov models. (a) The previous state and the current word are used to calculate the transition probabilities for the next state transition. (b) Same as (a), but when model is run right to left.</Paragraph> <Paragraph position="2"> Using left-to-right transition probabilities, as in Figure 1 (a), the equation for the MEMM can be formally stated as the following, where by di represents the set of features the transition probabilities are conditioned on: ( ) ( )iii d|tPwt,P [?]= Maximum entropy is used to calculate the probability P(ti |di) using the equation below. Here, fj(ti,di) represents a feature derived from the available contextual information (e.g. current word, previous tag, next word, etc.)</Paragraph> <Paragraph position="4"> We also used Gaussian prior to prevent overfitting.</Paragraph> <Paragraph position="5"> This technique allows us to utilize a large number of lexical and MEMM state sequence based features and also provides an intuitive framework for the use of morphological features generated from unknown word models.</Paragraph> </Section> <Section position="3" start_page="35" end_page="35" type="sub_section"> <SectionTitle> 5.3 Data cleanup </SectionTitle> <Paragraph position="0"> Before investigating the effect of our new features, we show the effects of data cleanup. Table 6 illustrates the .46 (absolute) performance gain obtained by cleaning character encoding errors and normalizing half width to full width.</Paragraph> <Paragraph position="1"> We also clustered punctuation symbols, since training set I has too many (36) variety of punctuations, compared to 9 in WSJ. We clustered punctuations, for example grouping &quot;&quot; and &quot;&quot; together. This mapping renders an overall improvement of .08%.</Paragraph> <Paragraph position="2"> All models in the following sections are then trained on font-normalized and punctuation clustered data.</Paragraph> <Paragraph position="3"> Table 6 Improvement of tagging accuracy after data cleanup. The features used by all of the models are the identity of the two previous words, the current word and the two following word. No features based on the sequence of tags were used.</Paragraph> </Section> <Section position="4" start_page="35" end_page="36" type="sub_section"> <SectionTitle> 5.4 Sequence features </SectionTitle> <Paragraph position="0"> We examined several tag sequence features from both left and right side of the current word. We use the term lexical features to refer to features derived from the identity of a word, and tag sequence features refer to features derived from the tags of surrounding words.</Paragraph> <Paragraph position="1"> These features have been shown to be useful in previous research on English (Toutanova et al, 2003, Brants 2000, Thede and Harper 1999) The models9 in Table 7 list the different tag sequence features used; they also use the same lexical features from the model 2Rw+2Lw shown in Table 6. The table shows that Model Lt+LLt conditioning on the previous tag and the conjunction of the two previous occurs fewer than 3 times, it is simply removed from the training data. All models are trained on training set I and evaluated on the devset.</Paragraph> <Paragraph position="2"> tags yields 88.27%. As such, using the sequence features<ti-1, ti-1ti-2> achieves the current best result. So far, there are no features specifically tailored toward unknown words in the model.</Paragraph> <Paragraph position="3"> Starting with Model Lt+LLt from the last section, we introduce 8 features to improve the performance of the tagger on unknown words. In the sections that follow, the model using affixation in conjunction with the basic lexical features described above is considered to be our baseline.</Paragraph> <Paragraph position="4"> We considered words that occur less than 7 times in the training set I as rare; if Wi is rare, an unknown word feature is used in place of a feature based on the actual word's identity. During evaluation, unknown word features are used for all words that occurred zero to 7 times in the training data. In addition, when tagging such rare and unknown words, we restrict the set of possible tags to just those tags that were associated with one or more rare words in the training data.</Paragraph> <Paragraph position="5"> Our affixation feature is motivated by similar features seen in inflectional language models. (Ng and Low 2004, Toutanova et al, 2003, Brants 2000, Ratnaparkhi 1996, Samuelsson 1993). Since Chinese also has affixation, it is reasonable to incorporate this feature into our model. For this feature, we use character n-gram prefixes and suffixes for n up to 4.10 An While affix information can be very informative, we showed earlier that affixes in Chinese are sparse, short, and ambiguous. Thus as our first new feature we used a POS-vector of the set of tags a given affix could have. We used the training set to build a morpheme/POS dictionary with the possible tags for each 10 Despite the short average word length, we found that affixes up to size 4 worked better than affixes only up to size 2, perhaps mainly because they help with long proper nouns and temporal expressions.</Paragraph> <Paragraph position="6"> morpheme. Thus for each prefix and suffix that occurs with each CTB tag in the training set I, we associate a set of binary features corresponding to each CTB tag. In the example below the prefix C occurred in both NN and VV words, but not AD or AS.</Paragraph> <Paragraph position="8"> This model smoothes affix identity and the quantity of active CTBMorph features for a given affix expresses the degree of ambiguity associated with that affix.</Paragraph> <Paragraph position="9"> for each t in CTB tag set: for each single-character prefix or suffix k of W if t.affixList contain k f.appendPair(t, 1) else f.appendPair(t, 0)</Paragraph> </Section> </Section> <Section position="9" start_page="36" end_page="37" type="metho"> <SectionTitle> 5.5.3 ASBC </SectionTitle> <Paragraph position="0"> One way to deal with robustness is to add more varied training data. For example the Academic Sinica Balanced Corpus11 contains POS-tagged data from a different variety (Taiwanese Mandarin). But the tags in this corpus are not easily converted to the CTB tags. This problem of labeled data from very different tagsets can happen more generally. We introduce two alternative methods for making use of such a corpus.</Paragraph> <Paragraph position="1"> 5.5.3.1 ASBCMorph (ASBCM) The ASBCMorph feature set is generated in an identical manner to the CTBMorph feature set, except that rather than generating the morpheme table using CTB, another corpus is used. The morpheme table is generated from the Academic Sinica Balanced Corpus, ASBC (Huang and Chen 1995), a 5 M word balanced corpus written in Taiwanese Mandarin. As the CTB annotation guide12 states, the mapping between the tag sets used in the two corpora is nontrivial. As such, the ASBC data can not be directly used to augment the training set. However, using our ASBCMorph feature, we are still able to derive some benefit out of such an alternative corpus.</Paragraph> <Paragraph position="2"> 5.5.3.2 ASBCWord (ASBCW) The ASBCWord feature is identical to the ASBCMorph feature, except that instead of using a table of tags that occur with each affix, we use a table of tags that a word occurs with in the ASBC data.</Paragraph> <Paragraph position="3"> Thus, a rare word in the CTB training/test set is augmented with features that correspond to all of the tags that the given word occurred with in the ASBC corpus, i.e. in this case, the pos tag of the identical word in ASBC, Cm>_.</Paragraph> <Paragraph position="4"> Wi=Cm>_ FASBCWord={(A,0),(Caa,0),(Cab,0)...(V_2,0)} This feature set contains only two feature values, based on whether a list of verb affixes contains the prefix or suffix of an unknown word. We use the verb affix list created by the Chinese Knowledge Information Processing Group13 at Academia Sinica. It contains 735 frequent verb prefixes and 282 frequent verb suffixes. For example, Prefix1=C, suffix1=>_ Fverb={(verb prefix, 1), (verb suffix, 0)} Radicals are the basic building blocks of Chinese characters. There are over 214 radicals, and all Chinese characters contain one or more of them. Sometimes radicals reflect the meaning of a character. For example, the characters) (monkey), ( (pig) ( (kitty cat) all contain the radical ( that roughly means &quot;something that is an animal&quot;. For our radical based feature, we use the radical map from the Unihan database.14 The radicals associated with the characters in the prefix and suffix of unknown words were incorporated into the model as features, for example: null Prefix1=C, suffix1=>_ FRADICAL={(radical prefix, B), (radical suffix,>7)} There is a convention that the suffix of a named entity points out the essential meaning of the named entity. For example, the suffix bao (newspaper) appears in Chinese translation of &quot;WSJ&quot;, huaerjierebao. The suffix he (river) is used to identify rivers, for example in huanghe (yellow river).</Paragraph> <Paragraph position="5"> To take advantage of this fact, we made 3 tables of named entity characters from the Chinese English Named Entity Lists (CENEL) (Huang 2002). These lists consist of a table of Chinese first name characters, a table of Chinese last name characters, and a change in token accuracies and unknown word accuracies from the baseline for each feature introduced cumulatively. The fourth column shows the improvement from each feature set. The six columns on the right side of the table shows the error rate for the 5 most frequent tagsets of unknown words and the rest of unknown words. Error analysis: error rate % of unknown words in each POS table of named entity suffixes such as organization, place, and company names in CENEL. Our named entity feature set contains 3 features, each corresponding to one of the three tables just described. To generate these features, first, we check if the prefix of an unknown is in the Chinese last name table. Second, we check if the suffix is in the Chinese first name table. Third, we check if the suffix of an unknown word is in the table of named entity suffixes. In Chinese last names are written in front of a first name, and the whole name is considered as a word, for example: Prefix1=C, suffix1=>_ FNEM={(last name, 0), (first name, 0), (NE suffix, The length of a word can be a useful feature, because the majority of words in CTB have less than 3 characters. Words that have more than 3 characters are normally proper nouns, numbers, and idioms. Therefore, we incorporate this feature into the system. For example: Wi=Cm>_, Flength={(length , 3)}</Paragraph> </Section> class="xml-element"></Paper>