File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/w03-1026_metho.xml
Size: 25,597 bytes
Last Modified: 2025-10-06 14:08:25
<?xml version="1.0" standalone="yes"?> <Paper uid="W03-1026"> <Title>HowtogetaChineseName(Entity): Segmentation and Combination Issues</Title> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Language-Specific Issues in Chinese NE Recognition </SectionTitle> <Paragraph position="0"> Chinese does not have delimiters between words, so a key design issue in Chinese NE recognition is whether to build a character-based model or a word-based model. In this section, we use a hidden Markov model NE recognition system as an example to discuss language-specific issues in Chinese NE recognition.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.1 The Hidden Markov Model Classifier </SectionTitle> <Paragraph position="0"> NE recognition can be formulated as a classification task, where the goal is to label each token with a tag indicating whether it belongs to a specific NE or is not part of any NE. The HMM classifier used in our experiments follows the algorithm described in (Bikel et al., 1999). It performs sequence classification by assigning each token either one of the NE types or the label &quot;O&quot; to represent &quot;outside any NE&quot;. The states in the HMM are organized into regions, one region for each type of NE plus one for &quot;O&quot;. Within each of the regions, a statistical language model is used to compute the likelihood of words occurring within that region. The transition probabilities are smoothed by deleted interpolation, and the decoding is performed using the Viterbi algorithm. null</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.2 Character-Based, Word-Based, and Class-Based Models </SectionTitle> <Paragraph position="0"> To build a model for identifying Chinese NEs, we need to determine the basic unit of the model: character or word. On one hand, the word-based model is attractive since it allows the system to inspect a larger window of text, which may lead to more informative decisions. On the other hand, a word segmenter is not error-prone and these errors may propagate and result in errors in NE recognition.</Paragraph> <Paragraph position="1"> Two systems, a character-based HMM model and a word-based HMM model, were built for comparison. The word segmenter used in our experiments relies on dictionaries and surrounding words in local context to determine the word boundaries. During training, the NE boundaries were provided to the word segmenter; the latter is restricted to enforce word boundaries at each entity boundary. Therefore, at training time, the word boundaries are consistent with the entity boundaries. At test time, however, the segmenter could create words which do not agree with the gold-standard entity boundaries.</Paragraph> <Paragraph position="2"> model, the word-based HMM model, and the class-based HMM model. (The precision, recall, and F-measure presented in this table and throughout this paper are based on correct identification of all the attributes of an NE, including boundary, content, and type.) The performance of the character-based model and the word-based model are shown in Table 1. The two corpora used in the evaluation, the IBM-FBIS corpus and the IEER corpus, differ greatly in data size and the number of NE types. The IBM-FBIS training data consists of 3.1 million characters and the corresponding test data has 270,000 characters.</Paragraph> <Paragraph position="3"> As we can see from the table, for both corpora, the character-based model outperforms the word-based model, with a lead of 3 to 5.5 in F-measure. The performance gap between two models is larger for the IEER data than for the IBM-FBIS data.</Paragraph> <Paragraph position="4"> We also built a class-based NE model. After word segmentation, class tags such as number, chinesename, foreign-name, date, and percent are used to replace words belonging to these classes. Whether a word belongs to a specific class is identified by a rule-based normalizer. The performance of the class-based HMM model is also shown in Table 1.</Paragraph> <Paragraph position="5"> For the IBM-FBIS corpus, the class-based model outperforms the word-based model; for the IEER corpus, the class-based model is worse than the word-based model. In both cases, the performance difference between the word-based model and the class-based model is very small. The character-based model outperforms the class-based model in both tests.</Paragraph> <Paragraph position="6"> A more careful analysis indicates that although the word-based model performs worse than the character-based model overall in our evaluation, it performs better for certain NE types. For instance, the word-based model has a better performance for the organization category than the character-based model in both tests. While the character-based model has an F-measure of 65.07 (IBM-FBIS) and 64.76 (IEER) for the organization category, the word-based model achieves F-measure scores of 69.14 (IBM-FBIS) and 72.38 (IEER) respectively.</Paragraph> <Paragraph position="7"> One reason may be that organization names tend to contain many characters, and since the word-based model allows the system to analyze a larger window of text, it is more likely to make a correct guess.</Paragraph> <Paragraph position="8"> We can integrate the character-based model and the word-based model by combining the decisions from the two models. For instance, if we use the decisions of the word-based model for the organization category, but use the decisions of the character-based model for all the other categories, the over-all F-measure goes up to 76.91 for the IEER data, higher than using either the character-based or word-based model alone. Another way to integrate the two models is to use a hybrid model - starting with a word-based model and backing off to character-based model if the word is unknown.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.3 Granularity of Word Segmentation </SectionTitle> <Paragraph position="0"> We believe that one main reason for the lower performance of the word-based model is that the word granularity defined by the word segmenter is not suitable for the HMM model to perform the NE recognition task. What exactly constitutes a Chinese word has been a topic of major debate. We are interested in what is the best word granularity for our particular task.</Paragraph> <Paragraph position="1"> To illustrate the word granularity problem for NE tagging, we take person names as an example. Our word segmenter marks a person's name as one word, consistent with the convention used by the Chinese treebank and many other word segmentation systems. While this may be useful in other applications, it is certainly not a good choice for our NE model.</Paragraph> <Paragraph position="2"> Chinese names typically contain two or three characters, with family name preceding first name. Only a limited set of characters are used as family names, while the first name can be any character(s). Therefore, the family name is a very important and useful feature in identifying an NE in the person category. By combining the family name and the first name into one word, this important feature is lost to the word-based model. In our tests, the word-based model performs much worse for the person category than the character-based model. We believe that, for the purpose of NE recognition, it is better to separate the family name from the first name in word segmentation, although this is not the convention used in the Chinese treebank.</Paragraph> <Paragraph position="3"> Other examples include the segmentation of words indicating dates, countries, locations, percentages, measures, and ordinals. For instance, &quot;July 4th&quot; is expressed by four characters &quot;7th month 4th day&quot; in Chinese. The word segmenter marks the four characters as a single word; however, the second and the last character are actually good features for indicating date, since the dates are usually expressed using the same structure (e.g., &quot;March 25th&quot; is expressed by &quot;3rd month 25th day&quot; in Chinese). For reasons similar to the above, we believe that it is better to separate characters representing &quot;month&quot; and &quot;day&quot;, rather than combining the four characters into one word. A similar problem can be observed in English with tokens such as &quot;61-year-old man&quot; if one is interested in identifying a person's age, in which case 'year' and 'old' are good features for predication.</Paragraph> <Paragraph position="4"> The above analysis suggests that a better way to apply a word segmenter in an NE system is to first adapt the segmenter so that the segmentation granularity is more appropriate to the particular task and model. As a guideline, characters that are good features for identifying NEs should not be combined with other characters into word. Additional examples include characters expressing &quot;percent&quot; and characters representing monetary measures .</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.4 The Effect of Segmentation Errors </SectionTitle> <Paragraph position="0"> Word segmentation errors can lead to mistakes in NE recognition. Suppose an NE consists of four characters a9a11a10 a7a13a12 a10a15a14 a12 a10a15a16 a12 a10a15a17a19a18 , if the word segmentation merges a10 a7 with a character preceding it, then this NE cannot be correctly identified by the word-based model since the boundary will be incorrect. Besides inducing NE boundary errors, incorrect word segmentation also leads to wrong matchings between training examples and testing examples, which may result in mistakes in identifying entities. null We computed the upper bound for the word-based model for the IBM-FBIS test presented in Table 1.</Paragraph> <Paragraph position="1"> The upper bound of performance is computed by dividing the total number of NEs whose boundaries are also recognized as boundaries by the word segmenter by the total number of NEs in the corpus, which is the precision, recall, and also the Fmeasure. For the IBM-FBIS test data in Table 1, the upper bound of the word-based model is 95.7 Fmeasure. null We also did the following experiment to measure the effect of word segmentation errors: we gave the boundaries of NEs in the test data to the word segmenter and forced it to mark entity boundaries as word boundaries. This eliminates the word segmentation errors that inevitably result in NE boundary errors. For the IBM-FBIS data, the word-based HMM model achieves 76.60 F-measure when the entity boundaries in the test data are given, and the class-based model achieves 77.77 F-measure, higher than the 77.19 F-measure by the character-based model in Table 1. For the IEER data, the F-measure of the word-based model improves from 70.83 to 73.74 when the entity boundaries are given, and the class-based model improves from 70.20 to 72.47.</Paragraph> <Paragraph position="2"> This suggests that with the improvement in Chinese word segmentation, the word-based model may achieve comparable or better performance than the character-based model.</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.5 Lexical Features </SectionTitle> <Paragraph position="0"> Capitalization in English gives good evidence of names. Our HMM classifier for English uses a set of word-features to indicate whether a word contains all capitalized letters, only digits, or capitalized letters and period, as described in (Bikel et al., 1999). However, Chinese does not have capitalization. When we applied the HMM system to Chinese, we retained such features since Chinese text also include digits and roman words (such as in product or company names). In an attempt to investigate the usefulness of such features for Chinese, we removed them from the system and observed very little difference in overall performance (0.4 difference in Fmeasure). null</Paragraph> </Section> <Section position="6" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.6 Sensitivity to Corpus and Training Size Variation </SectionTitle> <Paragraph position="0"> To test the robustness of the model, we trained the system on the 100,000 word IBM-CT data and tested on the same IBM-FBIS data. The character-based model achieves 61.36 F-measure and the word-based model achieves 58.40 F-measure, compared to 77.19 and 74.17, respectively, using the 20 times larger IBM-FBIS training set. This represents an approximately 20% relative reduction in performance when trained on a related yet different and considerably smaller training set. We plan to investigate further the relation between corpus type and size and performance.</Paragraph> </Section> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 Classifier Combination </SectionTitle> <Paragraph position="0"> This section investigates the combination of a set of classifiers for NE recognition. We first introduce the classifiers used in our experiments and then describe the combination methods.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.1 The Classifiers </SectionTitle> <Paragraph position="0"> Besides the HMM classifier mentioned in the previous section, the following three classifiers were used in the experiments.</Paragraph> <Paragraph position="1"> algorithm which has two major steps: it starts by assigning some classification to each example, and then automatically proposing, evaluating and selecting the classification changes that maximally decrease the number of errors.</Paragraph> <Paragraph position="2"> TBL has some attractive qualities that make it suitable for the language-related tasks: it can automatically integrate heterogeneous types of knowledge, without the need for explicit modeling (similar to Snow (Dagan et al., 1997), Maximum Entropy, decision trees, etc); it is error-driven, thus directly minimizes the ultimate evaluation measure: the error rate. The TBL toolkit used in this experiment is described in (Florian and Ngai, 2001).</Paragraph> <Paragraph position="3"> (MaxEnt) The model used here is based on the maximum entropy model used for shallow parsing (Ratnaparkhi, 1999). A sentence with NE tags is converted into a shallow tree: tokens not in any NE are assigned an &quot;O&quot; tag, while tokens within an NE are represented as constituents whose label is the same as the NE type. For example, the annotated sentence &quot;I will fly to (LO-</Paragraph> </Section> </Section> <Section position="6" start_page="0" end_page="0" type="metho"> <SectionTitle> CATION New York) (DATEREF tomorrow)&quot; is </SectionTitle> <Paragraph position="0"> represented as a tree &quot;(S I/O will/O fly/O to/O</Paragraph> </Section> <Section position="7" start_page="0" end_page="0" type="metho"> <SectionTitle> (LOCATION New/LOCATION York/LOCATION) </SectionTitle> <Paragraph position="0"> (DATEREF tomorrow/DATEREF) )&quot;. Once an NE is represented as a shallow tree, NE recognition can be realized by performing shallow parsing.</Paragraph> <Paragraph position="1"> We use the tagging and chunking model described in (Ratnaparkhi, 1999) for shallow parsing. In the tagging model, the context consists of a window of five tokens (including the token being tagged and two tokens to its left and two tokens to its right) and two tags to the left of the current token. Five groups of feature templates are used: token unigram, token bigram, token trigram, tag unigram and tag bigram (all within the context window). In the chunking model, the context is limited to a window of three subtrees: the previous, current and next subtree. Unigram and bigram chunk (or tag) labels are used as features.</Paragraph> <Paragraph position="2"> Classifier This system is a variant of the text chunking system described in Zhang et al. (2002), where the NE recognition problem is regarded as a sequential token-based tagging problem. We denote by a20a22a21a24a23a26a25 (a27a29a28a31a30 a12a33a32a34a12a33a35a33a35a33a35a36a12a38a37 ) the sequence of tokenized text, which is the input to our system. In token-based tagging, the goal is to assign a class-label a39 a23 to every token a21 a23 .</Paragraph> <Paragraph position="3"> In our system, this is achieved by estimating the conditional probability a40a41a9a42a39 a23 a28a44a43a34a45a46 a23 a18 for every possible class-label value a43 , wherea46 a23 is a feature vector associated with token a27 . The feature vector a46a47a23 can depend on previously predicted class-labels a20a22a39a11a48a49a25a33a48a13a50a51a23 , but the dependency is typically assumed to be local. Given such a conditional probability model, in the decoding stage, we estimate the best possible sequence of a39 a23 's using a dynamic programming approach.</Paragraph> <Paragraph position="4"> In our system, the conditional probability model has the following parametric form:</Paragraph> <Paragraph position="6"> where a59a61a9a42a68a8a18a69a28a71a70a73a72a74a75a9 a32a34a12 a70a77a76a22a78a8a9a11a30 a12 a68a8a18a11a18 is the truncation of a68 into the interval a79a30 a12a33a32a22a80 . a21 a64 is a linear weight vector and a66 a64 is a constant. Parameters a21 a64 and a66 a64 can be estimated from the training data.</Paragraph> <Paragraph position="7"> This classification method is based on approximately minimizing the risk function. The generalized Winnow method used in (Zhang et al., 2002) implements such a robust risk minimization method.</Paragraph> <Paragraph position="8"> We compared the performance of the four classifiers by training and testing them on the same data sets. We divided the IBM-FBIS corpus into three subsets: 2.8 million characters for training, 330,000 characters for development testing, and 330,000 characters for testing. Table 2 shows the results of each classifier for the development test set and the evaluation set. The RRM and fnTBL classifiers are the best performers for the test set, followed by Max-Ent. The HMM classifier lags behind by around 6 points in F-measure from the best system. The presented results are for character-based models.</Paragraph> <Paragraph position="9"> For comparison, we also computed two baselines: one in which each character is labeled with its most frequent label (Baseline1 in Table 2), and one in which each entity that was seen in training data is labeled with its most frequent classification (Baseline2 in Table 2 - this baseline is computed using the software provided with the CoNLL-2003 shared task (Tjong Kim Sang and De Meulder, 2003)).</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.2 Combination </SectionTitle> <Paragraph position="0"> The four classifiers differ in multiple dimensions, making them good candidates for combination. We explored various ways to combine the results from the four classifiers.</Paragraph> <Paragraph position="1"> For the first part of the experimental setup, we consider the following classification framework: given the (probabilistic) output a40a82a81 a23 a9a84a83a85a45a21a69a18 ofa86 classifiers a87 a7 a12a33a35a33a35a33a35a49a12 a87a89a88 , the classifier combination problem can be viewed a probability interpolation problem - compute the class probability distribution conditioned on the joint classifier output:</Paragraph> <Paragraph position="3"> where a10 a23 is the ith classifier's output, a21 is an observable context (e.g., a word trigram) and a115 is a combination function. A commonly used combining scheme is through linear interpolation of the classifiers' class probability distributions:</Paragraph> <Paragraph position="5"> The weights a122 a23 a9a123a21a69a18 encode the importance given to classifier a27 in combination, for the context a21 , and a40a4a23a6a9a11a10a117a45a21 a12 a10a96a23a26a18 is an estimation of the probability that the correct classification is a10 , given that the output of the classifier a27 on context a21 is a10 a23 . These parameters from Equation (2) can be estimated, if needed, on development data.</Paragraph> <Paragraph position="6"> Table 3 presents the combination results, for different ways of estimating the interpolation parameters. A simple combination method is the equal voting method (van Halteren et al., 2001; Tjong Kim Sang et al., 2000), where the parameters are computed as a122 a23 a9a42a21a69a18a124a28 a7 In other words, each of the classifiers votes with equal weight for the class that is most likely under its model, and the class receiving the largest number of votes wins (i.e., it is selected as the classification output). However, this procedure may lead to ties, where some classifications receive an identical number of votes - one usually resorts to randomly selecting one of the tied candidates in this case - Table 3 presents the average results obtained by this method, together with the variance obtained over 30 trials. To make the decision deterministically, the weights associated with the classifiers can be chosen as a122 a23 a9a42a21a69a18a126a28 a32a130a129 a40 a23 a9a11a131a11a81a103a81a133a132a134a81a133a18 . In this method, presented in Table 3 as weighted voting, better performing classifiers will have a higher impact on the final classification.</Paragraph> <Paragraph position="7"> In the previously described methods, also known as voting, each classifier gave its entire vote to one classification - its own output. However, Equation (2) allows for classifiers to give partial credit to alternative classifications, through the probability a40 a23 a9a11a10a117a45a21 a12 a10 a23 a18 . In the experiments described here, this value is computed directly on the development data. However, the space of possible choices for a10 , unreliable, so we use two approximations, named Model 1 and Model 2 in Table 3: a40 a23 a9a11a10a117a45a21 a12 a10 a23 a18a117a28</Paragraph> <Paragraph position="9"> tively. Both probability distributions are estimated as smoothed relative frequencies on the development data. Interestingly, both methods underperform the equal voting method, a fact which can be explained by inspecting the results in Table 2: the fnTBL method has an accuracy (computed on development data) lower than the MaxEnt accuracy, but it outperforms the latter on the test data. Since the parameters a40a138a9a11a10a120a45a21 a12 a10 a23 a18 are computed on the development data, they are probably favoring the Max-Ent method, resulting in lower performance. On the other hand, the equal voting method does not suffer from this problem, as its parameters are not dependent on the development data. In a last set of experiments, we extend the classification framework to a larger space, in which we compute the conditional class probability distribution conditioned on an arbitrary set of features:</Paragraph> <Paragraph position="11"> This setup allows us to still use the classifications of individual systems as features, but also allows for other types of conditioning features - for instance, one can use the output of any classifier (e.g., POS tags, text chunk labels, etc) as features.</Paragraph> <Paragraph position="12"> In the described experiments, we use the RRM method to compute the function a115 in Equation (3), allowing the system to select a good performing combination of features. At training time, the system was fed the output of each classifier on the development data, and, in subsequent experiments, the system was also fed a flag stream which briefly identifies some of the tokens (numbers, romanized char2a97a96a142 is the word associated with the context a97 . acters, etc) and the output of each system in a different NE encoding scheme.</Paragraph> <Paragraph position="13"> In all the voting experiments, the NEs were encoded in an IOB1 scheme, since it seems to be the most amenable to combination. Briefly, the IOB general encoding scheme associates a label with each word, corresponding to whether it begins a specific entity, continues the entity, or is outside any entity. Tjong Kim Sang and Veenstra (1999) describes in detail the IOB schemes. The final experiment also has access to the output of systems trained in the IOB2 encoding. The addition of each feature type resulted in better performance, with the final result yielding a 10% relative decrease of F-measure error when compared with the best performing system3.</Paragraph> <Paragraph position="14"> Table 3 also includes an upper-bound on the classifier combination performance - the performance of the switch oracle, which selects the correct classification if at least one classifier outputs it.</Paragraph> <Paragraph position="15"> Table 3 shows that, at least for the examined types of combination, using a robust feature-based classifier to compute the classification distribution yields better performance than combining the classifications through either voting or weighted interpolation. The RRM-based classifier is able to incorporate heterogenous information from multiple sources, obtaining a 2.8 absolute F-measure improvement versus the best performing classifier and 1.0 F-measure gain over the next best method.</Paragraph> </Section> </Section> class="xml-element"></Paper>