File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-0118_metho.xml
Size: 14,741 bytes
Last Modified: 2025-10-06 14:10:35
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-0118"> <Title>Voting between Dictionary-based and Subword Tagging Models for Chinese Word Segmentation</Title> <Section position="4" start_page="0" end_page="127" type="metho"> <SectionTitle> 2 System Description </SectionTitle> <Paragraph position="0"> In our segmentation system, a hybrid strategy is applied (Figure 1): First, forward maximum matching (Chen and Liu, 1992), which is a dictionary-based method, is used to generate a segmentation result. Also, the CRF model using maximum subword-based tagging (Zhang et al., 2006) and the CRF model using minimum subword-based tagging, both of which are statistical methods, are used individually to solve the problem. In the next step, the solutions from these three methods are combined via the hanzi-level majority voting algorithm. Then, a post-processing procedure is applied in order to to get the final output. This procedure merges adjoining words to match the dictionary entries and then splits words which are inconsistent with entries in the training corpus.</Paragraph> <Section position="1" start_page="0" end_page="126" type="sub_section"> <SectionTitle> 2.1 Forward Maximum Matching </SectionTitle> <Paragraph position="0"> The maximum matching algorithm is a greedy segmentation approach. It proceeds through the sentence, mapping the longest word at each point with an entry in the dictionary. In our system, the well-known forward maximum matching algorithm (Chen and Liu, 1992) is implemented.</Paragraph> <Paragraph position="1"> The maximum matching approach is simple and efficient, and it results in high in-vocabulary accuracy; However, the small size of the dictionary, which is obtained only from the training data, is a major bottleneck for this approach to be applied by itself.</Paragraph> </Section> <Section position="2" start_page="126" end_page="126" type="sub_section"> <SectionTitle> 2.2 CRF Model with Maximum Subword-based Tagging </SectionTitle> <Paragraph position="0"> Conditional random fields (CRF), a statistical sequence modeling approach (Lafferty et al., 2001), has been widely applied in various sequence learning tasks including Chinese word segmentation. In this approach, most existing methods use the character-based IOB tagging. For example, &quot; (all) > (extremely important)&quot; is labeled as &quot; (all)/O (until)/B (close)/I >(heavy)/I (demand)/I&quot;.</Paragraph> <Paragraph position="1"> Recently (Zhang et al., 2006) proposed a maximum subword-based IOB tagger for Chinese word segmentation, and our system applies their approach which obtains a very high accuracy on the shared task data from previous SIGHAN competitions. In this method, all single-hanzi words and the top frequently occurring multi-hanzi words are extracted from the training corpus to form the lexicon subset. Then, each word in the training corpus is segmented for IOB tagging, with the forward maximum matching algorithm, using the formed lexicon subset as the dictionary. In the above example, the tagging labels become &quot; (all)/O</Paragraph> <Paragraph position="3"> suming that &quot;> (important)&quot; is the longest sub-word in this word, and it is one of the top frequently occurring words in the training corpus.</Paragraph> <Paragraph position="4"> After tagging the training corpus, we use the package CRF++1 to train the CRF model. Suppose w0 represents the current word, w[?]1 is the first word to the left, w[?]2 is the second word to the left, w1 is the first word to the right, and w2 is the second word to the right, then in our experiments, the types of unigram features used include w0, w[?]1, w1, w[?]2, w2, w0w[?]1, w0w1, w[?]1w1, w[?]2w[?]1, and w2w0. In addition, only combinations of previous observation and current observation are exploited as bigram features.</Paragraph> </Section> <Section position="3" start_page="126" end_page="126" type="sub_section"> <SectionTitle> 2.3 CRF Model with Minimum Subword-based Tagging </SectionTitle> <Paragraph position="0"> In our third model, we applies a similar approach as in the previous section. However, instead of finding the maximum subwords, we explore the minimum subwords. At the beginning, we build the dictionary using the whole training corpus.</Paragraph> <Paragraph position="1"> Then, for each word in the training data, a forward shortest matching is used to get the sequence of minimum-length subwords, and this sequence is tagged in the same IOB format as before. Suppose &quot;a&quot;, &quot;ac&quot;, &quot;de&quot; and &quot;acde&quot; are the only entries in the dictionary. Then, for the word &quot;acde&quot;, the sequence of subwords is &quot;a&quot;, &quot;c&quot; and &quot;de&quot;, and the tags assigned to &quot;acde&quot; are &quot;a/B c/I de/I&quot;. After tagging the training data set, CRF++ package is executed again to train this type of model, using the identical unigram and bigram feature sets that are used in the previous model.</Paragraph> <Paragraph position="2"> Meanwhile, the unsegmented test data is segmented by the forward shortest matching algorithm. After this initial segmentation process, the result is fed into the trained CRF model for resegmentation by assigning IOB tags.</Paragraph> </Section> <Section position="4" start_page="126" end_page="126" type="sub_section"> <SectionTitle> 2.4 Majority Voting </SectionTitle> <Paragraph position="0"> Having the segmentation results from the above three models in hand, in this next step, we adopt the hanzi-level majority voting algorithm. First, for each hanzi in a segmented sentence, we tag it either as &quot;B&quot; if it is the first hanzi of a word or a single-hanzi word, or as &quot;I&quot; otherwise. Then, for a given hanzi in the results from those three models, if at least two of the models provide the identical tag, it will be assigned that tag. For instance, suppose &quot;a c de&quot; is the segmentation result via forward maximum matching, and it is also the result from CRF model with maximum subword-based tagging, and &quot;ac d e&quot; is the result from the third model. Then, for &quot;a&quot;, since all of them assign &quot;B' to it, &quot;a&quot; is given the &quot;B&quot; tag; for &quot;c&quot;, because two of segmentations tag it as &quot;B&quot;, &quot;c&quot; is given the &quot;B&quot; tag as well. Similarly, the tag for each remaining hanzi is determined by this majority voting process, and we get &quot;a c de&quot; as the result for this example.</Paragraph> <Paragraph position="1"> To test the performance of each of the three models and that of the majority voting, we divide the MSRA corpus into training set and held-out set. Throughout all the experiments we conducted, we discover that those two CRF models perform much better than the pure hanzi-based CRF method, and that the voting process improves the performance further.</Paragraph> </Section> <Section position="5" start_page="126" end_page="127" type="sub_section"> <SectionTitle> 2.5 Post-processing </SectionTitle> <Paragraph position="0"> While analyzing errors with the segmentation result from the held-out set, we find two inconsistency problems: First, the inconsistency between the dictionary and the result: that is, certain words that appear in the dictionary are separated into consecutive words in the test result; Second, the inconsistency among words in the dictionary; For instance, both &quot;)fV~&quot;(scientific research) and &quot;)f(science)V~(research)&quot; appear in the training corpus.</Paragraph> <Paragraph position="1"> To deal with the first phenomena, for the segmented result, we try to merge adjoining words to match the dictionary entries. Suppose &quot;a b c de&quot; are the original voting result, and &quot;ab&quot;, &quot;abc&quot; and &quot;cd&quot; form the dictionary. Then, we merge &quot;a&quot;, &quot;b&quot; and &quot;c&quot; together to get the longest match with the dictionary. Therefore, the output is &quot;abc de&quot;. For the second problem, we introduce the split procedure. In our system, we only consider two consecutive words. First, all bigrams are extracted from the training corpus, and their frequencies are counted. After that, for example, if &quot;a b&quot; appears more often than &quot;ab&quot;, then whenever in the test result we encounter &quot;ab&quot;, we split it into &quot;a b&quot;. The post-processing steps detailed above attempt to maximize the value of known words in the training data as well as attempting to deal with the word segmentation inconsistencies in the training data.</Paragraph> </Section> </Section> <Section position="5" start_page="127" end_page="128" type="metho"> <SectionTitle> 3 Experiments and Analysis </SectionTitle> <Paragraph position="0"> The third International Chinese Language Processing Bakeoff includes four different corpora, Academia Sinica (CKIP), City University of Hong Kong (CityU), Microsoft Research (MSRA), and University of Pennsylvania and University of Colorado, Boulder (UPUC), for the word segmentation task.</Paragraph> <Paragraph position="1"> In this bakeoff, we test our system in CityU, MSRA and UPUC corpora, and follow the closed track. That is, we only use training material from the training data for the particular corpus we are testing on. No other material or any type of external knowledge is used, including part-of-speech information, externally generated word-frequency counts, Arabic and Chinese numbers, feature characters for place names and common Chinese surnames. null</Paragraph> <Section position="1" start_page="127" end_page="127" type="sub_section"> <SectionTitle> 3.1 Results on SIGHAN Bakeoff 2006 </SectionTitle> <Paragraph position="0"> To observe the result of majority voting and the contribution of the post-processing step, the experiment is ran for each corpus by first producing the outcome of majority voting and then producing the output from the post-processing. In each experiment, the precision (P), recall (R), F-measure (F), Out-of-Vocabulary rate (OOV ), OOV recall rate (ROOV ), and In-Vocabulary rate (RIV ) are recorded. Table 1,2,3 show the scores for the CityU corpus, for the MSRA corpus, and for the From those tables, we can see that a simple majority voting algorithm produces accuracy that is higher than each individual system and reasonably high F-scores overall. In addition, the post-processing step indeed helps to improve the performance. null</Paragraph> </Section> <Section position="2" start_page="127" end_page="128" type="sub_section"> <SectionTitle> 3.2 Error analysis </SectionTitle> <Paragraph position="0"> The errors that occur in our system are mainly due to the following three factors: First, there is inconsistency between the gold segmentation and the training corpus. Although the inconsistency problem within the training corpus is intended to be tackled in the post-processing step, we cannot conclude that the segmentation for certain words in the gold test set always follows the convention in the training data set. For example, in the MSRA training corpus, &quot;Y=) u &quot;(Chinese government) is usually considered as a single word; while in the gold test set, it is separated as two words &quot;Y=)&quot;(Chinese) and &quot;u &quot;(government). This inconsistency issue lowers the system performance. This problem, of course, affects all competing systems.</Paragraph> <Paragraph position="1"> Second, we don't have specific steps to deal with words with postfixes such as &quot;V&quot;(person). Compared to our system, (Zhang, 2005) proposed a segmentation system that contains morphologically derived word recognition post-processing component to solve this problem. Lacking of such a step prevents us from identifying certain types of words such as &quot; ~V&quot;(worker) to be a single word.</Paragraph> <Paragraph position="2"> In addition, the unknown words are still troublesome because of the limited size of the training corpora. In the class of unknown words, we encounter person names, numbers, dates, organization names and words translated from languages other than Chinese. For example, in the produced CityU test result, the translated person name &quot;+ - b &quot;(Mihajlovic) is incorrectly separated as &quot;+- b&quot; and &quot; &quot;. Moreover, in certain cases, person names can also create ambiguity. Take the name &quot;B 0&quot;(Qiu, Beifang) in UPUC test set for example, without understanding the meaning of the whole sentence, it is difficult even for human to determine whether it is a person name or it represents &quot;B&quot;(autumn), &quot; 0&quot;(north), with the meaning of &quot;the autumn in the north&quot;.</Paragraph> </Section> </Section> <Section position="6" start_page="128" end_page="128" type="metho"> <SectionTitle> 4 Alternative to Majority Voting </SectionTitle> <Paragraph position="0"> In designing the voting procedure, we also attempt to develop and use a segmentation lattice, which proceeds using a similar underlying principle as the one applied in (Xu et al., 2005).</Paragraph> <Paragraph position="1"> In our approach, for an input sentence, the segmentation result using each of our three models is transformed into an individual lattice. Also, each edge in the lattice is assigned a particular weight, according to certain features such as whether or not the output word from that edge is in the dictionary. After building the three lattices, one for each model, we merge them together. Then, the shortest path, referring to the path that has the minimum weight, is extracted from the merged lattice, and therefore, the segmentation result is determined by this shortest path.</Paragraph> <Paragraph position="2"> However, in the time we had to run our experiments on the test data, we were unable to optimize the edge weights to obtain high accuracy on some held-out set from the training corpora. So instead, we tried a simple method for finding edge weights by uniformly distributing the weight for each feature; Nevertheless, by testing on the shared task data from the 2005 SIGHAN bakeoff, the performance is not competitive, compared to our simple majority voting method described above. As a result, we decide to abandon this approach for this year's SIGHAN bakeoff.</Paragraph> </Section> class="xml-element"></Paper>