File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/03/w03-1701_evalu.xml
Size: 7,070 bytes
Last Modified: 2025-10-06 13:59:03
<?xml version="1.0" standalone="yes"?> <Paper uid="W03-1701"> <Title>Unsupervised Training for Overlapping Ambiguity Resolution in Chinese Word Segmentation</Title> <Section position="5" start_page="2" end_page="2" type="evalu"> <SectionTitle> 4 Experiments and Discussions </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="2" end_page="2" type="sub_section"> <SectionTitle> 4.1 Settings </SectionTitle> <Paragraph position="0"> We evaluate our approach using a manually annotated test set, which was selected randomly from People's Daily news articles of year 1997, containing approximate 460,000 Chinese characters, or 247,000 words. In the test set, 5759 longest OAS are identified. Our lexicon contains 93,700 entries.</Paragraph> </Section> <Section position="2" start_page="2" end_page="2" type="sub_section"> <SectionTitle> 4.2 OAS Distribution </SectionTitle> <Paragraph position="0"> We first investigate the distribution of different types of OAS in the test set. In our approach, the performance upper bound (i.e. oracle accuracy) cannot achieve 100% because not all the OASs' correct segmentations can be generated by FMM and BMM segmentation. So it is very useful to know to what extent our approach can deal with the problem.</Paragraph> <Paragraph position="1"> The results are shown in Table 2. We denote the entire OAS data set as C, and divide it into two subsets A and B according to the type of OAS. It can be seen from the table that in data set A</Paragraph> <Paragraph position="3"> ), the accuracy of MM segmentation achieves 98.8% accuracy. Meanwhile, in data set B</Paragraph> <Paragraph position="5"> ) the oracle recall of candidates proposed by FMM and BMM is 95.7% (97.2% in the entire data set C). The statistics are very close to those reported in Huang (1997).</Paragraph> <Paragraph position="6"> Here are some examples for the overlapping ambiguities that cannot be covered by our approach. For errors resulting from O shi4-ji4, g1998g10628 |g3324 |g1002g13438) serves as a good example. These two types of errors are usually composed of several words and need a much more complicated search process to determine the final correct output. Since the number of such errors is very small, they are not target of our approach in this paper.</Paragraph> </Section> <Section position="3" start_page="2" end_page="2" type="sub_section"> <SectionTitle> 4.3 Experimental Results of Ensemble of Na- ive Bayesian Classifiers </SectionTitle> <Paragraph position="0"> The classifiers are trained from the People's Daily news articles of year 2000, which contain over 24 million characters. The training data is tokenized.</Paragraph> <Paragraph position="1"> That is, all OAS with O</Paragraph> <Paragraph position="3"> are replaced with the token [GAP]. After tokenization, there are 16,078,000 tokens in the training data in which 203,329 are [GAP], which is 1.26% of the entire training data set. Then a word trigram language model is constructed on the tokenized corpus, and each Bayesian classifier is built given the language model.</Paragraph> <Paragraph position="5"> Table 3 shows the accuracy of each classifier on data set B. The performance of the ensemble based on majority vote is 89.79% on data set B, and the overall accuracy on C is 94.13%. The ensemble consistently outperforms any of its members. Classifiers with both left and right context features perform better than the others because they are capable to segment some of the context sensitive OAS. For example, contextual information is necessary to segment the OAS &quot; g11487g2500 g990&quot;(kan4-tai2-shang4, on the stand) correctly in both following sentences:</Paragraph> <Paragraph position="7"> (Stand in the highest stand) Both Peterson (2000) and Brill (1998) found that the ultimate success of an ensemble depends on the assumption that classifiers to be combined make complementary errors. We investigate this assumption in our experiments, and estimate the oracle accuracy of our approach. Result shows that only 6.0% (180 out of 2996) of the OAS in data set B that is classified incorrectly by all the 9 classifiers. In addition, we can see from Table 2, that 130 instances of these 180 errors are impossible to be correct because neither O f nor O b is the correct segmentation. Therefore, the oracle accuracy of the ensemble is 94.0%, which is very close to 95.7%, the theoretical upper bound of our approach in data set B described in Section 4.2. However, our majority vote based ensemble only achieves accuracy close to 90%. This analysis thus suggests that further improves can be made by using more powerful ensemble strategies.</Paragraph> </Section> <Section position="4" start_page="2" end_page="2" type="sub_section"> <SectionTitle> 4.4 Lexicalized Rule Based OAS Disambigua- </SectionTitle> <Paragraph position="0"> tion We also conduct a series of experiments to evaluate the performance of a widely used lexicalized rule-based OAS disambiguation approach. As reported by Sun (1998) and Li (2001), over 90% of the OAS can be disambiguated in a context-free way. Therefore, simply collecting large amount of correctly segmented OAS whose segmentation is independent of its context would yield pretty good performance.</Paragraph> <Paragraph position="1"> We first collected 730,000 OAS with O from 20 years' People's Daily corpus which contains about 650 million characters. Then approximately 47,000 most frequently occurred OASs were extracted. For each of the extracted OAS, 20 sentences that contain it were randomly selected from the corpus, and the correct segmentation is manually labeled. 41,000 lexicalized disambiguation rules were finally extracted from the labeled data, whose either MM segmentation (O is very close to that reported in Sun (1998). Here is a sample rule extracted: g1461g5527g3332 => g1461g5527 |g3332. It means that among the 20 sentences that contain the character sequence &quot;g1461g5527g3332&quot;, at least 19 of them are segmented as &quot;g1461g5527 |g3332&quot;. The performance of the lexicalized rule-based approach is shown in Table 4, where for comparison we also include the performance of using only In Table 4, Rule + FMM means if there is no rule applicable to an OAS, FMM segmentation will be used. Similarly, Rule + BMM means that BMM segmentation will be used as backup. We can see in Table 4 that rule-based systems outperform their FMM and BMM counterparts significantly, but do not perform as well as our method, even when no context feature is used. This is because that the rules can only cover about 76% of the OASs in the test set with precision 95%, and FMM or BMM performs poorly on the rest of the OASs. Although the precision of these lexicalized rules is high, the room for further improvements is limited. For example, to achieve a higher coverage, say 90%, much more manually labeled training data (i.e.</Paragraph> <Paragraph position="2"> 81,000 OAS) are needed.</Paragraph> </Section> </Section> class="xml-element"></Paper>