File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/03/w03-1701_intro.xml
Size: 3,332 bytes
Last Modified: 2025-10-06 14:02:00
<?xml version="1.0" standalone="yes"?> <Paper uid="W03-1701"> <Title>Unsupervised Training for Overlapping Ambiguity Resolution in Chinese Word Segmentation</Title> <Section position="3" start_page="0" end_page="0" type="intro"> <SectionTitle> 2 Previous Work </SectionTitle> <Paragraph position="0"> Previous methods of resolving overlapping ambiguities can be grouped into rule-based approaches and statistical approaches.</Paragraph> <Paragraph position="1"> Maximum Matching (MM) based segmentation (Huang, 1997) can be regarded as the simplest rule-based approach, in which one starts from one end of the input sentence, greedily matches the longest word towards the other end, and repeats the process with the rest unmatched character sequences until the entire sentence is processed. If the process starts with the beginning of the sentence, it is called Forward Maximum Matching (FMM). If the process starts with the end of the sentence, it is called Backward Maximum Matching (BMM). Although it is widely used due to its simplicity, MM based segmentation performs poorly in real text.</Paragraph> <Paragraph position="2"> Zheng and Liu (1997) use a set of manually generated rules, and reported an accuracy of 81% on an open test set. Swen and Yu (1999) presents a lexicon-based method. The basic idea is that for each entry in a lexicon, all possible ambiguity types are tagged; and for each ambiguity types, a solution strategy is used. They achieve an accuracy of 95%. Sun (1998) demonstrates that most of the overlapping ambiguities can be resolved without taking into account the context information. He then proposes a lexicalized rule-based approach.</Paragraph> <Paragraph position="3"> His experiments show that using the 4,600 most frequent rules, 51% coverage can be achieved in an open test set.</Paragraph> <Paragraph position="4"> Statistical methods view the overlapping ambiguity resolution as a search or classification task. For example, Liu (1997) uses a word unigram language model, given all possible segmentations of a Chinese character sequence, to search the best segmentation with the highest probability. Similar approach can be traced back to Zhang (1991). But the method does not target to overlapping ambiguities. So the disambiguation results are not reported. Sun (1999) presents a hybrid method which incorporates empirical rules and statistical probabilities, and reports an overall accuracy of 92%. Li (2001) defines the word segmentation disambiguation as a binary classification problem. Li then uses Support Vector Machine (SVM) with mutual information between each Chinese character pair as a feature. The method achieves an accuracy of 92%. All the above methods utilize a supervised training procedure. However, a large manually labeled training set is not always available. To deal with the problem, unsupervised approaches have been proposed. For example, Sun (1997) detected word boundaries given an OAS using character-based statistical measures, such as mutual information and difference of t-test. He reported an accuracy of approximately 90%. In his approach, only the statistical information within 4 adjacent characters is exploited, and lack of word-level statistics may prevent the disambiguation performance from being further improved.</Paragraph> </Section> class="xml-element"></Paper>