File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/98/p98-2206_intro.xml
Size: 3,588 bytes
Last Modified: 2025-10-06 14:06:39
<?xml version="1.0" standalone="yes"?> <Paper uid="P98-2206"> <Title>Chinese Word Segmentation without Using Lexicon and Hand-crafted Training Data</Title> <Section position="3" start_page="0" end_page="1265" type="intro"> <SectionTitle> 1. Introduction </SectionTitle> <Paragraph position="0"> Any Chinese word is composed of either single or multiple characters. Chinese texts are explicitly concatenations of characters, words are not delimited by spaces as that in English. Chinese word segmentation is therefore the first step for any Chinese information processing system\[ 1\].</Paragraph> <Paragraph position="1"> Almost all methods for Chinese word segmentation developed so far, both statistical and rule-based, exploited two kinds of important resources, i.e., lexicon and hand-crafted linguistic resources(manually segmented and tagged corpus, knowledge for unknown words, and linguistic This work was supported in part by the National</Paragraph> <Section position="1" start_page="0" end_page="1265" type="sub_section"> <SectionTitle> Natural Science Foundation of China under grant </SectionTitle> <Paragraph position="0"> No. 69433010.</Paragraph> <Paragraph position="1"> rules)\[1,2,3,5,6,8,9,10\]. Lexicon is usually used as the means for finding segmentation candidates for input sentences, while linguistic resources for solving segnaentation ambiguities. Preparation of these resources (well-defined lexicon, widely accepted tag set, consistent annotated corpus etc.) is very hard due to particularity of Chinese, and time consuming. Furthermore, even the lexicon is large enough, and the corpus annotated is balanced and huge in size, the word segmenter will still face the problem of data incompleteness, sparseness and bias as it is utilized in different domains.</Paragraph> <Paragraph position="2"> An important issue in designing Chinese segmenters is thus how to reduce the effort of human supervision as much as possible.</Paragraph> <Paragraph position="3"> Palmer(1997) conducted a Chinese segrnenter which merely made use of a manually segmented corpus(without referring to any lexicon). A transformation-based algorithm was then explored to learn segmentation rules automatically from the segmented corpus. Sproat and Shih(1993) further proposed a method using neither lexicon nor segmented corpus: for input texts, simply grouping character pairs with high value of mutual information into words. Although this strategy is very simple and has many limitations(e.g., it can only treat bi-character words), the characteristic of it is that it is fully automatic -- the nmtual information between characters can be trained from raw Chinese corpus directly.</Paragraph> <Paragraph position="4"> Following the line of Sproat and Shih, here we present a new algorithm for segmenting Chinese texts which depends upon neither lexicon nor any hand-crafted resource. All data necessary for our system is derived from the raw corpus. The system may be viewed as a stand-alone segmenter in some applications (preliminary experiments show that its accuracy is acceptable); nevertheless, our main purpose is to study how and how well the work can be done by machine at the extreme conditions, say, without any assistance of human. We believe the performance of the existing Chinese segmenters, that is, the ability to deal with segmentation ambiguities and unknown words as well as the ability to adapt to new domains, will be improved in some degree if the gaining of this approach is incorporated into systems properly.</Paragraph> </Section> </Section> class="xml-element"></Paper>