File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/02/w02-1117_intro.xml
Size: 2,573 bytes
Last Modified: 2025-10-06 14:01:36
<?xml version="1.0" standalone="yes"?> <Paper uid="W02-1117"> <Title>A Character-net Based Chinese Text Segmentation Method</Title> <Section position="3" start_page="0" end_page="0" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> The segmentation of Chinese texts is a key problem in Chinese information processing. In the process of segmentation, the ambiguity processing, unknown Chinese words (not included in the lexicon) recognition (such as person names, organization names etc) are very difficult. As for those problems, many algorithms are put forward [Liu 2000]. But the existing algorithms haven't a universal data structure, each algorithm can resolve a problem, and correspond to a concrete data structure specifically. In process of the difficulties, the first step is identification of all possible candidates of Chinese words segmentation. For examples: a0a2a1a4a3a4a5a7a6a9a8a11a10a7a12a4a13 a3a4a14a16a15a18a17a4a19a9a20a22a21 these words should be obtained: a1a16a3a24a23a18a5a4a6a24a23a18a6a9a8a7a23a25a8a11a10a24a23a18a10a4a12a9a23a11a13a9a23a26a3a9a23a26a14 a15a25a23a26a17a7a19a9a27 The ambiguous string is a0a28a5a7a6a24a8a11a10 a12a29a20 .There are some methods to resolve this problem: the one is the method forward maximum matching, backward maximum matching and minimum matching are used to find out the possible word strings from the character string [Guo 1997; Sproat et al. 1996; Gu and Mao 1994; Li et al. 1991; Wang et al. 1991b; Wang et al.</Paragraph> <Paragraph position="1"> 1990]. The second is The words finding automaton based on the Aho-Corasick Algorithm [Hong-I and Lua]. The former requires three scans of the input character string. In addition, during each scan, backtracking has to be performed in cases where a dictionary search fails. After that, the word recognition is built based on the candidates. The second requires building up a state chart, is difficult to combine with other algorithms.</Paragraph> <Paragraph position="2"> In this paper, an algorithm is put forward to solve this problem, which uses the connection information between Chinese characters to recognize all possible candidates of segmentation words in a Chinese text. In the method, at first establish a Chinese character-net , try to establish a universal data structure, which is easy to combine with other algorithms in Chinese text segmentation, and can use different kinds of information in a Chinese text, then identify all possible candidates of words segmentation easily.</Paragraph> </Section> class="xml-element"></Paper>