File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/02/w02-1117_metho.xml
Size: 6,639 bytes
Last Modified: 2025-10-06 14:08:05
<?xml version="1.0" standalone="yes"?> <Paper uid="W02-1117"> <Title>A Character-net Based Chinese Text Segmentation Method</Title> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 Data Structure and Model </SectionTitle> <Paragraph position="0"> A Chinese character is considered as a node, a connection between characters considered as an edge. If a character is the final character of a Chinese word, the character is considered as a control node, and the formed edge weight is 1.</Paragraph> <Paragraph position="1"> The connection is defined as follows :</Paragraph> <Paragraph position="3"> In the structure, id is the sequence number of a connection edge, char1 is the first character node, char2 is the second character node; weight is the weight of a edge, if char1 and char2 is in a Chinese word and char2 isn't the final character of a word, weight equal to 0; if char2 is the final character of a word(char2 is a control node), weight equal to 1.</Paragraph> <Paragraph position="4"> wlen is the length of a word, if char2 isn't a control node, wlen is zero; wpos is the part-of-speech of a word, if char2 isn't a control node, wpos is null; bchar is the first character of a word, if char2 isn't a control node, bchar is null; route is the former connection id, if the length of a word is greater to two characters. For examples, as for these words : &quot;a0a2a1 &quot;a21</Paragraph> <Paragraph position="6"/> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Algorithm </SectionTitle> <Paragraph position="0"> Based on the Chinese character net which is described in section 2, the algorithm finding all the possible candidates of words segmented in a Chinese text is as follows: Begin the algorithm Variable CString strSrc;//the source string CString strRes; //the result of all possible word candidates int i; //the current character in the source string int iFind; //the position-number of the final-character of the last formed word int len; //the length of the source string Char str1[5]; //the current first character Char str2[5]; //the current second character BOOL Find=0; // a tag points the current routes are or aren't in words int Frec=0; //a tag points the route is or isn't in a word while(i < len-1) { get the first current character into str1 from the source string; get the second current character into str2 from the source string; select the connection between str1 and str2</Paragraph> <Paragraph position="2"> first character of the current formed word; if(its route matches the former right</Paragraph> <Paragraph position="4"> process the middle characters ( between iFind and j ) as single characters; add the candidate word to the result string strRes; set iFind to current value; } else set Frec = -1; reduce the current route from the route list ;</Paragraph> <Paragraph position="6"> process the current character as single character; set iFind += 2; } else if(not find connection) { process the current character as single</Paragraph> <Paragraph position="8"> example is the following Chinese character string</Paragraph> <Paragraph position="10"> segmentation strings.</Paragraph> </Section> <Section position="6" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 Experiment </SectionTitle> <Paragraph position="0"> Based on a basic Chinese word dictation obtained from Beijing University, which has 61135 Chinese words, we obtain the connections between each two characters, establish a Chinese character net which has 76259 connections. The records increase 24.7% ((76259-61135)/ 61135).</Paragraph> <Paragraph position="1"> In the character net, there are 2857 connections which have same char1 and same char2. In a general Chinese machine readable lexicon, there are about only 12% of words whose length are longer than three Chinese characters, about 70% of words whose length equal 4, and about 15% of words whose length equal 6. So, in the algorithm in this paper, the structure of the character-net is fine and the confliction may be processed seldom in the selection of the connections between same char1 and same char2. About 1500 Chinese characters can be processed per second.</Paragraph> </Section> <Section position="7" start_page="0" end_page="0" type="metho"> <SectionTitle> 5 Analysis of the Algorithm </SectionTitle> <Paragraph position="0"> In Chinese, the meaning of a character is atomic and based, the meaning of most of the Chinese words can be derived from the characters in the word, as is to say, the meaning of a Chinese word is compound or derived. This paper resolves the difficulties in segmentation of Chinese texts by the thought. The information in a Chinese text are divided into three kinds: (1) about characters, (2) about connections between characters, and (3) about Chinese words. As is expressed in Fig. 2.</Paragraph> <Paragraph position="1"> connection between each two characters In fig.2, a character and another character, which have relation between them, can compose into a connection. A connection and zero or several connections compose into a Chinese word. A Chinese word is composed of one or several Chinese characters.</Paragraph> <Paragraph position="2"> About a character, there are following information: (1) the probability used in a person name, (2) if it is a single character word etc. About a connection, there are information as described in section 2 and 3.</Paragraph> <Paragraph position="3"> About a word, there are following information: (1) if it is used as a prefix or a suffix(such as &quot;a0 a97 &quot;, &quot;a1a2a97 &quot;, &quot;a2a4a3 &quot;, &quot;a76a6a5 &quot;, &quot;a7 a120 &quot;); (2) mutual information between words, etc.</Paragraph> <Paragraph position="4"> In the process of segmentation of Chinese texts, we make the segmentation character by character. At first, the information of a character is processed, for example in this step we can obtain the possible person names; the second is obtaining and processing the information of connections between each two characters by the Chinese-character-net described in this paper; then we obtain all the possible candidate of segmentation words in a Chinese text. The third, we use the information of words and between words, resolve the ambiguity of segmentation words, identification of unknown words such as person names, place names and organization names.</Paragraph> <Paragraph position="5"> So the algorithm in this paper is easy combined with other existing algorithms.</Paragraph> </Section> class="xml-element"></Paper>