File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/05/i05-3025_intro.xml
Size: 2,497 bytes
Last Modified: 2025-10-06 14:02:57
<?xml version="1.0" standalone="yes"?> <Paper uid="I05-3025"> <Title>A Maximum Entropy Approach to Chinese Word Segmentation</Title> <Section position="4" start_page="162" end_page="162" type="intro"> <SectionTitle> 2 Testing </SectionTitle> <Paragraph position="0"> During testing, the probability of a boundary tag sequence assignment t1 ...tn given a character sequence C1 ...Cn is determined by using the maximum entropy classifier to compute the probability that a boundary tag ti is assigned to each individual character Ci. If we were to just assign each character the boundary tag with the highest probability, it is possible that the classifier produces a sequence of invalid tags (e.g., m followed by s). To eliminate such possibilities, we implemented a dynamic programming algorithm which considers only valid boundary tag sequences given an input character sequence.</Paragraph> <Paragraph position="1"> At each character position i, the algorithm considers each last word candidate ending at position i and consisting of K characters in length (K = 1,...,20 in our experiments). To determine the boundary tag assignment to the last word W with K characters, the first character of W is assigned boundary tag b, the last character of W is assigned tag e, and the intervening characters are assigned tag m. (If W is a single-character word, then the single character is assigned tag s).</Paragraph> <Paragraph position="2"> In this way, the dynamic programming algorithm only considers valid tag sequences.</Paragraph> <Paragraph position="3"> After word segmentation is done by the maximum entropy classifier, a post-processing step is applied to correct inconsistently segmented words made up of 3 or more characters. A word W is defined to be inconsistently segmented if the concatenation of 2 to 6 consecutive words elsewhere in the segmented output document matches W. In the post-processing step, the segmentation of the characters of these consecutive words is changed so that they are segmented as a single word. To illustrate, if the concatenation of 2 consecutive words &quot;_d_7403_d_3496&quot; and &quot;_d_6868_d_2975&quot; in the segmented output document matches another word &quot;_d_7403_d_3496_d_6868_d_2975&quot;, then the 2 consecutive words &quot;_d_7403_d_3496&quot; and &quot;_d_6868_d_2975&quot; will be re-segmented as a single word &quot;_d_7403_d_3496_d_6868 _d_2975&quot;.</Paragraph> </Section> class="xml-element"></Paper>