File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-0136_metho.xml
Size: 8,590 bytes
Last Modified: 2025-10-06 14:10:36
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-0136"> <Title>Sydney, July 2006. c(c)2006 Association for Computational Linguistics N-gram Based Two-Step Algorithm for Word Segmentation</Title> <Section position="4" start_page="1" end_page="197" type="metho"> <SectionTitle> 2 N-gram Features </SectionTitle> <Paragraph position="0"> The n-gram features in this work are similar to the previous one in the second bakeoff. The basic segmentation in (Kang and Lim, 2005) has performed by bigram features together with space tags, and the trigram features has been used as a postprocessing of correcting the segmentation errors. Trigrams for postprocessing are the ones that are highly biased to one type of the four tag In addition, unigram features are used for smoothing the bigram, where bigram is not found in the training corpora. In this current work, we extended the n-gram features to a trigram.</Paragraph> <Paragraph position="1"> In the above features, AB and ABC are a Chinese character sequence of bigram and trigram, respectively. The subscripts i, j, and k Single character words in Korean are not so common, compared to the Chinese language. We can control the occurrence of them through an additional processing. We applied the trigrams for error correction in which one of the trigram feature occupies 95% or more. denote word space tags, where the tags are marked as 1(space tag) and 0(non-space tag). For the unigram</Paragraph> <Paragraph position="3"> , four types of tag features are calculated in the training corpora and their frequencies are stored. In the same way, eight types of bigram features and four types of trigram features are constructed. If we take all the inside and outside space-tags of ABC, there are sixteen types of trigram features</Paragraph> <Paragraph position="5"> 1. It will cause a data sparseness problem, especially for small-sized training corpora. In order to avoid the data sparseness problem, we ignored the outside-space tags h and k and constructed four types of trigram features of A</Paragraph> <Paragraph position="7"> Table 1 shows the number of n-gram features for each corpora. The total number of unique trigrams for CITYU corpus is 1,341,612 in which 104,852 trigrams occurred more than three times.</Paragraph> <Paragraph position="8"> It is less than one tenth of the total number of trigrams. N-gram feature is a compound feature of <character, space-tag> combination. Trigram classes are distinguished by the space-tag context, It is simplified into four classes of C3T2</Paragraph> <Paragraph position="10"> C, in consideration of the memory space savings and the data sparseness problem. null</Paragraph> </Section> <Section position="5" start_page="197" end_page="198" type="metho"> <SectionTitle> 3 Word Segmentation Algorithm </SectionTitle> <Paragraph position="0"> Word segmentation is defined as to choose the best tag-sequence for a sentence.</Paragraph> <Paragraph position="1"> where 'Cn' refers to the number of characters and 'Tn' refers to the number of spae-tag. According to this notation,</Paragraph> <Paragraph position="3"> are expressed as C2T3 and C1T2, respectively.</Paragraph> <Paragraph position="4"> More specifically at each character position, the algorithm determines a space-tag '0' or '1' by using the word spacing features.</Paragraph> <Section position="1" start_page="197" end_page="197" type="sub_section"> <SectionTitle> 3.1 The Features </SectionTitle> <Paragraph position="0"> We investigated a two step algorithm of determining space tags in each character position of a sentence using by context dependent n-gram features. It is based on the assumption that space tags depend on the left and right context of characters together with the space tags that it accompanies. Let t</Paragraph> <Paragraph position="2"> ...</Paragraph> <Paragraph position="3"> In our previous work of (Lim and Kang, 2005), n-gram features (a) and (b) are used. These features are used to determine the space tag t i . In this work, core n-gram feature is a C3T2 classes of trigram features c</Paragraph> <Paragraph position="5"> addition, a simple character trigram with no space tag &quot;t</Paragraph> <Paragraph position="7"> Extended n-gram features with space tags are effective when left or right tags are fixed. Suppose that t</Paragraph> <Paragraph position="9"> are definitely set to 0 in a bigram context &quot;t</Paragraph> <Paragraph position="11"> &quot;. However, none of the space tags are fixed in the beginning that simple character n-gram features with no space tag are used.</Paragraph> </Section> <Section position="2" start_page="197" end_page="198" type="sub_section"> <SectionTitle> 3.2 Two-step Algorithm </SectionTitle> <Paragraph position="0"> The basic idea of our method is a cross checking the n-gram features in the space position by using three trigram features. For a character sequence</Paragraph> <Paragraph position="2"> is located before the character, not after the character that is common in other tagging problem like POS-tagging.</Paragraph> <Paragraph position="3"> Simple n-grams with no space tags are calculated from the extended n-grams.</Paragraph> <Paragraph position="4"> the beginning, word segmentation is performed in two steps. In the first step, simple n-gram features are applied with strong threshold values</Paragraph> <Paragraph position="6"> in Table 2). The space tags with high confidence are determined and the remaining space tags will be set in the next step.</Paragraph> <Paragraph position="8"> In the second step, extended bigram features are applied if any one of the left or right space tags is fixed in the first step. Otherwise, simple bigram probability will be applied, too. In this step, extended bigram features are applied with it was not determined by weak threshold values. Considering the fact that average length of Chinese words is about 1.6, the threshold values are lowered or highered.</Paragraph> <Paragraph position="9"> In the final step, error correction is performed by 4-gram error correction dictionary. It is constructed by running the training corpus and comparing the result to the answer. Error correction data format is 4-gram. If a 4-gram c</Paragraph> <Paragraph position="11"> unconditionally as is specified in the 4-gram dictionary.</Paragraph> </Section> </Section> <Section position="6" start_page="198" end_page="199" type="metho"> <SectionTitle> 4 Experimental Results </SectionTitle> <Paragraph position="0"> We evaluated our system in the closed task on all four corpora. Table 3 shows the final results in bakeoff 2006. We expect that R oov will be improved if any unknown word processing is performed. R iv can also be improved if lexicon is applied to correct the segmentation errors. Threshold values are optimized for each training corpus. The average length of Korean words is 3.2 characters.</Paragraph> <Section position="1" start_page="198" end_page="199" type="sub_section"> <SectionTitle> 4.1 Step-by-step Analysis </SectionTitle> <Paragraph position="0"> In order to analyze the effectiveness of each step, we counted the number of space positions for sentence by sentence. If the number of characters in a sentence is n, then the number of words positions is (n-1) because we ignored the first tag As we expressed in section 3, we assumed that trigram with space tag information will determine most of the space tags. Table 5 shows the application rate with strong threshold values. As we expected, around 93.8%~95.9% of total space tags are set in step-1 with the error rate 1.5%~2.8%.</Paragraph> <Paragraph position="1"> Table 6 shows the application rate of n-gram with weak threshold values in step-2. The space tags that are not determined in step-1 are set in the second step. The error rate in step-2 is 24.3%~30.1%.</Paragraph> </Section> <Section position="2" start_page="199" end_page="199" type="sub_section"> <SectionTitle> 4.2 4-gram Error Correction </SectionTitle> <Paragraph position="0"> We examined the effectiveness of 4-gram error correction. The number of 4-grams that is extracted from training corpora is about 10,000 to 15,000. We counted the number of space tags that are modified by 4-gram error correction dictionary. Table 7 shows the number of modified space tags and the negative effects of 4-gram error correction. Table 8 shows the results before error correction. When compared with the final results in Table 3, F-measure is slightly lower than the final results.</Paragraph> </Section> </Section> class="xml-element"></Paper>