File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/00/a00-2032_intro.xml
Size: 3,966 bytes
Last Modified: 2025-10-06 14:00:41
<?xml version="1.0" standalone="yes"?> <Paper uid="A00-2032"> <Title>Mostly-Unsupervised Statistical Segmentation of Japanese: Applications to Kanji</Title> <Section position="3" start_page="241" end_page="242" type="intro"> <SectionTitle> 2 Algorithm </SectionTitle> <Paragraph position="0"> Our algorithm employs counts of character n-grams in an unsegmented corpus to make segmentation decisions. We illustrate its use with an example (see Figure 2).</Paragraph> <Paragraph position="1"> Let &quot;A B C D W X Y Z&quot; represent an eight-kanji sequence. To decide whether there should be a word boundary between D and W, we check whether n-grams that are adjacent to the proposed boundary, such as the 4-grams sl =&quot;A B C D&quot; and 82 =&quot;W X Y Z&quot;, tend to be more frequent than n-grams that straddle it, such as the 4-gram tl ---- &quot;B C D W&quot;. If so, we have evidence of a word boundary between D and W, since there seems to be relatively little cohesion between the characters on opposite sides of this gap.</Paragraph> <Paragraph position="2"> The n-gram orders used as evidence in the segmentation decision are specified by the set N. For instance, if N = {4} in our example, then we pose the six questions of the form, &quot;Is #(s~) > #(tj)?&quot;, where #(x) denotes the number of occurrences of x in the (unsegmented) training corpus. If N = {2,4}, then two more questions (Is &quot;#(C D) > #(D W)?&quot; and &quot;Is #(W X) > #(O W)?&quot;) are added.</Paragraph> <Paragraph position="3"> More formally, let s~ and 8~ be the non-straddling n-grams just to the left and right of location k, respectively, and let t~ be the straddling n-gram with j characters to the right of location k.</Paragraph> <Paragraph position="5"> - are the non-straddling n-grams 81 and 82 more frequent than the straddling n-grams tl, t2, and t3? Let I> (y, z) be an indicator function that is 1 when y > z, and 0 otherwise, 2 In order to compensate for the fact that there are more n-gram questions than (n - 1)-gram questions, we calculate the fraction of affirmative answers separately for each n in N:</Paragraph> <Paragraph position="7"> Then, we average the contributions of each n-gram order:</Paragraph> <Paragraph position="9"> After vN(k) is computed for every location, boundaries are placed at all locations ~ such that either:</Paragraph> <Paragraph position="11"> The second condition is necessary to allow for single-character words (see Figure 3). Note that it also controls the granularity of the segmentation: low thresholds encourage shorter segments.</Paragraph> <Paragraph position="12"> Both the count acquisition and the testing phase are efficient. Computing n-gram statistics for all possible values of n simultaneously can be done in O(m log m) time using suffix arrays, where m is the training corpus size (Manber and Myers, 1993; Nagao and Mori, 1994). However, if the set N of n-gram orders is known in advance, conceptually simpler algorithms suffice. Memory allocation for :Note that we do not take into account the magnitude of the difference between the two frequencies; see section 5 for discussion.</Paragraph> <Paragraph position="13"> other three by the local maximum condition.</Paragraph> <Paragraph position="14"> count tables can be significantly reduced by omitting n-grams occurring only once and assuming the count of unseen n-grams to be one. In the application phase, the algorithm is clearly linear in the test corpus size if \[NI is treated as a constant.</Paragraph> <Paragraph position="15"> Finally, we note that some pre-segmented data is necessary in order to set the parameters N and t.</Paragraph> <Paragraph position="16"> However, as described below, very little such data was required to get good performance; we therefore deem our algorithm to be &quot;mostly unsupervised&quot;.</Paragraph> </Section> class="xml-element"></Paper>