File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/00/a00-2032_intro.xml

Size: 3,966 bytes

Last Modified: 2025-10-06 14:00:41

<?xml version="1.0" standalone="yes"?>
<Paper uid="A00-2032">
  <Title>Mostly-Unsupervised Statistical Segmentation of Japanese: Applications to Kanji</Title>
  <Section position="3" start_page="241" end_page="242" type="intro">
    <SectionTitle>
2 Algorithm
</SectionTitle>
    <Paragraph position="0"> Our algorithm employs counts of character n-grams in an unsegmented corpus to make segmentation decisions. We illustrate its use with an example (see Figure 2).</Paragraph>
    <Paragraph position="1"> Let &amp;quot;A B C D W X Y Z&amp;quot; represent an eight-kanji sequence. To decide whether there should be a word boundary between D and W, we check whether n-grams that are adjacent to the proposed boundary, such as the 4-grams sl =&amp;quot;A B C D&amp;quot; and 82 =&amp;quot;W X Y Z&amp;quot;, tend to be more frequent than n-grams that straddle it, such as the 4-gram tl ---- &amp;quot;B C D W&amp;quot;. If so, we have evidence of a word boundary between D and W, since there seems to be relatively little cohesion between the characters on opposite sides of this gap.</Paragraph>
    <Paragraph position="2"> The n-gram orders used as evidence in the segmentation decision are specified by the set N. For instance, if N = {4} in our example, then we pose the six questions of the form, &amp;quot;Is #(s~) &gt; #(tj)?&amp;quot;, where #(x) denotes the number of occurrences of x in the (unsegmented) training corpus. If N = {2,4}, then two more questions (Is &amp;quot;#(C D) &gt; #(D W)?&amp;quot; and &amp;quot;Is #(W X) &gt; #(O W)?&amp;quot;) are added.</Paragraph>
    <Paragraph position="3"> More formally, let s~ and 8~ be the non-straddling n-grams just to the left and right of location k, respectively, and let t~ be the straddling n-gram with j characters to the right of location k.</Paragraph>
    <Paragraph position="5"> - are the non-straddling n-grams 81 and 82 more frequent than the straddling n-grams tl, t2, and t3? Let I&gt; (y, z) be an indicator function that is 1 when y &gt; z, and 0 otherwise, 2 In order to compensate for the fact that there are more n-gram questions than (n - 1)-gram questions, we calculate the fraction of affirmative answers separately for each n in N:</Paragraph>
    <Paragraph position="7"> Then, we average the contributions of each n-gram order:</Paragraph>
    <Paragraph position="9"> After vN(k) is computed for every location, boundaries are placed at all locations ~ such that either:</Paragraph>
    <Paragraph position="11"> The second condition is necessary to allow for single-character words (see Figure 3). Note that it also controls the granularity of the segmentation: low thresholds encourage shorter segments.</Paragraph>
    <Paragraph position="12"> Both the count acquisition and the testing phase are efficient. Computing n-gram statistics for all possible values of n simultaneously can be done in O(m log m) time using suffix arrays, where m is the training corpus size (Manber and Myers, 1993; Nagao and Mori, 1994). However, if the set N of n-gram orders is known in advance, conceptually simpler algorithms suffice. Memory allocation for :Note that we do not take into account the magnitude of the difference between the two frequencies; see section 5 for discussion.</Paragraph>
    <Paragraph position="13">  other three by the local maximum condition.</Paragraph>
    <Paragraph position="14"> count tables can be significantly reduced by omitting n-grams occurring only once and assuming the count of unseen n-grams to be one. In the application phase, the algorithm is clearly linear in the test corpus size if \[NI is treated as a constant.</Paragraph>
    <Paragraph position="15"> Finally, we note that some pre-segmented data is necessary in order to set the parameters N and t.</Paragraph>
    <Paragraph position="16"> However, as described below, very little such data was required to get good performance; we therefore deem our algorithm to be &amp;quot;mostly unsupervised&amp;quot;.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML