File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/00/a00-2032_evalu.xml

Size: 8,686 bytes

Last Modified: 2025-10-06 13:58:32

<?xml version="1.0" standalone="yes"?>
<Paper uid="A00-2032">
  <Title>Mostly-Unsupervised Statistical Segmentation of Japanese: Applications to Kanji</Title>
  <Section position="5" start_page="243" end_page="245" type="evalu">
    <SectionTitle>
4 Results
</SectionTitle>
    <Paragraph position="0"> We report the average results over the five test sets using the optimal parameter settings for the corresponding training sets (we tried all nonempty sub-sets of {2, 3, 4, 5, 6} for the set of n-gram orders N and all values in {.05, .1, .15,..., 1} for the threshold t) 7. In all performance graphs, the &amp;quot;error bars&amp;quot; represent one standard deviation. The results for Chasen and Juman reflect the lexicon additions de7For simplicity, ties were deterministically broken by preferring smaller sizes of N, shorter n-grams in N, and larger threshold values, in that order.</Paragraph>
    <Paragraph position="1"> scribed in section 3.2.</Paragraph>
    <Paragraph position="2"> Word and morpheme accuracy The standard metrics in word segmentation are word precision and recall. Treating a proposed segmentation as a non-nested bracketing (e.g., &amp;quot;lAB ICI&amp;quot; corresponds to the bracketing &amp;quot;[AB][C]&amp;quot;), word precision (P) is defined as the percentage of proposed brackets that exactly match word-level brackets in the annotation; word recall (R) is the percentage of word-level annotation brackets that are proposed by the algorithm in question; and word F combines precision and recall: F = 2PR/(P + R).</Paragraph>
    <Paragraph position="3"> One problem with using word metrics is that morphological analyzers are designed to produce morpheme-level segments. To compensate, we altered the segmentations produced by Juman and Chasen by concatenating stems and affixes, as identified by the part-of-speech information the analyzers provided. (We also measured morpheme accuracy, as described below.) Figures 4 and 8 show word accuracy for Chasen, Juman, and our algorithm for parameter settings optimizing word precision, recall, and F-measure rates. Our algorithm achieves 5.27% higher precision and 0.25% better F-measure accuracy than Juman, and does even better (8.8% and 4.22%, respectively) with respect to Chasen. The recall performance falls (barely) between that of Juman and that of Chasen.</Paragraph>
    <Paragraph position="4"> As noted above, Juman and Chasen were designed to produce morpheme-level segmentations.</Paragraph>
    <Paragraph position="5"> We therefore also measured morpheme precision, recall, and F measure, all defined analogously to their word counterparts.</Paragraph>
    <Paragraph position="6"> Figure 5 shows our morpheme accuracy results.</Paragraph>
    <Paragraph position="7"> We see that our algorithm can achieve better recall (by 6.51%) and F-measure (by 1.38%) than Juman, and does better than Chasen by an even wider margin (11.18% and 5.39%, respectively). Precision was generally worse than the morphological analyzers. null  racy is a standard performance metric, it is clearly very sensitive to the test annotation. Morpheme accuracy suffers the same problem. Indeed, the authors of Juman and Chasen may well have constructed their standard dictionaries using different notions of word and morpheme than the definitions we used in annotating the data. We therefore developed two new, more robust metrics to measure the number of proposed brackets that would be incor- null annotated as &amp;quot;[[data] [base]]&amp;quot; because &amp;quot;data base&amp;quot; and &amp;quot;database&amp;quot; are interchangeable.  rect with respect to any reasonable annotation.</Paragraph>
    <Paragraph position="8"> Our novel metrics account for two types of errors. The first, a crossing bracket, is a proposed bracket that overlaps but is not contained within an annotation bracket (Grishman et al., 1992). Crossing brackets cannot coexist with annotation brackets, and it is unlikely that another human would create such brackets. The second type of error, a morpheme-dividing bracket, subdivides a morpheme-level annotation bracket; by definition, such a bracket results in a loss of meaning. See Figure 6 for some examples.</Paragraph>
    <Paragraph position="9"> We define a compatible bracket as a proposed bracket that is neither crossing nor morphemedividing. The compatible brackets rate is simply the compatible brackets precision. Note that this metric accounts for different levels of segmentation simultaneously, which is beneficial because the granularity of Chasen and Juman's segmentation varies from morpheme level to compound word level (by our definition). For instance, well-known university names are treated as single segments by virtue of being in the default lexicon, whereas other university names are divided into the name and the word &amp;quot;university&amp;quot;. Using the compatible brackets rate, both segmentations can be counted as correct.</Paragraph>
    <Paragraph position="10"> We also use the all-compatible brackets rate, which is the fraction of sequences for which all the proposed brackets are compatible. Intuitively, this function measures the ease with which a human could correct the output of the segmentation algorithm: if the all-compatible brackets rate is high, then the errors are concentrated in relatively few sequences; if it is low, then a human doing post-processing would have to correct many sequences.  compatible brackets rates. Our algorithm does better on both metrics (for instance, when F-measure is optimized, by 2.16% and 1.9%, respectively, in comparison to Chasen, and by 3.15% and 4.96%, respectively, in comparison to Juman), regardless of training optimization function (word precision, recall, or F -- we cannot directly optimize the compatible brackets rate because &amp;quot;perfect&amp;quot; performance is possible simply by making the entire sequence a single segment).</Paragraph>
    <Paragraph position="11">  bracket rates when word accuracy is optimized.</Paragraph>
    <Paragraph position="12">  before discarding overlaps with the test sets.</Paragraph>
    <Section position="1" start_page="245" end_page="245" type="sub_section">
      <SectionTitle>
4.1 Discussion
</SectionTitle>
      <Paragraph position="0"> Minimal human effort is needed. In contrast to our mostly-unsupervised method, morphological analyzers need a lexicon and grammar rules built using human expertise. The workload in creating dictionaries on the order of hundreds of thousands of words (the size of Chasen's and Juman's default lexicons) is clearly much larger than annotating the small parameter-training sets for our algorithm. We also avoid the need to segment a large amount of parameter-training data because our algorithm draws almost all its information from an unsegmented corpus. Indeed, the only human effort involved in our algorithm is pre-segmenting the five 50-sequence parameter training sets, which took only 42 minutes. In contrast, previously proposed supervised approaches have used segmented training sets ranging from 1000-5000 sentences (Kashioka et al., 1998) to 190,000 sentences (Nagata, 1996a).</Paragraph>
      <Paragraph position="1"> To test how much annotated training data is actually necessary, we experimented with using miniscule parameter-training sets: five sets of only five strings each (from which any sequences repeated in the test data were discarded). It took only 4 minutes to perform the hand segmentation in this case. As shown in Figure 8, relative word performance was not degraded and sometimes even slightly better. In fact, from the last column of Figure 8 we see that even if our algorithm has access to only five annotated sequences when Juman has access to ten times as many, we still achieve better precision and better F measure.</Paragraph>
      <Paragraph position="2"> Both the local maximum and threshold conditions contribute. In our algorithm, a location k is deemed a word boundary if VN(k) is either (1) a local maximum or (2) at least as big as the threshold t. It is natural to ask whether we really need two conditions, or whether just one would suffice.</Paragraph>
      <Paragraph position="3"> We therefore studied whether optimal performance could be achieved using only one of the conditions. Figure 9 shows that in fact both contribute to producing good segmentations. Indeed, in some cases, both are needed to achieve the best performance; also, each condition when used in isolation yields suboptimal performance with respect to some performance metrics.</Paragraph>
      <Paragraph position="4"> accuracy optimize optimize optimize precision recall F-measure  word M M &amp; T M morpheme M &amp; T T T</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML