File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/97/p97-1041_metho.xml

Size: 19,156 bytes

Last Modified: 2025-10-06 14:14:38

<?xml version="1.0" standalone="yes"?>
<Paper uid="P97-1041">
  <Title>A Trainable Rule-based Algorithm for Word Segmentation</Title>
  <Section position="4" start_page="322" end_page="326" type="metho">
    <SectionTitle>
3 Results
</SectionTitle>
    <Paragraph position="0"> With the above algorithm in place, we can use the training data to produce a rule sequence to augment an initial segmentation approximation in order to obtain a better approximation of the desired segmentation. Furthermore, since all the rules are purely character-based, a sequence can be learned for any character set and thus any language. We used our rule-based algorithm to improve the word segmentation rate for several segmentation algorithms in</Paragraph>
    <Section position="1" start_page="322" end_page="322" type="sub_section">
      <SectionTitle>
3.1 Evaluation of segmentation
</SectionTitle>
      <Paragraph position="0"> Despite the number of papers on the topic, the evaluation and comparison of existing segmentation algorithms is virtually impossible. In addition to the problem of multiple correct segmentations of the same texts, the comparison of algorithms is difficult because of the lack of a single metric for reporting scores. Two common measures of performance are recall and precision, where recall is defined as the percent of words in the hand-segmented text identified by the segmentation algorithm, and precision is defined as the percentage of words returned by the algorithm that also occurred in the hand-segmented text in the same position. The component recall and precision scores are then used to calculate an F-measure (Rijsbergen, 1979), where F = (1 +/~)PR/(~P + R). In this paper we will report all scores as a balanced F-measure (precision and recall weighted equally) with/~ = 1, such that</Paragraph>
      <Paragraph position="2"/>
    </Section>
    <Section position="2" start_page="322" end_page="324" type="sub_section">
      <SectionTitle>
3.2 Chinese
</SectionTitle>
      <Paragraph position="0"> For our Chinese experiments, the training set consisted of 2000 sentences (60187 words) from a Xinhun news agency corpus; the test set was a separate set of 560 sentences (18783 words) from the same corpus. 5 We ran four experiments using this corpus, with four different algorithms providing the starting point for the learning of the segmentation transformations. In each case, the rule sequence learned from the training set resulted in a significant improvement in the segmentation of the test set.</Paragraph>
      <Paragraph position="1">  A very simple initial segmentation for Chinese is to consider each character a distinct word. Since the average word length is quite short in Chinese, with most words containing only 1 or 2 characters, 6 this character-as-word segmentation correctly identified many one-character words and produced an initial segmentation score of F=40.3. While this is a low segmentation score, this segmentation algorithm identifies enough words to provide a reasonable initial segmentation approximation. In fact, the CAW algorithm alone has been shown (Buckley et al., 1996; Broglio et al., 1996) to be adequate to be used successfully in Chinese information retrieval.</Paragraph>
      <Paragraph position="2"> Our algorithm learned 5903 transformations from the 2000 sentence training set. The 5903 transformations applied to the test set improved the score from F=40.3 to 78.1, a 63.3% reduction in the error</Paragraph>
      <Paragraph position="4"> ~J and ~K can be any character except J and K, respectively.</Paragraph>
      <Paragraph position="5"> rate. This is a very surprising and encouraging result, in that, from a very naive initial approximation using no lexicon except that implicit from the training data, our rule-based algorithm is able to produce a series of transformations with a high segmentation accuracy.</Paragraph>
      <Paragraph position="6">  algorithm A common approach to word segmentation is to use a variation of the maximum matching algorithm, frequently referred to as the &amp;quot;greedy algorithm.&amp;quot; The greedy algorithm starts at the first character in a text and, using a word list for the language being segmented, attempts to find the longest word in the list starting with that character. If a word is found, the maximum-matching algorithm marks a boundary at the end of the longest word, then begins the same longest match search starting at the character following the match. If no match is found in the word list, the greedy algorithm simply skips that character and begins the search starting at the next character. In this manner, an initial segmentation can be obtained that is more informed than a simple character-as-word approach. We applied the maximum matching algorithm to the test set using a list of 57472 Chinese words from the NMSU CHSEG segmenter (described in the next section).</Paragraph>
      <Paragraph position="7"> This greedy algorithm produced an initial score of F=64.4.</Paragraph>
      <Paragraph position="8"> A sequence of 2897 transformations was learned * from the training set; applied to the test set, they improved the score from F=64.4 to 84.9, a 57.8% error reduction. From a simple Chinese word list, the rule-based algorithm was thus able to produce asegmentation score comparable to segmentation algorithms developed with a large amount of domain knowledge (as we will see in the next section).</Paragraph>
      <Paragraph position="9"> This score was improved further when combining the character-as-word (CAW) and the maximum matching algorithms. In the maximum matching algorithm described above, when a sequence of characters occurred in the text, and no subset of the sequence was present in the word list, the entire sequence was treated as a single word. This often resulted in words containing 10 or more characters, which is very unlikely in Chinese. In this experiment, when such a sequence of characters was encountered, each of the characters was treated as a separate word, as in the CAW algorithm above.</Paragraph>
      <Paragraph position="10"> This variation of the greedy algorithm, using the same list of 57472 words, produced an initial score of F=82.9. A sequence of 2450 transformations was learned from the training set; applied to the test set, they improved the score from F=82.9 to 87.7, a 28.1% error reduction. The score produced using this variation of the maximum matching algorithm combined with a rule sequence (87.7) is nearly equal to the score produced by the NMSU segmenter segmenter (87.9) discussed in the next section.</Paragraph>
      <Paragraph position="11">  The previous three experiments showed that our rule sequence algorithm can produce excellent segmentation results given very simple initial segmentation algorithms. However, assisting in the adaptation of an existing algorithm to different segmentation schemes, as discussed in Section 1, would most likely be performed with an already accurate, fullydeveloped algorithm. In this experiment we demon- null strate that our algorithm can also improve the output of such a system.</Paragraph>
      <Paragraph position="12"> The Chinese segmenter CHSEG developed at the Computing Research Laboratory at New Mexico State University is a complete system for high-accuracy Chinese segmentation (Jin, 1994). In addition to an initial segmentation module that finds words in a text based on a list of Chinese words, CHSEG additionally contains specific modules for recognizing idiomatic expressions, derived words, Chinese person names, and foreign proper names.</Paragraph>
      <Paragraph position="13"> The accuracy of CHSEG on an 8.6MB corpus has been independently reported as F=84.0 (Ponte and Croft, 1996). (For reference, Ponte and Croft report scores of F=86.1 and 83.6 for their probabilistic Chinese segmentation algorithms trained on over 100MB of data.) On our test set, CHSEG produced a segmentation score of F=87.9. Our rule-based algorithm learned a sequence of 1755 transformations from the training set; applied to the test set, they improved the score from 87.9 to 89.6, a 14.0% reduction in the error rate. Our rule-based algorithm is thus able to produce an improvement to an existing high-performance system. null Table 1 shows a summary of the four Chinese experiments. null</Paragraph>
    </Section>
    <Section position="3" start_page="324" end_page="324" type="sub_section">
      <SectionTitle>
3.3 Thai
</SectionTitle>
      <Paragraph position="0"> While Thai is also an unsegmented language, the Thai writing system is alphabetic and the average word length is greater than Chinese. ~ We would therefore expect that our character-based transformations would not work as well with Thai, since a context of more than one character is necessary in many cases to make many segmentation decisions in alphabetic languages.</Paragraph>
      <Paragraph position="1"> The Thai corpus consisted of texts s from the Thai News Agency via NECTEC in Thailand. For our experiment, the training set consisted of 3367 sentences (40937 words); the test set was a separate set of 1245 sentences (13724 words) from the same corpus.</Paragraph>
      <Paragraph position="2"> The initial segmentation was performed using the maximum matching algorithm, with a lexicon of 9933 Thai words from the word separation filter in ctte~,a Thai language Latex package. This greedy algorithm gave an initial segmentation score of F=48.2 on the test set.</Paragraph>
      <Paragraph position="3">  Our rule-based algorithm learned a sequence of 731 transformations which improved the score from 48.2 to 63.6, a 29.7% error reduction. While the alphabetic system is obviously harder to segment, we still see a significant reduction in the segmenter error rate using the transformation-based algorithm. Nevertheless, it is doubtful that a segmentation with a score of 63.6 would be useful in too many applications, and this result will need to be significantly improved.</Paragraph>
    </Section>
    <Section position="4" start_page="324" end_page="325" type="sub_section">
      <SectionTitle>
3.4 De-segmented English
</SectionTitle>
      <Paragraph position="0"> Although English is not an unsegmented language, the writing system is alphabetic like Thai and the average word length is similar. 9 Since English language resources (e.g. word lists and morphological analyzers) are more readily available, it is instructive to experiment with a de-segmented English corpus, that is, English texts in which the spaces have been removed and word boundaries are not explicitly indicated. The following shows an example of an English sentence and its de-segmented version: About 20,000 years ago the last ice age ended.</Paragraph>
      <Paragraph position="1"> About20,000yearsagothelasticeageended.</Paragraph>
      <Paragraph position="2"> The results of such experiments can help us determine which resources need to be compiled in order to develop a high-accuracy segmentation algorithm in unsegmented alphabetic languages such as Thai. In addition, we are also able to provide a more detailed error analysis of the English segmentation (since the author can read English but not Thai).</Paragraph>
      <Paragraph position="3"> Our English experiments were performed using a corpus of texts from the Wall Street Journal (WSJ).</Paragraph>
      <Paragraph position="4"> The training set consisted of 2675 sentences (64632 words) in which all the spaces had been removed; the test set was a separate set of 700 sentences (16318 words) from the same corpus (also with all spaces removed).</Paragraph>
      <Paragraph position="5">  For an initial experiment, segmentation was performed using the maximum matching algorithm, with a large lexicon of 34272 English words compiled from the WSJ. ldeg In contrast to the low initial Thai score, the greedy algorithm gave an initial English segmentation score of F=73.2. Our rule-based algorithm learned a sequence of 800 transformations,  which improved the score from 73.2 to 79.0, a 21.6% error reduction.</Paragraph>
      <Paragraph position="6"> The difference in the greedy scores for English and Thai demonstrates the dependence on the word list in the greedy algorithm. For example, an experiment in which we randomly removed half of the words from the English list reduced the performance of the greedy algorithm from 73.2 to 32.3; although this reduced English word list was nearly twice the size of the Thai word list (17136 vs. 9939), the longest match segmentation utilizing the list was much lower (32.3 vs. 48.2). Successive experiments in which we removed different random sets of half the words from the original list resulted in greedy algorithm performance of 39.2, 35.1, and 35.5. Yet, despite the disparity in initial segmentation scores, the transformation sequences effect a significant error reduction in all cases, which indicates that the transformation sequences are effectively able to compensate (to some extent) for weaknesses in the lexicon. Table 2 provides a summary of the results using the greedy algorithm for each of the three languages.</Paragraph>
    </Section>
    <Section position="5" start_page="325" end_page="325" type="sub_section">
      <SectionTitle>
3.4.2 Basic morphological segmentation
experiment
</SectionTitle>
      <Paragraph position="0"> As mentioned above, lexical resources are more readily available for English than for Thai. We can use these resources to provide an informed initial segmentation approximation separate from the greedy algorithm. Using our native knowledge of English as well as a short list of common English prefixes and suffixes, we developed a simple algorithm for initial segmentation of English which placed boundaries after any of the suffixes and before any of the prefixes, as well as segmenting punctuation characters. In most cases, this simple approach was able to locate only one of the two necessary boundaries for recognizing full words, and the initial score was understandably low, F=29.8. Nevertheless, even from this flawed initial approximation, our rule-based algorithm learned a sequence of 632 transformations which nearly doubled the word recall, improving the score from 29.8 to 53.3, a 33.5% error reduction.</Paragraph>
    </Section>
    <Section position="6" start_page="325" end_page="326" type="sub_section">
      <SectionTitle>
3.4.3 Amount of training data
</SectionTitle>
      <Paragraph position="0"> Since we had a large amount of English data, we also performed a classic experiment to determine the effect the amount of training data had on the ability of the rule sequences to improve segmentation.</Paragraph>
      <Paragraph position="1"> We started with a training set only slightly larger than the test set, 872 sentences, and repeated the maximum matching experiment described in Section 3.4.1. We then incrementally increased the amount of training data and repeated the experiment. The results, summarized in Table 3, clearly indicate (not surprisingly) that more training sentences produce both a longer rule sequence and a larger error reduction in the test data.</Paragraph>
      <Paragraph position="2">  Upon inspection of the English segmentation errors produced by both the maximum matching algorithm and the learned transformation sequences, one major category of errors became clear. Most apparent was the fact that the limited context transformations were unable to recover from many errors introduced by the naive maximum matching algorithm.</Paragraph>
      <Paragraph position="3"> For example, because the greedy algorithm always looks for the longest string of characters which can be a word, given the character sequence &amp;quot;economicsituation&amp;quot;, the greedy algorithm first recognized &amp;quot;economics&amp;quot; and several shorter words, segmenting the sequence as &amp;quot;economics it u at io n&amp;quot;. Since our transformations consider only a single character of context, the learning algorithm was unable to patch the smaller segments back together to produce the desired output &amp;quot;economic situation&amp;quot;. In some cases,  the transformations were able to recover some of the word, but were rarely able to produce the full desired output. For example, in one case the greedy algorithm segmented &amp;quot;humanactivity&amp;quot; as &amp;quot;humana c ti vi ty&amp;quot;. The rule sequence was able to transform this into &amp;quot;humana ctivity&amp;quot;, but was not able to produce the desired &amp;quot;human activity&amp;quot;. This suggests that both the greedy algorithm and the transformation learning algorithm need to have a more global word model, with the ability to recognize the impact of placing a boundary on the longer sequences of characters surrounding that point.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="326" end_page="326" type="metho">
    <SectionTitle>
4 Discussion
</SectionTitle>
    <Paragraph position="0"> The results of these experiments demonstrate that a transformation-based rule sequence, supplementing a rudimentary initial approximation, can produce accurate segmentation. In addition, they are able to improve the performance of a wide range of segmentation algorithms, without requiring expensive knowledge engineering. Learning the rule sequences can be achieved in a few hours and requires no language-specific knowledge. As discussed in Section 1, this simple algorithm could be used to adapt the output of an existing segmentation algorithm to different segmentation schemes as well as compensating for incomplete segmenter lexica, without requiring modifications to segmenters themselves.</Paragraph>
    <Paragraph position="1"> The rule-based algorithm we developed to improve word segmentation is very effective for segmenting Chinese; in fact, the rule sequences combined with a very simple initial segmentation, such as that from a maximum matching algorithm, produce performance comparable to manually-developed segmenters. As demonstrated by the experiment with the NMSU segmenter, the rule sequence algorithm can also be used to improve the output of an already highly-accurate segmenter, thus producing one of the best segmentation results reported in the literature. null In addition to the excellent overall results in Chinese segmentation, we also showed the rule sequence algorithm to be very effective in improving segmentation in Thai, an alphabetic language. While the scores themselves were not as high as the Chinese performance, the error reduction was nevertheless very high, which is encouraging considering the simple rule syntax used. The current state of our algorithm, in which only three characters are considered at a time, will understandably perform better with a language like Chinese than with an alphabetic language like Thai, where average word length is much greater. The simple syntax described in Section 2.2 can, however, be easily extended to consider larger contexts to the left and the right of boundaries; this extension would necessarily come at a corresponding cost in learning speed since the size of the rule space searched during training would grow accordingly. In the future, we plan to further investigate the application of our rule-based algorithm to alphabetic languages.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML