File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/96/p96-1019_intro.xml

Size: 6,129 bytes

Last Modified: 2025-10-06 14:06:03

<?xml version="1.0" standalone="yes"?>
<Paper uid="P96-1019">
  <Title>An Iterative Algorithm to Build Chinese Language Models</Title>
  <Section position="4" start_page="140" end_page="141" type="intro">
    <SectionTitle>
4 Experiments and Evaluation
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="140" end_page="141" type="sub_section">
      <SectionTitle>
4.1 Segmentation Accuracy
</SectionTitle>
      <Paragraph position="0"> Our first attempt is to see how accurate the segmentation algorithm proposed in section 2 is. To this end, we split the whole data set ~ into two parts, half for building LMs and half reserved for testing. The trigram model used in this experiment is the standard deleted interpolation model described in (Jelinek et al., 1992) with a vocabulary of 20K words.</Paragraph>
      <Paragraph position="1"> Since we lack an objective criterion to measure the accuracy of a segmentation system, we ask three ~The corpus has about 5 million characters and is coarsely pre-segmented.</Paragraph>
      <Paragraph position="2"> native speakers to segment manually 100 sentences picked randomly from the test set and compare them with segmentations by machine. The result is summed in table 2, where ORG stands for the original segmentation, P1, P2 and P3 for three human subjects, and TRI and UNI stand for the segmentations generated by trigram LM and unigram LM respectively. The number reported here is the arithmetic average of recall and precision, as was used in n_~ (Sproat et al., 1994), i.e., 1/2(~-~ + n2), where nc is the number of common words in both segmentations, nl and n2 are the number of words in each of the segmentations.</Paragraph>
      <Paragraph position="3">  We can make a few remarks about the result in table 2. First of all, it is interesting to note that the agreement of segmentations among human subjects is roughly at the same level of that between human subjects and machine. This confirms what reported in (Sproat et al., 1994). The major disagreement for human subjects comes from compound words, phrases and suffices. Since we don't give any specific instructions to human subjects, one of them tends to group consistently phrases as words because he was implicitly using semantics as his segmentation criterion. For example, he segments thesentence 3 dao4 jial li2 chil dun4 fan4(see table 3) as two words dao4 j+-al l+-2(go home) and chil dun4 :fem4(have a meal) because the two &amp;quot;words&amp;quot; are clearly two semantic units. The other two subjects and machine segment it as dao4 / jial li2/ chil/ dtm4 / fern4.</Paragraph>
      <Paragraph position="4"> Chinese has very limited morphology (Spencer, 1991) in that most grammatical concepts are conveyed by separate words and not by morphological processes. The limited morphology includes some ending morphemes to represent tenses of verbs, and this is another source of disagreement. For example, for the partial sentence zuo4 were2 le, where le functions as labeling the verb zuo4 wa.u2 as &amp;quot;perfect&amp;quot; tense, some subjects tend to segment it as two words zuo4 ~an2/ le while the other treat it as one single word.</Paragraph>
      <Paragraph position="5"> Second, the agreement of each of the subjects with either the original, trigram, or unigram segmentation is quite high (see columns 2, 6, and 7 in Table 2) and appears to be specific to the subject.</Paragraph>
      <Paragraph position="6">  Third, it seems puzzling that the trigram LM agrees with the original segmentation better than a unigram model, but gives a worse result when compared with manual segmentations. However, since the LMs are trained using the presegmented data, the trigram model tends to keep the original segmentation because it takes the preceding two words into account while the unigram model is less restricted to deviate from the original segmentation. In other words, if trained with &amp;quot;cleanly&amp;quot; segmented data, a trigram model is more likely to produce a better segmentation since it tends to preserve the nature of training data.</Paragraph>
    </Section>
    <Section position="2" start_page="141" end_page="141" type="sub_section">
      <SectionTitle>
4.2 Experiment of the iterative procedure
</SectionTitle>
      <Paragraph position="0"> In addition to the 5 million characters of segmented text, we had unsegmented data from various sources reaching about 13 million characters. We applied our iterative algorithm to that corpus.</Paragraph>
      <Paragraph position="1"> Table 4 shows the figure of merit of the resulting segmentation of the 100 sentence test set described earlier. After one iteration, the agreement with the original segmentation decreased by 3 percentage points, while the agreement with the human segmentation increased by less than one percentage point. We ran our computation intensive procedure for one iteration only. The results indicate that the impact on segmentation accuracy would be small. However, the new unsegmented corpus is a good source of automatically discovered words. A 20 examples picked randomly from about 1500 unseen words are shown in Table 5. 16 of them are reasonably good words and are listed with their translated meanings. The problematic words are marked with &amp;quot;?&amp;quot;.</Paragraph>
    </Section>
    <Section position="3" start_page="141" end_page="141" type="sub_section">
      <SectionTitle>
4.3 Perplexity of the language model
</SectionTitle>
      <Paragraph position="0"> After each segmentation, an interpolated trigram model is built, and an independent test set with 2.5 million characters is segmented and then used to measure the quality of the model. We got a perplexity 188 for a vocabulary of 80K words, and the alternating procedure has little impact on the perplexity. This can be explained by the fact that the change of segmentation is very little ( which is reflected in table reftab:accuracy-iter ) and the addition of unseen words(1.5K) to the vocabulary is also too little to affect the overall perplexity. The merit of the alternating procedure is probably its ability to detect unseen words.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML