File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/04/w04-1104_evalu.xml

Size: 1,593 bytes

Last Modified: 2025-10-06 13:59:16

<?xml version="1.0" standalone="yes"?>
<Paper uid="W04-1104">
  <Title>Adaptive Compression-based Approach for Chinese Pinyin Input</Title>
  <Section position="5" start_page="0" end_page="0" type="evalu">
    <SectionTitle>
4 Experiment and Result
</SectionTitle>
    <Paragraph position="0"> We use 220MB People Daily (91-95) as the training corpus and 58M People Daily (96) and stories download from Internet (400K) as the test corpus.</Paragraph>
    <Paragraph position="1"> We used SRILM language tools (Stolcke, 2002) to collect trigram counts and applied modified Kneser-Ney smoothing method to build the language model. Then we used disambig to translate Pinyin to Chinese characters. In PPM model we used the same count data collected by SRILM tools. We chose a trie structure to store the symbol and count. Adaptive PPM model updates the counts during Pinyin input. It is similar to a cache model (Kuhn and De Mori, 1990). We tested both static and adaptive PPM models on test corpus. PPM models run twice faster than SRILM tool disambig. It took 20 hours to translate Pinyin (People Daily 96) to character on a Sparc with two CPUs(900Mhz) using SRILM tools. The following Table 3 shows the results in terms of character error rate. People Daily(96) is the same domain as the training corpus. Results obtained testing on People Daily are consistently much better than Stories. Static PPM is a little worse than modified Kneser-Ney smoothing method. Adaptive PPM model testing on large corpus is better than small corpus as it takes time to adapt to the new model.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML