File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/06/p06-2108_intro.xml

Size: 5,906 bytes

Last Modified: 2025-10-06 14:03:47

<?xml version="1.0" standalone="yes"?>
<Paper uid="P06-2108">
  <Title>Using Word Support Model to Improve Chinese Input System</Title>
  <Section position="3" start_page="0" end_page="842" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> According to (Becker, 1985; Huang, 1985; Gu et al., 1991; Chung, 1993; Kuo, 1995; Fu et al., 1996; Lee et al., 1997; Hsu et al., 1999; Chen et al., 2000; Tsai and Hsu, 2002; Gao et al., 2002; Lee, 2003; Tsai, 2005), the approaches of Chinese input methods (i.e. Chinese input systems) can be classified into two types: (1) keyboard based approach: including phonetic and pinyin based (Chang et al., 1991; Hsu et al., 1993; Hsu, 1994; Hsu et al., 1999; Kuo, 1995; Lua and Gan, 1992), arbitrary codes based (Fan et al., 1988) and structure scheme based (Huang, 1985); and (2) non-keyboard based approach: including optical character recognition (OCR) (Chung, 1993), online handwriting (Lee et al., 1997) and speech recognition (Fu et al., 1996; Chen et al., 2000). Currently, the most popular Chinese input system is phonetic and pinyin based approach, because Chinese people are taught to write phonetic and pinyin syllables of each Chinese character in primary school.</Paragraph>
    <Paragraph position="1"> In Chinese, each Chinese word can be a mono-syllabic word, such as &amp;quot;Shu (mouse)&amp;quot;, a bisyllabic word, such as &amp;quot;Dai Shu (kangaroo)&amp;quot;, or a multi-syllabic word, such as &amp;quot;Mi No Shu (Mickey mouse).&amp;quot; The corresponding phonetic and pin-yin syllables of each Chinese word is called syllable-words, such as &amp;quot;dai4 shu3&amp;quot; is the pinyin syllable-word of &amp;quot;Dai Shu (kangaroo).&amp;quot; According to our computation, the {minimum, maximum, average} words per each distinct mono-syllableword and poly-syllable-word (including bisyllable-word and multi-syllable-word) in the CKIP dictionary (Chinese Knowledge Information Processing Group, 1995) are {1, 28, 2.8} and {1, 7, 1.1}, respectively. The CKIP dictionary is one of most commonly-used Chinese dictionaries in the research field of Chinese natural language processing (NLP). Since the size of problem space for syllable-to-word (STW) conversion is much less than that of syllable-tocharacter (STC) conversion, the most pinyin-based Chinese input systems (Hsu, 1994; Hsu et al., 1999; Tsai and Hsu, 2002; Gao et al., 2002; Microsoft Research Center in Beijing; Tsai, 2005) are addressed on STW conversion. On the other hand, STW conversion is the main task of Chinese Language Processing in typical Chinese speech recognition systems (Fu et al., 1996; Lee et al., 1993; Chien et al., 1993; Su et al., 1992).</Paragraph>
    <Paragraph position="2"> As per (Chung, 1993; Fong and Chung, 1994; Tsai and Hsu, 2002; Gao et al., 2002; Lee, 2003; Tsai, 2005), homophone selection and syllable-word segmentation are two critical problems in developing a Chinese input system. Incorrect homophone selection and syllable-word seg- null mentation will directly influence the STW conversion accuracy. Conventionally, there are two approaches to resolve the two critical problems: (1) linguistic approach: based on syntax parsing, semantic template matching and contextual information (Hsu, 1994; Fu et al., 1996; Hsu et al., 1999; Kuo, 1995; Tsai and Hsu, 2002); and (2) statistical approach: based on the n-gram models where n is usually 2, i.e. bigram model (Lin and Tsai, 1987; Gu et al., 1991; Fu et al., 1996; Ho et al., 1997; Sproat, 1990; Gao et al., 2002; Lee 2003). From the studies (Hsu 1994; Tsai and Hsu, 2002; Gao et al., 2002; Kee, 2003; Tsai, 2005), the linguistic approach requires considerable effort in designing effective syntax rules, semantic templates or contextual information, thus, it is more user-friendly than the statistical approach on understanding why such a system makes a mistake. The statistical language model (SLM) used in the statistical approach requires less effort and has been widely adopted in commercial Chinese input systems.</Paragraph>
    <Paragraph position="3"> In our previous work (Tsai, 2005), a word-pair (WP) identifier was proposed and shown a simple and effective way to improve Chinese input systems by providing tonal and toneless STW accuracies of 98.5% and 90.7% on the identified poly-syllabic words, respectively. In (Tsai, 2005), we have shown that the WP identifier can be used to reduce the over weighting and corpus sparseness problems of bigram models and achieve better STW accuracy to improve Chinese input systems. As per our computation, poly-syllabic words cover about 70% characters of Chinese sentences. Since the identified character ratio of the WP identifier (Tsai, 2005) is about 55%, there are still about 15% improving room left.</Paragraph>
    <Paragraph position="4"> The objective of this study is to illustrate a word support model (WSM) that is able to improve our WP-identifier by achieving better identified character ratio and STW accuracy on the identified poly-syllabic words with the same word-pair database. We conduct STW experiments to show the tonal and toneless STW accuracies of a commercial input product (Microsoft Input Method Editor 2003, MSIME), and an optimized bigram model, BiGram (Tsai, 2005), can both be improved by our WSM and achieve better STW improvements than that of these systems with the WP identifier.</Paragraph>
    <Paragraph position="5"> The remainder of this paper is arranged as follows. In Section 2, we present an auto word-pair (AUTO-WP) generation used to generate the WP database. Then, we develop a word support model with the WP database to perform STW conversion on identifying words from the Chinese syllables. In Section 3, we report and analyze our STW experimental results. Finally, in Section 4, we give our conclusions and suggest some future research directions.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML