File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/05/i05-3002_intro.xml

Size: 8,221 bytes

Last Modified: 2025-10-06 14:02:55

<?xml version="1.0" standalone="yes"?>
<Paper uid="I05-3002">
  <Title>Using Word-Pair Identifier to Improve Chinese Input System</Title>
  <Section position="2" start_page="0" end_page="10" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> More than 100 Chinese input methods have been developed in the past (Becker 1985, Huang 1985, Gu et al. 1991, Chung 1993, Kuo 1995, Fu et al.</Paragraph>
    <Paragraph position="1"> 1996, Lee et al. 1997, Hsu et al. 1999, Chen et al. 2000, Tsai and Hsu 2002, Gao et al. 2002, Lee 2003). Their underlying approaches can be classified into four types: (1) Optical character recognition (OCR) based (Chung 1993), (2) On-line handwriting based (Lee et al. 1997), (3) Speech based (Fu et al. 1996, Chen et al. 2000), and (4) Keyboard based consists of phonetic and pinyin based (Chang et al. 1991, Hsu et al.</Paragraph>
    <Paragraph position="2"> 1993, Hsu 1994, Hsu et al. 1999, Kuo 1995, Lua and Gan 1992); arbitrary codes based [Fan et al.</Paragraph>
    <Paragraph position="3"> 1988]; and structure scheme based (Huang 1985).</Paragraph>
    <Paragraph position="4"> Currently, the most popular method for Chinese input is phonetic and pinyin based, because Chinese people are taught to write the corresponding phonetic and pinyin syllables of each Chinese character and word in primary school.</Paragraph>
    <Paragraph position="5"> In Chinese, each Chinese character corresponds to at least one syllable; and each Chinese word can be a mono-syllabic word, such as &amp;quot;n (mouse)&amp;quot;, a bi-syllabic word, such as &amp;quot;X3n (kangaroo)&amp;quot;, or a multi-syllabic word, such as &amp;quot;LOn(Mickey mouse).&amp;quot; Although there are more than 13,000 distinct Chinese characters (of which 5,400 are commonly used), there are only about 1,300 distinct syllables. As per (Qiao et al. 1984), each Chinese syllable can be mapped from 3 to over 100 Chinese characters, with the average number of characters per syllable being 17. According to our computation, the minimum, maximum and average numbers f Chinese words per syllable-word in MOE-MANDARIN dictionary &amp;quot;5PZ`&amp;ZF_U! &amp;quot; (one of most commonly-used Chinese dictionaries published by the Ministry of Education in Taiwan, its online dictionary is at (MOE)) are 1, 22 and 1.5, respectively. Since the size of problem space for syllable-to-word conversion is much less than that of syllable-to-character conversion, the most existing Chinese input systems (Hsu 1994, Hsu et al. 1999, Tsai and Hsu 2002, Gao et al.</Paragraph>
    <Paragraph position="6"> 2002, MSIME) are addressed on syllable-to-word conversion, not syllable-to-character conversion. To the research field of Chinese speech recognition, the STW conversion is the main task of Chinese language processing in typical Chinese speech recognition systems (Fu et al.</Paragraph>
    <Paragraph position="7"> 1996, Lee et al. 1993, Chien et al. 1993, Su et al. 1992).</Paragraph>
    <Paragraph position="8"> Conventionally, there are two approaches for syllable-to-word (STW) conversion: (1) the linguistic approach based on syntax parsing, se- null mantic template matching and contextual information (Hsu 1994, Fu et al. 1996, Hsu et al. 1999, Kuo 1995, Tsai and Hsu 2002); and (2) the statistical approach based on the n-gram models where n is usually 2 or 3 (Lin and Tsai 1987, Gu et al. 1991, Fu et al. 1996, Ho et al.</Paragraph>
    <Paragraph position="9"> 1997, Sproat 1990, Gao et al. 2002, Lee 2003).</Paragraph>
    <Paragraph position="10"> Although the linguistic approach requires considerable effort in designing effective syntax rules, semantic templates or contextual information, it is more user-friendly than the statistical approach on understanding why such a system makes a mistake (Hsu 1994, Tsai and Hsu 2002).</Paragraph>
    <Paragraph position="11"> On the other hand, the statistical language model (SLM) used in the statistical approach requires less effort and has been widely adopted in commercial Chinese input systems.</Paragraph>
    <Paragraph position="12"> According to previous studies (Chung 1993, Fong and Chung 1994, Tsai and Hsu 2002, Gao et al. 2002, Lee 2003), homophone selection and syllable-word segmentation are two critical problems to the STW conversion in Chinese.</Paragraph>
    <Paragraph position="13"> Incorrect homophone selection and failed syllable-word segmentation will directly influence the STW conversion rate. For example, consider the syllable sequence &amp;quot;yi1 du4 ji4 yu2 zhong1  (We use the forward (F) and the backward (B) longest syllable-word first strategies (Chen et al. 1986, Tsai and Hsu 2002), and &amp;quot;/&amp;quot; to indicate a syllable-word boundary).</Paragraph>
    <Paragraph position="14"> Among the above syllable-word segmentations, there is an ambiguous syllable-word section:</Paragraph>
    <Paragraph position="16"> W}/{YTYN,kk}/), respectively. For the ambiguous syllable-word section, the set of word-pairs comprised of two multi-syllabic Chinese words (including bi-syllabic words in the following) and their corresponding word-pair frequencies found in the UDN2001 corpus are: { .N-YTYN(1), .N-&amp;(1), .N-2(W(4), YTYN&amp;(1), YTYN-2(W(1), &amp;-2(W(26), ah`-2( W(19)}. The UDN2001 corpus (Tsai and Hsu 2002) is a collection of 4,539,624 Chinese sentences extracted from whole 2001 articles on the United Daily News Website (UDN) in Taiwan.</Paragraph>
    <Paragraph position="17"> For this case, if the word-pair &amp;quot;&amp;(China)-2( W(technique)&amp;quot; with the maximum frequency 26 is used to be the key word-pair, the set of co-occurrence word-pairs with the key word-pair found in the UDN2001 will be {.N-YTYN, .N-&amp;,.N-2(W,YTYN-&amp;,YTYN-2(W}.</Paragraph>
    <Paragraph position="18"> Then, by the key word-pair &amp;quot;&amp;-2(W&amp;quot; and its co-occurrence word-pair set {.N-YTYN,.N&amp;,.N-2(W,YTYN-&amp;,YTYN-2(W}, the mentioned ambiguous syllable-word section (/du4ji4/yu2/ and /du4/ji4yu2/) and the homophone selection of syllable-word /ji4 shu4/ (/{2( W(technique),Y5 (count)}/) of this case can be resolved, simultaneously. Thus, the Chinese words &amp;quot;.N(once)&amp;quot;, &amp;quot;YTYN(covet)&amp;quot;, &amp;quot;&amp; (China)&amp;quot; and &amp;quot;2(W(technique)&amp;quot; in the syllable sequence &amp;quot;yi1 du4 ji4 yu2 zhong1 guo2 de5 niang4 jiu3 ji4 shu4&amp;quot; can then be correctly identified. If we use the Microsoft Input Method Editor 2003 for Traditional Chinese (MSIME) to translate the syllables, it will be converted into &amp;quot;.N(once)N$(continue)5d(to)&amp;(China)F, (of)ah`(making-wine)2(W(technique).&amp;quot; As per (Gao et al. 2002), MSIME is a trigam-like Chinese input system. The two error converted words &amp;quot;N$(continue)&amp;quot; and &amp;quot;5d(to)&amp;quot; are widely recognized that unseen event (YTYN-&amp;) and over-weighting (5d-&amp;) the two major problems of SLM systems (Fu et al. 1996, Gao et al.</Paragraph>
    <Paragraph position="19"> 2002).</Paragraph>
    <Paragraph position="20"> The objective of this study is to illustrate the effectiveness of word-pairs for resolving the STW conversion for improving the Chinese input systems. We also conduct STW experiments to show the tonal and toneless STW accuracies of a commercial input product and a bigram model can be improved by our word-pair identifier without a tuning process. Here, the &amp;quot;tonal&amp;quot; is to indicate the syllables input with four tones, such as &amp;quot;niang4(ah) jiu3(`) ji4(2() shu4(W)&amp;quot; and the &amp;quot;toneless&amp;quot; is to indicate the syllables input without four tones, such as &amp;quot;niang(ah) jiu(`) ji(2() shu(W).&amp;quot;  The remainder of this paper is arranged as follows. In Section 2, we present a method for auto-generating word-pair (AUTO-WP) data-base from Chinese sentences. Then, we develop a word-pair identifier with the WP database to effectively resolve homonym and segmentation ambiguities of STW conversion on the WP-related portion in Chinese syllables. In Section 3, we present our STW experiment results. Finally, in Section 4, we give our conclusions and suggest some future research directions.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML