File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/i05-3002_metho.xml
Size: 13,663 bytes
Last Modified: 2025-10-06 14:09:35
<?xml version="1.0" standalone="yes"?> <Paper uid="I05-3002"> <Title>Using Word-Pair Identifier to Improve Chinese Input System</Title> <Section position="3" start_page="10" end_page="11" type="metho"> <SectionTitle> 2 Development of Word-Pair Identifier </SectionTitle> <Paragraph position="0"> The system dictionary of our word-pair identifier is comprised of 155,746 Chinese words taken from the MOE-MANDARIN dictionary (MOE) and 29,408 unknown words auto-found in UDN2001 corpus by a Chinese word autoconfirmation (CWAC) system (Tsai et al. 2003). The system dictionary provides the knowledge of words and their corresponding pinyin syllable-words. The pinyin syllable-words were translated by phoneme-to-pinyin mappings, such as &quot;&quot;-to-&quot;ji4.&quot;</Paragraph> <Section position="1" start_page="10" end_page="10" type="sub_section"> <SectionTitle> 2.1 Generating the Word-Pair Database </SectionTitle> <Paragraph position="0"> The steps of our AUTO-WP to auto-discovery word-pairs from a given Chinese sentence are as below: Step 1. Segmentation: Generate the word segmentation for a given Chinese sentence by backward maximum matching (BMM) techniques (Chen et al. 1986) with the system dictionary. Take the Chinese sentence &quot;+^uDf-.(bring the military component parts here)&quot; as an example. Its BMM word-segmentation is &quot;+(get)/^uD(military)/f(component parts)/-.(bring)&quot; and its forward maximum matching (FMM) word-segmentation is &quot;+^u(a general)/D(use)/ f(component parts)/-.(bring).&quot; According to our previous work (Tsai et al. 2004), the word segmentation precision of BMM is about 1% greater than that of FMM.</Paragraph> <Paragraph position="1"> Step 2. Initial WP set: Extract all the combinations of word-pairs from the word segmentations of Step 1 to be the initial WP set. For the above case, there are six combinations of word-pairs extracted: {&quot;+/^uD&quot;, &quot;+/f&quot;, &quot;+/-.&quot;, &quot;^u D/f&quot;, &quot;^uD/-.&quot;, &quot;f/-.&quot;}. Step 3. Final WP set: Select out the word-pairs comprised of two multi-syllabic Chinese words to be the finial WP set.</Paragraph> <Paragraph position="2"> For the final WP set, if the word-pair is not found in the WP database, insert it into the WP database and set its frequency to 1; otherwise, increase its frequency by 1. In the above case, the final WP set includes three word-pairs: {&quot;^uD /f&quot;, &quot;^uD/-.&quot;, &quot;f/-.&quot;}. By applying our AUTO-WP to the UDN2001 corpus (the training corpus), totally 25,439,679 word-pairs were generated. From the generated WP database, the frequencies of word-pairs &quot;^u D/f&quot;, &quot;^uD/-.&quot; and &quot;f/-.&quot; are 1, 1 and 2, respectively. The frequency of a word-pair is the number of sentences that contain the word-pair with the same word-pair order in the training corpus.</Paragraph> </Section> <Section position="2" start_page="10" end_page="11" type="sub_section"> <SectionTitle> 2.2 Word-Pair Identifier </SectionTitle> <Paragraph position="0"> The algorithm of our WP identifier for a given Chinese syllables is as follows: Step 1. Input tonal or toneless syllables.</Paragraph> <Paragraph position="1"> Step 2. Generate all possible word-pairs comprised of two multi-syllabic Chinese words for the input syllables to be the input of Step 3.</Paragraph> <Paragraph position="2"> Step 3. Select out the word-pairs that match a word-pair in the WP database to be the initial WP set, firstly. Then, from the initial WP set, select the word-pair with maximum frequency as the key word-pair.</Paragraph> <Paragraph position="3"> Finally, find the co-occurrence word-pairs with the key word-pair in the training corpus to be the final WP set. If there are two or more word-pairs with the same maximum frequency, one of them is randomly selected as the key word-pair.</Paragraph> <Paragraph position="4"> Step 4. Arrange all word-pairs of the final WP set into a WP-sentence. If no word-pairs can be identified in the input syllables, a NULL WP-sentence is produced.</Paragraph> <Paragraph position="5"> Table 1 is a step by step example to show the details of applying our WP identifier on the Chinese syllables &quot;yi1 ge5 wen2 ming2 de5 shuai1 wei2 guo4 cheng2([a]5/5 [civilization]F,[of]X/V[decay]_I[process]).&quot; For this case, we have a WP-sentence &quot;5/ 5 de5shuai1wei2_I.&quot; As we have mentioned in Section 1, we found this WP-sentence can also be used to correct the MSIME converted errors in its output &quot;[a]P#[famous] F,[of]X/V[decay]_I[process].&quot;</Paragraph> </Section> </Section> <Section position="4" start_page="11" end_page="13" type="metho"> <SectionTitle> 3 The STW Experiments </SectionTitle> <Paragraph position="0"> To evaluate the STW performance of our WP identifier, we define the STW accuracy, identified character ratio (ICR) and STW improvement, by the following equations: STW accuracy = # of correct characters / # of total characters. (1) Identified character ratio (ICR) = # of characters of identified WP / # of total characters in testing sentences. (2) STW improvement (i.e. STW error reduction rate) = (accuracy of STW system with WP accuracy of STW system)) / (1 - accuracy of STW system). (3)</Paragraph> <Section position="1" start_page="11" end_page="12" type="sub_section"> <SectionTitle> 3.1 Generation of the Word-Pair Database </SectionTitle> <Paragraph position="0"> To conduct the STW experiments, firstly, use the inverse translator of phoneme-to-character (PTC) provided in GOING system to convert testing sentences into their corresponding syllables. Then, all the error PTC translations of GOING were corrected by post human-editing.</Paragraph> <Paragraph position="1"> Then, apply our WP identifier to convert these testing syllables back to their WP-sentences.</Paragraph> <Paragraph position="2"> Finally, calculate its STW accuracy and identified character ratio by Equations (1) and (2). Note that all test sentences are composed of a string of Chinese characters in this study.</Paragraph> <Paragraph position="3"> The training/testing corpus, closed/open test sets and the testing WP database used in the STW experiments are described as below: (1) Training corpus: We used the UDN2001 corpus mentioned in Section 1 as our training corpus. All knowledge of word frequencies, word-pairs, word-pair frequencies was autogenerated and computed by this corpus.</Paragraph> <Paragraph position="4"> (2) Testing corpus: The UDN2002 corpus was selected as our testing corpus. It is a collection of 3,321,504 Chinese sentences that were extracted from whole 2002 articles on the United Daily News Website (UDN).</Paragraph> <Paragraph position="5"> (3) Closed test set: 10,000 sentences were randomly selected from the UDN2001 corpus as the closed test set. The {minimum, maximum, and mean} of characters per sentence for the closed test set were {4, 37, and 12}.</Paragraph> <Paragraph position="6"> (4) Open test set: 10,000 sentences were randomly selected from the UDN2002 corpus as the open test set. At this point, we checked that the selected open test sentences were not in the closed test set as well. The {minimum, maximum, and mean} of characters per sentence for the open test set were {4, 43, and 13.7}.</Paragraph> <Paragraph position="7"> (5) Testing WP database: By applying our AUTO-WP on the UDN2001 corpus, we created 25,439,679 word-pairs as the testing WP database.</Paragraph> <Paragraph position="8"> We conducted the STW experiment in a progressive manner. The results and analysis of the experiment are described in Sub-sections 3.2 and 3.3.</Paragraph> </Section> <Section position="2" start_page="12" end_page="12" type="sub_section"> <SectionTitle> 3.2 STW Experiment of the WP Identifier </SectionTitle> <Paragraph position="0"> The purpose of this experiment is to demonstrate the tonal and toneless STW accuracies among the identified word-pairs by using the WP identifier with the testing WP database.</Paragraph> <Paragraph position="1"> From Table 2, the average tonal and toneless STW accuracies of the WP identifier for the closed and open test sets are 98.5% and 90.7%, respectively. Between the closed and the open test sets, the differences of the tonal and toneless STW accuracies of the WP identifier are 0.5% and 1.4%, respectively. These results strongly support that the WP identifier can be used to effectively perform Chinese STW conversion on the WP-related portion.</Paragraph> </Section> <Section position="3" start_page="12" end_page="12" type="sub_section"> <SectionTitle> 3.3 A Commercial IME System and A Bi- </SectionTitle> <Paragraph position="0"> gram Model with WP Identifier We selected Microsoft Input Method Editor 2003 for Traditional Chinese (MSIME) as our experimental commercial Chinese input system. In addition, an optimized bigram model called BiGram was developed. The BiGram STW system is a bigram-based model developing by SRILM (Stolcke 2002) with Good-Turing back-off smoothing (Manning and Schuetze, 1999), as well as forward and backward longest syllable-word first strategies (Chen et al. 1986, Tsai et al. 2004). The training corpus and system dictionary of the BiGram system are same with that of the WP identifier. All the bigram probabilities were calculated by the UDN2001 corpus. null Table 3a compares the results of MSIME and MSIME with the WP identifier on the closed and open test sentences. Table 3b compares the results of BiGram and BiGram with the WP identifier on the closed and open test sentences. In this experiment, the STW output of the MSIME with the WP identifier, or the BiGram with the WP identifier, was collected by directly replacing the identified word-pairs (WP-sentences) from the corresponding STW output of MSIME or BiGram.</Paragraph> <Paragraph position="1"> a STW accuracies of the words identified by the BiGram with the WP identifier From Table 3a, the tonal and toneless STW improvements of MSIME by using the WP identifier are 27.5% and 22.1%, respectively. Meanwhile, from Table 3b, the tonal and toneless STW improvements of BiGram by using the WP identifier are 18.9% and 18.8%, respectively. (Note that we also developed a TriGram STW system with the same source and techniques of BiGram. However, the differences between the tonal and toneless STW accuracies of BiGram and TriGram are only about 0.2%) To sum up the results of this experiment, we conclude that the WP identifier can achieve a better STW accuracy than that of the MSIME and BiGram systems on the WP-related portion. The results of Tables 3a and 3b indicate that the WP identifier can effectively improve the tonal and toneless STW accuracies of MSIME and BiGram without tuning processing. Appendix A presents two cases of STW results that were obtained from the experiment.</Paragraph> </Section> <Section position="4" start_page="12" end_page="13" type="sub_section"> <SectionTitle> 3.4 Error Analysis of the STW Conversion </SectionTitle> <Paragraph position="0"> We examine the Top 300 cases in the tonal and toneless STW conversion errors, respectively, from the open testing results of BiGram with the WP identifier. As per our analysis, the problems of STW conversion errors can be classified into three major types: (1) Unknown word problem: For any Chinese NLP system, unknown word extraction is one of the most difficult problems and a critical issue (Tsai et al. 2003). When an error is caused only by the lack of words in the system dictionary, we call it unknown word problem.</Paragraph> <Paragraph position="1"> (2) Inadequate syllable segmentation problem: When an error is caused by syllable-word overlapping (or say ambiguous syllable-word segmentation), instead of an unknown word problem, we call it inadequate syllable segmentation.</Paragraph> <Paragraph position="2"> (3) Homophones problem: These are the re- null a STW accuracies of the words identified by the BiGram with the WP identifier Table 4 is the coverage of the three problems. From Table 4, we have two observations: (1) The coverage of unknown word problem for tonal and toneless STW systems is similar. Since the unknown word problem is not specifically a STW problem, it can be easily taken care of through manual editing or semi-automatic learning during input. In practice, therefore, the tonal and toneless STW accuracies could be raised to 98% and 91%, respectively. Although some of unknown words have been incorporated in the system dictionary by a CWCA system (Tsai et al. 2004), they could still face the problems: inadequate syllable segmentation and failed homophone disambiguation.</Paragraph> <Paragraph position="3"> (2) The major problem caused error conversions in tonal and toneless STW systems is different. To improve tonal STW systems, the major targets should be the cases of failed homophone selection (53% coverage). For toneless STW systems, on the other hand, the cases of inadequate syllable segmentation (51% coverage) should be the focus for improvement.</Paragraph> <Paragraph position="4"> To sum up the above two observations, the bottlenecks of the STW conversion lie in the second and third problems. To resolve these issues, we believe one simple and effective approach is to extend the size of WP database, because our experiment results show that the WP identifier can achieve better tonal and toneless STW accuracies than those of MSIME and BiGram on the WP-related portion.</Paragraph> </Section> </Section> class="xml-element"></Paper>