File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/p06-2108_metho.xml
Size: 14,799 bytes
Last Modified: 2025-10-06 14:10:29
<?xml version="1.0" standalone="yes"?> <Paper uid="P06-2108"> <Title>Using Word Support Model to Improve Chinese Input System</Title> <Section position="4" start_page="842" end_page="843" type="metho"> <SectionTitle> 2 Development of Word Support Model </SectionTitle> <Paragraph position="0"> The system dictionary of our WSM is comprised of 82,531 Chinese words taken from the CKIP dictionary and 15,946 unknown words autofound in the UDN2001 corpus by a Chinese Word Auto-Confirmation (CWAC) system (Tsai et al., 2003). The UDN2001 corpus is a collection of 4,539624 Chinese sentences extracted from whole 2001 UDN (United Daily News, 2001) Website in Taiwan (Tsai and Hsu, 2002).</Paragraph> <Paragraph position="1"> The system dictionary provides the knowledge of words and their corresponding pinyin syllable-words. The pinyin syllable-words were translated by phoneme-to-pinyin mappings, such as &quot;IU/&quot;-to-&quot;ju2.&quot;</Paragraph> <Section position="1" start_page="842" end_page="843" type="sub_section"> <SectionTitle> 2.1 Auto-Generation of WP Database </SectionTitle> <Paragraph position="0"> Following (Tsai, 2005), the three steps of autogenerating word-pairs (AUTO-WP) for a given Chinese sentence are as below: (the details of AUTO-WP can be found in (Tsai, 2005)) Step 1. Get forward and backward word segmentations: Generate two types of word segmentations for a given Chinese sentence by forward maximum matching (FMM) and backward maximum matching (BMM) techniques (Chen et al., 1986; Tsai et al., 2004) with the system dictionary. null Step 2. Get initial WP set: Extract all the combinations of word-pairs from the FMM and the BMM segmentations of Step 1 to be the initial WP set.</Paragraph> <Paragraph position="1"> Step 3. Get finial WP set: Select out the word-pairs comprised of two poly-syllabic words from the initial WP set into the finial WP set. For the final WP set, if the word-pair is not found in the WP data- null base, insert it into the WP database and set its frequency to 1; otherwise, increase its frequency by 1.</Paragraph> </Section> <Section position="2" start_page="843" end_page="843" type="sub_section"> <SectionTitle> 2.2 Word Support Model </SectionTitle> <Paragraph position="0"> The four steps of our WSM applied to identify words for a given Chinese syllables is as follows: Step 1. Input tonal or toneless syllables.</Paragraph> <Paragraph position="1"> Step 2. Generate all possible word-pairs comprised of two poly-syllabic words for the input syllables to be the WP set of Step 3.</Paragraph> <Paragraph position="2"> Step 3. Select out the word-pairs that match a word-pair in the WP database to be the WP set. Then, compute the word support degree (WS degree) for each distinct word of the WP set. The WS degree is defined to be the total number of the word found in the WP set. Finally, arrange the words and their corresponding WS degrees into the WSM set. If the number of words with the same syllable-word and WS degree is greater than one, one of them is randomly selected into the WSM set.</Paragraph> <Paragraph position="3"> Step 4. Replace words of the WSM set in descending order of WS degree with the input syllables into a WSM-sentence. If no words can be identified in the input syllables, a NULL WSM-sentence is produced.</Paragraph> <Paragraph position="4"> Table 1 is a step by step example to show the four steps of applying our WSM on the Chinese syllables &quot;sui1 ran2 fu3 shi2 jin4 shi4 sui4 yue4 xi1 xu1(Sui Ran Fu Sip Jin Shi Sui Yue Xi Xu ).&quot; For this input syllables, we have a WSM-sentence &quot;Sui Ran Fu Sip Jin Shi Sui Yue Xi Xu .&quot; For the same syllables, outputs of the MSIME, the BiGram and the WP identifier are &quot;Sui Ran Fu Shi Jin Shi Sui Yue Xi Xu ,&quot; &quot;Sui Ran</Paragraph> </Section> </Section> <Section position="5" start_page="843" end_page="846" type="metho"> <SectionTitle> 3 STW Experiments </SectionTitle> <Paragraph position="0"> To evaluate the STW performance of our WSM, we define the STW accuracy, identified character ratio (ICR) and STW improvement, by the following equations: STW accuracy = # of correct characters / # of total characters. (1) Identified character ratio (ICR) = # of characters of identified WP / # of total characters in testing sentences. (2) STW improvement (I) (i.e. STW error reduction rate) = (accuracy of STW system with WP accuracy of STW system)) / (1 - accuracy of STW system). (3)</Paragraph> <Section position="1" start_page="843" end_page="844" type="sub_section"> <SectionTitle> 3.1 Background </SectionTitle> <Paragraph position="0"> To conduct the STW experiments, firstly, use the inverse translator of phoneme-to-character (PTC) provided in GOING system to convert testing sentences into their corresponding syllables. All the error PTC translations of GOING PTC were corrected by post human-editing.</Paragraph> <Paragraph position="1"> Then, apply our WSM to convert the testing input syllables back to their WSM-sentences.</Paragraph> <Paragraph position="2"> Finally, calculate its STW accuracy and ICR by Equations (1) and (2). Note that all test sentences are composed of a string of Chinese characters in this study.</Paragraph> <Paragraph position="3"> The training/testing corpus, closed/open test sets and system/user WP database used in the following STW experiments are described as below: (1) Training corpus: We used the UDN2001 corpus as our training corpus, which is a collection of 4,539624 Chinese sentences extracted from whole 2001 UDN (United Daily News, 2001) Website in Taiwan (Tsai and Hsu, 2002).</Paragraph> <Paragraph position="4"> (2) Testing corpus: The Academia Sinica Balanced (AS) corpus (Chinese Knowledge Information Processing Group, 1996) was selected as our testing corpus. The AS corpus is one of most famous traditional Chinese corpus used in the Chinese NLP research field (Thomas, 2005).</Paragraph> <Paragraph position="5"> (3) Closed test set: 10,000 sentences were randomly selected from the UDN2001 corpus as the closed test set. The {minimum, maximum, and mean} of characters per sentence for the closed test set are {4, 37, and 12}.</Paragraph> <Paragraph position="6"> (4) Open test set: 10,000 sentences were randomly selected from the AS corpus as the open test set. At this point, we checked that the selected open test sentences were not in the closed test set as well. The {minimum, maximum, and mean} of characters per sentence for the open test set are {4, 40, and 11}. (5) System WP database: By applying the AUTO-WP on the UDN2001 corpus, we created 25,439,679 word-pairs to be the system WP database.</Paragraph> <Paragraph position="7"> (6) User WP database: By applying our AUTO-WP on the AS corpus, we created 1,765,728 word-pairs to be the user WP database. null We conducted the STW experiment in a progressive manner. The results and analysis of the experiments are described in Subsections 3.2 and 3.3.</Paragraph> </Section> <Section position="2" start_page="844" end_page="844" type="sub_section"> <SectionTitle> 3.2 STW Experiment Results of the WSM </SectionTitle> <Paragraph position="0"> The purpose of this experiment is to demonstrate the tonal and toneless STW accuracies among the identified words by using the WSM with the system WP database. The comparative system is the WP identifier (Tsai, 2005). Table 2 is the experimental results. The WP database and system dictionary of the WP identifier is same with that of the WSM.</Paragraph> <Paragraph position="1"> From Table 2, it shows the average tonal and toneless STW accuracies and ICRs of the WSM are all greater than that of the WP identifier.</Paragraph> <Paragraph position="2"> These results indicate that the WSM is a better way than the WP identifier to identify poly-syllabic words for the Chinese syllables.</Paragraph> <Paragraph position="3"> toneless STW experiments for the WP identifier and the WSM.</Paragraph> </Section> <Section position="3" start_page="844" end_page="845" type="sub_section"> <SectionTitle> 3.3 STW Experiment Results of Chinese </SectionTitle> <Paragraph position="0"> Input Systems with the WSM We selected Microsoft Input Method Editor 2003 for Traditional Chinese (MSIME) as our experimental commercial Chinese input system. In addition, following (Tsai, 2005), an optimized bigram model called BiGram was developed. The BiGram STW system is a bigram-based model developing by SRILM (Stolcke, 2002) with Good-Turing back-off smoothing (Manning and Schuetze, 1999), as well as forward and backward longest syllable-word first strategies (Chen et al., 1986; Tsai et al., 2004). The system dictionary of the BiGram is same with that of the WP identifier and the WSM.</Paragraph> <Paragraph position="1"> Table 3a compares the results of the MSIME, the MSIME with the WP identifier and the MSIME with the WSM on the closed and open test sentences. Table 3b compares the results of the BiGram, the BiGram with the WP identifier and the BiGram with the WSM on the closed and open test sentences. In this experiment, the STW output of the MSIME with the WP identifier and the WSM, or the BiGram with the WP identifier and the WSM, was collected by directly replacing the identified words of the WP identifier and the WSM from the corresponding STW output of the MSIME and the BiGram.</Paragraph> <Paragraph position="2"> STW accuracies and improvements of the words identified by the MSIME (Ms) with the WP identifier b STW accuracies and improvements of the words identified by the MSIME (Ms) with the WSM Table 3a. The results of tonal and toneless STW experiments for the MSIME, the MSIME with the WP identifier and with the WSM.</Paragraph> <Paragraph position="3"> STW accuracies and improvements of the words identified by the BiGram (Bi) with the WP identifier b STW accuracies and improvements of the words identified by the BiGram (Bi) with the WSM Table 3b. The results of tonal and toneless STW experiments for the BiGram, the BiGram with the WP identifier and with the WSM.</Paragraph> <Paragraph position="4"> From Table 3a, the tonal and toneless STW improvements of the MSIME by using the WP identifier and the WSM are (18.9%, 10.1%) and (25.6%, 16.6%), respectively. From Table 3b, the tonal and toneless STW improvements of the BiGram by using the WP identifier and the WSM are (8.6%, 11.9%) and (17.1%, 22.0%), respectively. (Note that, as per (Tsai, 2005), the differences between the tonal and toneless STW accuracies of the BiGram and the TriGram are less than 0.3%).</Paragraph> <Paragraph position="5"> Table 3c is the results of the MSIME and the BiGram by using the WSM as an adaptation processing with both system and user WP database. From Table 3c, we get the average tonal and toneless STW improvements of the MSIME and the BiGram by using the WSM as an adaptation processing are 37.2% and 34.6%, respectively. null STW accuracies, ICRs and improvements of the words identified by the MSIME (Ms) with the WSM b STW accuracies, ICRs and improvements of the words identified by the BiGram (Bi) with the WSM Table 3c. The results of tonal and toneless STW experiments for the MSIME and the BiGram using the WSM as an adaptation processing.</Paragraph> <Paragraph position="6"> To sum up the above experiment results, we conclude that the WSM can achieve a better STW accuracy than that of the MSIME, the Bi-Gram and the WP identifier on the identifiedwords portion. (Appendix A presents two cases of STW results that were obtained from this study).</Paragraph> </Section> <Section position="4" start_page="845" end_page="846" type="sub_section"> <SectionTitle> 3.4 Error Analysis </SectionTitle> <Paragraph position="0"> We examine the Top 300 STW conversions in the tonal and toneless from the open testing results of the BiGram with the WP identifier and the WSM, respectively. As per our analysis, the STW errors are caused by three problems, they are: (1) Unknown word (UW) problem: For Chinese NLP systems, unknown word extraction is one of the most difficult problems and a critical issue. When an STW error is caused only by the lack of words in the system dictionary, we call it unknown word problem. (2) Inadequate Syllable-Word Segmentation (ISWS) problem: When an error is caused by ambiguous syllable-word segmentation (including overlapping and combination ambiguities), we call it inadequate syllable-word segmentation problem.</Paragraph> <Paragraph position="1"> (3) Homophone selection problem: The remaining STW conversion error is homophone selection problem.</Paragraph> <Paragraph position="2"> from the Top 300 tonal and toneless STW conversions of the BiGram with the WP identifier and the WSM.</Paragraph> <Paragraph position="3"> Table 4 is the analysis results of the three STW error types. From Table 4, we have three observations: null (1) The coverage of unknown word problem for tonal and toneless STW conversions is similar. In most Chinese input systems, unknown word extraction is not specifically a STW problem, therefore, it is usually taken care of through online and offline manual editing processing (Hsu et al, 1999). The results of Table 4 show that the most STW errors should be caused by ISWS and HS problems, not UW problem. This observation is similarly with that of our previous work (Tsai, 2005).</Paragraph> <Paragraph position="4"> (2) The major problem of error conversions in tonal and toneless STW systems is different. This observation is similarly with that of (Tsai, 2005). From Table 4, the major improving targets of tonal STW performance are the HS errors because more than 50% tonal STW errors caused by HS problem. On the other hand, since the ISWS errors cover more than 50% toneless STW errors, the major targets of improving toneless STW performance are the ISWS errors. (3) The total number of error characters of the BiGram with the WSM in tonal and toneless STW conversions are both less than that of the BiGram with the WP identifier. This observation should answer the question &quot;Why the STW performance of Chinese input systems (MSIME and BiGram) with the WSM is better than that of these systems with the WP-identifier?&quot; To sum up the above three observations and all the STW experimental results, we conclude that the WSM is able to achieve better STW improvements than that of the WP identifier is because: (1) the identified character ratio of the WSM is 15% greater than that of the WP identifier with the same WP database and dictionary, and meantime (2) the WSM not only can maintain the ratio of the three STW error types but also can reduce the total number of error characters of converted words than that of the WP identifier.</Paragraph> </Section> </Section> class="xml-element"></Paper>