File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/concl/05/i05-3002_concl.xml
Size: 2,529 bytes
Last Modified: 2025-10-06 13:54:37
<?xml version="1.0" standalone="yes"?> <Paper uid="I05-3002"> <Title>Using Word-Pair Identifier to Improve Chinese Input System</Title> <Section position="5" start_page="13" end_page="14" type="concl"> <SectionTitle> 4 Conclusion and Future Directions </SectionTitle> <Paragraph position="0"> In this paper, we have applied a WP identifier to support the Chinese language processing on the STW conversion and obtained a high STW accuracy on the identified word-pairs. All of the WP data can be generated fully automatically by applying the AUTO-WP on the system and user corpus. We are encouraged by the fact that WP knowledge can achieve tonal and toneless STW accuracies of 98.5% and 90.7%, respectively, for the WP-related portion on the testing syllables. The WP identifier can be easily integrated into existing Chinese input systems by identifying word-pairs in a post-processing step.</Paragraph> <Paragraph position="1"> Our experimental results show that, by applying the WP identifier together with MSIME (a trigram-like model) and BiGram (an optimized bigram model), the tonal and toneless STW improvements of the two Chinese input systems are 27.5%/22.1% and 18.9%/18.8%, respectively. For adaptation STW approach, we have tried to apply the AUTO-WP to extract the word-pairs from the 10,000 open testing sentences into the testing WP database, the tonal and toneless STW accuracies of the MSIME with the adaptation WP identifier and the Bi-Gram with the adaptation WP identifier will become 97.0%/97.2% and 91.1%/90.0%, respectively. null Currently, our approach is quite basic when more than one WP occurs in the same sentence.</Paragraph> <Paragraph position="2"> Although there is room for improvement, we believe it would not produce a noticeable effect as far as the STW accuracy is concerned. However, this issue will become important as we want to apply the WP knowledge to speech recognition. According to our computations, the collection of testing WP knowledge can cover approximately 50% and 40% of the characters in the UDN2001 and UDN2002 corpus, respectively. null We will continue to expand our collection of WP knowledge to cover more characters in the UDN2001 and UDN2002 corpus with Web corpus (search engine results) for improving our STW system. In other directions, we will try to improve our WP-based STW conversion with other statistical language models, such as HMM, and extend it to other areas of NLP, especially word segmentation and speech recognition.</Paragraph> </Section> class="xml-element"></Paper>