File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/96/p96-1019_metho.xml
Size: 1,907 bytes
Last Modified: 2025-10-06 14:14:22
<?xml version="1.0" standalone="yes"?> <Paper uid="P96-1019"> <Title>An Iterative Algorithm to Build Chinese Language Models</Title> <Section position="5" start_page="141" end_page="141" type="metho"> <SectionTitle> 5 Conclusion </SectionTitle> <Paragraph position="0"> In this paper, we present an iterative procedure to build Chinese language model(LM). We segment Chinese text into words based on a word-based Chinese language model. However, the construction of a Chinese LM itself requires word boundaries. To get out of the chicken-egg problem, we propose an iterative procedure that alternates two operations: segmenting text into words and building an LM.</Paragraph> <Paragraph position="1"> Starting with an initial segmented corpus and an LM based upon it, we use Viterbi-like algorithm to segment another set of data. Then we build an LM based on the second set and use the LM to segment again the first corpus. The alternating procedure provides a self-organized way for the segmenter to detect automatically unseen words and correct segmentation errors. Our preliminary experiment shows that the alternating procedure not only improves the accuracy of our segmentation, but discovers unseen words surprisingly well. We get a perplexity 188 for a general Chinese corpus with 2.5 million characters 4</Paragraph> </Section> <Section position="6" start_page="141" end_page="141" type="metho"> <SectionTitle> 6 Acknowledgment </SectionTitle> <Paragraph position="0"> The first author would like to thank various members of the Human Language technologies Department at the IBM T.J Watson center for their encouragement and helpful advice. Special thanks go to Dr. Martin Franz for providing continuous help in using the IBM language model tools. The authors would also thank the comments and insight of two anonymous reviewers which help improve the final draft.</Paragraph> </Section> class="xml-element"></Paper>