File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/04/w04-1122_intro.xml
Size: 2,628 bytes
Last Modified: 2025-10-06 14:02:34
<?xml version="1.0" standalone="yes"?> <Paper uid="W04-1122"> <Title>An Integrated Method for Chinese Unknown Word Extraction 1</Title> <Section position="3" start_page="1" end_page="1" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> The unique feature of Chinese writing system is that it is character-based, not word-based. The fact that there are no delimiters between words poses the well-known problem of word segmentation. Any Chinese Information Processing (CIP) systems beyond character level, such as information retrieval, automatic proofreading, text classification, text-to-speech conversion, syntactic parser, information extraction and machine translation, etc. should have a built-in word segmentation block. Currently, dictionary-based method is the basic and efficient one for word segmentation. A fixed Chinese electronic dictionary is required for most CIP systems. Yet there are many unknown words (out of the fixed dictionary) coming into being all the time. The unknown words are diverse, including proper nouns (person names, place names, organization names, etc.), domain-specific terminological nouns and abbreviations, even author-coined terms, etc. and they appear frequently in real text. This may cause ambiguity in Chinese word segmentation and lead to errors in the applications. Presently, many systems (Tan et al, 1999), (Liu, 2000), (Song, 1993), (Luo et al, 2001) focus on online recognition of proper nouns, and have achieved inspiring results in newscorpus but will be deteriorated in special text, such as spoken corpus, novels. As to the rests of unknown words types, it is still the obstacle of application systems, although they are really important for specific collections of texts.</Paragraph> <Paragraph position="1"> For instance, according to our count on Chinese novel Xiao Ao Jiang Hu (<<Xiao Ao Jiang Hu >> ) (JIN Yong (Jin Yong ), 1967), there are almost 515 unknown word types (out of our 243,539-item general dictionary) of total 39,404 occurrences and total 112,654 characters, and there are 983,134 characters overall in this novel (that is, about 11.46% characters of the whole novel are occupied by unknown words.). And most of them, such as &quot;Dong Fang Bu Bai &quot;(person name), &quot;Pi Xie Jian Pu &quot;(normal noun), &quot; Ri Yue Shen Jiao &quot;(organization name), etc. can't be recognized by most current CIP systems. It is important to note that without efficient unknown word extraction method, most CIP systems can't obtain satisfactory results.</Paragraph> </Section> class="xml-element"></Paper>