File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/abstr/04/j04-1004_abstr.xml
Size: 6,266 bytes
Last Modified: 2025-10-06 13:43:23
<?xml version="1.0" standalone="yes"?> <Paper uid="J04-1004"> <Title>Xiaotie Deng ++</Title> <Section position="2" start_page="0" end_page="77" type="abstr"> <SectionTitle> 1. Introduction </SectionTitle> <Paragraph position="0"> Words are the basic linguistic units of natural language processing. The importance of word extraction is stressed in many papers. According to Huang, Chen, and Tsou (1996), the word is the basic unit in natural language processing (NLP), as it is at the lexical level where all modules interface. Possible modules involved are the lexicon, speech recognition, syntactic parsing, speech synthesis, semantic interpretation, and so on. Thus, the identification of lexical words and/or the delimitation of words in running texts is a prerequisite of NLP. Teahan et al. (2000) state that interpreting a text as a sequence of words is beneficial for some information retrieval and storage tasks: for example, full-text searches, word-based compression, and key-phrase extraction.</Paragraph> <Paragraph position="1"> According to Guo (1997), words and tokens are the primary building blocks in almost all linguistic theories and language-processing systems, including Japanese (Kobayasi, Tokumaga, and Tanaka 1994), Korean (Yun, Lee, and Rim 1995), German (Pachunke et al. 1992), and English (Garside, Leech, and Sampson 1987), in various media, such [?] School of Computer Science and Technology, Jinan, PRC; Department of Computer Science, Tat Chee Avenue, Kowloon, Hong Kong. E-mail: fenghd@cs.cityu.edu.hk or fenghaodi@hotmail.com.</Paragraph> <Paragraph position="2"> + Department of Computer Science and Technology, Peking, PR China. E-mail: {ck99,zwm-dcs}@mails. tsinghua.edu.cn.</Paragraph> <Paragraph position="3"> ++ Department of Computer Science, Tat Chee Avenue, Kowloon, Hong Kong. E-mail: csdeng@cityu. edu.hk.</Paragraph> <Paragraph position="4"> Computational Linguistics Volume 30, Number 1 as continuous speech and cursive handwriting, and in numerous applications, such as translation, recognition, indexing, and proofreading. The identification of words in natural language is nontrivial since, as observed by Chao (1968), linguistic words often represent a different set than do sociological words.</Paragraph> <Paragraph position="5"> Chinese texts are character based, not word based. Each Chinese character stands for one phonological syllable and in most cases represents a morpheme. This presents a problem, as only less than 10% of the word types (and less than 50% of the tokens in a text) in Chinese are composed of a single character (Chen et al. 1993). However, Chinese texts, and texts in some other Oriental languages such as Japanese, do not have delimiters such as spaces to mark the boundaries of meaningful words. Even for English text, some phrases consist of several words. However, the problem in English is not as dominant a factor as in Chinese. How to extract words from Chinese texts is still an interesting problem. Note that word extraction is different from the very closely related problem of sentence segmentation. Word extraction aims to collect all of the meaningful strings in a text. Sentence segmentation partitions a sentence into several consecutive meaningful segments. Word extraction should be easier than sentence segmentation, and the problems involved in it can be solved using simpler methods.</Paragraph> <Paragraph position="6"> Some Chinese information-retrieval systems operate at the character level instead of the word level, for example, the Csmart system (Chien 1995). However, to further improve the efficiency of natural Chinese processing, it is commonly thought to be important to apply studies from linguistics (Kwok 1997). Lexicon construction is considered to be one of the most important tasks. Single Chinese characters can quite often carry different meanings. This ambiguity can be resolved when the characters are combined with other characters to form a word. Chinese words can be unigrams, bigrams, trigrams, or n-grams, where n > 3. According to the Frequency Dictionary of Modern Chinese (Beijing Language Institute 1986), among the 9,000 most frequent Chinese words, 26.7% are unigrams, 69.8% are bigrams, 2.7% are trigrams, 0.007% are four-grams, and 0.002% are five-grams. There are lexicons for identifying some (and probably most of the frequent) words. However, sometimes less-frequent words are more effective. Weeber, Vos, and Baayen (2000) recently extracted side-effect-related terms in a medical-information extraction system and found that many of the terms had a frequency of less than five. This indicates that low-frequency words may also carry very important information. Our experiments show that we can extract low-frequency words using a simple method without overly degrading the precision.</Paragraph> <Paragraph position="7"> There are generally two directions in which words can be formed (Huang, Chen, and Tsou 1996). One is the deductive strategy, whereby words are identified through the segmentation of running texts. The other is the inductive strategy, which identifies words through the compositional process of morpho-lexical rules. This strategy represents words with common characteristics (e.g., numeric compounds) by rules. In Chinese text segmentation there are three basic approaches (Sproat et al. 1996): pure heuristic, pure statistical, and a hybrid of the two. The heuristic approach identifies words by applying prior knowledge or morpho-lexical rules governing the derivation of new words. The statistical approach identifies words based on the distribution of their components in a large corpus. Sproat and Shih (1990) develop a purely statistical method that utilizes the mutual information between two characters: I(x, y)=log p(x,y) p(x)p(y) ; the limitation of the method is that it can deal only with words of length two characters. Ge, Pratt, and Smyth (1999) introduce a simple probabilistic model based on the occurrence probability of the words that constitute a set of predefined assumptions. Chien (1997) develops a PAT-tree-based method that extracts significant words by observing mutual information of two overlapped patterns with the significance function</Paragraph> </Section> class="xml-element"></Paper>