File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/abstr/00/j00-3004_abstr.xml
Size: 12,820 bytes
Last Modified: 2025-10-06 13:41:41
<?xml version="1.0" standalone="yes"?> <Paper uid="J00-3004"> <Title>A Compression-based Algorithm for Chinese Word Segmentation</Title> <Section position="2" start_page="0" end_page="378" type="abstr"> <SectionTitle> 1. Introduction </SectionTitle> <Paragraph position="0"> Languages such as Chinese and Japanese are written without using any spaces or other word delimiters (except for punctuation marks)--indeed, the Western notion of a word boundary is literally alien (Wu 1998). Nevertheless, words are present in these languages, and Chinese words often comprise several characters, typically two, three, or four--five-character words also exist, but they are rare. Many characters can stand alone as words in themselves, while on other occasions the same character is the first or second character of a two-character word, and on still others it participates as a component of a three- or four-character word. This phenomenon causes obvious ambiguities in word segmentation.</Paragraph> <Paragraph position="1"> Readers unfamiliar with Chinese can gain an appreciation of the problem of multiple interpretations from Figure 1, which shows two alternative interpretations of the same Chinese character sequence. The text is a joke that relies on the ambiguity of phrasing. Once upon a time, the story goes, a man set out on a long journey. Before he could return home the rainy season began, and he had to take shelter at a friend's house. But he overstayed his welcome, and one day his friend wrote him a note: the first line in Figure 1. The intended interpretation is shown in the second line, which means &quot;It is raining, the god would like the guest to stay. Although the god wants you to stay, I do not!&quot; On seeing the note, the visitor took the hint and prepared to leave. As a joke he amended the note with the punctuation shown in the third line, which leaves three sentences whose meaning is totally different--&quot;The rainy day, the staying day. Would you like me to stay? Sure!&quot; Example of treating each character in a query as a word.</Paragraph> <Paragraph position="2"> This example relies on ambiguity of phrasing, but the same kind of problem can arise with word segmentation. Figure 2 shows a more prosaic example. For the ordinary sentence of the first line, there are two different interpretations depending on the context of the sentence: &quot;I like New Zealand flowers&quot; and &quot;I like fresh broccoli&quot; respectively.</Paragraph> <Paragraph position="3"> The fact that machine-readable Chinese text is invariably stored in unsegmented form causes difficulty in applications that use the word as the basic unit. For example, search engines index documents by storing a list of the words they contain, and allow the user to retrieve all documents that contain a specified combination of query terms. This presupposes that the documents are segmented into words. Failure to do so, and treating every character as a word in itself, greatly decreases the precision of retrieval since large numbers of extraneous documents are returned that contain characters, but not words, from the query.</Paragraph> <Paragraph position="4"> Figure 3 illustrates what happens when each character in a query is treated as a single-character word. The intended query is &quot;physics&quot; or &quot;physicist.&quot; The first character returns documents about such things as &quot;evidence,&quot; &quot;products,&quot; &quot;body, .... image,&quot; &quot;prices&quot;; while the second returns documents about &quot;theory, .... barber,&quot; and so on. Thus many documents that are completely irrelevant to the query will be returned, causing the precision of information retrieval to decrease greatly. Similar problems occur in word-based compression, speech recognition, and so on.</Paragraph> <Paragraph position="5"> Teahan, Wen, McNab, and Witten Chinese Word Segmentation It is true that most search engines allow the user to search for multiword phrases by enclosing them in quotation marks, and this facility could be used to search for multicharacter words in Chinese. This, however, runs the risk of retrieving irrelevant documents in which the same characters occur in sequence but with a different intended segmentation. More importantly, it imposes on the user an artificial requirement to perform manual segmentation on each full-text query.</Paragraph> <Paragraph position="6"> Word segmentation is an important prerequisite for such applications. However, it is a difficult and ill-defined task. According to Sproat et al. (1996) and Wu and Fung (1994), experiments show that only about 75% agreement between native speakers is to be expected on the &quot;correct&quot; segmentation, and the figure reduces as more people become involved.</Paragraph> <Paragraph position="7"> This paper describes a general scheme for segmenting text by inferring the position of word boundaries, thus supplying a necessary preprocessing step for applications like those mentioned above. Unlike other approaches, which involve a dictionary of legal words and are therefore language-specific, it works by using a corpus of alreadysegmented text for training and thus can easily be retargeted for any language for which a suitable corpus of segmented material is available. To infer word boundaries, a general adaptive text compression technique is used that predicts upcoming characters on the basis of their preceding context. Spaces are inserted into positions where their presence enables the text to be compressed more effectively. This approach means that we can capitalize on existing research in text compression to create good models for word segmentation. To build a segmenter for a new language, the only resource required is a corpus of segmented text to train the compression model.</Paragraph> <Paragraph position="8"> The structure of this paper is as follows: The next section reviews previous work on the Chinese segmentation problem. Then we explain the operation of the adaptive text compression technique that will be used to predict word boundaries. Next we show how space insertion can be viewed as a problem of hidden Markov modeling, and how higher-order models, such as the ones used in text compression, can be employed in this way. The following section describes several experiments designed to evaluate the success of the new word segmenter. Finally we discuss the application of language segmentation in digital libraries.</Paragraph> <Paragraph position="9"> Our system for segmenting Chinese text is available on the World Wide Web at http://www.nzdl.org/cgi-bin/congb. It takes GB-encoded input text, which can be cut from a Chinese document and pasted into the input window. 1 Once the segmenter has been invoked, the result is rewritten into the same window.</Paragraph> <Paragraph position="10"> 2. Previous Methods for Segmenting Chinese The problem of segmenting Chinese text has been studied by researchers for many years; see Wu and Tseng (1993) for a detailed survey. Several different algorithms have been proposed, which, generally speaking, can be classified into dictionary-based and statistical-based methods, although other techniques that involve more linguistic information, such as syntactic and semantic knowledge, have been reported in the natural language processing literature.</Paragraph> <Paragraph position="11"> Cheng, Young, and Wong (1999) describe a dictionary-based method. Given a dictionary of frequently used Chinese words, an input string is compared with words in the dictionary to find the one that matches the greatest number of characters of the Computational Linguistics Volume 26, Number 3 input. This is called the maximum forward match heuristic. An alternative is to work backwards through the text, resulting in the maximum backward match heuristic.</Paragraph> <Paragraph position="12"> It is easy to find situations where these fail. To use an English example, forward matching fails on the input &quot;the red ...&quot; (it is misinterpreted as &quot;there d ... &quot;), while backward matching fails on text ending &quot;... his car&quot; (it is misinterpreted as &quot;... hi scar'). Analogous failures occur with Chinese text.</Paragraph> <Paragraph position="13"> Dai, Khoo, and Loh (1999) use statistical methods to perform text segmentation.</Paragraph> <Paragraph position="14"> They concentrate on two-character words, because two characters is the most common word length in Chinese. Several different notions of frequency of characters and bigrams are explored: relative frequency, document frequency, weighted document frequency, and local frequency. They also look at both contextual and positional information. Contextual information is found to be the single most important factor that governs the probability that a bigram forms a word; incorporating the weighted document frequency can improve the model significantly. In contrast, the positional frequency is not found to be helpful in determining words.</Paragraph> <Paragraph position="15"> Ponte and Croft (1996) introduce two models for word segmentation: word-based and bigram models. Both utilize probabilistic automata. In the word-based method, a suffix tree of words in the lexicon is used to initialize the model. Each node is associated with a probability, which is estimated by segmenting training text using the longest match strategy. This makes the segmenter easy to transplant to new languages. The bigram model uses the lexicon to initialize probability estimates for each bigram, and the probability with which each bigram occurs, and uses the Baum-Welch algorithm (Rabiner 1989) to update the probabilities as the training text is processed.</Paragraph> <Paragraph position="16"> Hockenmaier and Brew (1998) present an algorithm, based on Palmer's (1997) experiments, that applies a symbolic machine learning technique--transformation-based error-driven learning (Brill 1995)--to the problem of Chinese word segmentation. Using a set of rule templates and four distinct initial-state annotators, Palmer concludes that the learning technique works well. Hockenmaier and Brew investigate how performance is influenced by different rule templates and corpus size. They use three rule templates: simple bigram rules, trigram rules, and more elaborate rules. Their experiments indicate that training data size has the most significant influence on performance. Good performance can be acquired using simple rules only if the training corpus is large enough.</Paragraph> <Paragraph position="17"> Lee, Ng, and Lu (1999) have recently introduced a new segmentation method for a Chinese spell-checking application. Using a dictionary with single-character word occurrence frequencies, this scheme first divides text into sentences, then into phrases, and finally into words using a small number of word combinations that are conditioned on a heuristic to avoid delay during spell-checking. When compared with forward maximum matching, the new method resolves more than 10% more ambiguities, but enjoys no obvious speed advantage.</Paragraph> <Paragraph position="18"> The way in which Chinese characters are used in names differs greatly from the way they are used in ordinary text, and some researchers, notably Sproat et al. (1996), have established special-purpose recognizers for Chinese names (and translated foreign names), designed to improve the accuracy of automatic segmenters by treating names specially. 2 Chinese names always take the form family name followed by given name. Whereas family names are limited to a small group of characters, given names can consist of any characters. They normally comprise one or two characters, but 2 In English there are significant differences between the frequency distribution of letters in names and in words--for example, compare the size of the T section of a telephone directory with the size of the T section of a dictionary--but such differences are far more pronounced in Chinese. Teahan, Wen, McNab, and Witten Chinese Word Segmentation three-character names have arisen in recent years to ensure uniqueness when the family name is popular--such as Smith or Jones in English. Sproat et al. (1996) implement special recognizers not only for Chinese names and transliterated foreign names, but for components of morphologically obtained words as well. The approach we present is not specially tailored for name recognition, but because it is fully adaptive it is likely that it would yield good performance on names if lists of names were provided as supplementary training text. This has not yet been tested.</Paragraph> </Section> class="xml-element"></Paper>