File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/97/w97-0120_intro.xml
Size: 8,725 bytes
Last Modified: 2025-10-06 14:06:20
<?xml version="1.0" standalone="yes"?> <Paper uid="W97-0120"> <Title>A Self-Organlzing Japanese Word Segmenter using He-ristic Word Identification and Re-estimation</Title> <Section position="3" start_page="0" end_page="206" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> Word segmentation is an important problem for Japanese because word boundaries are not marked in its writing system. Other Asian languages such as Chinese and Thai have the same problem.</Paragraph> <Paragraph position="1"> Any Japanese NLP application requ/res word segmentation as the first stage because there are phonological and semantic units whose pronunciation and meaning is not trivially derivable from that of the individual characters. Once word segmentation is done, all established techniques can be exploited to build practically important applications such as spelling correction \[Nagata, 1996\] and text retrieval \[Nie and Brisebois, 1996\] In a sense, Japanese word segmentation is a solved problem if (and only if) we have plenty of segmented training text. Around 95% word segmentation accuracy is reported by using a word-based language model and the Viterbi-like dynamic programi-g procedure \[Nagata, 1994, Takeuchi and Matsumoto, 1995, Yamamoto, 1996\]. However, manually segmented corpora are not always available in a particular target domain and manual segmentation is very expensive.</Paragraph> <Paragraph position="2"> The goal of our research is unsupervised learning of Japanese word segmentation. That is, to build a Japanese word segmenter from a list of initial words and unsegmented training text. Today, it is easy to obtain a 10K-100K word list from either commercial or public domain on-line Japanese dictionaries. Gigabytes of Japanese text are readily available from newspapers, patents, HTML documents, etc..</Paragraph> <Paragraph position="3"> Few works have examined unsupervised word segmentation in Japanese. Both \[Yamamoto, 1996\] and \[Takeuchi and Matsumoto, 1995\] built a word-based language model from unsegmented text using a re-estimation procedure whose initial segmentation was obtained by a rule-based word segreenter. The utility of this approach is limited because it presupposes the existence of a rule-based word segmenter like JUMAN \[Matsumoto et al., 1994\]. It is impossible to build a word segmenter for a new domain without human intervention.</Paragraph> <Paragraph position="4"> For Chinese word segmentation, more self-organized approaches have been tried. \[Sproat et al., 1996\] built a word unigram model using the Viterbi re-estimation whose initial estimates were derived from the frequencies in the corpus of the strings of each word in the lexicon. \[Chang et al., 1995\] combined a small seed segmented corpus and a large unsegmented corpus to build a word unigram model using the Viterbi re-estimation. \[Luo and Roukos, 1996\] proposed a re-estimation procedure which alternates word segmentation and word frequency re-estimation on each half of the training text divided into halves.</Paragraph> <Paragraph position="5"> One of the major problems in unsupervised word segmentation is the treatment of unseen words.</Paragraph> <Paragraph position="6"> \[Sproat et al., 1996\] wrote lexical rules for each productive morphological process, such as plural noun formation, Chinese personal names, and transliterations of foreign words. \[Chang et al., 1995\] used a statistical method called &quot;Two-Class Classifier&quot;, which decided whether the string is actually a word based on the features derived from character N-gram.</Paragraph> <Paragraph position="7"> In this paper, we present a self-organized method to build a Japanese word segmenter from a small number of basic words and a large amount of unsegmented training text using a novel re-estimation procedure. The major contribution of this paper is its treatment of unseen words. We devised a statistical word formation model for unseen words which can be re-estimated. We show that it is very effective to combine a heuristic initial word identification method with a re-estimation procedure to filter out inappropriate word hypotheses. We also devised a new method to estimate initial word frequencies.</Paragraph> <Paragraph position="8"> Figure 1 shows the configuration of our Japanese word segmenter. In the following sections, we ffirst describe the statistical language model and the word segmentation algorithm. We then describe the initial word frequency estimation method and the initial word identification method. Finally, we describe the experiment results of unsupervised word segmentation under various conditions.</Paragraph> <Paragraph position="9"> m 2 Language Model -rid Word Segmentation Algorithm</Paragraph> <Section position="1" start_page="203" end_page="203" type="sub_section"> <SectionTitle> 2.1 Word Segmentation Model k </SectionTitle> <Paragraph position="0"> Let the input Japanese character sequence be C = ClC2 ... cm. Our goal is to segment it into I word sequence W = wlw2.., w,. The word segmentation task can be defined as finding a word segmentation l~ r that maximizes the joint probability of word sequence given character sequence Ill P(W\[C). Since the maximization is carried out with fixed character sequence C, the word segmenter * only has to maximize the probability of the word sequence P(W).</Paragraph> <Paragraph position="2"> We approximate the joint probability P(W) by the word unigram model, which is the product of word unigram probabilities P(wl). i</Paragraph> <Paragraph position="4"> We used the word unlgram model because of its computational efficiency, l</Paragraph> </Section> <Section position="2" start_page="203" end_page="205" type="sub_section"> <SectionTitle> 2.2 Unknown Word Model </SectionTitle> <Paragraph position="0"> We defined a statistical word model to assign a reasonable word probability to an arbitrary substring in the input sentence. It is formally defined as the joint probability of the character sequence c,... ck if wi is an lmkaown word. We decompose it into the product of word length probability and word spelling probability,</Paragraph> <Paragraph position="2"> where k is the length of the character sequence and <OlqK> represents unknown word.</Paragraph> <Paragraph position="3"> We assume that word length probability P(lc) obeys a Poisson distribution whose parameter is the average word length A in the training corpus. This means that we regard word length as the interval between hidden word boundary markers, which axe randomly placed with an average interval equal to the average word length.</Paragraph> <Paragraph position="5"> We approx4mate the spelling probability given word length P(cl... ca\[k) by the product of character unigram probabilities regardless of word length.</Paragraph> <Paragraph position="7"> Character unigram probabilities can be estimated from unsegmented texts. The average word length A can be computed, once the word frequencies in the texts are obtained.</Paragraph> <Paragraph position="9"> 'i where Iw l and C(w ) are the length and the frequency of word ~i, respectively. Therefore, the only parameters we have to (re)estimate in the language model are the word frequencies. the experiment. It shows two pairs of distributions: word length of all words (~ = 1.6) and that of words appearing only once (~ -- 4.8). The latter is expected to be close to the distribution of unknown words. Although the estimates by Poisson distribution are not so accurate, they enables us to make a robust and computationaUy efficient word model.</Paragraph> </Section> <Section position="3" start_page="205" end_page="206" type="sub_section"> <SectionTitle> 2.3 Viterbi Re-estlmation </SectionTitle> <Paragraph position="0"> We used the Viterbi-like dyn~m~c programing procedure described in \[Nagata, 1994\] to get the most likely word segmentation. The generalized Viterbi algorithm starts from the beginning of the input sentence, and proceeds character by character. At each point in the sentence, R looks up the combination of the best partial word segmentation hypothesis ending at the point and all word hypotheses starting at the point.</Paragraph> <Paragraph position="1"> We used the Viterbi reoestimation procedure to refine the word unigram model because of its computational efficiency. It involves applying the above segmentation algorithm to a training corpus, using a set of initial estimates of the word frequencies. The best analysis of the corpus is taken to be the true analysis, the frequencies are re-estimated, and the algorithm is repeated until it converges.</Paragraph> </Section> </Section> class="xml-element"></Paper>