XML Viewer - w96-0205

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/abstr/96/w96-0205_abstr.xml
Size: 41,127 bytes
Last Modified: 2025-10-06 13:48:45
<?xml version="1.0" standalone="yes"?>
<Paper uid="W96-0205">
  <Title>Automatic Extraction of New Words from Japanese Texts using Generalized Forward-Backward Search</Title>
  <Section position="2" start_page="0" end_page="57" type="abstr">
    <SectionTitle>
Abstract
</SectionTitle>
    <Paragraph position="0"> We present a novel new word extraction method from Japanese texts based on expected word frequencies. First, we compute expected word frequencies from Japanese texts using a robust stochastic N-best word segmenter. We then extract new words by filtering out erroneous word hypotheses whose expected word frequencies are lower than the predefined threshold. The method is derived from an approximation of the generalized version of the Forward-Backward algorithm. When the Japanese word segmenter is trained on a 4.7 million word segmented corpus and tested on 1000 sentences whose out-of-vocabulary rate is 2.1%, the accuracy of the new word extraction method is 43.7% recall and 52.3% precision.</Paragraph>
    <Paragraph position="1"> Introduction Segmentation of sentences into words is trivial in English because words are delimited by spaces.</Paragraph>
    <Paragraph position="2"> It is a simple task to count word frequencies in a given text. It is also a simple task to list all new words (unknown words), namely, the words in a given text that are not found in the system dictionary. However, several languages such as Japanese, Chinese and Thai do not put spaces between words and so in these languages word segmentation, word frequency counting, and new word extraction remain unsolved problems in computational linguistics.</Paragraph>
    <Paragraph position="3"> Most Japanese NLP applications require word segmentation as a first stage because there are phonological units and semantic units whose pronunciation and/or meaning is not trivially derivable from the pronunciation and/or meaning of the individual characters. It is well known that the accuracy of word segmentation greatly depends on the coverage of the dictionary, in other words, the Out-Of-Vocabulary (00V) rate of the target texts.</Paragraph>
    <Paragraph position="4"> Our goal is to provide a method to automatically extract new words from Japanese texts. This nmthod should adapt the dictionary of the word segmenter to new domains and applications. It should also maintain the dictionary by collecting new words in the target domain. The application of the word segmenter is described elsewhere (Nagata, 1996).</Paragraph>
    <Paragraph position="5"> The approach we take is as follows: First, we design a statistical language model that can assign a reasonable word probability to an arbitrary substring in the input sentence, whether or not it is truly a word. Second, we devised a method to obtain the expected word N-gram count in the target texts, using an N-best word segmentation algorithm (Nagata, 1994). Finally, we extract new words by filtering out spurious word hypotheses whose expected word frequencies are lower than the threshold.</Paragraph>
    <Section position="1" start_page="0" end_page="48" type="sub_section">
      <SectionTitle>
Japanese Morphological Analysis
</SectionTitle>
      <Paragraph position="0"> Before we start, we briefly explain the difficulties of Japanese morphological analysis, especially when the input sentence includes unknown words.</Paragraph>
      <Paragraph position="1"> Suppose the input sentence is &amp;quot;-~4)~p/~7 ~}~ ENIAC 69 50 ~ 3o &amp;quot;, which means &amp;quot;University of Pennsylvania celebrates the 50th anniversary of ENIAC&amp;quot;, where the words ~Y5~ JP/~7 (transliteration of 'Pennsylvania') and ENIAC (the name of the world's first computer) are not registered in the system dictionary. Figure 1 shows three possible analyses of the input sentence, where each box represents a word hypothesis whose meaning and part of speech are shown above and under the box. The tag &lt;UNK&gt; represents an unknown word.</Paragraph>
      <Paragraph position="2"> One of the hardest problems in handling unrestricted Japanese text is the identification of unknown words. In Figure 1, the string ENIAC is successfully tokenized as an unknown word. However, there is ambiguity in the segmentation of the string ~ 5/zL~J&lt;~-7~.</Paragraph>
      <Paragraph position="3"> In the first analysis, the system considers -~'-.~ 5//1~/~_~7 ('Pennsylvania') as an unknown word,  Logprob &amp;quot;'4 ~ ~ \]1t \]&lt; ~ 7&amp;quot; 5k: ~ ~ E N l A C cO (rel prob) ~_~y ~j ENIAC of Pennsylvania &amp;quot;\[ ENIAC -108.95 \[ -&amp;quot;~5.'J1.-'-'&lt;-~7&amp;quot; \] \] \[\]  (0.790) &lt;UNK&gt; noun part. &lt;UNK&gt; part.</Paragraph>
      <Paragraph position="4"> pcnqil Vania university ~j. ENIAC of -no.49 i~,v I 1&amp;quot;&lt;=7~'~ I I ENIAC/ I \[\] (0.169) noun &lt;UNK&gt; part. &lt;UNK&gt; part.</Paragraph>
      <Paragraph position="5"> -ll 1.90 oencil Vania ~i~y ~j. ENIAC ~\] (0.041) I -'&lt;~5&amp;quot;7~ I \[ )&lt;=7 \] I ENIAC \] noun &lt;UNK&gt; noun part. &lt;UNK&gt; part.  numeral suffix part. verb intl. sym.</Paragraph>
      <Paragraph position="6"> numeral suffix part. verb intl. sym.</Paragraph>
      <Paragraph position="7"> numeral suffix part. verb infl.sym.</Paragraph>
      <Paragraph position="8">  because ~: ('university') is registered in the dictionary. This is correct. In the second analysis, the system guesses .'&lt;~-7&amp;quot;~: ('Vania university') as an unknown word, because -'&lt;~/5.'A~ (transliteration of 'pencil') is registered in the dictionary and some university names are registered in the dictionary, such as Y,~ ~/7~---b'~ ('Stanford University') and ~r~'~ ~3~-~ (.'Cambridge University'). In the third analysis, the system considers ,'&lt;-~7&amp;quot; ('Vania') as an unknown word, because both ~:/5,',,1~ and ~ are registered in the dictionary. null It is often the case that we have overlapping word hypotheses if the input sentence contains unknown words, such as -'&lt;~'~\]P.'&lt;~7, ~:7~, and ,,&lt;:'T in Figure 1. We need a criteria to select the most likely word hypothesis from among the overlapping candidates. In fact, it is fairly difficult to get plausible analyses like the ones shown in Figure 1, because failure to identify an unknown word affects the segmentation of the neighboring words. Obviously, a robust word segmenter is the essential first step.</Paragraph>
      <Paragraph position="9"> In the following sections, we first describe a statistical language model to cope with unknown words. We then describe the word segmentation algorithm and the new word extraction method, with their derivation as an approximation of a generalization of the Forward-Backward algorithm (Baum, 1972). Finally, we show experiment results and prove its effectiveness.</Paragraph>
    </Section>
    <Section position="2" start_page="48" end_page="49" type="sub_section">
      <SectionTitle>
Statistical Language Model
Segmentation Model (Tagging Model)
</SectionTitle>
      <Paragraph position="0"> Let the input Japanese character sequence be</Paragraph>
      <Paragraph position="2"> be defined as finding the set of word segmentation and parts of speech assignment (~V, T) that maximize the joint probability of word sequence and tag sequence given character sequence P(W, TIC ).</Paragraph>
      <Paragraph position="3"> Since the maximization is carried out with fixed character sequence C, the word segmenter only has to maximize the joint probability of word sequence and tag sequence P(W, T).</Paragraph>
      <Paragraph position="5"> We call P(W,T) the segmentation model, although it is usually called tagging model in English tagger research. In this paper, we compare three segmentation models: part of speech trigram, word unigram, and word bigram.</Paragraph>
      <Paragraph position="6"> In the part-of-speech trigram model (POS tri-gram model), the joint probability P(W, T) is approximated by the product of parts of speech tri-gram probabilities P(tilti_2,ti_l) and word output probabilities for given part of speech P(wiItl)</Paragraph>
      <Paragraph position="8"> In the word unigram and word bigram models, the joint probability P(W,T) is approximated by the product of word unigram probabilities P(wi,ti) and word bigram probabilities</Paragraph>
      <Paragraph position="10"> Basically, parameters of these segmentation models are estimated by computing the relative frequencies of the corresponding events in the segmented training corpus. However, in order to ham dle unknown words, we have introduced a slight modification in computing the relative frequencies, as is described in the next section.</Paragraph>
    </Section>
    <Section position="3" start_page="49" end_page="49" type="sub_section">
      <SectionTitle>
Word Model
</SectionTitle>
      <Paragraph position="0"> We think of an unknown word as a word having a special part of speech &lt;U~IK&gt;. We define a statistical word model to assign a word probability to each word hypothesis. It is formally defined as the joint probability of the character sequence cl ... ck if wi is the unknown word. We decompose it into the product of word length probability and word spelling probability,</Paragraph>
      <Paragraph position="2"> where k is the length of the character sequence. We call P(k) the word length model, and P(cl ... ck I k) the word spelling model.</Paragraph>
      <Paragraph position="3"> We assume that word length probability P(k) obeys a Poisson distribution whose paraineter is the average word length ,~ in the training corpus,</Paragraph>
      <Paragraph position="5"> This means that we regard word length as the interval between hidden word boundary markers, which are randomly placed with an average interval equal to the average word length. Although this word length model is very simple, it plays a key role in making the word segmentation algorithm robust.</Paragraph>
      <Paragraph position="6"> We approximate the spelling probability given word length P(O ... ck \[k) by the word-based character bigram model, regardless of word length. Since there are more than 3,000 characters in Japanese, the amount of training data would be too small if we divided them by word length.</Paragraph>
      <Paragraph position="8"> Here, special symbol &amp;quot;#&amp;quot; indicates the word boundary marker.</Paragraph>
      <Paragraph position="9"> Note that the word-based character bigram model is different from the sentence-based character bigram model. The former is estimated from the corpus segmented into words. It assigns a large probability to a character sequence that appears in the beginning (prefixes), the middle, and the end (suffixes) of a word. It also assigns a small probability to a character sequence that appears across a word boundary.</Paragraph>
      <Paragraph position="10"> By using the word model, we can create modified segmentation models that take unknown words into consideration. The parameters of the modified POS trigram, word unigram, and word bigram are estimated by Equations (8), (9), (10), and (11), in Figure 2.</Paragraph>
      <Paragraph position="11"> hi Figure 2, C(.) denotes the count of the specified event in the training corpus. In the part of speech trigram model, P(wi\[ti) for an unknown word wi is obtained, by definition, from the word model P(wi\]&lt;UNK&gt;). In the word unigram model, the unigram count C(wi) for unknown word wi is given as the product of the total unigram count of unknown words C(&lt;UNK&gt;) and the word model probability P(wil&lt;UNK&gt;). The higher order N-gram counts involving unknown words are also obtained in the same manner.</Paragraph>
      <Paragraph position="12"> In order to compute the parameters in Figure 2, we need the counts involving unknown words, such as C(ti-2, ti-1, &lt;UNK&gt;), C(&lt;UNK&gt;), and C((wi-~,tl-a),&lt;UNK&gt;). These counts are important because they represent the contexts in which unknown words likely to appear. To estimate these counts, we replace all words appearing only once in the training corpus with unknown word tags &lt;UNK&gt;, before computing relative frequencies. The underlying idea of the replacement is the same as Turing's estimates in back-off smoothing (Katz, 1987). We redistribute the probability mass of low count sequences to &amp;quot;unseen&amp;quot; sequences. null</Paragraph>
    </Section>
    <Section position="4" start_page="49" end_page="49" type="sub_section">
      <SectionTitle>
Generalized Forward Backward
Reestimation
</SectionTitle>
      <Paragraph position="0"> Generalization of the Forward and</Paragraph>
    </Section>
    <Section position="5" start_page="49" end_page="51" type="sub_section">
      <SectionTitle>
Viterbi Algorithm
</SectionTitle>
      <Paragraph position="0"> In English part of speech taggers, the maximization of Equation (1) to get the most likely tag sequence, is accomplished by the Viterbi algorithm (Church, 1988), and the maximum likelihood estimates of the parameters of Equation (2) are obtained from untagged corpus by the Forward-Backward algorithm (Cutting et al., 1992). However, it is impossible to apply the Viterbi algorithm and the Forward-Backward algorithm for word segmentation of those languages that have no delimiter between words, such as Japanese and Chinese, because word segmentation hypotheses overlap one another.</Paragraph>
      <Paragraph position="1"> Figure 3 shows an example of overlapping word hypotheses and possible word segmentations for the string ~N~t~ig-f~ ('all prefectures in the nation'). We assume ~\[\] ('all nation'), ~ ('all'), \[~l ('national capital'), ~ii~;g~t,~ ('prefectures'), ~i.~ ('metropolitan road'), ~li ('metropolis'), ~Kff t.~ ('prefectures'), ~ ('road'), ~ ('prefectures'), ~.f ('prefecture'), and ~ ('prefecture') are registered in the dictionary. There are 15 possible word segmentations in this example. In Japanese, a lot of words consist of one character. Moreover, sequence of characters may constitute a different word.</Paragraph>
      <Paragraph position="3"> ized Viterbi algorithm as follows. Let the input Japanese character sequence of length n be C = cl c2 . . . c,, and cg denote the substring cp+ l ... %. We define a flmction D that maps a character sequence c_q to a list of word hypotheses {wi}.</Paragraph>
      <Paragraph position="4"> Function/~ is the generalization of the dictionary.</Paragraph>
      <Paragraph position="5"> Here, wi denotes a combination of orthography (formally denoted by wi) and part of speech ti, for simplicity. We use word bigram as the segmentation model in the following example. Other segmentation models, such as part of speech tri-gram and word unigrarn, can be used in the same manner.</Paragraph>
      <Paragraph position="6"> In the generalized forward algorithm, the forward probability o~(wi) is the joint probability of the character sequence c~ and the event that the final word in the segmentation of cq0 is wi that spans the substring d. Forward probabilities can be recurslvely computed as follows.</Paragraph>
      <Paragraph position="7"> O&lt;p&lt;q wiED(c~) e o &lt; q &lt;., q &lt; &lt;. 02) The generalized forward algorithm starts from the beginning of the input sentence, and proceeds  character by character. At each point q in the sentence, it sums over the product of the forward probability of the word segmentation hypotheses ending at the point ~pq(wl) and the transition probability to the word hypotheses starting at that point P(wi+l \[wi).</Paragraph>
      <Paragraph position="9"> Figure 4 shows a snapshot of the generalized forward algorithm. Tile input is ~\[\]~i~, and the current point q is 2. The word hypotheses ending at point 2 (wi 6 n(c~)) are ~I~ (Co 2) and \[\] (c~). Those starting at point 2 (wi+x 6 D(c~)) are ~J.~ (c~), ~_ (c~), and ~li (c~). The string ~$~ (c25) is not registered in the dictionary. All combinations of these words are examined.</Paragraph>
      <Paragraph position="10"> The generalized Viterbi algorithm can be ob- null tained by replacing summation with maximization in Equation (12). Here, Cpq(wi) is the probability of the most likely word segmentation sequence for the character sequence cq0 whose final word wi spans the substring c~.</Paragraph>
      <Paragraph position="12"> Note that tile original Forward algorithm and tile Viterbi algorithin is the special case in Equation (12) and (13) where p and q are fixed as p=q-1 andr=q+i.</Paragraph>
      <Paragraph position="13"> In order to handle unknown words, the dictionary function D returns a word hypothesis tagged as unknown word if the substring cpq is not registered in the dictionary, such as ~i.~gf (%5) in Figure 4. The word model assigns a reasonable probability to the unknown word. Therefore, in the generalized forward algorithm and the generalized Viterbi algorithm, we hypothesize all substrings in the input sentence as words, and examine all possible combinations of these word hypotheses.</Paragraph>
      <Paragraph position="14"> Since we can define the generalized Backward algorithm in the same manner, we can define the generalized Forward-Backward algorithm to estimate the word N-gram counts in Japanese texts, and to reestimate the word N-gram probabilities in the segmentation model. However, we give a more intuitive account of the method to introduce an approximation of the generalized Forward-Backward algorithm.</Paragraph>
    </Section>
    <Section position="6" start_page="51" end_page="52" type="sub_section">
      <SectionTitle>
Expected Word N-gram Count
</SectionTitle>
      <Paragraph position="0"> By using the above mentioned word segmentation algorithm, we can get all word segmentation hypotheses of the input sentence. Once we get them, we can estimate word N-gram count in an unsegmented Japanese corpus.</Paragraph>
      <Paragraph position="1"> Let Oj be the jth word segmentation hypothesis for the ith sentence in the corpus. P(O~) can * d be cornputed by using the segmentahon model The Bayes a posleriori estimate of the word unigram count Ci(wi) and the word bigram count Ci(wi_l, wi) ill the ith sentence can be computed as,</Paragraph>
      <Paragraph position="3"> Here,. n}(w~) and. ni'(w~'w3 Z) denote the number of tunes the umgram w~ and the bigram w~, w~ appeared in tile jth candidate of tile ith sentence  The estimate of the total unigram count C(w~) and the total bigram count C(w~, wE) can be obtained by summing the counts over all sentences in the corpus.</Paragraph>
      <Paragraph position="5"> The estimate of the unigram probability and the bigram probability can be obtained as the relative frequency of the associated events.</Paragraph>
      <Paragraph position="7"> If necessary, we can reestimate the word N-gram probabilities by replacing P(w~) and P(w~lw,~ ) with f(w~) and f(wolw~).</Paragraph>
      <Paragraph position="8"> Extraction of New Words in Texts Expected word unigram counts (expected word frequencies) in the corpus (Equation (16)) can be used as a measure of likelihood that a particular substring in the input texts is actually a word. Let 0 denote the minimum expected word frequency that we use to classify a given word hypothesis w~ as a word.</Paragraph>
      <Paragraph position="9"> C(w.) &gt; o (20) Those words that are not found in the dictionary and whose expected frequencies in the corpus are larger than the threshold O are extracted as the new words in the input texts.</Paragraph>
      <Paragraph position="10"> In theory, expected word N-gram counts can be obtained by the generalized Forward-Backward algorithm. In order to save computation time, however, we approximated the weighted sum of  called the Vitcrbi reestimation. Our method might be called N-best reestimation. It is designed to be more accurate than the Viterbi rcestimation and more efficient than the generafized Forward-Backward algorithm. null  N-best word segmentation hypotheses can be obtained by using the Forward-DP Backward-A* algorithm (Nagata, 1994). It consists of a forward dynamic programming search to record tlle probabilities of all partial word segmentation hypotheses, and a backward A* algorithm to extract the N-best hypotheses. It is a generalization of the tree-trellis search (Soong and Huang, 1991), in the sense that its forward Viterbi search is replaced with the generalized Viterbi search described in this paper.</Paragraph>
      <Paragraph position="11"> In reestimating the word N-gram probabilities, we introduce two modifications to the normal reestimation procedure. The first modification is that, instead of using the relative frequency in an unsegmented corpus (Equation (18) and (19)), we combine the N-gram count in the segmented corpus with the estimated N-gram count in the unsegmented corpus to increase estimate reliability. This is because a fairly large amount of segmented Japanese corpus were available in our experiments.</Paragraph>
      <Paragraph position="13"> where C,~a(. ) denotes the count in the segmented corpus, and Cuns,a(') denotes the estimated count in tile unsegmented corpus.</Paragraph>
      <Paragraph position="14"> The second modification is that we prune the expected N-gram counts in the unsegmented corpus if they are lower than a predefined threshold, before computing Equation (21) and (22). This is because Cunse#(') is unreliable, especially when C%,,,a(. ) is low.</Paragraph>
    </Section>
    <Section position="7" start_page="52" end_page="52" type="sub_section">
      <SectionTitle>
Examples of Estimating Expected
Word Frequencies
</SectionTitle>
      <Paragraph position="0"> Finally, we show a simple example of estimating the word N-gram counts in an unsegmented sentence. Assume that the ith input sentence is the character sequence ~-~-~-)kPq, which means &amp;quot;introduction to linguistics&amp;quot;, and its best three word segmentation hypotheses are as shown in  the relative probabilities of the word segmentation P(O)) hypotheses, corresponding to ~ p(oD ill Equation (14). The expected word unigram count of each word hypothesis in the sentence is,</Paragraph>
      <Paragraph position="2"> The expected total number of tile words in tile sentence ~ Ci(w~) is 2.3. If all word hypotheses are not registered in tile dictionary and the threshold 0 is 0.15, we regard )kPq ('introduction'), ,~-~liq: ('linguistics'), ~ ('language'), and q: ('study') as tile new words. ~&amp;quot; ('say') and ~/iq: ('study of languages') are discarded.</Paragraph>
      <Paragraph position="3"> Let us give another example that shows the effect of summing tlle expected word unigram counts over all the sentences in the corpus. Suppose tile sentence &amp;quot;-&amp;quot;-~ 5/J~,~7~q:~: ENIAC (c) 50 J~l~ 5o &amp;quot;, which means &amp;quot;University of Pennsylvania celebrates the 50th anniversary of ENIAC.&amp;quot;, is in the corpus, and the first three word segmentation hypotheses are as shown in  nia University'), and \]&lt;~7&amp;quot; ('Vania') are 0.790, 0.169, and 0.041, respectively. Suppose also the sentence &amp;quot;zh~4&amp;quot; b\]~gc~2:~'.-~/5/A~'&lt;=7~ 9 ~5~ ~b 7~o &amp;quot;, which means &amp;quot;White House lies at Pennsylvania Avenue.&amp;quot;, is in the corpus, and the expected word unigram counts for -~-:/~/z~,&lt;: 7&amp;quot; ('Pennsylvania'), .'&lt;-:7&amp;quot;~ V ('Vania Avenue'), and J&lt;~7 ('Vania') are 0.825, 0.127, and 0.048, respectively. The expected word unigram counts in the corpus are,</Paragraph>
      <Paragraph position="5"> to be a new word. Tile more often the unknown word appears in the corpus, the more it is likely to be extracted, even if there is word segmentation ambiguity in each sentence.</Paragraph>
    </Section>
    <Section position="8" start_page="52" end_page="53" type="sub_section">
      <SectionTitle>
Experiments
Language Data
</SectionTitle>
      <Paragraph position="0"> We used the EDR .Japanese Corpus Version 1.0 (EDR, 1995) to train and test the word segmen- null tation program. It is a corpus of approximately 5 million words (200,000 sentences). It was collected to build a Japanese Electronic Dictionary, and contains a variety of Japanese sentences taken from newspapers, magazines, dictionaries, encyclopedias, textbooks, etc. It has a variety of annotations on morphology, syntax, and semantics. We used word segmentation, pronunciation, and part of speech in the morphology information field of the annotation.</Paragraph>
      <Paragraph position="1"> In this experiment, we randomly selected 90% of the sentences in the EDR Corpus for training the word segmentation program. We made two test sets from the rest of the corpus, one for a small size experiment (100 sentences) and the other for a medium size experiment (1000 sentences). Table 1 shows the number of sentences, words, and characters for training and test sets. Note that the test sets were not part of the training set. That is, open data were tested in the experiment.</Paragraph>
      <Paragraph position="2">  The training texts contained 133281 word types. We discarded word types that appeared only once in the training texts. This resulted in 65152 word types being registered in the dictionary of the word segmenter. We trained three segmentation models, namely, part of speech trigram, word unigram, and word trigram, after we replaced those words appeared only once in the training texts with the unknown word tag &lt;UNK&gt;, as described in the section of word model. After this replacement, there were 758172 distinct word bigrams. Again, we discarded word bigrams that appeared only once in the training texts for saving main memory, and used the remaining 294668 word bigrams. The word bigram probabilities were smoothed using deleted interpolation (Jelinek, 1985).</Paragraph>
      <Paragraph position="3"> The training texts contained 3534 character types. We discarded characters that appeared only once in the training texts; 3167 character types remained. We then replaced the discarded characters with the unknown character tag to train the word spelling model. There were 91198 distinct character bigrams in the words in the training texts 3 aThere are more than 3000 (some say nlore than 10000) charters in Japanese, and their frequency distribution is skewed. In order to save memory, we used a type of character bigram model that considers un-We made two spelling models. The first was trained using all words in the training texts, while the second was trained using those words whose frequency is less than or equal to 2. In principle, the spelling model of unknown words must be trained using the low frequency words. However, it nlight suffer from the sparse data problem because the total number of word tokens for training is decreased from 4746461 to 103919. We also made two length models. The average word lengths of all words and that of low frequency words were 1.58 and 4.49, respectively. Note that the average word length is the only parameter of the word length model.</Paragraph>
    </Section>
    <Section position="9" start_page="53" end_page="54" type="sub_section">
      <SectionTitle>
Evaluation Measures
</SectionTitle>
      <Paragraph position="0"> Word Segmentation accuracy is expressed in ternrs of recall and precision. First, we count the number of words in corpus segmentation (Std), the number of words in system segmentation (Sys), and tile number of matching word segmentations (M).</Paragraph>
      <Paragraph position="1"> Recailis defined as M/Std, and precision is defined as M/Sys.</Paragraph>
      <Paragraph position="2"> Figure 6 shows an example of computing precision and recall for the sentence &amp;quot;ta ~ ~ 7 ~ 2-~c~J~-~..fi~&amp;quot;~&amp;quot;~'% &amp;quot;, which means &amp;quot;Rockefeller Laboratory is an academic laboratory founded by an American millionaire, Rockefeller&amp;quot;. Because of the difference in the segmentation of ~ ~ ~ 7 z: ~--iT~p~, the number of words in corpus segmentation (Std=15) differs from that of system segmentation (Sys=14). Note that the system correctly tokenized -~fbJ~E~, although it is not registered in the dictionary.</Paragraph>
      <Paragraph position="3"> New word extraction accuracy is described in terms of recall, precision, and F-measure. First, we count the number of unknown words in the corpus segmentation (Std), the number of unknown words in the system segmentation (Sys), and the number of matching words (M). Here, unknown words are those that are not registered in the system dictionary. Recall is defined as M/Std, and precision is defined as M/Sys. Since recall and precision greatly depend on the frequency threshold, we used the F-measure to indicate the overall performance. F-measure is used in Information Retrieval, and is calculated by</Paragraph>
      <Paragraph position="5"> where P is precision, R is recall, and/3 is the relative importance given to recall over precision.</Paragraph>
      <Paragraph position="6"> known characters, like the word bigram model used in the segmentation model.</Paragraph>
    </Section>
    <Section position="10" start_page="54" end_page="54" type="sub_section">
      <SectionTitle>
All
Word Segmentation Accuracy
</SectionTitle>
      <Paragraph position="0"> In order to decide the best configuration of the underlying Japanese word segmenter, we compared three segmentatio n models: part of speech trigram, word unigram, and word bigram. We also compared three word models: all words, low frequency words, and the combination of the two.</Paragraph>
      <Paragraph position="1"> The third word model consisted of the spelling model trained using all words and the length model trained using low frequency words.</Paragraph>
      <Paragraph position="2"> Table 2 shows, for the small test set (100 sentences), the segmentation accuracy of the various combinations of the segmentation models and the word models.</Paragraph>
      <Paragraph position="3"> It is obvious that word bigram outperformed the part of speech trigram as well as word unigram. As for the word model, it seems the combination of the spelling model for all words and the length model for low frequency words is the best, but the difference is small. In the following experiment, we decided to use word bigram as the segmentation model, and the combination of the spelling model of all words and the length model of low frequency words as the word model.</Paragraph>
    </Section>
    <Section position="11" start_page="54" end_page="56" type="sub_section">
      <SectionTitle>
New Word Extraction Accuracy
</SectionTitle>
      <Paragraph position="0"> We tested the new word extraction method using the medium size test set (1000 sentences). It contains 538 unknown word types. 8 word types appeared twice in the test set. The other 530 word types appeared only once. The out-of-vocabulary rate of the test set is 2.2%. To count the expected word frequencies, we used the top-10 word segmentation hypotheses. We limited tile maximum character length of the a unknown word to 8 in order to save computation time.</Paragraph>
      <Paragraph position="1"> We tested three variations of the new word extraction method. The first one was &amp;quot;No Reestimarion&amp;quot;; it uses the word segmenter's outputs as they are when extracting new words. The second and the third ones carry out reestimation before extraction, where the pruning thresholds of the expected N-gram counts in the reestimation are 0.95 and 0.50, respectively. Reestimations were carried out three times.</Paragraph>
      <Paragraph position="2"> Table 3 shows the new word extraction accuracies for a variety of expected word frequency thresholds 0, with and without reestimation. In Table 3, we set fl = 1.0 to compute F-measure.</Paragraph>
      <Paragraph position="3"> As Table 3 shows, the higher the threshold is, the higher the precision and the lower the recall become. When we put equal importance on recall and precision, the best value for the expected word frequency threshold is around 0.10 where the recall  were not extracted (std-matched), when the frequency threshold was 0.5 and reestimation was not carried out. We find that the overall quality of the extracted word hypotheses is satisfactory, al- null though the values of recall and precision are not so high. We discuss the reason for this in the next section.</Paragraph>
      <Paragraph position="4">  The problem of Japanese word segmentation is that people often can not agree on a single word segmentation. Therefore, the reported performance could be greatly underestimated. Most of the new words extracted by the system are acceptable as a word (at least for us), and nmy not necessarily be a wrong word entry. On the other hand, most of the new words not extracted by the system can be divided into shorter words that are registered in the dictionary.</Paragraph>
      <Paragraph position="5"> For example, in the first sentence of Figure 8, W'~/~' * ~ :2 ~-~--5/~ .-/('data coinmunication') is regarded as one word in corpus segmentation and counted as an unknown word in the test sentence. However, the system segmented it into -U--~ ('data') and = :2:~0--5/~ Z/('communication'), both of which are found in the dictionary. In the second sentence of Figure 8, the system extracted .&amp;quot;,3-- ~'7/c~ ('Duke of Hanover') as a new word, while this word is divided into ~',/--~'~ ('Hanover') and ~ ('Duke') in corpus segmentation. Most of extraction errors are of this category.</Paragraph>
      <Paragraph position="6"> There are three types of obvious extraction errors. The first type is the truncation of long words. Some transliterated Western-origin words exceed the predefined maximum length for unknown word. The third sentence of Figure 8 is an example of this type. In Japanese, 'illustration' is transliterated into 9 characters ~ ~ 7, b 1/--5/ :/, which exceeds tile maximum unknown word length of 8 characters in our system. Since 4&amp;quot; ~ 1- (the transliteration of 'illust', which also means illustration in Japanese) is registered in the dictionary, t/--5/~ ./(the transliteration of 'ration') is incorrectly extracted as a new word.</Paragraph>
      <Paragraph position="7"> The second type is the fragmentation of numerals. Since we did not use any tokenizers, numerals tend to be divided arbitrarily. In the second sentence in Figure 8, the system divided &amp;quot;1676&amp;quot; into &amp;quot;16&amp;quot; and &amp;quot;76&amp;quot;. In fact, it may output &amp;quot;1&amp;quot; and &amp;quot;676&amp;quot;, &amp;quot;16 .... 7&amp;quot; and &amp;quot;6&amp;quot;, or whatever. The third type is the concatenation of noun(s) and particle. In other words, the system sometimes erroneously recognizes a noun phrase as a word. For example, the Japanese counterparts of &amp;quot;A of B&amp;quot;, &amp;quot;A and B&amp;quot;, and &amp;quot;A, B&amp;quot; are recognized as a word. This may be because the probability of one long unknown word can be higher than the product of the probabilities of two short unknown (or infrequent) words and one known word. The fourth sentence of Figure 8 is an example of this type of error. The system considered ~li~l\]li~lh~-'v ~l~tlJ ('controllable and observable') as a word, while it is divided into ~-I ('able'), fi~Jt~ ('control'), ~-o ('and'), ~f ('able'), and ~tJ ('observe') in the corpus.</Paragraph>
      <Paragraph position="8"> As for reestimation, Table 3 shows no significant improvements in the new word extraction accuracy. The only effect of reestimation, in our experiment, is to increase the expected word frequencies of the unknown word hypotheses whose expected word frequencies are greater than the pruning threshold of reestimation.</Paragraph>
      <Paragraph position="9"> This result does not necessarily mean that reestimation is useless. This is because most tin- null (sys-raatched), and not extracted new words (std-matched). known words appeared only once in the test sentences. An ideal example to confirm that reestimarion works well would have an unknown word appearing more than twice in the test sentences, and it is trivial to extract the word in one appearance, while it is difficult in the others, because of, for example, successive unknown words. If the test set were larger, or the out-of-vocabulary rate were higher, we believe that the effectiveness of reestimation would be more clearly shown.</Paragraph>
    </Section>
    <Section position="12" start_page="56" end_page="57" type="sub_section">
      <SectionTitle>
Related Work
</SectionTitle>
      <Paragraph position="0"> Recent years have seen several works on corpus-based word segmentation and dictionary construction for both Japanese and Chinese. For Chinese, (Sproat et al., 1994) used the word unigram model in their word segmenter based on weighted finite-state transducer. Word frequencies were estimated by the Viterbi reestimation (a reesthnation procedure using the best analysis) from an unsegmented corpus of 20 million words. Initial estimates of the word frequencies were derived from the frequencies in the corpus of the strings of hanzi making up each word in the lexicon whether or not each string is actually an instance of the word in question.</Paragraph>
      <Paragraph position="1"> (Chang et al., 1995) proposed an automatic dictionary construction method for Chinese from a large unsegmented corpus (311591 sentences) with the help of a small segmented seed corpus (1000 sentences). They combined Viterbi reestimation using the word unigram model with a post filter called the &amp;quot;Two-Class Classifier&amp;quot;, which is a linear discrimination function to decide whether the string is actually a word or not based on features derived from the character N-gram in a large unsegmented corpus. The system's performance is compared with a word list derived from two on-line Chinese dictionaries (21141 words). Tile reported recall and precision values were 56.88% and 77.37% for two character words, and 6.12% and 85.97% for three character words, respectively.</Paragraph>
      <Paragraph position="2"> For Japanese, (Nagao and Mori, 1994) proposed a method of computing an arbitrary length character N-gram, and showed that the character N-gram statistics obtained from a large corpus includes information useful for word extraction. However, they did not report any evaluation of their word extraction method.</Paragraph>
      <Paragraph position="3"> (Teller and Batchelder, 1994) proposed a very naive probabilistic word segmentation method for Japanese, based on character type information and hiragana bigram frequencies. They claimed 98% word segmentation accuracy, while we clMrn 94.7%. However, their evaluation method is very optimistic, and completely different from ours.</Paragraph>
      <Paragraph position="4"> They count an error only when the system segmentation violates morpheme boundaries. In other words, they count an error only when the system segmentation is not acceptable to human judgemen% while we count an error whenever tim system segmentation does not exactly match the corpus segmentation, even if it is inconsistent.</Paragraph>
      <Paragraph position="5"> We used the word bigram model for word segmentation, and expected word frequency for unknown word extraction. We compared the results with a segmented Japanese corpus, and reported 43.7% recall and 52.3% precision for 1000 sentences whose out-of-vocabulary rate is 2.1%. It is impossible to compare our results with (Chang et al., 1995), because the experiment conditions are completely different in terms of language (Chinese vs. Japanese), the size of seed segmented corpus, the size of target unsegmented corpus and its out-of-vocabulary rate, the size of initial word list, and the type of reference data  (on-line dictionary vs. segmented corpus).</Paragraph>
      <Paragraph position="6"> Our idea of filtering erroneous word hypothesis by expected word frequency is simple and straightforward. The major contribution of this paper is that we present a more accurate method for estimating word frequencies in an unsegmented corpus, even if it includes unknown words. This is achieved by introducing an explicit statistical model of unknown words, and by using an N-best word segmentation algorithm (Nagata, 1994) as an approximation of the generalized Forward-Backward algorithm.</Paragraph>
      <Paragraph position="7"> In English taggers, (Weischedel et al., 1993) proposed a statistical model to estimate word output probability p(wi\]tl) for an unknown word from spelling information such as inflectional endings, derivational endings, hyphenation, and capitalization. Our word model can be thought of a generalization of their statistical model. One potential benefit of our statistical model and segmentation algorithm is that they are completely independent of the target language and its writing system. We intend to test our word segmentation method on other languages, such as Chinese and Thai.</Paragraph>
      <Paragraph position="8"> Conclusion We present a new word extraction method for Japanese based on expected word frequency, which is computed by using a statistical language model and an N-best word segmentation algorithm. Although we have encouraging initial results, there are a number of questions to be answered, for example, the minimmn seed segmented corpus size required, the minimum initial word list required, the effect of reestimation for a large unsegmented corpus with various out-of-vocabulary rates. Besides these questions, we are also thinking of assigning the part of speech to the extracted new words in order to construct a Japanese dictionary automatically.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML