XML Viewer - p99-1036

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/99/p99-1036_metho.xml
Size: 12,231 bytes
Last Modified: 2025-10-06 14:15:28
<?xml version="1.0" standalone="yes"?>
<Paper uid="P99-1036">
  <Title>A Part of Speech Estimation Method for Japanese Unknown Words using a Statistical Model of Morphology and Context</Title>
  <Section position="5" start_page="277" end_page="280" type="metho">
    <SectionTitle>
3 Unknown Word Model
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="277" end_page="278" type="sub_section">
      <SectionTitle>
3.1 Baseline Model
</SectionTitle>
      <Paragraph position="0"> The simplest unknown word model depends only on the spelling. We think of an unknown word as a word having a special part of speech &lt;UNK&gt;. Then, the unknown word model is formally defined as the joint probability of the character sequence wi = cl .. * ck if it is an unknown word. Without loss of generality, we decompose it into the product of word length probability and word spelling probability given its</Paragraph>
      <Paragraph position="2"> where k is the length of the character sequence.</Paragraph>
      <Paragraph position="3"> We call P(kI&lt;UNK&gt; ) the word length model, and P(cz... ck Ik, &lt;UNK&gt;) the word spelling model.</Paragraph>
      <Paragraph position="4"> In order to estimate the entropy of English, (Brown et al., 1992) approximated P(kI&lt;UNK&gt; ) by a Poisson distribution whose parameter is the average word length A in the training corpus, and P(cz... cklk, &lt;UNK&gt;) by the product of character zerogram probabilities. This means all characters in the character set are considered to be selected independently and uniformly.</Paragraph>
      <Paragraph position="6"> where p is the inverse of the number of characters in the character set. If we assume JIS-X-0208 is used as the Japanese character set, p = 1/6879.</Paragraph>
      <Paragraph position="7"> Since the Poisson distribution is a single parameter distribution with lower bound, it is appropriate to use it as a first order approximation to the word length distribution. But the Brown model has two problems. It assigns a certain amount of probability mass to zero-length words, and it is too simple to express morphology.</Paragraph>
      <Paragraph position="8"> For Japanese word segmentation and OCR error correction, (Nagata, 1996) proposed a modified version of the Brown model. Nagata also assumed the word length probability obeys the Poisson distribution. But he moved the lower bound from zero to one.</Paragraph>
      <Paragraph position="10"> Instead of zerogram, He approximated the word spelling probability P(Cl...ck\[k, &lt;UNK&gt;) by the product of word-based character bigram probabilities, regardless of word length.</Paragraph>
      <Paragraph position="12"> where &lt;bow&gt; and &lt;eow&gt; are special symbols that indicate the beginning and the end of a word.</Paragraph>
    </Section>
    <Section position="2" start_page="278" end_page="278" type="sub_section">
      <SectionTitle>
3.2 Correction of Word Spelling
Probabilities
</SectionTitle>
      <Paragraph position="0"> We find that Equation (7) assigns too little probabilities to long words (5 or more characters). This is because the lefthand side of Equation (7) represents the probability of the string cl ... Ck in the set of all strings whose length are k, while the righthand side represents the probability of the string in the set of all possible strings (from length zero to infinity).</Paragraph>
      <Paragraph position="1"> Let Pb(cz ...ck\]&lt;UNK&gt;) be the probability of character string Cl...ck estimated from the character bigram model.</Paragraph>
      <Paragraph position="3"> Let Pb (kl &lt;UNK&gt;) be the sum of the probabilities of all strings which are generated by the character bigram model and whose length are k. More appropriate estimate for P(cl... cklk, &lt;UNK&gt;) is,</Paragraph>
      <Paragraph position="5"> But how can we estimate Pb(kI&lt;UNK&gt;)? It is difficult to compute it directly, but we can get a reasonable estimate by considering the unigram case.</Paragraph>
      <Paragraph position="6"> If strings are generated by the character unigram model, the sum of the probabilities of all length k strings equals to the probability of the event that the end of word symbol &lt;eow&gt; is selected after a character other than &lt;eow&gt; is selected k - 1 times.</Paragraph>
      <Paragraph position="8"> Throughout in this paper, we used Equation (9) to compute the word spelling probabilities.</Paragraph>
    </Section>
    <Section position="3" start_page="278" end_page="279" type="sub_section">
      <SectionTitle>
3.3 Japanese Orthography and Word
Length Distribution
</SectionTitle>
      <Paragraph position="0"> In word segmentation, one of the major problems of the word length model of Equation (6) is the decomposition of unknown words. When a substring of an unknown word coincides with other word in the dictionary, it is very likely to be decomposed into the dictionary word and the remaining substring. We find the reason of the decomposition is that the word  words and its estimate by Poisson distribution  and katakana words length model does not reflect the variation of the word length distribution resulting from the Japanese orthography.</Paragraph>
      <Paragraph position="1"> Figure 1 shows the word length distribution of infrequent words in the EDR corpus, and the estimate of word length distribution by Equation (6) whose parameter (A = 4.8) is the average word length of infrequent words. The empirical and the estimated distributions agree fairly well. But the estimates by Poisson are smaller than empirical probabilities for shorter words (&lt;= 4 characters), and larger for longer words (&gt; characters). This is because we rep- null consists of only katakana characters. It shows that the length of kanji words distributes around 3 characters, while that of katakana words distributes around 5 characters. The empirical word length distribution of Figure 1 is, in fact, a weighted sum of these two distributions.</Paragraph>
      <Paragraph position="2"> In the Japanese writing system, there are at least five different types of characters other than punctuation marks: kanji, hiragana, katakana, Roman alphabet, and Arabic numeral. Kanji which means 'Chinese character' is used for both Chinese origin words and Japanese words semantically equivalent to Chinese characters. Hiragana and katakana are syllabaries: The former is used primarily for grammatical function words, such as particles and inflectional endings, while the latter is used primarily to transliterate Western origin words. Roman alphabet is also used for Western origin words and acronyms. Arabic numeral is used for numbers.</Paragraph>
      <Paragraph position="3"> Most Japanese words are written in kanji, while more recent loan words are written in katakana.</Paragraph>
      <Paragraph position="4"> Katakana words are likely to be used for technical terms, especially in relatively new fields like computer science. Kanji words are shorter than katakana words because kanji is based on a large (&gt; 6,000) alphabet of ideograms while katakana is based on a small (&lt; 100) alphabet of phonograms.</Paragraph>
      <Paragraph position="5"> Table 2 shows the distribution of character type sequences that constitute the infrequent words in the EDR corpus. It shows approximately 65% of words are constituted by a single character type.</Paragraph>
      <Paragraph position="6"> Among the words that are constituted by more than two character types, only the kanji-hiragana and hiragana-kanji sequences are morphemes and others are compound words in a strict sense although they part of speech character bigram frequency  are identified as words in the EDR corpus 3 Therefore, we classified Japanese words into 9 word types based on the character types that constitute a word: &lt;sym&gt;, &lt;num&gt;, &lt;alpha&gt;, &lt;hira&gt;, &lt;kata&gt;, and &lt;kan&gt; represent a sequence of symbols, numbers, alphabets, hiraganas, katakanas, and kanjis, respectively. &lt;kan-hira&gt; and &lt;hira-kan&gt; represent a sequence of kanjis followed by hiraganas and that of hiraganas followed by kanjis, respectively. The rest are classified as &lt;misc&gt;.</Paragraph>
      <Paragraph position="7"> The resulting unknown word model is as follows.</Paragraph>
      <Paragraph position="8"> We first select the word type, then we select the length and spelling.</Paragraph>
      <Paragraph position="10"/>
    </Section>
    <Section position="4" start_page="279" end_page="280" type="sub_section">
      <SectionTitle>
3.4 Part of Speech and Word Morphology
</SectionTitle>
      <Paragraph position="0"> It is obvious that the beginnings and endings of words play an important role in tagging part of speech. Table 3 shows examples of common character bigrams for each part of speech in the infrequent words of the EDR corpus. The first example in Table 3 shows that words ending in ' --' are likely to be nouns. This symbol typically appears at the end of transliterated Western origin words written in katakana.</Paragraph>
      <Paragraph position="1"> It is natural to make a model for each part of speech. The resulting unknown word model is as follows.</Paragraph>
      <Paragraph position="3"> By introducing the distinction of word type to the model of Equation (12), we can derive a more sophisticated unknown word model that reflects both word 3 When a Chinese character is used to represent a semantically equivalent Japanese verb, its root is written in the Chinese character and its inflectional suffix is written in hi- ragana. This results in kanji-hiragana sequence. When a Chinese character is too difficult to read, it is transliterated in hiragana. This results in either hiragana-kanji or kanji-hiragana sequence.</Paragraph>
      <Paragraph position="4">  type and part of speech information. This is the unknown word model we propose in this paper. It first selects the word type given the part of speech, then the word length and spelling.</Paragraph>
      <Paragraph position="6"> The first factor in the righthand side of Equation (13) is estimated from the relative frequency of the corresponding events in the training corpus.</Paragraph>
      <Paragraph position="8"> Here, C(.) represents the counts in the corpus. To estimate the probabilities of the combinations of word type and part of speech that did not appeared in the training corpus, we used the Witten-Bell method (Witten and Bell, 1991) to obtain an estimate for the sum of the probabilities of unobserved events. We then redistributed this evenly among all unobserved events a The second factor of Equation (13) is estimated from the Poisson distribution whose parameter '~&lt;WT&gt;,&lt;U-t&gt; is the average length of words whose word type is &lt;WT&gt; and part of speech is &lt;U-t&gt;.</Paragraph>
      <Paragraph position="10"> If the combinations of word type and part of speech that did not appeared in the training corpus, we used the average word length of all words.</Paragraph>
      <Paragraph position="11"> To compute the third factor of Equation (13), we have to estimate the character bigram probabilities that are classified by word type and part of speech.</Paragraph>
      <Paragraph position="12"> Basically, they are estimated from the relative frequency of the character bigrams for each word type and part of speech.</Paragraph>
      <Paragraph position="14"> However, if we divide the corpus by the combination of word type and part of speech, the amount of each training data becomes very small. Therefore, we linearly interpolated the following five probabilities (Jelinek and Mercer, 1980).</Paragraph>
      <Paragraph position="16"/>
    </Section>
  </Section>
  <Section position="6" start_page="280" end_page="280" type="metho">
    <SectionTitle>
4 The Witten-Bell method estimates the probability of ob-
</SectionTitle>
    <Paragraph position="0"> serving novel events to be r/(n+r), where n is the total number of events seen previously, and r is the number of symbols that are distinct. The probability of the event observed c times is c/(n + r).</Paragraph>
    <Paragraph position="2"> cies of the character unigram and bigram for each word type and part of speech, f(ci) and f(cilci_l) are the relative frequencies of the character unigram and bigram. V is the number of characters (not tokens but types) appeared in the corpus.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML