File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/99/p99-1036_intro.xml

Size: 4,303 bytes

Last Modified: 2025-10-06 14:06:56

<?xml version="1.0" standalone="yes"?>
<Paper uid="P99-1036">
  <Title>A Part of Speech Estimation Method for Japanese Unknown Words using a Statistical Model of Morphology and Context</Title>
  <Section position="4" start_page="0" end_page="277" type="intro">
    <SectionTitle>
2 Word Segmentation Model
2.1 Baseline Language Model and Search
Algorithm
</SectionTitle>
    <Paragraph position="0"> Let the input Japanese character sequence be C = Cl...Cm, and segment it into word sequence W = wl ... wn 1 . The word segmentation task can be defined as finding the word segmentation 12d that maximize the joint probability of word sequence given character sequence P(WIC ). Since the maximization is carried out with fixed character sequence C, the word segmenter only has to maximize the joint probability of word sequence P(W).</Paragraph>
    <Paragraph position="1"> = arg mwax P(WIC) = arg mwax P(W) (1) We call P(W) the segmentation model. We can use any type of word-based language model for P(W), such as word ngram and class-based ngram.</Paragraph>
    <Paragraph position="2"> We used the word bigram model in this paper. So, P(W) is approximated by the product of word bi-gram probabilities P(wi\[wi- 1).</Paragraph>
    <Paragraph position="4"> Here, the special symbols &lt;bos&gt; and &lt;eos&gt; indicate the beginning and the end of a sentence, respectively. null Basically, word bigram probabilities of the word segmentation model is estimated by computing the 1 In this paper, we define a word as a combination of its surface form and part of speech. Two words are considered to be equal only if they have the same surface form and part of speech.</Paragraph>
    <Paragraph position="5">  relative frequencies of the corresponding events in the word segmented training corpus, with appropriate smoothing techniques. The maximization search can be efficiently implemented by using the Viterbi-like dynamic programming procedure described in (Nagata, 1994).</Paragraph>
    <Section position="1" start_page="277" end_page="277" type="sub_section">
      <SectionTitle>
2.2 Modification to Handle Unknown
Words
</SectionTitle>
      <Paragraph position="0"> To handle unknown words, we made a slight modification in the above word segmentation model. We have introduced unknown word tags &lt;U-t&gt; for each part of speech t. For example, &lt;U-noun&gt; and &lt;Uverb&gt; represents an unknown noun and an unknown verb, respectively.</Paragraph>
      <Paragraph position="1"> If wl is an unknown word whose part of speech is t, the word bigram probability P(wi\[wl-a) is approximated as the product of word bigram probability P(&lt;U-t&gt;\[wi_l) and the probability of wi given it is an unknown word whose part of speech is t,</Paragraph>
      <Paragraph position="3"> Here, we made an assumption that the spelling of an unknown word solely depends on its part of speech and is independent of the previous word.</Paragraph>
      <Paragraph position="4"> This is the same assumption made in the hidden Markov model, which is called output independence.</Paragraph>
      <Paragraph position="5"> The probabilities P(&lt;U-t&gt;lwi_l ) can be estimated from the relative frequencies in the training corpus whose infrequent words are replaced with their corresponding unknown word tags based on their part of speeches 2 Table 1 shows examples of word bigrams including unknown word tags. Here, a word is represented by a list of surface form, pronunciation, and part of speech, which are delimited by a slash '/'. The first 2 Throughout in this paper, we use the term &amp;quot;infrequent words&amp;quot; to represent words that appeared only once in the corpus. They are also called &amp;quot;hapax legomena&amp;quot; or &amp;quot;hapax words&amp;quot;. It is well known that the characteristics of hapax legomena are similar to those of unknown words (Baayen and Sproat, 1996). example &amp;quot;C/)/no/particle &lt;U-noun&gt;&amp;quot; will appear in the most frequent form of Japanese noun phrases &amp;quot;A (c) B&amp;quot;, which corresponds to &amp;quot;B of A&amp;quot; in English. As Table 1 shows, word bigrams whose infrequent words are replaced with their corresponding part of speech-based unknown word tags are very important information source of the contexts where unknown words appears.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML