File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/p03-1061_metho.xml

Size: 11,880 bytes

Last Modified: 2025-10-06 14:08:21

<?xml version="1.0" standalone="yes"?>
<Paper uid="P03-1061">
  <Title>Satoshi Sekine ++</Title>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
* Accuracy for unknown words
</SectionTitle>
    <Paragraph position="0"> The morpheme model that will be described in Section 3.1 can detect word segments and their POS categories even for unknown words.</Paragraph>
    <Paragraph position="1"> However, the accuracy for unknown words is lower than that for known words. One of the solutions is to use dictionaries developed for a corpus on another domain to reduce the number of unknown words, but the improvement achieved is slight (Uchimoto et al., 2002). We believe that the reason for this is that definitions of a word segment and its POS category depend on a particular corpus, and the definitions from corpus to corpus differ word by word. Therefore, we need to put only words extracted from the same corpus into a dictionary. We are manually examining words that are detected by the morpheme model but that are not found in a dictionary. We are also manually examining those words that the morpheme model estimated as having low probability. During the process of manual examination, if we find words that are not found in a dictionary, those words are then put into a dictionary. Section 4.2.1 will describe the accuracy of detecting unknown words and show how much those words contribute to improving the morphological analysis accuracy when they are detected and put into a dictionary.</Paragraph>
    <Paragraph position="2"> * Insufficiency of features The model currently used for morphological analysis considers the information of a target morpheme and that of an adjacent morpheme on the left. To improve the model, we need to consider the information of two or more morphemes on the left of the target morpheme.</Paragraph>
    <Paragraph position="3"> However, too much information often leads to overtraining the model. Using all the information makes training the model difficult when there is too much of it. Therefore, the best way to improve the accuracy of the morphological information in the CSJ within the limited time available to us is to examine and revise the errors of automatic morphological analysis and to improve the model. We assume that the smaller the probability estimated by a model for an output morpheme is, then the greater the likelihood is that the output morpheme is wrong. Therefore, we examine output morphemes in ascending order of their probabilities. The expected improvement of the accuracy of the morphological information in the whole of the CSJ will be described in Section 4.2.1 Another problem concerning unknown words is that the cost of manual examination is high when there are several definitions for word segments and their POS categories. Since there are two types of word definitions in the CSJ, the cost would double. Therefore, to reduce the cost, we propose another method for detecting word segments and their POS categories. The method will be described in Section 3.2, and the advantages of the method will be described in Section 4.2.2 The next problem described here is one that we have to solve to make a language model for automatic speech recognition.</Paragraph>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
* Pronunciation
</SectionTitle>
    <Paragraph position="0"> Pronunciation of each word is indispensable for making a language model for automatic speech recognition. In the CSJ, pronunciation is transcribed separately from the basic form written by using kanji and hiragana characters as shown in Fig. 1. Text targeted for morpho- null logical analysis is the basic form of the CSJ and it does not have information on actual pronunciation. The result of morphological analysis, therefore, is a row of morphemes that do not have information on actual pronunciation. To estimate actual pronunciation by using only the basic form and a dictionary is impossible. Therefore, actual pronunciation is assigned to results of morphological analysis by aligning the basic form and pronunciation in the CSJ. First, the results of morphological analysis, namely, the morphemes, are transliterated into katakana characters by using a dictionary, and then they are aligned with pronunciation in the CSJ by using a dynamic programming method.</Paragraph>
    <Paragraph position="1"> In this paper, we will mainly discuss methods for detecting word segments and their POS categories in the whole of the CSJ.</Paragraph>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Models and Algorithms
</SectionTitle>
    <Paragraph position="0"> This section describes two methods for detecting word segments and their POS categories. The first method uses morpheme models and is used to detect any type of word segment. The second method uses a chunking model and is only used to detect long word segments.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 Morpheme Model
</SectionTitle>
      <Paragraph position="0"> Given a tokenized test corpus, namely a set of strings, the problem of Japanese morphological analysis can be reduced to the problem of assigning one of two tags to each string in a sentence. A string is tagged with a 1 or a 0 to indicate whether it is a morpheme. When a string is a morpheme, a grammatical attribute is assigned to it. A tag designated asa1isthus assigned one of a number, n,of grammatical attributes assigned to morphemes, and the problem becomes to assign an attribute (from 0 to n) to every string in a given sentence.</Paragraph>
      <Paragraph position="1"> We define a model that estimates the likelihood that a given string is a morpheme and has a grammatical attribute i(1 [?] i [?] n) as a morpheme model. We implemented this model within an ME modeling framework (Jaynes, 1957; Jaynes, 1979; Berger et al., 1996). The model is represented by Eq. (1):</Paragraph>
      <Paragraph position="3"> where a is one of the categories for classification, and it can be one of (n+1) tags from 0 to n (This is  called a &amp;quot;future.&amp;quot;), b is the contextual or conditioning information that enables us to make a decision among the space of futures (This is called a &amp;quot;history.&amp;quot;), and Z l (b) is a normalizing constant determined by the requirement that  is dependent on a set of &amp;quot;features&amp;quot; which are binary functions of the history and future. For instance, one of our features is</Paragraph>
      <Paragraph position="5"> in our experiments are described in detail in Section 4.1.1.</Paragraph>
      <Paragraph position="6"> Given a sentence, probabilities of n tags from 1 to n are estimated for each length of string in that sentence by using the morpheme model. From all possible division of morphemes in the sentence, an optimal one is found by using the Viterbi algorithm. Each division is represented as a particular division of morphemes with grammatical attributes in a sentence, and the optimal division is defined as a division that maximizes the product of the probabilities estimated for each morpheme in the division. For example, the sentence &amp;quot;6</Paragraph>
      <Paragraph position="8"> Mh`b&amp;quot; in basic form as shown in Fig. 1 is analyzed as shown in Fig. 2. &amp;quot;6 r s&amp;quot; is analyzed as three morphemes, &amp;quot;6(noun)&amp;quot;, &amp;quot;  (suffix)&amp;quot;, and &amp;quot;r s(noun)&amp;quot;, for short words, and as one morpheme, &amp;quot;6 r s(noun)&amp;quot; for long words. In conventional models (e.g., (Mori and Nagao, 1996; Nagata, 1999)), probabilities were estimated for candidate morphemes that were found in a dictionary or a corpus and for the remaining strings obtained by eliminating the candidate morphemes from a given sentence. Therefore, unknown words were apt to be either concatenated as one word or divided into both a combination of known words and a single word that consisted of more than one character. However, this model has the potential to correctly detect any length of unknown words.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.2 Chunking Model
</SectionTitle>
      <Paragraph position="0"> The model described in this section can be applied when several types of words are defined in a corpus and one type of words consists of compounds of other types of words. In the CSJ, every long word consists of one or more short words.</Paragraph>
      <Paragraph position="1"> Our method uses two models, a morpheme model for short words and a chunking model for long words. After detecting short word segments and their POS categories by using the former model, long word segments and their POS categories are detected by using the latter model. We define four labels, as explained below, and extract long word segments by estimating the appropriate labels for each short word according to an ME model. The four labels are listed below: Ba: Beginning of a long word, and the POS category of the long word agrees with the short word.</Paragraph>
      <Paragraph position="2"> Ia: Middle or end of a long word, and the POS category of the long word agrees with the short word.</Paragraph>
      <Paragraph position="3"> B: Beginning of a long word, and the POS category of the long word does not agree with the short word.</Paragraph>
      <Paragraph position="4"> I: Middle or end of a long word, and the POS category of the long word does not agree with the short word.</Paragraph>
      <Paragraph position="5"> A label assigned to the leftmost constituent of a long word is &amp;quot;Ba&amp;quot; or &amp;quot;B&amp;quot;. Labels assigned to other constituents of a long word are &amp;quot;Ia&amp;quot;, or &amp;quot;I&amp;quot;. For example, the short words shown in Fig. 2 are labeled as shown in Fig. 3. The labeling is done deterministically from the beginning of a given sentence to its end. The label that has the highest probability as estimated by an ME model is assigned to each short word. The model is represented by Eq. (1). In Eq.</Paragraph>
      <Paragraph position="6"> (1), a can be one of four labels. The features used in our experiments are described in Section 4.1.2.</Paragraph>
      <Paragraph position="7">  When a long word that does not include a short word that has been assigned the label &amp;quot;Ba&amp;quot; or &amp;quot;Ia&amp;quot;, this indicates that the word's POS category differs from all of the short words that constitute the long word. Such a word must be estimated individually.</Paragraph>
      <Paragraph position="8"> In this case, we estimate the POS category by using transformation rules. The transformation rules are automatically acquired from the training corpus by extracting long words with constituents, namely short words, that are labeled only &amp;quot;B&amp;quot; or &amp;quot;I&amp;quot;. A rule is constructed by using the extracted long word and the adjacent short words on its left and right. For example, the rule shown in Fig. 4 was acquired in our experiments. The middle division of the consequent part represents a long word &amp;quot;o&amp;quot; (auxiliary verb), and it consists of two short words &amp;quot;o&amp;quot; (post-positional particle) and &amp;quot;&amp;quot; (verb). If several different rules have the same antecedent part, only the rule with the highest frequency is chosen. If no rules can be applied to a long word segment, rules are generalized in the following steps.</Paragraph>
      <Paragraph position="9">  1. Delete posterior context 2. Delete anterior and posterior contexts 3. Delete anterior and posterior contexts and lexical entries.</Paragraph>
      <Paragraph position="10">  If no rules can be applied to a long word segment in any step, the POS category noun is assigned to the long word.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML