File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/p03-1038_metho.xml

Size: 13,211 bytes

Last Modified: 2025-10-06 14:08:15

<?xml version="1.0" standalone="yes"?>
<Paper uid="P03-1038">
  <Title>Self-Organizing Markov Models and Their Application to Part-of-Speech Tagging</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 Underlying Model
</SectionTitle>
    <Paragraph position="0"> The tagging model is probabilistically defined as finding the most probable tag sequence when a word sequence is given (equation (1)).</Paragraph>
    <Paragraph position="2"> By applying Bayes' formula and eliminating a redundant term not affecting the argument maximization, we can obtain equation (2) which is a combination of two separate models: the tag language model, P(t1;k) and the tag-to-word translation model, P(w1;kjt1;k). Because the number of word sequences, w1;k and tag sequences, t1;k is infinite, the model of equation (2) is not computationally tractable. Introduction of Markov assumption reduces the complexity of the tag language model and independent assumption between words makes the tag-to-word translation model simple, which result in equation (3) representing the well-known Hidden Markov Model.</Paragraph>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Effect of Context Classification
</SectionTitle>
    <Paragraph position="0"> Let's focus on the Markov assumption which is made to reduce the complexity of the original tagging problem and to make the tagging problem tractable. We can imagine the following process through which the Markov assumption can be introduced in terms of context classification:</Paragraph>
    <Paragraph position="2"> In equation (5), a classification function (t1;i 1) is introduced, which is a mapping of infinite contextual patterns into a set of finite equivalence classes. By defining the function as follows we can get equation</Paragraph>
    <Paragraph position="4"> Equation (7) classifies all the contextual patterns ending in same tags into the same classes, and is equivalent to the Markov assumption.</Paragraph>
    <Paragraph position="5"> The assumption or the definition of the above classification function is based on human intuition.</Paragraph>
    <Paragraph position="7"> Although this simple definition works well mostly, because it is not based on any intensive analysis of real data, there is room for improvement. Figure 1 and 2 illustrate the effect of context classification on the compiled distribution of syntactic classes, which we believe provides the clue to the improvement.</Paragraph>
    <Paragraph position="8"> Among the four distributions showed in Figure 1, the top one illustrates the distribution of syntactic classes in the Brown corpus that appear after all the conjunctions. In this case, we can say that we are considering the first order context (the immediately preceding words in terms of part-of-speech). The following three ones illustrates the distributions collected after taking the second order context into consideration. In these cases, we can say that we have extended the context into second order or we have classified the first order context classes again into second order context classes. It shows that distributions like P( jvb;conj) and P( jvbp;conj) are very different from the first order ones, while distributions like P( jfw;conj) are not.</Paragraph>
    <Paragraph position="9"> Figure 2 shows another way of context extension, so called lexicalization. Here, the initial first order context class (the top one) is classified again by referring the lexical information (the following three ones). We see that the distribution after the preposition, out is quite different from distribution after other prepositions.</Paragraph>
    <Paragraph position="10"> From the above observations, we can see that by applying Markov assumptions we may miss much useful contextual information, or by getting a better context classification we can build a better context model.</Paragraph>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Related Works
</SectionTitle>
    <Paragraph position="0"> One of the straightforward ways of context extension is extending context uniformly. Tri-gram tagging models can be thought of as a result of the uniform extension of context from bi-gram tagging models. TnT (Brants, 2000) based on a second order HMM, is an example of this class of models and is accepted as one of the best part-of-speech taggers used around.</Paragraph>
    <Paragraph position="1"> The uniform extension can be achieved (relatively) easily, but due to the exponential growth of the model size, it can only be performed in restrictive a way.</Paragraph>
    <Paragraph position="2"> Another way of context extension is the selective extension of context. In the case of context extension from lower context to higher like the examples in figure 1, the extension involves taking more information about the same type of contextual features. We call this kind of extension homogeneous context extension. (Brants, 1998) presents this type of context extension method through model merging and splitting, and also prediction suffix tree learning (Sch&amp;quot;utze and Singer, 1994; D. Ron et. al, 1996) is another well-known method that can perform homogeneous context extension.</Paragraph>
    <Paragraph position="3"> On the other hand, figure 2 illustrates heterogeneous context extension, in other words, this type of extension involves taking more information about other types of contextual features. (Kim et. al, 1999) and (Pla and Molina, 2001) present this type of context extension method, so called selective lexicalization. null The selective extension can be a good alternative to the uniform extension, because the growth rate of the model size is much smaller, and thus various contextual features can be exploited. In the follow-</Paragraph>
    <Paragraph position="5"> ing sections, we describe a novel method of selective extension of context which performs both homogeneous and heterogeneous extension simultaneously.</Paragraph>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
5 Self-Organizing Markov Models
</SectionTitle>
    <Paragraph position="0"> Our approach to the selective context extension is making use of the statistical decision tree framework. The states of Markov models are represented in statistical decision trees, and by growing the trees the context can be extended (or the states can be split).</Paragraph>
    <Paragraph position="1"> We have named the resulting models Self-Organizing Markov Models to reflect their ability to automatically organize the structure.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.1 Statistical Decision Tree Representation of
Markov Models
</SectionTitle>
      <Paragraph position="0"> The decision tree is a well known structure that is widely used for classification tasks. When there are several contextual features relating to the classification of a target feature, a decision tree organizes the features as the internal nodes in a manner where more informative features will take higher levels, so the most informative feature will be the root node.</Paragraph>
      <Paragraph position="1"> Each path from the root node to a leaf node represents a context class and the classification information for the target feature in the context class will be contained in the leaf node1.</Paragraph>
      <Paragraph position="2"> In the case of part-of-speech tagging, a classification will be made at each position (or time) of a word sequence, where the target feature is the syntactic class of the word at current position (or time) and the contextual features may include the syntactic 1While ordinary decision trees store deterministic classification information in their leaves, statistical decision trees store probabilistic distribution of possible decisions.</Paragraph>
      <Paragraph position="4"/>
      <Paragraph position="6"> its equivalent decision tree classes or the lexical form of preceding words. Figure 3 shows an example of Markov model for a simple language having nouns (N), conjunctions (C), prepositions (P) and verbs (V). The dollar sign ($) represents sentence initialization. On the left hand side is the graph representation of the Markov model and on the right hand side is the decision tree representation, where the test for the immediately preceding syntactic class (represented by P-1) is placed on the root, each branch represents a result of the test (which is labeled on the arc), and the corresponding leaf node contains the probabilistic distribution of the syntactic classes for the current position2.</Paragraph>
      <Paragraph position="7"> The example shown in figure 4 involves a further classification of context. On the left hand side, it is represented in terms of state splitting, while on the right hand side in terms of context extension (lexicalization), where a context class representing contextual patterns ending in P (a preposition) is extended by referring the lexical form and is classified again into the preposition, out and other prepositions. null Figure 5 shows another further classification of 2The distribution doesn't appear in the figure explicitly. Just imagine each leaf node has the distribution for the target feature in the corresponding context.</Paragraph>
      <Paragraph position="8"> context. It involves a homogeneous extension of context while the previous one involves a heterogeneous extension. Unlike prediction suffix trees which grow along an implicitly fixed order, decision trees don't presume any implicit order between contextual features and thus naturally can accommodate various features having no underlying order.</Paragraph>
      <Paragraph position="9"> In order for a statistical decision tree to be a Markov model, it must meet the following restrictions: null There must exist at least one contextual feature that is homogeneous with the target feature.</Paragraph>
      <Paragraph position="10"> When the target feature at a certain time is classified, all the requiring context features must be visible The first restriction states that in order to be a Markov model, there must be inter-relations between the target features at different time. The second restriction explicitly states that in order for the decision tree to be able to classify contextual patterns, all the context features must be visible, and implicitly states that homogeneous context features that appear later than the current target feature cannot be contextual features. Due to the second restriction, the Viterbi algorithm can be used with the self-organizing Markov models to find an optimal sequence of tags for a given word sequence.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.2 Learning Self-Organizing Markov Models
</SectionTitle>
      <Paragraph position="0"> Self-organizing Markov models can be induced from manually annotated corpora through the SDTL algorithm (algorithm 1) we have designed. It is a variation of ID3 algorithm (Quinlan, 1986). SDTL is a greedy algorithm where at each time of the node making phase the most informative feature is selected (line 2), and it is a recursive algorithm in the sense that the algorithm is called recursively to make child nodes (line 3), Though theoretically any statistical decision tree growing algorithms can be used to train self-organizing Markov models, there are practical problems we face when we try to apply the algorithms to language learning problems. One of the main obstacles is the fact that features used for language learning often have huge sets of values, which cause intensive fragmentation of the training corpus along with the growing process and eventually raise the sparse data problem.</Paragraph>
      <Paragraph position="1"> To deal with this problem, the algorithm incorporates a value selection mechanism (line 1) where only meaningful values are selected into a reduced value set. The meaningful values are statistically defined as follows: if the distribution of the target feature varies significantly by referring to the value v, v is accepted as a meaningful value. We adopted the 2-test to determine the difference between the distributions of the target feature before and after referring to the value v. The use of 2-test enables us to make a principled decision about the threshold based on a certain confidence level3.</Paragraph>
      <Paragraph position="2"> To evaluate the contribution of contextual features to the target classification (line 2), we adopted Lopez distance (L'opez, 1991). While other measures including Information Gain or Gain Ratio (Quinlan, 1986) also can be used for this purpose, the Lopez distance has been reported to yield slightly better results (L'opez, 1998).</Paragraph>
      <Paragraph position="3"> The probabilistic distribution of the target feature estimated on a node making phase (line 4) is smoothed by using Jelinek and Mercer's interpolation method (Jelinek and Mercer, 1980) along the ancestor nodes. The interpolation parameters are estimated by deleted interpolation algorithm introduced in (Brants, 2000).</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML