File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/92/h92-1023_metho.xml

Size: 9,433 bytes

Last Modified: 2025-10-06 14:13:08

<?xml version="1.0" standalone="yes"?>
<Paper uid="H92-1023">
  <Title>Decision Tree Models Applied to the Labeling of Text with Parts-of-Speech</Title>
  <Section position="3" start_page="0" end_page="117" type="metho">
    <SectionTitle>
2. Decision Trees
</SectionTitle>
    <Paragraph position="0"> The problem at hand is to predict a tag for a given word in a sentence, taking into consideration the tags assigned to previous words, as well as the remaining words in the sentence. Thus, if we wish to predict tag S,~ for word w~ in a sentence S -- wl, w2, * * *, wN, then we must form an estimate of the probability The usual hidden Markov model, trained as described the last section of this paper, incorrectly labeled the verb announced as having the active rather than the passive aspect. If, however, a decision procedure is used to resolve the ambiguity, the context may be queried to determine the nature of the verb as well its agent. We can easily imagine, for example, that if the battery of available questions is rich enough to include such queries as &amp;quot;Is the previous noun inanimate?&amp;quot; and &amp;quot;Does the preposition by appear within three words of the word being tagged?&amp;quot; then such ambiguities may be probabilistically resolved. Thus it is evident that the success of the decision approach will rely in the questions as well as the manner in which they are asked.</Paragraph>
    <Paragraph position="1"> In the experiments described in this paper, we con-P(S,~ \[ $1, $2,...S,~-1 and wl, w2 . .., war).</Paragraph>
    <Paragraph position="2"> We will refer to a sequence ($1,..., t,~-l; wl,..., wN) as a history. A generic history is denoted as H, or as H = (HT, Hw), when we wish to separate it into its tag and word components. The set of histories is denoted by 7-/, and a pair (t, H) is called an event.</Paragraph>
    <Paragraph position="3"> A tag is chosen from a fixed tag vocabulary VT, and words are chosen from a word vocabulary Vw. Given a training corpus E of events, the decision tree method proceeds by placing the observed histories into equivalence classes by asking binary questions about them.</Paragraph>
    <Paragraph position="4"> Thus, a tree is grown with each node labeled by a question q : 7-/ --~ {True, False}. The entropy of tags at a  leaf L of the tree q- is given by</Paragraph>
    <Paragraph position="6"> and the average entropy of tags in the tree is given by</Paragraph>
    <Paragraph position="8"> The method of growing trees that we have employed adopts a greedy algorithm, described in \[1\], to minimize the average entropy of tags.</Paragraph>
    <Paragraph position="9"> Specifically, the tree is grown in the following manner. Each node n is associated with a subset E, C E of training events. For a given node n, we compute for each question q, the conditional entropy of tags at n, given</Paragraph>
    <Paragraph position="11"> The node n is then assigned the question q with the lowest conditional entropy. The reduction in entropy at node n resulting in asking question q is H'(T In) - B(TIn, q).</Paragraph>
    <Paragraph position="12"> If this reduction is significant, as determined by evaluating the question on held-out data, then two descendent nodes of n are created, corresponding to the equivalence classes of events</Paragraph>
    <Paragraph position="14"> The algorithm continues to split nodes by choosing the questions which maximize the reduction in entropy, until either no further splits are possible, or until a maximum number of leaves is obtained.</Paragraph>
  </Section>
  <Section position="4" start_page="117" end_page="118" type="metho">
    <SectionTitle>
3. Maximum Entropy Models
</SectionTitle>
    <Paragraph position="0"> The above algorithm for growing trees has as its objective function the entropy of the joint distribution of tags and histories. More generally, if we suppose that tags and histories arise according to some distribution ~(4, HT, Hw) in textual data, the coding theory point-of-view encourages us to try to construct a model for generating tags and histories according to a distribution</Paragraph>
    <Paragraph position="2"> Typically, one may be able to obtain estimates for certain marginals ofp. In the case of tagging, we have estimates of the marginals q(t, HT) = ~H- p(4, HT, Hw) from the EM algorithm applied to label~'ed or partially labeled text. The marginals r(4, Hw) = ~ HTp(4, HT, Hw) might be estimated using decision trees applied to labelled text. To minimize D(p II q) subject to knowing these marginals, introducing Lagrange multipliers a and fl leads us to minimize the function</Paragraph>
    <Paragraph position="4"> Differentiating with respect to p and solving this equation, we find that the maximum en4ropy solution p takes the form p(4, HT, Hw) = 7f(4, HT)g(4, Hw)p(4, HT, Hw) for some normalizing constant 7. In particular, in the case where we know no better than to take ~ equal to the uniform distribution, we obtain the solution p(4, HT, HW) = q(t, HT) r(t, Hw) q(4) where the marginal q(4) is assumed to satisfy</Paragraph>
    <Paragraph position="6"> Note that the usual HMM tagging model is given by</Paragraph>
    <Paragraph position="8"> which has the form of a maximum entropy model, even though the marginals P(wn, in) and P(4., 4.-2, 4,~-1) are modelled as bigram and trigram statistics, estimated according to the maximum likelihood criterion using the EM algorithm.</Paragraph>
    <Paragraph position="9"> In principle, growing a decision tree to estimate the full density p(4n, HT, Hw) will provide a model with smaller Kullback information. In practice, however, the quantity of training data is severely limited, and the statistics at the leaves will be unreliable. In the model described above we assume that we are able to construct more reliable estimates of the marginals separating the word  and tag components of the history, and we then combine these marginals according to the maximum entropy criterion. In the experiments that we performed, such models performed slightly better than those for which the full distribution p(tn, HT, Hw) was modeled with a tree.</Paragraph>
  </Section>
  <Section position="5" start_page="118" end_page="118" type="metho">
    <SectionTitle>
4. Constructing Questions
</SectionTitle>
    <Paragraph position="0"> The method of mutual information clustering, described in \[2\], can be used to obtain a set of binary features to assign to words, which may in turn be employed as binary questions in growing decision trees. Mutual information clustering proceeds by beginning with a vocabulary V, and initially assigning each word to a distinct class. At each step, the average mutual information between adjacent classes in training text is computed using a bigram model, and two classes are chosen to be merged based upon the criterion of minimizing the loss in average mutual information that the merge affects. If this process is continued until only one class remains, the result is a binary tree, the leaves of which are labeled by the words in the original vocabulary. By labeling each branch by 0 or 1, we obtain a bit string assigned to each word.</Paragraph>
    <Paragraph position="1"> Like all methods in statistical language modeling, this approach is limited by the problems of statistical significance imposed by the lack of sufficient training data.</Paragraph>
    <Paragraph position="2"> However, the method provides a powerful way of automatically extracting both semantic and syntactic features of large vocabularies. We refer to \[2\] for examples of the features which this procedure yields.</Paragraph>
  </Section>
  <Section position="6" start_page="118" end_page="118" type="metho">
    <SectionTitle>
5. Smoothing the Leaf Distributions
</SectionTitle>
    <Paragraph position="0"> After growing a decision tree according to the procedures outlined above, we obtain an equivalence class of histories together with an empirical distribution of tags at each leaf. Because the training data, which is in any case limited, is split exponentially in the process of growing the tree, many of the leaves are invariably associated with a small number of events. Consequently, the empirical distributions at such leaves may not be reliable, and it is desirable to smooth them against more reliable statistics.</Paragraph>
    <Paragraph position="1"> One approach is to form the smoothed distributions P(. \[ n) from the empirical distributions P(. \[ n) for a node n by setting P(t I n) = An P(t I n) + (1 - An) P(t I parent(n)) where parent(n) is the parent node of n (with the convention that parent(root) -- root), and 0 _&lt; An _&lt; 1 can be thought of as the confidence placed in the empirical distribution at the node.</Paragraph>
    <Paragraph position="2"> In order to optimize the coefficients An, we seek to maximize the probability that the correct prediction is made for every event in a corpus PSg held-out from the training corpus used to grow the tree. That is, we attempt to maximize the objective function</Paragraph>
    <Paragraph position="4"> as a function of the coefficients A = (A1, A2,...) where L(H) is the leaf of the history H. While finding the maximizing A is generally an intractable problem, the EM algorithm can be adopted to estimate coefficients which locally maximize the above objective function. Since this is a straightforward application of the EM algorithm we will not present the details of the calculation here.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML