File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/94/j94-2001_metho.xml
Size: 10,376 bytes
Last Modified: 2025-10-06 14:13:56
<?xml version="1.0" standalone="yes"?> <Paper uid="J94-2001"> <Title>Tagging English Text with a Probabilistic Model</Title> <Section position="3" start_page="0" end_page="156" type="metho"> <SectionTitle> 2. The Problem of Tagging </SectionTitle> <Paragraph position="0"> We suppose that the user has defined a set of tags (attached to words). Consider a sentence W = WlW2... w,, and a sequence of tags T -- ht2.., t,, of the same length.</Paragraph> <Paragraph position="1"> We call the pair (W, T) an alignment. We say that word wi has been assigned the tag ti in this alignment.</Paragraph> <Paragraph position="2"> We assume that the tags have some linguistic meaning for the user, so that among all possible alignments for a sentence there is a single one that is correct from a grammatical point of view.</Paragraph> <Paragraph position="3"> A tagging procedure is a procedure ~ that selects a sequence of tags (and so defines an alignment) for each sentence.</Paragraph> <Paragraph position="4"> ~: W ~ T = ~(W) There are (at least) two measures for the quality of a tagging procedure: * at sentence level perfs(~) -- percentage of sentences correctly tagged * at word level perfw(~) = percentage of words correctly tagged In practice, performance at sentence level is generally lower than performance at word level, since all the words have to be tagged correctly for the sentence to be tagged correctly.</Paragraph> <Paragraph position="5"> The standard measure used in the literature is performance at word level, and this is the one considered here.</Paragraph> </Section> <Section position="4" start_page="156" end_page="157" type="metho"> <SectionTitle> 3. Probabilistic Formulation </SectionTitle> <Paragraph position="0"> In the probabilistic formulation of the tagging problem we assume that the alignments are generated by a probabilistic model according to a probability distribution: p(W,T) In this case, depending on the criterion that we choose for evaluation, the optimal tagging procedure is as follows: * for evaluation at sentence level, choose the most probable sequence of tags for the sentence argmax argmax T p(T/W)= T p(W,T) We call this procedure Viterbi tagging. It is achieved using a dynamic programming scheme.</Paragraph> <Paragraph position="1"> for evaluation at word level, choose the most probable tag for each word in the sentence argmax argmax ~(W)i = t p(ti = t/W) = t ~ p(W, T) T:ti=t where ~(W)i is the tag assigned to word wi by the tagging procedure ~b in the context of the sentence W, We call this procedure Maximum Likelihood (ML) tagging.</Paragraph> <Paragraph position="2"> It is interesting to note that the most commonly used method is Viterbi tagging (see DeRose 1988; Church 1989) although it is not the optimal method for evaluation at word level. The reasons for this preference are presumably that: * Viterbi tagging is simpler to implement than ML tagging and requires less computation (although they both have the same asymptotic complexity) * Viterbi tagging provides the best interpretation for the sentence, which is linguistically appealing * ML tagging may produce sequences of tags that are linguistically impossible (because the choice of a tag depends on all contexts taken together).</Paragraph> <Paragraph position="3"> However, in our experiments, we will show that Viterbi and ML tagging result in very similar performance.</Paragraph> <Paragraph position="4"> Computational Linguistics Volume 20, Number 2 Of course, the real tags have not been generated by a probabilistic model and, even if they had been, we would not be able to determine this model exactly because of practical limitations. Therefore the models that we construct will only be approximations of an ideal model that does not exist. It so happens that despite these assumptions and approximations, these models are still able to perform reasonably well.</Paragraph> </Section> <Section position="5" start_page="157" end_page="159" type="metho"> <SectionTitle> 4. The Triclass Model </SectionTitle> <Paragraph position="0"> We have the mathematical expression:</Paragraph> <Paragraph position="2"> The triclass (or tri-POS \[Derouault 1986\], or tri-Ggram \[Codogno et al. 1987\], or HK) model is based on the following approximations: * The probability of the tag given the past depends only on the last two tags p(ti/wltl . . . wi-lti-1) = h(ti/ti_ati_l) * The probability of the word given the past depends only on its tag p(wi/wltl . . . Wi-lti-lti) = k(wi/ti) (the name HK model comes from the notation chosen for these probabilities). In order to define the model completely we have to specify the values of all h and k probabilities. If Nw is the size of the vocabulary and NT the number of different tags, then there are:</Paragraph> <Paragraph position="4"> Note that this number grows only linearly with respect to the size of the vocabulary, which makes this model attractive for vocabularies of a very large size.</Paragraph> <Paragraph position="5"> The triclass model by itself allows any word to have any tag. However, if we have a dictionary that specifies the list of possible tags for each word, we can use this information to constrain the model: if t is not a valid tag for the word w, then we are sure that k(w/t) = O.</Paragraph> <Paragraph position="6"> There are thus at most as many nonzero values for the k probabilities as there are possible pairs (word, tag) allowed in the dictionary.</Paragraph> <Section position="1" start_page="158" end_page="159" type="sub_section"> <SectionTitle> 5.1 Relative Frequency Training </SectionTitle> <Paragraph position="0"> If we have some tagged text available we can compute the number of times N(w, t) a given word w appears with the tag t, and the number of times N(h, t2~ t3) the sequence (tl~ t2~ t3) appears in this text. We can then estimate the probabilities h and k by computing the relative frequencies of the corresponding events on this data:</Paragraph> <Paragraph position="2"> These estimates assign a probability of zero to any sequence of tags that did not occur in the training data. But such sequences may occur if we consider other texts.</Paragraph> <Paragraph position="3"> A probability of zero for a sequence creates problems because any alignment that contains this sequence will get a probability of zero. Therefore, it may happen that, for some sequences of words, all alignments get a probability of zero and the model becomes useless for such sentences.</Paragraph> <Paragraph position="4"> To avoid this, we interpolate these distributions with uniform distributions, i.e.</Paragraph> <Paragraph position="5"> we consider the interpolated model defined by: where</Paragraph> <Paragraph position="7"> The interpolation coefficient ,~ is computed using the deleted interpolation algorithm (Jelinek and Mercer 1980) (it would also be possible to use two coefficients, one for the interpolation on h, one for the interpolation on k). The value of this coefficient is expected to increase if we increase the size of the training text, since the relative frequencies should be more reliable. This interpolation procedure is also called &quot;smoothing.&quot; Smoothing is performed as follows: Some quantity of tagged text from the training data is not used in the computation of the relative frequencies. It is called the &quot;held-out&quot; data. The coefficient & is chosen to maximize the probability of emission of the held-out data by the interpolated model.</Paragraph> <Paragraph position="8"> Computational Linguistics Volume 20, Number 2 This maximization can be performed by the standard Forward-Backward (FB) or Baum-Welch algorithm (Baum and Eagon 1967; Jelinek 1976; Bahl, Jelinek, and Mercer 1983; Poritz 1988), by considering ~ and 1 as the transition probabilities of a Markov model.</Paragraph> <Paragraph position="9"> It can be noted that more complicated interpolation schemes are possible. For example, different coefficients can be used depending on the count of (h, t2), with the intuition that relative frequencies can be trusted more when this count is high. Another possibilitity is to interpolate also with models of different orders, such as hrf(t3/t2) or hrf(t3).</Paragraph> <Paragraph position="10"> Smoothing can also be achieved with procedures other than interpolation. One example is the &quot;backing-off&quot; strategy proposed by Katz (1987).</Paragraph> </Section> <Section position="2" start_page="159" end_page="159" type="sub_section"> <SectionTitle> 5.2 Maximum Likelihood Training </SectionTitle> <Paragraph position="0"> Using a triclass model M it is possible to compute the probability of any sequence of words W according to this model:</Paragraph> <Paragraph position="2"> where the sum is taken over all possible alignments. The Maximum Likelihood (ML) training finds the model M that maximizes the probability of the training text: max II pM(W) M W where the product is taken over all the sentences W in the training text. This is the problem of training a hidden Markov model (it is hidden because the sequence of tags is hidden). A well-known solution to this problem is the Forward-Backward (FB) or Baum-Welch algorithm (Baum and Eagon 1967; Jelinek 1976; Bahl, Jelinek, and Mercer 1983), which iteratively constructs a sequence of models that improve the probability of the training data.</Paragraph> <Paragraph position="3"> The advantage of this approach is that it does not require any tagging of the text, but makes the assumption that the correct model is the one in which tags are used to best predict the word sequence.</Paragraph> </Section> </Section> <Section position="6" start_page="159" end_page="159" type="metho"> <SectionTitle> 6. Tagging Algorithms </SectionTitle> <Paragraph position="0"> The Viterbi algorithm is easily implemented using a dynamic programming scheme (Bellman 1957). The Maximum Likelihood algorithm appears more complex at first glance, because it involves computing the sum of the probabilities of a large number of alignments. However, in the case of a hidden Markov model, these computations can be arranged in a way similar to the one used during the FB algorithm, so that the overall amount of computation needed becomes linear in the length of the sentence (Baum and Eagon 1967).</Paragraph> </Section> class="xml-element"></Paper>