File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/w04-3235_metho.xml
Size: 11,809 bytes
Last Modified: 2025-10-06 14:09:31
<?xml version="1.0" standalone="yes"?> <Paper uid="W04-3235"> <Title>Error Measures and Bayes Decision Rules Revisited with Applications to POS Tagging</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 Bayes Decision Rule for Minimum Error </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> Rate 2.1 The Bayes Posterior Risk </SectionTitle> <Paragraph position="0"> Knowing that any task in NLP tasks is a difficult one, we want to keep the number of wrong decisions as small as possible. This point-of-view has been used already for more than 40 years in pattern classification as the starting point for many techniques in pattern classification. To classify an observation vector y into one out of several classes c, we resort to the so-called statistical decision theory and try to minimize the average risk or loss in taking a decision. The result is known as Bayes decision rule (Chapter 2 in (Duda and Hart, 1973)):</Paragraph> <Paragraph position="2"> where L[c,~c] is the so-called loss function or error measure, i.e. the loss we incur in making decision c when the true class is ~c.</Paragraph> <Paragraph position="3"> In the following, we will consider two specific forms of the loss function or error measure L[c,~c]. The first will be the measure for string errors, which is the typical loss function used in virtually all statistical approaches. The second is the measure for symbol errors, which is the more appropriate measure for POS tagging and also speech recognition with no insertion and deletion errors (such as isolated word recognition).</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.2 String Error </SectionTitle> <Paragraph position="0"> For POS tagging, the starting point is the observed sequence of words y = wN1 = w1...wN, i.e. the sequence of words for which the POS tag sequence has c = gN1 = g1...gN has to be determined.</Paragraph> <Paragraph position="1"> The first error measure we consider is the string error: the error is equal to zero only if the POS symbols of the two strings are identical at each position. In this case, the loss function is:</Paragraph> <Paragraph position="3"> with the Kronecker delta d(c, ~c). In other words, the errors are counted at the string level and not at the level of single symbols. Inserting this cost function into the Bayes risk (see Section 2.1), we immediately obtain the following form of Bayes decision rule for minimum string error:</Paragraph> <Paragraph position="5"> This is the starting point for virtually all statistical approaches in NLP like speech recognition and machine translation. However, this decision rule is only optimal when we consider string errors, e.g.</Paragraph> <Paragraph position="6"> sentence error rate in POS tagging and in speech recognition. In practice, however, the empirical errors are counted at the symbol level. Apart from (Goel and Byrne, 2003), this inconsistency of decision rule and error measure is never addressed in the literature.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.3 Symbol Error </SectionTitle> <Paragraph position="0"> Instead of the string error rate, we can also consider the error rate of single POS tag symbols (Bahl et al., 1974; Merialdo, 1994).</Paragraph> <Paragraph position="1"> This error measure is defined by the loss function:</Paragraph> <Paragraph position="3"> This loss function has to be inserted into the Bayes decision rule in Section 2.1. The computation of the expected loss, i.e. the averaging over all classes ~c = ~gN1 , can be performed in a closed form. We omit the details of the straightforward calculations and state only the result. It turns out that we will need the marginal (and posterior) probability distribution</Paragraph> <Paragraph position="5"> where the sum is carried out over all POS tag strings gN1 with gm = g, i.e. the tag gm at position m is fixed at gm = g. The question of how to perform this summation efficiently will be considered later after we have introduced the model distributions.</Paragraph> <Paragraph position="6"> Thus we have obtained the Bayes decision rule for minimum symbol error at position m = 1,...,N:</Paragraph> <Paragraph position="8"> By construction this decision rule has the special property that it does not put direct emphasis on local coherency of the POS tags produced. In other words, this decision rule may produce a POS tag string which is linguistically less likely.</Paragraph> </Section> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 The Modelling Approaches to POS </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> Tagging </SectionTitle> <Paragraph position="0"> The derivation of the Bayes decision rule assumes that the probability distribution Pr(gN1 ,wN1 ) (or Pr(gN1 |wN1 )) is known. Unfortunately, this is not the case in practice. Therefore, the usual approach is to approximate the true but unknown distribution by a model distribution p(gN1 ,wN1 ) (or p(gN1 |wN1 )).</Paragraph> <Paragraph position="1"> We will review two popular modelling approaches, namely the generative model and the direct model, and consider the associated Bayes decision rules for both minimum string error and minimum symbol error.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.1 Generative Model: Trigram Model </SectionTitle> <Paragraph position="0"> We replace the true but unknown joint distribution</Paragraph> <Paragraph position="2"> We apply the so-called chain rule to factorize each of the distributions p(gN1 ) and p(wN1 |gN1 ) into a product of conditional probabilities using specific dependence assumptions:</Paragraph> <Paragraph position="4"> with suitable definitions for the case n = 1.</Paragraph> <Paragraph position="5"> Here, the specific dependence assumptions are that the conditional probabilities can be represented by a POS trigram model p(gn|gn[?]1n[?]2) and a word membership model p(wn|gn). Thus we obtain a probability model whose structure fits into the mathematical framework of so-called Hidden Markov Model (HMM). Therefore, this approach is often also referred to as HMM-based POS tagging.</Paragraph> <Paragraph position="6"> However, this terminology is misleading: The POS tag sequence is observable whereas in the Hidden Markov Model the state sequence is always hidden and cannot be observed. In the experiments, we will use a 7-gram POS model. It is clear how to extend the equations from the trigram case to the 7-gram case.</Paragraph> <Paragraph position="7"> Using the above model distribution, we directly obtain the decision rule for minimum string error:</Paragraph> <Paragraph position="9"> Since the model distribution is a basically a second-order model (or trigram model), there is an efficient algorithm for finding the most probable POS tag string. This is achieved by a suitable dynamic programming algorithm, which is often referred to as Viterbi algorithm in the literature.</Paragraph> <Paragraph position="10"> To apply the Bayes decision rule for minimum symbol error rate, we first compute the marginal</Paragraph> <Paragraph position="12"> Again, since the model is a second-order model, the sum over all possible POS tag strings gN1 (with gm = g) can be computed efficiently using a suitable extension of the forward-backward algorithm (Bahl et al., 1974).</Paragraph> <Paragraph position="13"> Thus we obtain the decision rule for minimum symbol error at positions m = 1,...,N:</Paragraph> <Paragraph position="15"> Here, after the the marginal probability pm(g,wN1 ) has been computed, the task of finding the most probable POS tag at position m is computationally easy. Instead, the lion's share for the computational effort is required to compute the marginal probability pm(g,wN1 ).</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.2 Direct Model: Maximum Entropy </SectionTitle> <Paragraph position="0"> We replace the true but unknown posterior distri-</Paragraph> <Paragraph position="2"> and apply the chain rule:</Paragraph> <Paragraph position="4"> As for the generative model, we have made specific assumptions: There is a second-order dependence for the tags gn1 , and the dependence on the words wN1 is limited to a window wn+2n[?]2 around position n. The resulting model is still rather complex and requires further specifications. The typical procedure is to resort to log-linear modelling, which is also referred to as maximum entropy modelling (Ratnaparkhi, 1996; Berger et al., 1996).</Paragraph> <Paragraph position="5"> For the minimum string error, we obtain the decision rule:</Paragraph> <Paragraph position="7"> Since this is still a second-order model, we can use dynamic programming to compute the most likely POS string.</Paragraph> <Paragraph position="8"> For the minimum symbol error, the marginal (and posterior) probability pm(g|wN1 ) has to be computed:</Paragraph> <Paragraph position="10"> which, due to the specific structure of the model p(gn|gn[?]1n[?]2,wn+2n[?]2), can be calculated efficiently using only a forward algorithm (without a 'backward' part).</Paragraph> <Paragraph position="11"> Thus we obtain the decision rule for minimum symbol error at positions m = 1,...,N:</Paragraph> <Paragraph position="13"> As in the case of the generative model, the computational effort is to compute the posterior probability pm(g|wN1 ) rather than to find the most probable tag at position m.</Paragraph> </Section> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 The Training Procedure </SectionTitle> <Paragraph position="0"> So far, we have said nothing about how we train the free parameters of the model distributions. We use fairly conventional training procedures that we mention only for the sake of completeness.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.1 Generative Model </SectionTitle> <Paragraph position="0"> We consider the trigram-based model. The free parameters here are the entries of the POS trigram distribution p(g|gprimeprime,gprime) and of the word membership distribution p(w|g). These unknown parameters are computed from a labelled training corpus, i.e. a collection of sentences where for each word the associated POS tag is given.</Paragraph> <Paragraph position="1"> In principle, the free parameters of the models are estimated as relative frequencies. For the test data, we have to allow for both POS trigrams (or ngrams) and (single) words that were not seen in the training data. This problem is tackled by applying smoothing methods that were originally designed for language modelling in speech recognition (Ney et al., 1997).</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.2 Direct Model </SectionTitle> <Paragraph position="0"> For the maximum entropy model, the free parameters are the so-called li or feature parameters (Berger et al., 1996; Ratnaparkhi, 1996). The training criterion is to optimize the logarithm of the model probabilities p(gn|gn[?]2n[?]1,wn+2n[?]2) over all positions n in the training corpus. The corresponding algorithm is referred to as GIS algorithm (Berger et al., 1996). As usual with maximum entropy models, the problem of smoothing does not seem to be critical and is not addressed explicitly.</Paragraph> </Section> </Section> class="xml-element"></Paper>