File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/c04-1082_metho.xml

Size: 13,455 bytes

Last Modified: 2025-10-06 14:08:42

<?xml version="1.0" standalone="yes"?>
<Paper uid="C04-1082">
  <Title>Tagging with Hidden Markov Models Using Ambiguous Tags</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 Computing probability
</SectionTitle>
    <Paragraph position="0"> distributions for ambiguous tags Probabilistic models for part of speech taggers are built in two stages. In a flrst stage, counts are collected from a tagged training corpus while in the second, probabilities are computed on the basis of these counts. Two type of counts are collected: lexical counts, noted Cl(w;T) indicating how many times word w has been tagged T in the training corpus and syntactic counts Cs(T1;T2;T3) indicating how many times the tag sequence T1;T2;T3 occurred in the training corpus. Lexical counts are stored in a lexicon and syntactic counts in a 3-gram database.</Paragraph>
    <Paragraph position="1"> These real counts will be used to compute flctitious counts for ambiguous tags on the basis of which probability distributions will be estimated. The rationale behind the computation of the counts (lexical as well as syntactic) of an ambiguous tag T1:::j is that they must re ect the homogeneity of the counts of fT1 :::Tjg. If they are all equal, the count of T1:::j should be maximal.</Paragraph>
    <Paragraph position="2"> Impurity functions (Breiman et al., 1984) perfectly model this behavior1: an impurity function ' is a function deflned on the set of all N-tuples of numbers (p1;:::;pN) satisfying 8j 2 [1;:::;N];pj , 0 and PNj=1 pj = 1 with the following properties: 1Entropy would be another candidate for such computation. The same experiments have also been conducted using entropy and lead to almost the same results. + ' reaches its maximum at the point</Paragraph>
    <Paragraph position="4"> Given an impurity function ', we deflne the impurity measure of a N-tuple of counts C =</Paragraph>
    <Paragraph position="6"> whose maximal value is equal to N!1N .</Paragraph>
    <Paragraph position="7"> The impurity measure will be used to compute both lexical and syntactic flctitious counts as described in the two following sections.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.1 Lexical counts
</SectionTitle>
      <Paragraph position="0"> Lexical counts for an ambiguous tag T1;:::;n are computed using lexical impurity Il(w;T1;:::;n) which measures the impurity of the n-tuple</Paragraph>
      <Paragraph position="2"> A high lexical impurity Il(w;T1;:::;n) means that w is ambiguous with respect to the difierent classes T1;:::;Tn. It reaches its maximum when w has the same probability to belong to any of them. The lexical count Cl(w;T1;:::;n) is computed using the following formula:</Paragraph>
      <Paragraph position="4"> This formula is used to update a lexicon, for each lexical entry, the counts of the ambiguous tags are computed and added to the entry. The two entries daily and deals whose original counts are represented below2:  daily RB 32 JJ 41 deals NNS 1 VBZ 13 2RB, JJ, NNS and VBZ stand respectively for adverb, adjective, plural noun and verb (3rd person singular, present).</Paragraph>
      <Paragraph position="5"> are updated to3: daily RB 32 JJ 41 JJ_RB 36 deals NNS 1 VBZ 13 NNS_VBZ 2</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.2 Syntactic counts
</SectionTitle>
      <Paragraph position="0"> Syntactic counts of the form Cs(X;Y;T1;:::;n) are computed using syntactic impurity Is(X;Y;T1;:::;n) which measures the impurity of</Paragraph>
      <Paragraph position="2"> A maximum syntactic impurity means that all the tags T1;:::;Tn have the same probability of occurrence after the tag sequence X Y . If any of them has a probability of occurrence equal to zero after such a tag sequence, the impurity is also equal to zero. The syntactic count Cs(X;Y;T1;:::;n) is computed using the following formula:</Paragraph>
      <Paragraph position="4"> Such a formula is used to update the 3-gram database in three steps. First, syntactic counts of the form Cs(X;Y;T1;:::;n) (with X and Y unambiguous) are computed, then syntactic counts of the form Cs(X;T1;:::;n;Y ) (with X unambiguous and Y possibly ambiguous) and eventually, syntactic counts of the form Cs(T1;:::;n;X;Y ) (for X and Y possibly ambiguous). The following four real 3-grams:  which will be added to the 3-gram database.</Paragraph>
      <Paragraph position="5"> Note that the real 3-grams are not modifled during this operation.</Paragraph>
      <Paragraph position="6"> Once the lexicon and the 3-gram database have been updated, both real and flctitious counts are used to estimate lexical and syntactic probability distribution. These probability distributions constitute the model. The tagging process itself, based on the Viterbi search algorithm, is unchanged.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.3 Data sparseness
</SectionTitle>
      <Paragraph position="0"> The introduction of new tags in the tagset increases the number of states in the HMM and therefore the number of parameters to be estimated. It is important to notice that even if the number of parameters increases, the model does not become more sensitive to data sparseness problems than the original model was. The reason is that flctitious counts are computed based on actual counts. The occurrence, in the training corpus, of an event (as the occurrence of a sequence of tags or the occurrence of a word with a given tag) is used for estimating both the probability of the event associated to the simple tag and the probabilities of the events associated with the ambiguous tags which contain the simple tag. For example, the occurrence of the word w with tag T, in the training corpus, will be used to estimate the lexical probability P(wjT) as well as the lexical probabilities P(wjT0) for every ambiguous tag T0 of which T may be a component.</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Learning ambiguous tags from
</SectionTitle>
    <Paragraph position="0"> errors Since ambiguous tags are not given a priori, candidates can be selected based on the errors made by the tagger. The idea developed in this section consists in learning iteratively ambiguous tags on the basis of the errors made by a tagger. When a word w tagged T1 in a reference corpus has been wrongly tagged T2 by the tagger, that means that T1 and T2 are lexically and syntactically ambiguous, with respect to w and a given context. Consequently, T1;2 is a potential candidate for an ambiguous tag.</Paragraph>
    <Paragraph position="1"> The process of discovering ambiguous tags starts with a tagged training corpus whose tagset is called T0. A standard tagger, M0, is trained on this corpus. M0 is used to tag the training corpus. A confusion matrix is then computed and the most frequent error is selected to form an ambiguous tag which is added to T0 to constitute T1. M0 is then updated with the new ambiguous tag to constitue M1, as described in section 2. The process is iterated : the training corpus is tagged with Mi, the most frequent error is used to constitue Ti+1 and a new tagger Mi+1 is built, based on Mi.</Paragraph>
    <Paragraph position="2"> The process continues until the result of the tagging on the development corpus converges or the number of iterations has reached a given threshold. null</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 Experiments
</SectionTitle>
      <Paragraph position="0"> The model described in section 2 has been tested on the Brown corpus (Francis and Ku*cera, 1982), tagged with the 45 tags of the Penn treebank tagset (Marcus et al., 1993), which constitute the initial tagset T0. The corpus has been divided in a training corpus of 961;3 K words, a development corpus of 118;6 K words and a test corpus of 115;6 K words.</Paragraph>
      <Paragraph position="1"> The development corpus was used to detect the convergence and the flnal model was evaluated on the test corpus. The iterative tag learning algorithm converged after 50 iterations.</Paragraph>
      <Paragraph position="2"> A standard trigram model (without ambiguous tags) M0 was trained on the training corpus using the CMU-Cambridge statistical language modeling toolkit (Clarkson and Rosenfeld, 1997). Smoothing was done through backofi on bigrams and unigrams using linear discounting (Ney et al., 1994).</Paragraph>
      <Paragraph position="3"> The lexical probabilities were estimated on the training corpus. Unknown words (words of the development and test corpus not present in the lexicon) were taken into account by a simple technique: the words of the development corpus not present in the training corpus were used to estimate the lexical counts of unknown words Cl(UNK;t). During tagging, if a word is unknown, the probability distribution of word UNK is used. The development corpus contains 4097 unknown words (3:4% of the corpus) and the test corpus 3991 (3:3%).</Paragraph>
      <Paragraph position="4">  The result of the tagging process consists in a sequence of ambiguous and non ambiguous tags.</Paragraph>
      <Paragraph position="5"> This result can no longer be evaluated using accuracy alone (or word error rate), as it is usually the case in part of speech tagging, since the introduction of ambiguous tags allows the tagger to assign multiple tags to a word. This is why two measures have been used to evaluate the output of the tagger with respect to a gold standard: the recall and the ambiguity rate.</Paragraph>
      <Paragraph position="6"> Given an output of the tagger T = t1 :::tn, where ti is the tag associated to word i by the tagger, and a gold reference R = r1 :::rn where r1 is the correct tag for word wi, the recall of T is computed as follows :</Paragraph>
      <Paragraph position="8"> where -(p) equals to 1 if predicate p is true and 0 otherwise. A recall of 1 means that for every word occurrence, the correct tag is an element of the tag given by the tagger.</Paragraph>
      <Paragraph position="9"> The ambiguity rate of T is computed as fol-</Paragraph>
      <Paragraph position="11"> where AMB(ti) is the ambiguity of tag ti. An ambiguity rate of 1 means that no ambiguous tag has been introduced. The maximum ambiguity rate for the development corpus (when all the possible tags of a word are kept) is equal to 2:4.</Paragraph>
      <Paragraph position="12">  The successive models Mi are based on the different tagsetsTi. Their output is evaluated with the two measures described above. But these flgures by themselves are di-cult to interpret if we cannot compare them with the output of another tagging process based on the same tagset. The only point of comparision at hand is model M0 but it is based on tagset T0, which does not contain ambiguous tags. In order to create such a point of comparison, a baseline model Bi is built at every iteration. The general idea is to replace in the training corpus, all occurrences of tags that appear as an element of an ambiguous tag of Ti by the ambiguous tag itself. After the replacement stage, a model Bi is computed and used to tag the development corpus. The output of the tagging is evaluated using recall and ambiguity rate and can be compared to the output of model Mi.</Paragraph>
      <Paragraph position="13"> The replacement stage described above is actually too simplistic and gives rise to very poor baseline models. There are two problems with this approach. The flrst is that a tag Ti can appear as a member of several ambiguous tags and we must therefore decide which one to choose.</Paragraph>
      <Paragraph position="14"> The second, is that a word tagged Ti in the reference corpus might be unambiguous, it would therefore be \unfair&amp;quot; to associate to it an ambiguous tag. This is the reason why the replacement step is more elaborate. At iteration i, for each couple (wj;Tj) of the training corpus, a lookup is done in the lexicon, which gives access to all the possible non ambiguous tags word wj can have. If there is an ambiguous tag T in Ti such that all its elements are possible tags of wj then, couple (wj;Tj) is replaced with (wj;T) in the corpus. If several ambiguous tags fulflll this condition, the ambiguous tag which has the highest lexical count for wj is chosen.</Paragraph>
      <Paragraph position="15"> Another simple way to build a baseline would be to produce the n best solutions of the tagger, then take for each word of the input the tags associated to it in the difierent solutions and make an ambiguous tag out of these tags.</Paragraph>
      <Paragraph position="16"> This solution was not adopted for two reasons.</Paragraph>
      <Paragraph position="17"> The flrst is that this method mixes tags from difierent solutions of the tagger and can lead to completely incoherent tags sequences. It is di-cult to measure the in uence of this incoherence on the post-tagging stages of the application and we didn't try to measure it empirically. But the idea of potentially producing solutions which are given very poor probabilities by the model is unappealing. The second reason is that we cannot control anymore which ambiguous tags will be created (although this feature might be desirable in some cases). It will be therefore di-cult to compare the result with our models (the tagsets will be difierent).4</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML