File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/94/j94-2001_evalu.xml

Size: 11,368 bytes

Last Modified: 2025-10-06 14:00:13

<?xml version="1.0" standalone="yes"?>
<Paper uid="J94-2001">
  <Title>Tagging English Text with a Probabilistic Model</Title>
  <Section position="7" start_page="159" end_page="165" type="evalu">
    <SectionTitle>
7. Experiments
</SectionTitle>
    <Paragraph position="0"> The main objective of this paper is to compare RF and ML training. This is done in Section 7.2. We also take advantage of the environment that we have set up to perform other experiments, described in Section 7.3, that have some theoretical interest, but did  not bring any improvement in practice. One concerns the difference between Viterbi and ML tagging, and the other concerns the use of constraints during training. We shall begin by describing the textual data that we are using, before presenting the different tagging experiments using these various training and tagging methods.</Paragraph>
    <Section position="1" start_page="160" end_page="160" type="sub_section">
      <SectionTitle>
7.1 Text Data
</SectionTitle>
      <Paragraph position="0"> We use the &amp;quot;treebank&amp;quot; data described in Beale (1988). It contains 42,186 sentences (about one million words) from the Associated Press. These sentences have been tagged manually at the Unit for Computer Research on the English Language (University of Lancaster, U.K.), in collaboration with IBM U.K. (Winchester) and the IBM Speech Recognition group in Yorktown Heights (USA). In fact, these sentences are not only tagged but also parsed. However, we do not use the information contained in the parse.</Paragraph>
      <Paragraph position="1"> In the treebank 159 different tags are used. These tags were projected on a smaller system of 76 tags designed by Evelyne Tzoukermann and Peter Brown (see Appendix).</Paragraph>
      <Paragraph position="2"> The results quoted in this paper all refer to this smaller system.</Paragraph>
      <Paragraph position="3"> We built a dictionary that indicates the list of possible tags for each word, by taking all the words that occur in this text and, for each word, all the tags that are assigned to it somewhere in the text. In some sense, this is an optimal dictionary for this data, since a word will not have all its possible tags (in the language), but only the tags that it actually had within the text.</Paragraph>
      <Paragraph position="4"> We separated this data into two parts: a set of 40,186 tagged sentences, the training data, which is used to build the models a set of 2,000 tagged sentences (45,583 words), the test data, which is used to test the quality of the models.</Paragraph>
    </Section>
    <Section position="2" start_page="160" end_page="163" type="sub_section">
      <SectionTitle>
7.2 Basic Experiments
</SectionTitle>
      <Paragraph position="0"> RF training, Viterbi tagging In this experiment, we extracted N tagged sentences from the training data. We then computed the relative frequencies on these sentences and built a &amp;quot;smoothed&amp;quot; model using the procedure previously described. This model was then used to tag the 2,000 test sentences. We experimented with different values of N, for each of which we indicate the value of the interpolation coefficient and the number and percentage of correctly tagged words. Results are indicated in Table 1.</Paragraph>
      <Paragraph position="1">  Computational Linguistics Volume 20, Number 2 As expected, as the size of the training increases, the interpolation coefficient increases and the quality of the tagging improves.</Paragraph>
      <Paragraph position="2"> When N = 0, the model is made up of uniform distributions. In this case, all alignments for a sentence are equally probable, so that the choice of the correct tag is just a choice at random. However, the percentage of correct tags is relatively high (more than three out of four) because: * almost half of the words of the text have a single possible tag, so that no mistake can be made on these words * about a quarter of the words of the text have only two possible tags so that, on the average, a random choice is correct every other time.</Paragraph>
      <Paragraph position="3"> Note that this behavior is obviously very dependent on the system of tags that is used. It can be noted that reasonable results are obtained quite rapidly. Using 2,000 tagged sentences (less than 50,000 words), the tagging error rate is already less than 5%. Using 10 times as much data (20,000 tagged sentences) provides an improvement of only 1.5%.</Paragraph>
      <Paragraph position="4"> ML training, Viterbi tagging In ML training we take all the training data available (40,186 sentences) but we only use the word sequences, not the associated tags (except to compute the initial model, as will be described later). This is possible since the FB algorithm is able to train the model using the word sequence only.</Paragraph>
      <Paragraph position="5"> In the first experiment we took the model made up of uniform distributions as the initial one. The only constraints in this model came from the values k(w/t) that were set to zero when the tag t was not possible for the word w (as found in the dictionary). We then ran the FB algorithm and evaluated the quality of the tagging. The results are shown in Figure 1. (Perplexity is a measure of the average branching factor for probabilistic models.) This figure shows that ML training both improves the perplexity of the model and reduces the tagging error rate. However, this error rate remains at a relatively high level--higher than that obtained with a RF training on 100 tagged sentences. Having shown that ML training is able to improve the uniform model, we then wanted to know if it was also able to improve more accurate models. We therefore took as the initial model each of the models obtained previously by RF training and, for each one, performed ML training using all of the training word sequences. The results are shown graphically in Figure 2 and numerically in Table 2.</Paragraph>
      <Paragraph position="6"> These results show that, when we use few tagged data, the model obtained by relative frequency is not very good and Maximum Likelihood training is able to improve it. However, as the amount of tagged data increases, the models obtained by Relative Frequency are more accurate and Maximum Likelihood training improves on the initial iterations only, but after deteriorates. If we use more than 5,000 tagged sentences, even the first iteration of ML training degrades the tagging. (This number is of course dependent on both the particular system of tags and the kind of text used in this experiment).</Paragraph>
      <Paragraph position="7"> These results call for some comments. ML training is a theoretically sound procedure, and one that is routinely and successfully used in speech recognition to estimate the parameters of hidden Markov models that describe the relations between sequences of phonemes and the speech signal. Although ML training is guaranteed to improve perplexity, perplexity is not necessarily related to tagging accuracy, and it is possible to improve one while degrading the other. Also, in the case of tagging,</Paragraph>
      <Paragraph position="9"> ML training from various initial points (top line corresponds to N=IO0, bottom line to N=all).</Paragraph>
      <Paragraph position="10"> the relations between words and tags are much more precise than the relations between phonemes and speech signals (where the correct correspondence is harder to define precisely). Some characteristics of ML training, such as the effect of smoothing probabilities, are probably more suited to speech than to tagging.</Paragraph>
    </Section>
    <Section position="3" start_page="163" end_page="165" type="sub_section">
      <SectionTitle>
7.3 Extra Experiments
</SectionTitle>
      <Paragraph position="0"> Viterbi versus ML tagging For this experiment we considered the initial model built by RF training over the whole training data and all the successive models created by the iterations of ML training. For each of these models we performed Viterbi tagging and ML tagging on the same test data, then evaluated and compared the number of tagging errors produced by these two methods. The results are shown in Table 3.</Paragraph>
      <Paragraph position="1"> The models obtained at different iterations are related, so one should not draw strong conclusions about the definite superiority of one tagging procedure. However, the difference in error rate is very small, and shows that the choice of the tagging procedure is not as critical as the kind of training material.</Paragraph>
      <Paragraph position="2"> Constrained ML training Following a suggestion made by E Jelinek, we investigated the effect of constraining the ML training by imposing constraints on the probabilities. This idea comes from the observation that the amount of training data needed to properly estimate the model increases with the number of free parameters of the model. In the case of little training data, adding reasonable constraints on the shape of the models that are looked for reduces the number of free parameters and should improve the quality of the estimates.</Paragraph>
      <Paragraph position="3">  We tried two different constraints: * The first one keeps p(t/w) fixed if w is a frequent word, in our case one of the 1,000 most frequent words. We call it tw-constraint. The rationale is that if w is frequent, the relative frequency provides a good estimate for p(t/w) and the training should not change it.</Paragraph>
      <Paragraph position="4"> * The second one keeps the marginal distribution p(t) constant and is based on a similar reasoning. We call it t-constraint.</Paragraph>
      <Paragraph position="5"> tw-constraint The tw-constrained ML training is similar to the standard ML training, except that the probabilities p(t/w) are not changed at the end of an iteration.</Paragraph>
      <Paragraph position="6"> The results in Table 4 show the number of tagging errors when the model is trained with the standard or tw-constrained ML training. They show that the tw-constrained ML training still degrades the RF training, but not as quickly as the standard ML. We  have not tested what happens when smaller training data is used to build the initial model.</Paragraph>
      <Paragraph position="7"> t-constraint This constraint is more difficult to implement than the previous one because the probabilities p(t) are not the parameters of the model, but a combination of these parameters. With the help of R. Polyak we have designed an iterative procedure that allows the likelihood to be improved while preserving the values of p(t). We do not have sufficient space to describe this procedure here. Because of its greater computational complexity, we have only applied it to a biclass model, i.e. a model where p(ti/wltl . . . Wi-lti-1) = h(ti/ti-1).</Paragraph>
      <Paragraph position="8"> The initial model is estimated by relative frequency on the whole training data and Viterbi tagging is used.</Paragraph>
      <Paragraph position="9"> As in the previous experiment, the results in Table 5 show the number of tagging errors when the model is trained with the standard or t-constrained ML training. They show that the t-constrained ML training still degrades the RF training, but not as quickly as the standard ML. Again, we have not tested what happens when smaller training data is used to build the initial model.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML