XML Viewer - w04-3235

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/04/w04-3235_evalu.xml
Size: 5,314 bytes
Last Modified: 2025-10-06 13:59:21
<?xml version="1.0" standalone="yes"?>
<Paper uid="W04-3235">
  <Title>Error Measures and Bayes Decision Rules Revisited with Applications to POS Tagging</Title>
  <Section position="6" start_page="0" end_page="0" type="evalu">
    <SectionTitle>
5 Experimental Results
</SectionTitle>
    <Paragraph position="0"> Of course, there have already been many papers about POS tagging using statistical methods. The goal of the experiments is to compare the two decision rules and to analyze the differences in performance. As the results for the WSJ corpus will show, both the trigram method and the maximum entropy method have an tagging error rate of 3.0% to 3.5% and are thus comparable to the best results reported in the literature, e.g. (Ratnaparkhi, 1996).</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.1 Task and Corpus
</SectionTitle>
      <Paragraph position="0"> The experiments are performed on the Wall Street  The MTP corpus (Table 2) was compiled at the University of M&amp;quot;unster and contains tagged German words from articles of the newspapers Die Zeit and Frankfurter Allgemeine Zeitung (Kinscher and Steiner, 1995).</Paragraph>
      <Paragraph position="1"> For the corpus statistics, it is helpful to distinguish between the true words and the punctuation marks (see Table 1 and Table 2). This distinction is made for both the text and the POS corpus. In addition, the tables show the vocabulary size (number of different tokens) for the words and for the punctuation marks.</Paragraph>
      <Paragraph position="2"> Punctuation marks (PMs) are all tokens which do not contain letters or digits. The total number of running tokens is indicated as Words+PMs.</Paragraph>
      <Paragraph position="3"> Singletons are the tokens which occur only once in  the training data. Out-of-Vocabulary words (OOVs) are the words in the test data that did not not occur in the training corpus.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.2 POS Tagging Results
</SectionTitle>
      <Paragraph position="0"> The tagging experiments were performed for both types of models, each of them with both types of the decision rules. The generative model is based on the approach described in (S&amp;quot;undermann and Ney, 2003). Here the optimal value of the n-gram order is determined from the corpus statistics and has a maximum of n = 7. The experiments for the direct model were performed using the maximum entropy tagger described in (Ratnaparkhi, 1996).</Paragraph>
      <Paragraph position="1"> The tagging error rates are showed in Table 3 and Table 4. In addition to the overall tagging error rate (Overall), the tables show the tagging error rates for the Out-of-Vocabulary words (OOVs) and for the punctuation marks (PMs).</Paragraph>
      <Paragraph position="2"> For the generative model, both decision rules yield similar results. For the direct model, the overall tagging error rate increases on each of the two tasks (from 3.0 % to 3.3 % on WSJ and from 5.4 % to 5.6 % on MTP) when we use the symbol decision rule instead of the string decision rule. In particular, for OOVs, the error rate goes up clearly. Right now, we do not have a clear explanation for this difference between the generative model and the direct model. It might be related to the 'forward' structure of the direct model as opposed to the 'forward-backward' structure of the generative model. Anyway, the refined bootstrap method (Bisani and Ney, 2004) has shown that differences in the overall tagging error rate are statistically not significant.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.3 Examples
</SectionTitle>
      <Paragraph position="0"> A detailed analysis of the tagging results showed that for both models there are sentences where the one decision rule is more efficient and sentences where the other decision rule is better.</Paragraph>
      <Paragraph position="1"> For the generative model, these differences seem to occur at random, but for the direct model, some distinct tendencies can be observed. For example,  for the WSJ corpus, the string decision rule is significantly better for the present and past tense of verbs (VBP, VBN), and the symbol decision rule is better for adverb (RB) and verb past participle (VBN). Typical errors generated by the symbol decision rule are tagging present tense as infinitive (VB) and past tense as past participle (VBN), and for string decision rule, adverbs are often tagged as preposition (IN) or adjective (JJ) and past participle as past tense (VBD).</Paragraph>
      <Paragraph position="2"> For the German corpus, the string decision rule better handles demonstrative determiners (Rr) and subordinate conjunctions (Cs) whereas symbol decision rule is better for definite articles (Db). The symbol decision rule typically tags the demonstrative determiner as definite article (Db) and subordinate conjunctions as interrogative adverbs (Bi), and the string decision rule tends to assign the demonstrative determiner tag to definite articles.</Paragraph>
      <Paragraph position="3"> These typical errors for the symbol decision rule are shown in Table 5, and for the string decision rule in Table 6.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML