File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/97/a97-1017_metho.xml

Size: 12,628 bytes

Last Modified: 2025-10-06 14:14:32

<?xml version="1.0" standalone="yes"?>
<Paper uid="A97-1017">
  <Title>Probabilistic and Rule-Based Tagger of an Inflective Languagea Comparison</Title>
  <Section position="4" start_page="112" end_page="113" type="metho">
    <SectionTitle>
2.1.2 CZECH TRAINING DATA
</SectionTitle>
    <Paragraph position="0"> For training, we used the corpus collected during the 1960's and 1970's in the Institute for Czech Language at the Czechoslovak Academy of Sciences.</Paragraph>
    <Paragraph position="1"> The corpus was originally hand-tagged, including the lemmatization and syntactic tags. We had to do some cleaning, which means that we have disregarded the lemmatization information and the syntactic tag, as we were interested in words and tags only. Tags used in this corpus were different from our suggested tags: number of morphological categories was higher in the original sample and the notation was also different. Thus we had to carry out conversions of the original data into the format presented above, which resulted in the so-called Czech &amp;quot;modified&amp;quot; corpus, with the following features:  V~Te used the complete &amp;quot;modified&amp;quot; corpus (621015 tokens) in the experiments No. 1, No. 3, No. 4 and a small part of this corpus in the experiment No. 2, as indicated in Table 2.4.</Paragraph>
    <Paragraph position="2"> tokens 110 874 words 22 530 tags 882 average number of tags per token 2.36 Table 2.4</Paragraph>
    <Section position="1" start_page="112" end_page="112" type="sub_section">
      <SectionTitle>
2.2 ENGLISH EXPERIMENTS
</SectionTitle>
      <Paragraph position="0"> For the tagging of English texts, we used the Penn Treebank tagset which contains 36 POS tags and 12 other tags (for punctuation and the currency symbol). A detailed description is available in (Santorini, 1990).</Paragraph>
      <Paragraph position="1">  For training in the English experiments, we used WSJ (Marcus et al., 1993). We had to change the format of WSJ to prepare it for our tagging software. V~e used a small (100k tokens) part of WSJ in the experiment No. 6 and the complete corpus (1M tokens) in the experiments No. 5, No. 7 and No. 8. Table 2.5 contains the basic characteristics of the training data.</Paragraph>
    </Section>
    <Section position="2" start_page="112" end_page="113" type="sub_section">
      <SectionTitle>
2.3 CZECH VS ENGLISH
</SectionTitle>
      <Paragraph position="0"> Differences between Czech as a morphologically ambiguous inflective language and English as language with poor inflection are also reflected in the number of tag bigrams and tag trigrams. The figures given in Table 2.6 and 2.7 were obtained from the training files.</Paragraph>
      <Paragraph position="1">  It is interesting to note the frequencies of the most ambiguous tokens encountered in the whole &amp;quot;modified&amp;quot; corpus and to compare them with the English data. Table 2.8 and Table 2.9 contain the first tokens with the highest number of possible tags in the complete Czech &amp;quot;modified&amp;quot; corpus and in the  complete WSJ.</Paragraph>
      <Paragraph position="2"> Token Frequency #tags in train, data in train, data jejich 1 087 51 jeho 1 087 46 jeho~ 163 35 jejich~ 150 25 vedoucl 193 22 Table 2.8  In the Czech &amp;quot;modified&amp;quot; corpus, the token &amp;quot;vedouc/&amp;quot; appeared 193 times and was tagged by twenty two different tags: 13 tags for adjective and 9 tags  for noun. The token &amp;quot;vedoucf' means either: &amp;quot;leading&amp;quot; (adjective) or &amp;quot;manager&amp;quot; or &amp;quot;boss&amp;quot; (noun). The following columns represent the tags for the token &amp;quot;vedouc/&amp;quot; and their frequencies in the training data; for example &amp;quot;vedoucf' was tagged twice as adjective, feminine, plural, nominative, first degree, affirmative. null  It is clear from these figures that the two languages in question have quite different properties and that nothing can be said without really going through an experiment.</Paragraph>
    </Section>
    <Section position="3" start_page="113" end_page="113" type="sub_section">
      <SectionTitle>
2.4 THE ALGORITHM
</SectionTitle>
      <Paragraph position="0"> We have used the basic source channel model (described e.g. in (Merialdo, 1992)). The tagging procedure C/ selects a sequence of tags T for the sentence</Paragraph>
      <Paragraph position="2"> Our implementation is based on generating the (W,T) pairs by means of a probabilistic model using approximations of probability distributions Pr(WIT) and Pr(T). The Pr(T) is based on tag bi-grams and trigrams, and Pr(WIT ) is approximated as the product of Pr(wi\[tl). The parameters have been estimated by the usual maximum likelihood training method, i.e. we approximated them as the relative frequencies found in the training data with smoothing based on estimated unigram probability and uniform distributions.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="113" end_page="334" type="metho">
    <SectionTitle>
2.5 THE RESULTS
</SectionTitle>
    <Paragraph position="0"> The results of the Czech experiments are displayed in Table 2.10.</Paragraph>
    <Paragraph position="1"> No. 1 No. 2 No. 3 No. 4  81.14% These results show, not surprisingly, of course, that the more data, the better (results experiments of No.2 vs. No.3), but in order to get better results for a trigram tag prediction model, we would need far more data. Clearly, if 88% trigrams occur four times or less, then the statistics is not reliable. The following tables show a detailed analysis of the errors of the trigram experiment.</Paragraph>
    <Paragraph position="3"> The letters in the first column and row denote POS classes, the interpunction (T) and the &amp;quot;unknown tag&amp;quot; (X). The numbers show how many times the tagger assigned an incorrect POS tag to a token in the test file. The total number of errors was 244. Altogether, fifty times the adjectives (A) were  tagged incorrectly, nouns (N) 93 times, numbers (C) 5 times and etc. (see the last unmarked column in Table 2.11b); to provide a better insight, we should add that in 32 cases, when the adjective was correctly tagged as an adjective, but the mistakes appeared in the assignment of morphological categories (see Table 2.12), 6 times the adjective was tagged as a noun, twice as a pronoun, 3 times as an adverb and so on (see the second row in Table 2.11a). A detailed look at Table 2.12 reveals that for 32 correctly marked adjectives the mistakes was 17 times in gender, once in number, three times in gender and case simultaneously and so on.</Paragraph>
    <Paragraph position="4">  To illustrate the results of our tagging experiments, we present here short examples taken from the test data. Cases of incorrect tag assignment are in boldface.</Paragraph>
    <Paragraph position="5"> -- Czech word\[hand tag exp. exp. exp. exp.</Paragraph>
    <Paragraph position="7"> exp. exp. exp. exp.</Paragraph>
    <Paragraph position="9"/>
  </Section>
  <Section position="6" start_page="334" end_page="334" type="metho">
    <SectionTitle>
2.6 A PROTOTYPE OF RANK XEROX
POS TAGGER FOR CZECH
</SectionTitle>
    <Paragraph position="0"> (Schiller, 1996) describes the general architecture of the tool for noun phrase mark-up based on finite-state techniques and statistical part-of-speech disambiguation for seven European languages. For Czech, we created a prototype of the first step of this process -- the part-of-speech (POS) tagger -using Rank Xerox tools (Tapanainen, 1995), (Cutting et al., 1992).</Paragraph>
    <Paragraph position="1">  The first step of POS tagging is obviously a definition of the POS tags. We performed three ex- null periments. These experiments differ in the POS tagset. During the first experiment we designed tagset which contains 47 tags. The POS tagset can be described as follows:</Paragraph>
  </Section>
  <Section position="7" start_page="334" end_page="334" type="metho">
    <SectionTitle>
2.6.2 RESULTS
</SectionTitle>
    <Paragraph position="0"> Figures representing the results of all experiments are presented in the following table. We have also included the results of English tagging using the same Xerox tools.</Paragraph>
    <Paragraph position="1"> language tags  The results show that the more radical reduction of Czech tags (from 1171 to 34) the higher accuracy of the results and the more comparable are the Czech and English results. However, the difference in the error rate is still more than visible -- here we can speculate that the reason is that Czech is &amp;quot;free&amp;quot; word order language, whereas English is not.</Paragraph>
    <Paragraph position="2">  The analysis of the results of the first experiment showed very high ambiguity between the nominative and accusative cases of nouns, adjectives, pronouns and numerals. That is why we replaced the tags for nominative and accusative of nouns, adjectives, pronouns and numerals by new tags NOUNANA, ADJANA, PRONANA and NUMANA (meaning nominative or accusative, undistinguished). The rest of the tags stayed unchanged. This led 43 POS tags. In the third experiment we deleted the morphological information for nouns and adjectives alltogether. This process resulted in the final 34 POS tags.</Paragraph>
  </Section>
  <Section position="8" start_page="334" end_page="334" type="metho">
    <SectionTitle>
3 A RULE-BASED EXPERIMENT
FOR CZECH
</SectionTitle>
    <Paragraph position="0"> A simple rule-based part of speech (RBPOS) tagger is introduced in (Brill, 1992). The accuracy of this tagger for English is comparable to a stochastic English POS tagger. From our point of view, it is very interesting to compare the results of Czech stochastic POS (SPOS) tagger and a modified RBPOS tagger for Czech.</Paragraph>
    <Section position="1" start_page="334" end_page="334" type="sub_section">
      <SectionTitle>
3.1 TRAINING DATA
</SectionTitle>
      <Paragraph position="0"> We used the same corpus used in the case of the SPOS tagger for Czech. RBPOS requires different input format; we thus converted the whole corpus into this format, preserving the original contents.</Paragraph>
    </Section>
    <Section position="2" start_page="334" end_page="334" type="sub_section">
      <SectionTitle>
3.2 LEARNING
</SectionTitle>
      <Paragraph position="0"> It is an obvious fact that the Czech tagset is totally different from the English tagset. Therefore, we had to modify the method for the initial guess. For Czech the algorithm is: &amp;quot;If the word is W_SB (sentence boundary) assign the tag T_SB, otherwise assign the tag NNSI.&amp;quot;</Paragraph>
    </Section>
  </Section>
  <Section position="9" start_page="334" end_page="334" type="metho">
    <SectionTitle>
3.2.1 LEARNING RULES TO PREDICT
THE MOST LIKELY TAG FOR
UNKNOWN WORDS
</SectionTitle>
    <Paragraph position="0"> The first stage of training is learning rules to predict the most likely tag for unknown words.</Paragraph>
    <Paragraph position="1"> These rules operate on word types; for example, if  a word ends by &amp;quot;d37;, it is probably a masculine adjective. To compare the influence of the size of the training files on the accuracy of the tagger we performed two subexperiments4:  We present here an example of rules taken from LEXRULEOUTFILE from the exp. No. 1: u hassuf 1 NIS2 # change the tag to NIS2 if the suffix is &amp;quot;u&amp;quot; y hassuf 1 NFS2 # change the tag to NFS2 if the suffix is &amp;quot;y&amp;quot; ho hassuf 2 AIS21A # change the tag to AIS21A if the suffix is &amp;quot;ho&amp;quot; PSch hassuf 3 NFP6 # change the tag to NFP6 if the suffix is &amp;quot;PSch&amp;quot; nej addpref 3 O2A # change the tag to O2A if adding the prefix &amp;quot;nej&amp;quot; results in a word</Paragraph>
  </Section>
  <Section position="10" start_page="334" end_page="334" type="metho">
    <SectionTitle>
3.2.2 LEARNING CONTEXTUAL CUES
</SectionTitle>
    <Paragraph position="0"> The second stage of training is learning rules to improve tagging accuracy based on contextual cues.</Paragraph>
    <Paragraph position="1"> These rules operate on individual word tokens.</Paragraph>
    <Paragraph position="2"> 4We use the same names of files and variables as Eric Brill in the rule-based POS tagger's documentation. TAGGED-CORPUS -- manually tagged training corpus, UNTAGGED-CORPUS -- collection of all untagged texts, LEXRULEOUTFILE -- the list of transformations to determine the most likely tag for unknown words, TAGGED-CORPUS-2 -- manually tagged training corpus, TAGGED-CORPUS-ENTIRE -- Czech &amp;quot;modified&amp;quot; corpus (the entire manually tagged corpus), CONTEXT-RULEFILE -- the list of transformations to improve accuracy based on contextual cues. No. 1 No. 2</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML