File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/94/a94-1008_metho.xml

Size: 11,126 bytes

Last Modified: 2025-10-06 14:13:37

<?xml version="1.0" standalone="yes"?>
<Paper uid="A94-1008">
  <Title>Tagging accurately- Don't guess if you know</Title>
  <Section position="3" start_page="47" end_page="47" type="metho">
    <SectionTitle>
2 The taggers in outline
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="47" end_page="47" type="sub_section">
      <SectionTitle>
2.1 English Constraint Grammar Parser
</SectionTitle>
      <Paragraph position="0"> The English Constraint Grammar Parser, ENGCG (Voutilainen et al., 1992; Karlsson el al., 1994), is based on Constraint Grammar, a parsing framework proposed by Fred Karlsson (1990). It was developed 1989-1993 at the Research Unit for Computational Linguistics, University of Helsinki, by Atro Voutilainen, Juha Heikkil~i and Arto Anttila; later on, Timo JPSrvinen has extended the syntactic description, and Pasi Tapanainen has made a new fast implementation of the CG parsing program. ENGCG is primarily designed for the analysis of standard written English of the British and American varieties.</Paragraph>
      <Paragraph position="1"> In the development and testing of the system, over 100 million words of running text have been used.</Paragraph>
      <Paragraph position="2"> The ENGTWOL lexicon is based on the two-level model (Koskenniemi, 1983). The lexicon contains over 80,000 lexical entries, each of which represents all inflected and central derived forms of the lexemes.</Paragraph>
      <Paragraph position="3"> The lexicon also employs a collection of tags for part of speech, inflection, derivation and even syntactic category (e.g. verb classification).</Paragraph>
      <Paragraph position="4"> Usually less than 5 % of all word-form tokens in running text are not recognised by the morphological analyser. Therefore the system employs a rule-based heuristic module that provides all unknown words with one or more readings. About 99.5 % of words not recognised by the ENGTWOL analyser itself get a correct analysis from the heuristic module. The module contains a list of prefixes and suffixes, and possible analyses for matching words. For instance, words beginning with un... and ending in ...al are marked as adjectives.</Paragraph>
      <Paragraph position="5"> The grammar for morphological disambiguation (Voutilainen, 1994) is based on 23 linguistic generalisations about the form and function of essentially syntactic constructions, e.g. the form of the noun phrase, prepositional phrase, and finite verb chain. These generalisations are expressed as 1,100 highly reliable 'grammar-based' and some 200 less reliable add-on 'heuristic' constraints, usually in a partial and negative fashion. Using the 1,100 best constraints results in a somewhat ambiguous output. Usually there are about 1.04-1.07 morphological analyses per word. Usually at least 997 words out of every thousand retain the contextually appropriate morphological reading, i.e. the recall usually is at least 99.7 %. If the heuristic constraints are also used, the ambiguity rate falls to 1.02-1.04 readings per word, with an overall recall of about 99.5 %. This accuracy compares very favourably with results reported in (de Marcken, 1990; Weisehedel et al., 1993; Kempe, 1994) - for instance, to reach the recall of 99.3 %, the system by (Weischedel et al., 1993) has to leave as many as three readings per word in its output.</Paragraph>
    </Section>
    <Section position="2" start_page="47" end_page="47" type="sub_section">
      <SectionTitle>
2.2 Xerox Tagger
</SectionTitle>
      <Paragraph position="0"> The Xerox Tagger 1, XT, (Cutting et al., 1992) is a statistical tagger made by Doug Cutting, Julian Kupiec, Jan Pedersen and Penelope Sibun in Xerox PARC. It was trained on the untagged Brown Corpus (Francis and Kubera, 1982).</Paragraph>
      <Paragraph position="1"> The lexicon is a word-list of 50,000 words with alternative tags. Unknown words are analysed according to their suffixes. The lexicon and suffix tables are implemented as tries. For instance, for the word live there are the following alternative analyses: JJ (adjective) and VB (uninflected verb). Unknown words not recognised by suffix tables get all tags from a specific set (called open-class).</Paragraph>
      <Paragraph position="2"> The tagger itself is based on the Hidden Markov Model (Baum, 1972) and word equivalence classes (Kupiec, 1989). Although the tagger is trained with the untagged Brown corpus, there are several ways to 'force' it to learn.</Paragraph>
      <Paragraph position="3"> * The symbol biases represent a kind of lexical probabilities for given word equivalence classes.</Paragraph>
      <Paragraph position="4"> * The transition biases can be used for saying that it is likely or unlikely that a tag is followed by some specific tag. The biases serve as default values for the Hidden Markov Model before the training.</Paragraph>
      <Paragraph position="5"> * Some rare readings may be removed from the lexicon to prevent the tagger from selecting them.</Paragraph>
      <Paragraph position="6"> * There are some training parameters, like the number of iterations (how many times the same block of text is used in training) and the size of the block of the text used for training.</Paragraph>
      <Paragraph position="7"> * The choice of the training corpus affects the result. null The tagger is reported (Cutting el al., 1992) to have a better than 96 % accuracy in the analysis of parts of the Brown Corpus. The accuracy is similar to other probabilistic taggers.</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="47" end_page="48" type="metho">
    <SectionTitle>
3 Grammatical representations of
</SectionTitle>
    <Paragraph position="0"> the taggers A major difference between a knowledge-based and a probabilistic tagger is that the knowledge-based tagger needs as much information as possible while the probabilistic tagger requires some compact set of tags that does not make too many distinctions between similar words. The difference can be seen by comparing the Brown Corpus tag set (used by XT) with the ENGCG tag set.</Paragraph>
    <Paragraph position="1"> The ENGTWOL morphological analyser employs 139 tags. Each word usually receives several tags (see Figure 1). There are also 'auxiliary' tags for derivational and syntactic information that do not</Paragraph>
    <Paragraph position="3"> tional information for rules. If these auxiliary tags are ignored, the morphological analyser produces about 180 different tag combinations.</Paragraph>
    <Paragraph position="4"> The XT lexicon contains 94 tags for words; 15 of them are assigned unambiguously to only one word.</Paragraph>
    <Paragraph position="5"> There are 32 verb tags: 8 tags for have, 13 for be, 6 for do and 5 tags for other verbs. ENGCG does not make a distinction in the tagset between words have, be, do and the other verbs. To see the difference with ENGCG, see Figure 1.</Paragraph>
    <Paragraph position="6"> The ENGCG description differs from the Brown Corpus tag set in the following respects. ENGCG is more distinctive in that a part of speech distinction is spelled out (see Figure 2) in the description of  represented as ambiguous due to the subjunctive, imperative, infinitive and present tense readings.</Paragraph>
    <Paragraph position="7"> On the other hand, ENGCG does not spell out part-of-speech ambiguity in the description of  meanings of the adjective and noun readings are similar, * ambiguities due to proper nouns, common nouns and abbreviations.</Paragraph>
  </Section>
  <Section position="5" start_page="48" end_page="49" type="metho">
    <SectionTitle>
4 Combining the taggers
</SectionTitle>
    <Paragraph position="0"> In our approach we apply ENGCG and XT independently. Combining the taggers means aligning the outputs of the taggers and transforming the result of one tagger to that of the other.</Paragraph>
    <Paragraph position="1"> Aligning the output is straightforward: we only need to match the word forms in the output of the taggers. Some minor problems occur when tokenisation is done differently. For instance, XT handles words like aren't as a single token, when ENGCG divides it to two tokens, are and not. Also ENGCG recognises some multiple word phrases like in spite of as one token, while XT handles it as three tokens.</Paragraph>
    <Paragraph position="2"> We do not need to map both Brown tags to ENGCG and vice versa. It is enough to transform ENGCG tags to Brown tags and select the tag that XT has produced, or transform the tag of XT into ENGCG tags. We do the latter because the ENGCG tags contain more information. This is likely to be desirable in the design of potential applications.</Paragraph>
    <Paragraph position="3"> There are a couple of problems in mapping: * Difference in distinctiveness. Sometimes ENGTWOL makes a distinction not made by the Brown tagset; sometimes the Brown tagset makes a distinction not made by ENGTWOL (see Figure 2).</Paragraph>
    <Paragraph position="4"> * Sometimes tags are used in a different way. A  case in point is the word as. In a sample of 76 instances of as from the tagged Brown corpus, 73 are analysed as CS; two as QL and one as IN, while in the ENGCG description the same instances of as were analysed 15 times as CS, four times as ADV, and 57 times as PREP.</Paragraph>
    <Paragraph position="5"> In ENGCG, the tag CS represents subordinating conjunctions. In the following sentences the correct analysis for word as in ENGCG is PREP, not CS, which the Brown corpus suggests. null The city purchasing department, the jury said, is lacking in experienced clerical personnel as(CS) a result of city personnel policies. -- The petition listed the mayor's occupation as(CS) attorney and his age as(CS) 71.</Paragraph>
    <Paragraph position="6"> It listed his wife's age as(CS) 74 and place of birth as(CS) Opelika, Ala.</Paragraph>
    <Paragraph position="7"> The sentences are the three first sentences where word as appears in Brown corpus. In the Brown Corpus as appears over 7000 times and it is the fourteenth most common word. Because XT is trained according to the Brown Corpus, this is likely to cause problems.</Paragraph>
    <Paragraph position="8"> XT is applied independently to the text, and the tagger's prediction is consulted in the analysis of those words where ENGCG is unable to make a unique prediction. The system selects the ENGCG morphological reading that most closely corresponds to the tag proposed by XT.</Paragraph>
    <Paragraph position="9"> The mapping scheme is the following. For each Brown Corpus tag, there is a decision list for possible ENGCG tags, the most probable one first. We have computed the decision list from the part of Brown Corpus that is also manually tagged according to the ENGCG grammatical representation. The mapping can be used in two different ways.</Paragraph>
    <Paragraph position="10"> * Careful mode: An ambiguous reading in the output of ENGCG may be removed only when it is not in the decision list. In practise this leaves quite much ambiguity.</Paragraph>
    <Paragraph position="11"> * Unambiguous mode: Select the reading in the output of ENGCG that comes first in the decision list 2.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML