XML Viewer - w96-0101

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/96/w96-0101_intro.xml
Size: 6,357 bytes
Last Modified: 2025-10-06 14:06:08
<?xml version="1.0" standalone="yes"?>
<Paper uid="W96-0101">
  <Title>Using word class for Part-of-speech disambiguation</Title>
  <Section position="4" start_page="0" end_page="2" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> In the part-of-speech hterature, whether taggers are based on a rule-based approach (Klein and Simmons, 1963), (Brill, 1992), (Voutilainen, 1993), or on a statistical one (Bahl and Mercer, 1976), (Leech et al., 1983), (Merialdo, 1994), (DeRose, 1988), (Church, 1989), (Cutting et al., 1992), there is a debate as to whether more attention should be paid to lexical probabilities rather than contextual ones. (Church, 1992) claims that part-of-speech taggers depend almost exclusively on lexical probabilities, whereas other researchers, such as Voutilainen (Karlsson et al., 1995) argue that word ambiguities vary widely in function of the specific text and genre. Indeed, part of Church's argument is relevant if a system is based on a large corpus such as the Brown corpus (Francis and Ku~era, 1982) which represents one million surface forms of morpho-syntacticaJly disambiguated words from a range of balanced texts. Consider, for example, a word like &amp;quot;cover&amp;quot; as discussed by Voutilainen (Karlsson et al., 1995): in the Brown and the LOB Corpus (Johansson, 1980), the word &amp;quot;cover&amp;quot; is a noun 40% of the occurrences and a verb 60% of the other, but in the context of a car maintenance manual, it is a noun 100~0 of the time. Since, for statistical taggers, 90% of texts can be disambiguated solely applying lexical probabilities, it is, in fact, tempting to think that with more data and more accurate lexical estimates, more text could be better disambiguated. If this hypothesis is true for English, we show that it does not hold for languages for which publicly available tagged corpora do not exist. We also argue against Church's position, supporting the claim that more attention needs to be paid to contextual information for part-of-speech disambiguation (Tzoukermann et ai., 1995).</Paragraph>
    <Paragraph position="1"> The problem tackled here is to develop an &amp;quot;efficient&amp;quot; training corpus. Unless large effort, money, and time are devoted to this project, only small corpora can be disambiguated manually.</Paragraph>
    <Paragraph position="2"> Consequently, the problem of extracting lexical probabilities from a small training corpus is twofold: first, the statistical model may not necessarily represent the use of a particular word in a particular context. In a morphologically inflected language, this argument is particularly serious since a word can be tagged with a large number of parts of speech, i.e. the ambiguity potential is high. Second, word ambiguity may vary widely depending on the particular genre of the text, and this could differ from the training corpus. When there is no equivalent for the Brown corpus in French, how should one build an adequate training corpus which reflects properly lexical probabilities? How can the numerous morphological variants that render this task even harder be handled? The next section gives examples from French and describes how morphology affects part-of-speech disambiguation and what types of ambiguities are found in the language. Section 3 examines different techniques used to obtain lexical probabilities. Given the problems created by estimating probabilities on a corpus of restricted size, we present in Section 4 a solution for coping with these difficulties. We suggest a new paradigm called genotype, derived from the concept of ambiguity class (Kupiec, 1992), which gives a more efficient representation of the data in order to achieve more accuracy in part-of-speech disambiguation. Section 5 shows how our approach differs from the approach taken by Cutting and Kupiec. The frequencies of unigram, bigram, and trigram genotypes are computed in order to further refine the disambiguation and results are provided to support our claims. The final section offers a methodology for developing an adequate training corpus.</Paragraph>
    <Paragraph position="3"> 2 French words and morphological variants To illustrate our position, we consider the case of French, a typical Romance language. French has a rich morphological system for verbs - which can have as many as 48 inflected forms - and a less rich inflectional system for nouns and adjectives, the latter varying in gender and number having up to four different forms. For example, the word &amp;quot;marine&amp;quot; shown in Table 1, can have as many as eight morphological analyses.</Paragraph>
    <Paragraph position="4"> word base form morphological analysis  verb, 1st person, singular, present, indicative vlspi verb, 1st person, singular, present, subjunctive vlsps verb, 2nd person, singular, present, imperative v2spm verb, 3rd person, singular, present, indicative v3spi verb, 3rd person, singular, present, subjunctive v3sps Table 1: Morphological analyses of the word &amp;quot;marine&amp;quot;.</Paragraph>
    <Paragraph position="5"> The same word &amp;quot;marine&amp;quot;, inflected in all forms of the three syntactic categories (adjective, noun, and verb) would have 56 morphologically distinct forms, i.e. 4 for the adjective, 2 for  each of the nouns, and 48 for the verb. At the same time, if we collapse the homographs, these 56 morphologically distinct forms get reduced to 37 homographically distinct forms and the ambiguity lies in the 19 forms which overlap across internal verb categories, but also across nouns and adjectives. Table 1 shows 5 verb ambiguities, 2 noun ambiguities, a total of 8 homographs including the adjective form.</Paragraph>
    <Paragraph position="6"> Part-of-speech Ambiguity of French words. Once morphological analysis is completed, ambiguity of words is computed in order to locate the difficulties. Figure 1 shows two corpora of different sizes and the number of words each tag contains. The figure clearly exhibits that even though Corpus 2 is twice as large as Corpus 1, the distribution of words per tags is very similar, i.e. more than 50% of the words have only one tag and are thus unambiguous, 25% of the words have two tags, 11% of the words have three tags, and about 5% of the words have from four to eight tags.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML