File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/98/p98-1080_intro.xml

Size: 5,657 bytes

Last Modified: 2025-10-06 14:06:32

<?xml version="1.0" standalone="yes"?>
<Paper uid="P98-1080">
  <Title>Tagging Inflective Languages: Prediction of Morphological Categories for a Rich, Structured Tagset</Title>
  <Section position="3" start_page="483" end_page="484" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="483" end_page="483" type="sub_section">
      <SectionTitle>
1.1 Orthogonality of morphological
</SectionTitle>
      <Paragraph position="0"> categories of inflective languages The major obstacle in morphological 1 tagging of highly inflective languages, such as Czech or Russian, is - given the resources possibly available - the tagset size. Typically, it is in the order of thousands. This is due to the (partial) &amp;quot;orthogonality &amp;quot;2 of simple morphological categories, which then multiply when creating a &amp;quot;flat&amp;quot; list of tags. However, the individual categories contain only a very small number of different values; e.g., number has five (Sg, P1, Dual, Any, and &amp;quot;not applicable&amp;quot;), case nine etc. The &amp;quot;orthogonality&amp;quot; should not be taken to mean complete independence, though. Inflectional languages (as opposed to agglutinative languages such as Finnish or Hungarian) typically combine several certain categories into one morpheme (suffix or ending). At the same time, the morphemes display a high degree of ambiguity, even across major POS categories.</Paragraph>
      <Paragraph position="1"> For example, most of the Czech nouns can form singular and plural forms in all seven cases, most adjectives can (at least potentially) form all (4) genders, both numbers, all (7) cases, all (3) degrees of comparison, and can be either of positive or negative polarity. That gives 336 possibilities (for adjectives), many of them homonymous on the surface. On the other hand, pronouns and numerals do 1 This type of tagging is sometimes called morpho-syntactic tagging. However, to stress that we are not dealing with syntactic categories such as Object or Attribute (but rather with morphological categories such as Number or Case) we will use the term &amp;quot;morphological&amp;quot; here.</Paragraph>
      <Paragraph position="2"> 2By orthogonality we mean that all combinations of values of two (or more) categories are systematically possible, i.e. that every member of the cartesian product of the two (or more) sets of values do appear in the language.</Paragraph>
      <Paragraph position="3"> not display such an orthogonality, and even adjectives are not fully orthogonal - an ancient &amp;quot;dual&amp;quot; number, happily living in modern Czech in the feminine, plural and instrumental case adds another 6 sub-orthogonal possibilities to almost every adjective. Together, we employ 3127 plausible combinations (including style and diachronic variants).</Paragraph>
    </Section>
    <Section position="2" start_page="483" end_page="483" type="sub_section">
      <SectionTitle>
1.2 The individual categories
</SectionTitle>
      <Paragraph position="0"> There are 13 morphological categories currently used for morphological tagging of Czech: part of speech, detailed POS (called &amp;quot;subpart of speech&amp;quot;), gender, number, case, possessor's gender, possessor's number, person, tense, degree of comparison, negativeness (affirmative/negative), voice (active/passive), and variant/register.</Paragraph>
      <Paragraph position="1"> The P0S category contains only the major part of speech values (noun (N), verb (V), adjective (A), pronoun (P), verb (V), adjective (A), adverb (D), numeral (C), preposition (R), conjunction (J), interjection (I), particle (T), punctuation (Z), and &amp;quot;undefined&amp;quot; (X)). The &amp;quot;subpart of speech&amp;quot; (SUBPOS) contains details about the major category mad has 75 different values.</Paragraph>
      <Paragraph position="2"> For example, verbs (POS: V) are divided into simple finite form in present or future tense (B), conditional (c), infinitive (f), imperative (i), etc. 3 All the categories vary in their size as well as in their unigram entropy (see Table 1) computed using the standard entropy definition</Paragraph>
      <Paragraph position="4"> where p is the unigram distribution estimate based on the training data, and Y is the set of possible values of the category in question. This formula can be rewritten as</Paragraph>
      <Paragraph position="6"> where p is the unigram distribution, D is the data and IDI its size, and yi is the value of the category in question at the i - th event (or position) in the data. The form (2) is usually used for cross-entropy computation on data (such as test data) different from those used for estimating p. The base of the log function is always taken to be 2.</Paragraph>
    </Section>
    <Section position="3" start_page="483" end_page="484" type="sub_section">
      <SectionTitle>
1.3 The morphological analyzer
</SectionTitle>
      <Paragraph position="0"> Given the nature of inflectional languages, which can generate many (sometimes thousands of) forms for a given lemma (or &amp;quot;dictionary entry&amp;quot;), it is necessary to employ morphological analysis before the tagging proper. In Czech, there are as many as 5 different lemmas (not counting underlying derivations nor  word senses) and up to 108 different tags for an input word form. The morphological analyzer used for this purpose (Hajji, in prep.), (Haji~, 1994) covers about 98% of running unrestricted text (newspaper, magazines, novels, etc.). It is based on a lexicon containing about 228,000 lemmas and it can analyze about 20,000,000 word forms.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML