File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/00/w00-0744_intro.xml

Size: 1,118 bytes

Last Modified: 2025-10-06 14:01:01

<?xml version="1.0" standalone="yes"?>
<Paper uid="W00-0744">
  <Title>Recognition and Tagging of Compound Verb Groups in Czech</Title>
  <Section position="3" start_page="219" end_page="219" type="intro">
    <SectionTitle>
2 Data Source
</SectionTitle>
    <Paragraph position="0"> DESAM (Pala et al., 1997), the annotated and fully disambiguated corpus of Czech newspaper texts, has been used as the source of learning data. It contains more than 1 000 000 word positions, about 130 000 different word forms, about 65 000 of them occurring more then once, and 1 665 different tags. E.g. in Tab. 1 the tag kbeApFnStMmPaP of the word zd6astnila (participated) means: part of speech (k) = verb (5), person (p) = feminine (F), number (n) = singular (S) and tense (t) = past (M). Lemmata and possible tags are prefixed by &lt;1&gt;, &lt;t&gt; respectively. As pointed out in (Pala et al., 1997; PopeHnsk~ et al., 1999), DESAM is not large enough. It does not contain the representative set of Czech sentences yet. In addition some words are tagged incorrectly and about 1/5 positions are untagged.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML