File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/00/w00-0744_intro.xml
Size: 1,118 bytes
Last Modified: 2025-10-06 14:01:01
<?xml version="1.0" standalone="yes"?> <Paper uid="W00-0744"> <Title>Recognition and Tagging of Compound Verb Groups in Czech</Title> <Section position="3" start_page="219" end_page="219" type="intro"> <SectionTitle> 2 Data Source </SectionTitle> <Paragraph position="0"> DESAM (Pala et al., 1997), the annotated and fully disambiguated corpus of Czech newspaper texts, has been used as the source of learning data. It contains more than 1 000 000 word positions, about 130 000 different word forms, about 65 000 of them occurring more then once, and 1 665 different tags. E.g. in Tab. 1 the tag kbeApFnStMmPaP of the word zd6astnila (participated) means: part of speech (k) = verb (5), person (p) = feminine (F), number (n) = singular (S) and tense (t) = past (M). Lemmata and possible tags are prefixed by <1>, <t> respectively. As pointed out in (Pala et al., 1997; PopeHnsk~ et al., 1999), DESAM is not large enough. It does not contain the representative set of Czech sentences yet. In addition some words are tagged incorrectly and about 1/5 positions are untagged.</Paragraph> </Section> class="xml-element"></Paper>