File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/96/c96-2114_intro.xml

Size: 1,630 bytes

Last Modified: 2025-10-06 14:06:06

<?xml version="1.0" standalone="yes"?>
<Paper uid="C96-2114">
  <Title>Linguistic Indeterminacy as a Source of Errors in Tagging</Title>
  <Section position="3" start_page="0" end_page="0" type="intro">
    <SectionTitle>
2 The Linguistic Material Used in
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
the Study
</SectionTitle>
      <Paragraph position="0"> The error analysis on which this study is based was carried out on material from the Stockholm-Ume~ Corpus of modem written Swedish. (See KNlgren 1990.) It is a carefully composed, balanced corpus. Its composition follows the principles established by the Brown and LOB corpora, with adjustments for the fact that it should cover the most common genres of the Swedish of the 1990's. It contains newspaper texts, fact, and fiction on several stylistic levels. The texts all consist of written prose published sometime between 1990 and 1994. No spoken language material is included in the corpus.</Paragraph>
      <Paragraph position="1"> All words in the SUC are tagged for part-of-speech and for inflectional features. For a description of the SUC annotation system, see Ejerhed et al. (1992). The tagged texts of the SUC are converted into SGML format and additional tags are added in accordance with the TEI Guidelines (Sperberg-McQueen and Bumard 1993, K~illgren 1995) to give the format in which the corpus will finally be distributed. There are legal permissions allowing the corpus to be used and distributed for non-commercial research purposes.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML