File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/05/p05-2020_intro.xml

Size: 3,560 bytes

Last Modified: 2025-10-06 14:03:07

<?xml version="1.0" standalone="yes"?>
<Paper uid="P05-2020">
  <Title>Learning Information Structure in The Prague Treebank</Title>
  <Section position="2" start_page="0" end_page="115" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> Information Structure (IS) is a partitioning of the content of a sentence according to its relation to the discourse context. There are numerous theoretical approaches describing IS and its semantics (Halliday, 1967; Sgall, 1967; Vallduv'i, 1990; Steedman, 2000) and the terminology used is diverse (see (Kruijff-Korbayov'a &amp; Steedman, 2003) for an overview). However, all theories consider at least one of the following two distinctions: (i) a topic/focus  distinction that divides the linguistic meaning of the sentence into parts that link the content to the context, and others that advance the discourse, i.e. add or modify information; and (ii) We use the Praguian terminology for this distinction. a background/kontrast  distinction between parts of the utterance which contribute to distinguishing its actual content from alternatives the context makes available. Existing theories, however, state their principles using carefully selected illustrative examples. Because of this, they fail to adequately explain what possibly different linguistic dimensions cooperate to realize IS and how they do it.</Paragraph>
    <Paragraph position="1"> In this paper we report the results of an experiment aimed to automatically identify aspects of IS. This effort is part of a larger investigation aimed to get a more realistic view on the realization of IS in naturally occurring texts.</Paragraph>
    <Paragraph position="2"> For such an investigation, the existence of a corpus annotated with some kind of 'informativity status' is of great importance. Fully manual annotation of such a corpus is tedious and time-consuming. Our plan is to initially annotate a small amount of data and then to build models to automatically detect IS in order to apply bootstrapping techniques to create a larger corpus.</Paragraph>
    <Paragraph position="3"> This paper describes the results of a pilot study; its aim is to check if the idea of learning IS works by trying it on an already existing corpus. For our experiments, we have used the Prague Dependency Treebank (PDT) (HajiVc, 1998), as it is the only corpus annotated with IS (following the theory of Topic-Focus Articulation). We trained three different classifiers, C4.5, Bagging and Ripper, using basic features from the treebank and derived features inspired by the annotation guidelines. We have evaluated the performance of the classifiers against a baseline that simulates the preprocessing procedure that preceded the manual annotation of PDT, and The notion 'kontrast' with a 'k' has been introduced in (Vallduv'i and Vilkuna, 1998) to replace what Steedman calls 'focus', and to avoid confusion with other definitions of focus.  against a rule-based system which we implemented following the annotation instructions.</Paragraph>
    <Paragraph position="4"> The organization of the paper is as follows. Section 2 describes the Prague Dependency Treebank, Section 3 presents the Praguian approach of Topic-Focus Articulation, from two perspectives: of the theoretical definition and of the annotation guidelines that have been followed to annotate the PDT. Section 4 presents the experimental setting, evaluation metric and results. The paper closes with conclusions and issues for future research (Section 5).</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML