File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/06/w06-2914_intro.xml

Size: 3,514 bytes

Last Modified: 2025-10-06 14:04:10

<?xml version="1.0" standalone="yes"?>
<Paper uid="W06-2914">
  <Title>Word Distributions for Thematic Segmentation in a Support Vector Machine Approach</Title>
  <Section position="5" start_page="0" end_page="101" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> (Todd, 2005) distinguishes between &amp;quot;local-level topics (of sentences, utterances and short discourse segments)&amp;quot; and &amp;quot;discourse topics (of more extended stretches of discourse)&amp;quot;.1 (Todd, 2005) points out that &amp;quot;discourse-level topics are one of the most elusive and intractable notions in semantics&amp;quot;. Despite this difficulty in giving a rigorous definition of discourse topic, the task of discourse/dialogue segmentation into thematic episodes can be described by 1In this paper, we make use of the term topic or theme as referring to the discourse/dialogue topic.</Paragraph>
    <Paragraph position="1"> invoking an &amp;quot;intuitive notion of topic&amp;quot; (Brown and Yule, 1998). Thematic segmentation also relates to several notions such as speaker's intention, topic flow and cohesion.</Paragraph>
    <Paragraph position="2"> In order to find out if thematic segment identification is a feasible task, previous state-of-the-art works appeal to experiments, in which several human subjects are asked to mark thematic segment boundaries based on their intuition and a minimal set of instructions. In this manner, previous studies, e.g. (Passonneau and Litman, 1993; Galley et al., 2003), obtained a level of inter-annotator agreement that is statistically significant.</Paragraph>
    <Paragraph position="3"> Automatic thematic segmentation (TS), i.e. the segmentation of a text stream into topically coherent segments, is an important component in applications dealing with large document collections such as information retrieval and document browsing. Other tasks that could benefit from the thematic textual structure include anaphora resolution, automatic summarisation and discourse understanding.</Paragraph>
    <Paragraph position="4"> The work presented here tackles the problem of TS by adopting a supervised learning approach for capturing linear document structure of non-overlapping thematic episodes. A prerequisite for the input data to our system is that texts are divided into sentences or utterances.2 Each boundary between two consecutive utterances is a potential thematic segmentation point and therefore, we model the TS task as a binary-classification problem, where each utterance should be classified as marking the 2Occasionally within this document we employ the term utterance to denote either a sentence or an utterance in its proper sense.</Paragraph>
    <Paragraph position="5">  presence or the absence of a topic shift in the discourse/dialogue based only on observations of patterns in vocabulary use.</Paragraph>
    <Paragraph position="6"> The remainder of the paper is organised as follows. The next section summarizes previous techniques, describes how our method relates to them and presents the motivations for a support vector approach. Sections 3 and 4 present our approach in adopting support vector learning for thematic segmentation. Section 5 outlines the empirical methodology and describes the data used in this study. Section 6 presents and discusses the evaluation results. The paper closes with Section 7, which briefly summarizes this work and offers some conclusions and future directions.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML