File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/97/j97-1003_metho.xml

Size: 18,777 bytes

Last Modified: 2025-10-06 14:14:32

<?xml version="1.0" standalone="yes"?>
<Paper uid="J97-1003">
  <Title>TextTiling: Segmenting Text into Multi-paragraph Subtopic Passages</Title>
  <Section position="5" start_page="34" end_page="38" type="metho">
    <SectionTitle>
2.1 Online Text Display and Hypertext
</SectionTitle>
    <Paragraph position="0"> Research in hypertext and text display has produced hypotheses about how textual information should be displayed to users. One study of an on-line documentation system (Girill 1991) compares display of fine-grained portions of text (i.e., sentences), full texts, and intermediate-sized units. Girill finds that divisions at the fine-grained level are less efficient to manage and less effective in delivering useful answers than intermediate-sized units of text.</Paragraph>
    <Paragraph position="1"> Girill does not make a commitment about exactly how large the desired text unit should be, but talks about &amp;quot;passages&amp;quot; and describes passages in terms of the communicative goals they accomplish (e.g., a problem statement, an illustrative example, an enumerated list). The implication is that the proper unit is the one that groups together the information that performs some communicative function; in most cases, this unit will range from one to several paragraphs. (Girill also finds that using document boundaries is more useful than ignoring document boundaries, as is done in some hypertext systems, and that premarked sectional information, if available and not too long, is an appropriate unit for display.) Tombaugh, Lickorish, and Wright (1987) explore issues relating to ease of readability of long texts on CRT screens. Their study explores the usefulness of multiple windows for organizing the contents of long texts, hypothesizing that providing readers with spatial cues about the location of portions of previously read texts will aid in their recall of the information and their ability to quickly locate information that has already been read once. In the experiment, the text is divided using premarked sectional information, and one section is placed in each window. They conclude that segmenting the text by means of multiple windows can be very helpful if readers are familiar with the mechanisms supplied for manipulating the display.</Paragraph>
    <Paragraph position="2">  Computational Linguistics Volume 23, Number 1 Converting text to hypertext, in what is called post hoc authoring (Marchionini, Liebscher, and Lin 1991), requires division of the original text into meaningful units (a task noted by these authors to be a challenging one) as well as meaningful interconnection of the units. Automated multi-paragraph segmentation should help with the first step of this process, and is more important than ever now that pre-existing documents are being put up for display on the World Wide Web. Salton et al. (1996) have recognized the need for multi-paragraph units in the automatic creation of hypertext links as well as theme generation (this work is discussed in Section 5).</Paragraph>
    <Section position="1" start_page="35" end_page="36" type="sub_section">
      <SectionTitle>
2.2 Information Retrieval
</SectionTitle>
      <Paragraph position="0"> In the field of information retrieval, there has recently been a surge of interest in the role of passages in full text. Until very recently, most information retrieval experiments made use only of titles and abstracts, bibliographic entries, or very short newswire articles, as opposed to full text. When long texts are available, there arises the question: can retrieval results be improved if the query is compared against only a passage or subpart of the text, as opposed to the text as a whole? And if so, what size unit should be used? In this context, &amp;quot;passage&amp;quot; refers to any segment of text isolated from the full text. This includes author-determined segments, marked orthographically (paragraphs, sections, and chapters) (Hearst and Plaunt 1993; Salton, Allan, and Buckley 1993; Moffat et al. 1994) and/or automatically derived units of text, including fixed-length blocks (Hearst and Plaunt 1993; Callan 1994), segments motivated by subtopic structure (TextTiles) (Hearst and Plaunt 1993), or segments motivated by properties of the query (Mittendorf and Sch~iuble 1994).</Paragraph>
      <Paragraph position="1"> Hearst and Plaunt (1993), in some early passage-based retrieval experiments, report improved results using passages over full-text documents, but do not find a significant difference between using motivated subtopic segments and arbitrarily chosen block lengths that approximated the average subtopic segment length. Salton, Allan, and Buckley (1993), working with encyclopedia text, find that comparing a query against orthographically marked sections and then paragraphs is more successful than comparing against full documents alone.</Paragraph>
      <Paragraph position="2"> Moffat et al. (1994) find, somewhat surprisingly, that manually supplied sectioning information may lead to poorer retrieval results than techniques that automatically subdivide the text. They compare two methods of subdividing long texts. The first consists of using author-supplied sectioning information. The second uses a heuristic in which small numbers of paragraphs are grouped together until they exceed a size threshold. The results are that the small, artificial multi-paragraph groupings seemed to perform better than the author-supplied sectioning information (which usually consisted of many more paragraphs than Moffet et al.'s subdivision algorithm or TextTiling would create). More experiments in this vein are necessary to firmly establish this result, but it does lend support to the conjecture that multi-paragraph subtopicsized segments, such as those produced by TextTiling, are useful for similarity-based comparisons in information retrieval.</Paragraph>
      <Paragraph position="3"> It will not be surprising if motivated subtopic segments are not found to perform significantly better than appropriately sized, but arbitrarily segmented, units in a coarse-grained information retrieval evaluation. At TREC, the most prominent information retrieval evaluation platform (Harman 1993), the top 1,000 documents are evaluated for each query, and the best-performing systems tend to use very simple statistical methods for ranking documents. In this kind of evaluation methodology, subtle distinctions in analysis techniques tend to be lost, whether those distinctions be how accurately words are reduced to their roots (Hull and Grefenstette 1995; Harman 1991), or exactly how passages are subdivided. The results of Hearst and Plaunt (1993),</Paragraph>
    </Section>
    <Section position="2" start_page="36" end_page="38" type="sub_section">
      <SectionTitle>
Hearst TextTiling
</SectionTitle>
      <Paragraph position="0"> Salton, Allan, and Buckley (1993) and Moffat et al. (1994) suggest that it is the nature of the intermediate size of the passages that matters.</Paragraph>
      <Paragraph position="1"> Perhaps a more appropriate use of motivated segment information is in the display of information to the user. One obvious way to use segmentation information is to have the system display the passages with the closest similarity to the query, and to display a passage-based summary of the documents' contents.</Paragraph>
      <Paragraph position="2"> As a more elaborate example of using segmentation in full-text information access, I have used the results of TextTiling in a new paradigm for display of retrieval results (Hearst 1995). This approach, called TileBars, allows the user to make informed decisions about which documents and which passages of those documents to view, based on the distributional behavior of the query terms in the documents. TileBars allows users to specify different sets of query terms, as discussed later. The goal is to simultaneously and compactly indicate:  1. the relative length of the document, 2. the frequency of the term sets in the document, and 3. the distribution of the term sets with respect to the document and to each other.</Paragraph>
      <Paragraph position="3">  TextTiling is used to partition each document, in advance, into a set of multi-paragraph subtopical segments.</Paragraph>
      <Paragraph position="4"> Figure 1 shows an example query about automated systems for medical diagnosis, run over the ZIFF portion of the TIPSTER collection (Harman 1993). Each large rectangle next to a title indicates a document, and each square within the rectangle represents a TextTile in the document. The darker the tile, the more frequent the term (white indicates 0, black indicates 8 or more hits; the frequencies of all the terms within a term set are added together). The top row of each rectangle corresponds to the hits for Term Set 1, the middle row to hits for Term Set 2, and the bottom row to hits for Term Set 3. The first Column of each rectangle corresponds to the first TextTile of the document, the second column to the second TextTile, and so on. The patterns of graylevel are meant to provide a compact summary of which passages of the document matched which topics of the query.</Paragraph>
      <Paragraph position="5"> Users' queries are written as lists of words, where each list, or term set, is meant to correspond to a different component of the query. 2 This list of words is then translated into conjunctive normal form. For example, the query in the Figure is translated by the system as: (patient OR medicine OR medical) AND (test OR scan OR cure OR diagnosis) AND (software OR program). This formulation allows the interface to reflect each conceptual part of the query: the medical terms, the diagnosis terms, and the software terms. The document whose title begins &amp;quot;VA automation means faster admissions&amp;quot; is quite likely to be relevant to the query, and has hits on all three term sets throughout the document. By contrast, the document whose title begins &amp;quot;It's hard to ghostbust a network ...&amp;quot; is about computer-aided diagnosis, but has only a passing reference to medical diagnosis, as can be seen by the graphical representation.</Paragraph>
      <Paragraph position="6"> This version of the TileBars interface allows the user to filter the retrieved documents according to which aspects of the query are most important. For example, if the user decides that medical terms should be better represented, the Min Hits or Min  (~) ACM).</Paragraph>
      <Paragraph position="7">  Distribution constraint on this term set can be adjusted accordingly. Min Hits indicates the minimum number of times words from a term set must appear in the document in order for it to be displayed. Similarly, Min Distribution indicates the minimum percentage of tiles that must have a representative from the term set. The setting Min Overlap Span refers to the minimum number of tiles that must have at least one hit from each of the three term sets. In Figure 1, the user has indicated that the diagnosis aspect of the query must be strongly present in the retrieved documents, by setting the Min Distribution to 30% for the second term set. 3 When the user mouse-clicks on a square in a TileBar, the corresponding document is displayed beginning at the selected TextTile. Thus the user can also view the subtopic structure within the document itself.</Paragraph>
      <Paragraph position="8"> 3 Most likely this setting information is too complicated for a typical user; I have performed some experiments to determine how to set these constraints automatically (Hearst 1996) to be used in future versions of the interface.</Paragraph>
    </Section>
    <Section position="3" start_page="38" end_page="38" type="sub_section">
      <SectionTitle>
Hearst TextTiling
</SectionTitle>
      <Paragraph position="0"> This section has discussed why multi-paragraph segmentation is important and how it might be used. The next section elaborates on what is meant by multi-paragraph subtopic structure, casting the problem in terms of detection of topic or subtopic shift.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="38" end_page="39" type="metho">
    <SectionTitle>
3. Coarse-Grained Subtopic Structure
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="38" end_page="39" type="sub_section">
      <SectionTitle>
3.1 What is Subtopic Structure?
</SectionTitle>
      <Paragraph position="0"> In order to describe the detection of subtopic structure, it is important to define the phenomenon of interest. The use of the term subtopic here is meant to signify pieces of text &amp;quot;about&amp;quot; something and is not to be confused with the topic/comment distinction (Grimes 1975), also known as the given/new contrast (Kuno 1972), found within individual sentences.</Paragraph>
      <Paragraph position="1"> The difficulty of defining the notion of topic is discussed at length in Brown and Yule (1983, Section 3). They note: The notion of 'topic' is clearly an intuitively satisfactory way of describing the unifying principle which makes one stretch of discourse 'about' something and the next stretch 'about' something else, for it is appealed to very frequently in the discourse analysis literature ....</Paragraph>
      <Paragraph position="2"> Yet the basis for the identification of 'topic' is rarely made explicit.</Paragraph>
      <Paragraph position="3"> (pp. 69-70) After many pages of attempting to pin the concept down, they suggest, as one alternative, investigating topic-shift markers instead: It has been suggested.., that instead of undertaking the difficult task of attempting to define 'what a topic is', we should concentrate on describing what we recognize as topic shift. That is, between two contiguous pieces of discourse which are intuitively considered to have two different 'topics', there should be a point at which the shift from one topic to the next is marked. If we can characterize this marking of topic-shift, then we shall have found a structural basis for dividing up stretches of discourse into a series of smaller units, each on a separate topic .... The burden of analysis is consequently transferred to identifying the formal markers of topic-shift in discourse. (pp. 94-95) This notion of looking for a shift in content bears a close resemblance to Chafe's notion of The Flow Model of discourse in narrative texts (Chafe 1979), in description of which he writes: Our data ... suggest that as a speaker moves from focus to focus (or from thought to thought) there are certain points at which there may be a more or less radical change in space, time, character configuration, event structure, or, even, world .... At points where all of these change in a maximal way, an episode boundary is strongly present. But often one or another will change considerably while others will change less radically, and all kinds of varied interactions between these several factors are possible. 4 (pp. 179-80)  Computational Linguistics Volume 23, Number 1 Thus, rather than identifying topics (or subtopics) per se, several theoretical discourse analysts have suggested that changes or shifts in topic can be more readily identified and discussed. TextTiling adopts this stance. The problem remains, then, of how to detect subtopic shift. Brown and Yule (1983) consider in detail two markers: adverbial clauses and certain kinds of prosodic markers. By contrast, the next sub-section will show that lexical co-occurrence patterns can be used to identify subtopic shift.</Paragraph>
    </Section>
    <Section position="2" start_page="39" end_page="39" type="sub_section">
      <SectionTitle>
3.2 Relationship to Segmentation in Hierarchical Discourse Models
</SectionTitle>
      <Paragraph position="0"> Much of the current work in empirical discourse processing makes use of hierarchical discourse models, and several prominent theories of discourse assume a hierarchical segmentation model. Foremost among these are the attentional/intentional structure of Grosz and Sidner (1986) and the Rhetorical Structure Theory of Mann and Thompson (1987). The building blocks for these theories are phrasal or clausal units, and the targets of the analyses are usually very short texts, typically one to three paragraphs in length. 5 Many problems in discourse analysis, such as dialogue generation and turn-taking (Moore and Pollack 1992; Walker and Whittaker 1990), require fine-grained, hierarchical models that are concerned with utterance-level segmentation. Progress is being made in the automatic detection of boundaries at this level of granularity using machine learning techniques combined with a variety of well-chosen discourse cues (Litman and Passonneau 1995).</Paragraph>
      <Paragraph position="1"> In contrast, TextTiling has the goal of identifying major subtopic boundaries, attempting only a linear segmentation. We should expect to see, in grouping together paragraph-sized units instead of utterances, a decrease in the complexity of the feature set and algorithm needed. The work described here makes use only of lexical distribution information, in lieu of prosodic cues such as intonational pitch, pause, and duration (Hirschberg and Nakatani 1996), discourse markers such as oh, well, ok, however (Schiffrin 1987; Litman and Passonneau 1995), pronoun reference resolution (Passonneau and Litman 1993; Webber 1988) and tense and aspect (Webber 1987; Hwang and Schubert 1992). From a computational viewpoint, deducing textual topic structure from lexical occurrence information alone is appealing, both because it is easy to compute, and because discourse cues are sometimes misleading with respect to the topic structure (Brown and Yule 1983, Section 3).</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="39" end_page="39" type="metho">
    <SectionTitle>
4. Detecting Subtopic Change via Lexical Co-occurrence Patterns
</SectionTitle>
    <Paragraph position="0"> TextTiling assumes that a set of lexical items is in use during the course of a given subtopic discussion, and when that subtopic changes, a significant proportion of the vocabulary changes as well. The algorithm is designed to recognize episode boundaries by determining where thematic components like those listed by Chafe (1979) change in a maximal way. However, unlike other researchers who have studied setting, time, characters, and the other thematic factors that Chafe mentions, I attempt to determine where a relatively large set of active themes changes simultaneously, regardless of the type of thematic factor. This is especially important in expository text in which the subject matter tends to structure the discourse more so than characters, setting, and so on. For example, in the Stargazers text introduced in Section 1, a discussion of</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML