XML Viewer - w98-0204

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/98/w98-0204_metho.xml
Size: 21,321 bytes
Last Modified: 2025-10-06 14:15:07
<?xml version="1.0" standalone="yes"?>
<Paper uid="W98-0204">
  <Title>Texplore exploring expository texts via hierarchical representation</Title>
  <Section position="3" start_page="0" end_page="26" type="metho">
    <SectionTitle>
2 Some characteristics of expository
texts
</SectionTitle>
    <Paragraph position="0"> The following subsections consider the linguistic evidence that is needed to develop a hierarchical representation of expository texts. The next  1. First/Third person 2. Agent oriented 3. Accomplished time 4. Chronological time i F~pository 1. No necessary reference 2. Subject matter oriented 3. Time not focal 4. Logical n~kase +Projected Procedural I. Non-specific person 2. Patient oriented 3. Projected time 4. Chron. linkage Horatory 1. Second person 2. Addressee oriented 3. Mode, not time 4. Logical linlmge  subsection discusses expository text in general and its relation to other discourse types. H.ierarchical structure in discourse is discussed next, with the paragraph as its basic unit. The final subsection considers lexical cohesion as the basic technique for identifying structure.</Paragraph>
    <Section position="1" start_page="25" end_page="25" type="sub_section">
      <SectionTitle>
2.1 Expository text and other
</SectionTitle>
      <Paragraph position="0"> discourse types In order to understand the particular domain of expository text is important to see it in the larger context of other possible discourse types. Longacre (1976) presents a 2 x 2 model of four discourse types, Narrative, Procedural (Instructive), Expository, and Horatory (Sermon), shown in Table 1.</Paragraph>
      <Paragraph position="1"> Expository text is seen to be less modal since its discourse is determined by its subject, and the logical structure built in its exposition, rather than by who the speaker is, the audience or the temporal order of the speech acts. This is not to say that two authors are expected to produce the same text on the same subject.</Paragraph>
      <Paragraph position="2"> Personal style is a factor here as in any human writing. However, we can take advantage of the modeless character of expository text when creating a representation of its content. If the discourse relations between two segments can be assumed to be modeless we can expect these relations to be manifested, to a large extent, in their lexical context. In other words, we can expect the robust techniques of information retrieval to be useful for identifying the information structure of expository texts.</Paragraph>
    </Section>
    <Section position="2" start_page="25" end_page="25" type="sub_section">
      <SectionTitle>
2.2 The paragraph unit in hierarchical
</SectionTitle>
      <Paragraph position="0"> discourse structure Hierarchical structure is present in all types of discourse. Authors organi~.e their works as trilogies of books, as chapters in books, sections in chapters, then subsections, subsubsections, etc. This is true for an instruction manual, the Bible, for The Hitchhiker Guide to the Galaxy, War and Peace, and, in a completely different category, this humble paper.</Paragraph>
      <Paragraph position="1"> Previous research shows that this hierarchical structure is not just an author's style but is inherent in many language phenomena.</Paragraph>
      <Paragraph position="2"> A number of rhetoric structure theories have been proposed (Meyer and Rice, 1982; Mann and Thompson, 1987) which recognize distinct rhetorical structures like problem-solution and cause-effect. Applying this model recursively forms a hierarchical structure over the text.</Paragraph>
      <Paragraph position="3"> From the cognitive aspect, Giora (1985) pro..</Paragraph>
      <Paragraph position="4"> poses a hierarchical categorial structure where the discourse topic functions as a prototype in the cognitive representation of the nnlt, i.e. a minimal gener~llzation of the propositions in the unit. FFinally, the hierarchical intention structure, proposed for a more general, multiple participants discourse, is a key part of the well-accepted discourse theory of Grosz and Sido net (1986).</Paragraph>
      <Paragraph position="5"> Hierarchical structure implies some kind of basic unit. Many researches (Longacre, 1979; Hinds, 1979; Kieras, 1982) have shown that the paragraph is a basic unit of coherency, and that it functions very slmilarly in many languages of vastly different origin (Chafe, 1979).</Paragraph>
      <Paragraph position="6"> Not only the paragraph is a basic unit of coherency, its initial position, the first one or two sentences of the paragraph, provides key information for identifying the discourse topics (Yaari et al., ). Again, as Chafe shows, this is true for many varied languages. The initial position of a paragraph is thus a key heuristic for general purpose document summarization (Paice, 1990).</Paragraph>
    </Section>
    <Section position="3" start_page="25" end_page="26" type="sub_section">
      <SectionTitle>
2.3 Cohesion
</SectionTitle>
      <Paragraph position="0"> Lexical cohesion is the most common linguistic mechanism used for discourse segmentation (Hearst, 1997; Yaari, 1997). The basic notion comes from the work of Halliday and Hasan (1976) and further developed in (HaUlday, 1994).</Paragraph>
      <Paragraph position="1"> Cohesion is defined as the non-structural mechanlam by which discourse units of different sizes can be connected across gaps of any texts. One of the forms of cohesion is lexical cohesion. In this type, cohesion is achieved by choosing words that are related in some way - null lexicaUy, semantically or collocationally. Lexical cohesion is important for a practical reason - it is relatively easy to identify it computationally. It is also important for linguistic reasons since, unlike other forms of cohesion, this form is active over large extents of text.</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="26" end_page="27" type="metho">
    <SectionTitle>
3 Hierarchical representation of text
</SectionTitle>
    <Paragraph position="0"> In the previous section the hierarchical structure of a text was established as an inherent linguistic phenomena. We have also identified linguistic evidence that can be used to uncover this structure.</Paragraph>
    <Paragraph position="1"> In this section we focus on the humanmanhine interaction aspects of this form of representation. From this point of view, hierarchical representation answers two kinds of problems: how to navigate in free text, and how to effectively communicate the content of the document to the user. These two issues are discussed in the following subsections.</Paragraph>
    <Section position="1" start_page="26" end_page="26" type="sub_section">
      <SectionTitle>
3.1 Navigating in free text
</SectionTitle>
      <Paragraph position="0"> The basic approach for free, 1restructured, text navigation (and the basis for the whole internet explosion} is the hypertext method. Navigation follows what may be called a stream of associations. At any point in the text the user may hyper-jump to one out of a set of available destination sites, each determined by its association with narrow context around the link anchor. In spite of their popularity, the arbitrary hyperjumps create a serious drawback by losing the global context. Having lost the global context, the navigator is destined to wander aimlessly in maze of pages, wasting time and forgetting what he/she was looking for in the first place.</Paragraph>
      <Paragraph position="1"> The use of a static ticker frame that allows an immediate deliverance from this maze (typically placed on the left part of the browser's window) is a recognition of this drawback.</Paragraph>
      <Paragraph position="2"> Once NLP methods are applied on the text document, more sophisticated methods become possible for navigating in unstructured text. An important example is the use of lexical cohesion, implemented by measuring distance between term vectors, to decompose the text to themes (Salton et al., 1995). Themes are defined as a set of paragraphs, not necessarily adjacent, that have strong mutual cohesion between them. Navigation through such themelinked paragraphs is a step forward in effective text exploration. However, the user navigates within the context of a single theme and still loses the overall context of the full text. Because there is only one hierarchy here, the user has to go through a selected theme to its end to find out whether it provides the sought information. null The an.~wer proposed in Texplore is to discover and present the user with a hierarchical representation of the text. Hierarchical structure is oriented specifically to present complex information. Authors use it explicitly to organize large works. Scientists use it to describe complex flora and fauna. Manual writers use it to describe complex procedures. Our task here is somewhat different. We are presented with a given unstructured text and want to uncover in it some latent hierarchical structure. We claim that in so far as the text is coherent, that is, it makes sense, there is some structure in it. The more coherent the text, the more structure it has.</Paragraph>
      <Paragraph position="3"> Combining the capabilities of hypertext and hierarchical representation is particularly attractive. Together they provide two advantages not found in other access methods: .</Paragraph>
      <Paragraph position="4"> .</Paragraph>
      <Paragraph position="5"> Immediate access to the sought piece of information, or quick dismissal if none exists. In computer memory jargon we call this random access. This is the ability to access the required information in a small number of steps (bound by the maximum depth of the hierarchy).</Paragraph>
      <Paragraph position="6"> User control over the level of details. Most navigation tools provide the text as is so the user has to scan at the maximum level of details at all times. However, for expository texts beyond a couple of pages in size, the user needs the ability to skim quickly over most of the text and go deeper only at few points. There is a need, then, to have good interactive control over the level of details presented.</Paragraph>
    </Section>
    <Section position="2" start_page="26" end_page="27" type="sub_section">
      <SectionTitle>
3.2 Communicating document's
</SectionTitle>
      <Paragraph position="0"> content Document summarization systems today are concerned with extracting significant, indicative sentences or clauses, and combining them as a more-or-less coherent abstract.</Paragraph>
      <Paragraph position="1">  This static abstract should answer the question &amp;quot;what the text is about&amp;quot;. However, because of the underlying technique of sentence extraction and its static nature, the answer is too elaborated in some of the details and insufficient in others.</Paragraph>
      <Paragraph position="2"> Boguraev et al. (1998) discuss extensively this drawback of today's summarizers and conclude that good content representation requires two basic features: (a) presenting the summary extracts in their conte.~, and (b) user control over the granularity. Their solution is based on identifying primitive clauses, called capsules, resolving their anaphoric references and providing them, through a user interface, at different granularities.</Paragraph>
      <Paragraph position="3"> The expandable outline view of Texplore, built upon hierarchical representation of the text's contents, nicely meets the requirements of context and granularity, though the underlying NLP technology is completely different. In the next section we discuss the Texplore system in details, the supporting NLP tools as well as the front-end visualization system.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="27" end_page="30" type="metho">
    <SectionTitle>
4 Texplore - system description
</SectionTitle>
    <Paragraph position="0"> The overall data-flow in TexpIore is shown in Figure i. It starts with a preprocessing stage, a structure and heading analyses, leading to expandable outline display.</Paragraph>
    <Paragraph position="1"> A typical screen of Texplore is shown in Figure 2 I. It consists of three parts. The original text is shown on the right pane, the expandable outline on the upper left and the concept index on the lower left pane.</Paragraph>
    <Paragraph position="2"> The following subsections describe the different parts of the system, focusing on the visualization aspects related to content presentation.</Paragraph>
    <Section position="1" start_page="27" end_page="27" type="sub_section">
      <SectionTitle>
4.1 NLP preprocessing
</SectionTitle>
      <Paragraph position="0"> The first two preprocessing steps, sentence analysis part-of-speech (POS) analysis, are pretty standard. The result is a list of POS-tagged sentences, grouped in paragraphs.</Paragraph>
      <Paragraph position="1"> In the N-gram analysis, a repeated scan is made at each i'th stage, looking for pairs of consecutive candidates from the previous stage. We  filter each stage using mutual information measure so the complexity is practically O(N). Finally we remove those N-grams whose instances are a proper subset of some longer N-grams, and then apply part-of-speech filter on the remaining candidates leaving only noun compounds. That last step was found to be extremely useful, reducing false N-grams to practically nil.</Paragraph>
    </Section>
    <Section position="2" start_page="27" end_page="28" type="sub_section">
      <SectionTitle>
4.2 Hierarchical structure segmentation
</SectionTitle>
      <Paragraph position="0"> The core of the system is the hierarchical structure segmentation. The method used for segmentation, called hierarchical agglomerative clustering (HAC), was described in detail by Yaari (1997). In HAC the structure is discovered by repeatedly merging the closest data elements into a new element. In our application we use paragraphs as the elementary segments of discourse and apply a lexical cohesion measure as the proximity test. The lexical cohesion measure, Proximity(si, 8i/1) , iS adapted from the standard Saltoniau term vector distance. It computes the cosine between two successive segments, si and si+~.</Paragraph>
      <Paragraph position="2"> Here Wk,i is the weight of the k'th term of si, and \]lsiH is the length of the vector. The as- null upper left and the concept index on the lower left. The outline is shown collapsed except for one section sumption that only adjacent segments are compared is not necessarily the case, see (Salton et el., 1995). However, it allows us to create the more conventional 2-D structure of a table-of-contents, instead of the arbitrary graph structure that would have been formed otherwise. Another modification is the way the term weights, wk,~, are determined. We found that having the weight proportional to the term's IDF (Inverse Document Frequency, measuring its general significance) and the position of the sentence in the paragraph, improves the quality of the proximity measure, by giving higher weight to terms with higher relevance to inter-segment cohesion.</Paragraph>
      <Paragraph position="3"> The result is shown in Figure 3. Inter-segment boundaries are set at points where lexical cohesion f~lls below some specific threshold. The resulting nesting could be quite deep (in this example there are 10 levels). H-man authors, however, rarely use a hierarchy depth greater than 3 (except possibly in instructional discourse). The rather deep nesting is then smoothed, between the detected boundaries, to fit human convenience, as seen in Figure 4. This smoothed structure is superimposed over the original text, producing the expandable outline shown in the left pane of Figure 2.</Paragraph>
      <Paragraph position="5"> and thus deeper-nested, adjoin;ng segments.</Paragraph>
      <Paragraph position="6"> Vertical lines indicate inter-segment boundaries.</Paragraph>
      <Paragraph position="7"> The hierarchical structure thus discovered is certainly not the only one possible. However, experiments with human judges (Hearst, 1997) showed that segmentation based on lexical cohesion is quite accurate compared to manual ones.</Paragraph>
    </Section>
    <Section position="3" start_page="28" end_page="29" type="sub_section">
      <SectionTitle>
4.3 Heading generation
</SectionTitle>
      <Paragraph position="0"> The next step, after the hierarchical structure of the text is determined, is to compose head;ngs for the identified sections. Figure 5 shows the outline pane representing the hierarchical structure.</Paragraph>
      <Paragraph position="1"> The generated headings are, at the moment,  should not be thought of as a model for life elsewhere. It Is a ~:ii billion to one chance, they say, that the earth should have i~ received just the right blow to set in mo0on a train of events that led to the emergence and rapid development of living i~i lhings, But suppose that biilion to one chance was repeated ii~ ~reughout be universe? There could still be 200 clvtllsattons iii in our galaxy alone. But most researchers do not think this is iii the case, and ~e Hubble 'linalty may put that ~eory to rest ~ii 2~.2. lqmet~ ~  pounds (NPs) is scored for each section by their frequency, mutual information (for N-grams with N &gt; 1), and position in the section.</Paragraph>
      <Paragraph position="2"> A higher score is ~ven to NPs in initial position, that is, the first one or two sentences in the paragraph.</Paragraph>
      <Paragraph position="3"> The syntax of headings follows standard conventions for human-composed headings. The most common case is the shnple NP. Another common case is a coordination of two NPs.</Paragraph>
      <Paragraph position="4"> With these guidelines we came up with the foL lowing heuristics to compose headings:  1. Remove from the list any NP that appears in an enclosing section.</Paragraph>
      <Paragraph position="5"> 2. If the saliency of the first NP in the list is much higher than the second, use this NP by itself. Otherwise, create a coordination of the first two NPs.</Paragraph>
      <Paragraph position="6"> 3. Prefix each NP in the heading with the de- null terminer that first appeared with that NP, if any. This rule is not very successful and rJiIl be modified in the future.</Paragraph>
      <Paragraph position="7"> The first rule exemplifies the power of this kind of content representation. Once an NP appears in a heading, it is hnplied in the headings of all enclosed sections and thus should not appear there explicitly. For example, in Figure 5 the NP Moon appears in the heading of section 2. Without the first rule it would appear in a few of the enclosed subsections because of its saliency. We, as readers, would see this as redundant information.</Paragraph>
    </Section>
    <Section position="4" start_page="29" end_page="30" type="sub_section">
      <SectionTitle>
4.4 Expandable outline display
</SectionTitle>
      <Paragraph position="0"> Figure 5 also ~ustrates the importance of context and granularity, mentioned earlier as key points in dynamic content presentation. The out\]me functions as a dynamic abstract with user control over the level of granularity. The user is able to read section 2.2.1, The Hubble and the earth in its full context. He/she is not lost anymore.</Paragraph>
      <Paragraph position="1"> In fact, the two panes with the original text and its outline are synchronized so that the outline can act as a ticker-only pane viewing the text on the larger right pane, or be used as a combined text and heading pane.</Paragraph>
      <Paragraph position="2"> The outline pane also supports the standard controUed-aperture metaphor. Double-clicking on a heading alternately expands and collapses the underlying text. The user can thus easily increase the granularity of the outline to see further detail, or close the aperture to see only the general topics.</Paragraph>
      <Paragraph position="3"> The heading acts cognitively as the surrogate of its text. Thus if a heaAing of a collapsed section is selected, the full text corresponding to this heading is selected on the right. This strengthens the concept of the out\]me as a true, thought compact, representation of the original text. Figure 6 shows the same out\]me, this time with reduced granularity, highlighting the correspondence between a selected heading and its text. The concept index, shown in the lower left pane of the window, is discussed in the next section.</Paragraph>
    </Section>
    <Section position="5" start_page="30" end_page="30" type="sub_section">
      <SectionTitle>
4.5 Concept index
</SectionTitle>
      <Paragraph position="0"> The N-gram analysis, combined with part-of-speech filtering, identifies a set of noun compounds that are used as a concept index for the text. They are termed concepts because the information they carry reveals a lot about the text, much more than simple one word nouns. Consider, for example, the first three Ngrams: lunar samples, living things, and dry lava lakes. In contrast, the composh~ words of each N-gram, e.g. lava, lake, living, or things, reveal very little.</Paragraph>
      <Paragraph position="1"> The high information content of the concept index makes it a very concise representation of what the text is about, though certainly secondary to the outline. Also, having these &amp;quot;concepts&amp;quot; hot-linked to their references in the text forms a hot-llnk index of key topics of the text.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML