File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/80/p80-1039_metho.xml

Size: 8,960 bytes

Last Modified: 2025-10-06 14:11:25

<?xml version="1.0" standalone="yes"?>
<Paper uid="P80-1039">
  <Title>FOR ALL CO-OCCURRENCES OF EACH WORD PAIR IN TEXT ASSOCIATION STRENGTH = SUM (SCORES)</Title>
  <Section position="1" start_page="0" end_page="0" type="metho">
    <SectionTitle>
REPRESENTATION OF TEXTS FOR INFORMATION RETRIEVAL
</SectionTitle>
    <Paragraph position="0"> The representation of whole texts is a major concern of the field known as information retrieval (IR), an importaunt aspect of which might more precisely be called 'document retrieval' (DR). The DR situation, with which we will be concerned, is, in general, the following: a. A user, recognizing an information need, presents to an IR mechanism (i.e., a collection of texts, with a set of associated activities for representing, storing, matching, etc.) a request, based upon that need hoping that the mechanism will be able to satisfy that need.</Paragraph>
    <Paragraph position="1"> b. The task of the IR mechanism is to present the user with the text(s) that it judges to be most likely to satisfy the user's need, based upon the request.</Paragraph>
    <Paragraph position="2"> c. The user examines the text(s) and her/his need is satisfied completely or partially or not at all.</Paragraph>
    <Paragraph position="3"> The user's judgement as to the contribution of each text in satisfying the need establishes that text's usefulness or relevance to the need.</Paragraph>
    <Paragraph position="4"> Several characteristics of the problem which DR attempts to solve make current IR systems rather different from, say, question-answering systems. One is that the needs which people bring to the system require, in general, responses consisting of documents about the topic or problem rather than specific data, facts, or inferences. Another is that these needs are typically not precisely specifiable, being expressions of an anomaly in the user's state of knowledge. A third is that this is an essentially probabilistic, rather than deterministic situation, and is likely to remain so. And finally, the corpus of documents in many such systems is in the order of millions (of, say, journal articles or abstracts), and the potential needs are, within rather broad subject constraints, unpredictable. The DR situation thus puts certain constraints upon text representation and relaxes others. The major relaxation is that it may not be necessary in such systems to produce representations which are capable of inference. A constraint, on the other hand, is that it is necessary to have representations which ca~ indicate problems that a user cannot her/himself specify, and a matching system whose strategy is to predict which documents might resolve specific anomalies. This strategy can, however, be based on probability of resolution, rat.her than certainty. Finally, because of the large amount of data,.</Paragraph>
    <Paragraph position="5"> it is desirable that the representation techniques be reasonably simple computationally.</Paragraph>
    <Paragraph position="6"> Appropriate text representations, given these con-Straints, must necessarily be of whole texts, and probably ought to be themselves whole, unitary structures, rather than lists of atomic elements, each treated separately. They must be capable of representing problems, or needs, as well as expository texts, and they ought to allow for some sort of pattern matching. An obvious general schema within these requirements is a labelled associative network.</Paragraph>
    <Paragraph position="7"> Our approach to this general problem is strictly problem-oriented. We begin with a representation scheme which we realize is oversimplified, but which stands within the constraints, and test whether it can be progressively modified in response to observed deficiencies, until either the desired level of performance in solving the problem is reached, or the approach is shown to be unworkable. We report here on some lingu/stically-derived modifications to a very simple, but nevertheless psychologically and linguistically based word-cooccurrence analysis of text \[i\] (figure I).</Paragraph>
    <Paragraph position="8">  The original analysis was applied to two kinds of texts : abstracts of articles representing documents stored by the system, and a set of 'problem statements' representing users' information needs -- their anomalous states of knowledge -- when they approach the system. The analysis produced graph-like structures, or association maps, of the abstracts and problem statements which were evaluated by the authors of the texts (Figure 2) (Figure 3).</Paragraph>
  </Section>
  <Section position="2" start_page="0" end_page="147" type="metho">
    <SectionTitle>
CLUSTERING LARGE FILES OF DO~NTS
USING THE SINGLE-LINK METHOD
</SectionTitle>
    <Paragraph position="0"> A method for clustering large files of documents using a clustering algorithm which takes O(n**2) operations (single-link) is proposed. This method is tested on a file of i1,613 doc%unents derived from an operational system. One prop-erty of the generated cluster hierarchy (hierarchy con~ection percentage) is examined and it indicates that the hierarchy is similar to those from other test collections. A comparison of clustering times with other methods shows that large files can be cluStered by single-link in a time at least comparable to various heuristic algorithms which theoretically require  In general, the representations were seen as being accurate reflections of the author's state of knowledge or problem; however, the majority of respondents also felt that some concepts were too strongly or weakly comnected, and that important concepts were omitted (Table i).</Paragraph>
    <Paragraph position="1"> We think that at least some of these problems arise because the algorithm takes no account of discourse structure. But because the evaluations indicated that the algorithm produces reasonable representations, we ha%~ decided to amend the analytic structure, rather than abandon it completely.</Paragraph>
    <Paragraph position="2">  Our current modifications to the analysis consist primarily of methods for translating facts about discourse structure into rough equivalents within the word-cooccurrence paradigm. We choose this strategy, rather than attempting a complete and theoretically adequate discourse analysis, in order to incorporate insights about discourse without violating the cost -d volume constraints typical of DR systems. The modi~,cations are designed to recognize such aspects of discourse structure as establishment of topic; &amp;quot;setting of context; summarizing; concept foregrounding; and stylistic variation. Textual characteristics which correspond with these aspects Include discourse-initial and discoursefinal sentences; title words in the text: equivalence relations; and foregrounding devices (Figure 4).</Paragraph>
    <Paragraph position="3"> i. Repeat first and last sentences of the text.</Paragraph>
    <Paragraph position="4"> These sentences may include the more important concepts, and thus should be more heavily weighted.</Paragraph>
    <Paragraph position="5"> 2. Repeat first sentence of paragraph after the last sentence.</Paragraph>
    <Paragraph position="6"> To integrate these sentences more fully into ~he overall structure.</Paragraph>
    <Paragraph position="7"> 3. Make the title the first and last sentence of the text, or overweight the score for each cO-OCcurrence containing a title word.</Paragraph>
    <Paragraph position="8"> Concepts in the title are likely to be the most important in the text, yet are unlikely to be used often in the abstract.</Paragraph>
    <Paragraph position="9"> 4. Hyphenate phrases in the input text (phrases chosen algorithmically) and then either: a. Use the phrase only as a unit equivalent to a single word in the co-occurrence analysis ; or b. use any co-occurrence with either member of the phrase as a co-occurrence with the phrase, rather than the individual word.</Paragraph>
    <Paragraph position="10"> This is to control for conceptual units, as opposed to conceptual relations.</Paragraph>
    <Paragraph position="11"> 5. Modify original definition of adjacency, which counted stop-list words, to one which ignores stop-list words. This is to correct for the distortion caused by the distribution of function words in the recognition of multi-word concepts.</Paragraph>
    <Paragraph position="12"> Figure 4. Modifications to Text Analysis Program We have written alternative systems for each of the proposed modifications. In this experiment the original corpus of thirty abstracts (but not the prublem statements) is submitted to all versions of the analysis programs and the results co~ared to the evaluations of the original analysis and to one another. From the comparisons can be determined: the extent to which discourse theory can be translated into these terms; and the relative effectiveness of the various modifications in improving the original representations.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML