File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/02/w02-0405_metho.xml

Size: 21,564 bytes

Last Modified: 2025-10-06 14:07:56

<?xml version="1.0" standalone="yes"?>
<Paper uid="W02-0405">
  <Title>Using Summaries in Document Retrieval</Title>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 Defining a Summary for News
Articles
</SectionTitle>
    <Paragraph position="0"> For this investigation, the leading text of news documents is used as a basis for creating document summaries - specifically the definition of Searchable LEAD found in Wasson (1998).</Paragraph>
    <Paragraph position="1"> Brandow et al. (1995) compared summaries they created using tf-idf-based sentence extraction to fixed amounts of leading text -Philadelphia, July 2002, pp. 37-44. Association for Computational Linguistics. Proceedings of the Workshop on Automatic Summarization (including DUC 2002), approximately 60, 150 and 250 words long, in three separate trials - generated using a slightly modified version of our production Searchable LEAD text processing software. In that effort, Searchable LEAD-based extracts were judged to be acceptable as summaries for general news articles 92% of the time. This compared favorably to the 74% reported for those summaries created through sentence extraction.</Paragraph>
    <Paragraph position="2"> However, that test was limited to only 250 news articles.</Paragraph>
    <Paragraph position="3"> Wasson (1998) reported on a larger scale version of this evaluation, although in that work Searchable LEAD was used as-is. Searchable LEAD-based extracts resulted in an average compression ratio of 13% in that test.</Paragraph>
    <Paragraph position="4"> Compression ratios generally ranged between about 5-20% for most documents, depending on document length; with Searchable LEAD, the number of leading sentences and paragraphs included in the leading text field was linked to document length. For a shorter document, the Searchable LEAD might consist of only a single sentence. For long documents, Searchable LEAD might consist of the first three paragraphs or more of the document.</Paragraph>
    <Paragraph position="5"> The Searchable LEAD-based extracts were evaluated on their acceptability as summaries in more than 2,727 documents. For the 1,951 general news articles in that test corpus, Searchable LEADs were judged to be acceptable as summaries 94.1% of the time, a result that is not appreciably different from that reported by Brandow et al. (1995), especially when seven newsbrief type documents are excluded from their results. For the other types of documents in the corpus, including lists, newsbriefs and transcripts, acceptability rates were somewhat to substantially lower, as Table 1 shows.</Paragraph>
    <Paragraph position="6">  Zhou (1999) reported the results of an experiment where Searchable LEADs were compared to summaries created by two internal prototype and three commercially available sentence extraction summary generators in a document relevance judgment task, where evaluators used each summary to determine the corresponding document's relevance to a topic.</Paragraph>
    <Paragraph position="7"> The result of that evaluation showed that the top five systems, including Searchable LEAD, statistically tied in this task.</Paragraph>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Related Work
</SectionTitle>
    <Paragraph position="0"> In addition to evaluating the value of both Searchable LEAD-based and sentence extraction-based extracts for their value as general summaries, Brandow et al. (1995) also reported on the results of a limited experiment examining the differences between summary-only versus full-text searching. In tests involving twelve Boolean queries applied to a corpus of about 20,000 documents extracted from the LexisNexis NEWS library, they found that average precision increased from 37% for searches applied to full-text to 45% for searches applied to sentence extraction-based extracts and 47% for searches applied to leading text-based extracts. This was more than offset by large drops in relative recall, 100% for full-text compared to 56% for sentence extraction-based extracts and 58% for leading text-based extracts. (Relative recall assumes that the full-text queries achieved 100% recall; due to limited resources on the project, there was no attempt to determine actual recall rates.) In addition to its limited scale, there were two key problems with this evaluation. First, although Brandow et al. (1995) correctly reported that Searchable LEAD was introduced to enhance search precision, Searchable LEAD also targeted only a subset of our customer segments, specifically those customers who wanted to retrieve only highly relevant documents (in LexisNexis-internal jargon, we refer to these as on-point or major reference documents). This point was not mentioned in Brandow et al. (1995), nor was it reflected in their search evaluation. Second, the convenience of using relative recall notwithstanding, this approach to measuring recall will generally magnify the difference in recall that should be expected when comparing full-text and summary-only search results.</Paragraph>
    <Paragraph position="1"> Sumita &amp; Iida (1997) tested both leading text-based extracts and tf-idf-based sentence extracts of up to three sentences in an experiment involving 10 queries and 600 Japanese language news articles. They reported that limiting searches to such summaries both improved the effectiveness for retrieving highly relevant documents, but also helped exclude other relevant documents with lower levels of relevance.</Paragraph>
    <Paragraph position="2"> Sakai &amp; Sparck Jones (2001) examined the value of summaries for general information retrieval and a pseudo-relevance feedback model, in their case using 30 queries applied to a nearly-39,000 document corpus derived from the TREC collection. The Okapi Basic Search System was used. Precision evaluation focused on both the top 1000 and top 10 relevance ranked documents retrieved. The authors concluded that a summary-only search may be as effective as full-text for precision-oriented searching of highly relevant documents.</Paragraph>
    <Paragraph position="3"> Incorporating both summaries and full-text documents into their pseudo-relevance feedback model was significantly more effective than using summaries only.</Paragraph>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 User Evaluation Scopes
</SectionTitle>
    <Paragraph position="0"> Most information retrieval experiments calculate recall, precision and the corresponding f-measure from a single evaluation perspective or evaluation scope. All documents are judged to be relevant or irrelevant with respect to that one scope. However, commercial information services now report that they handle millions of searches a day for their customers. It is not reasonable to assume that all of the people using those services have the same perspective on relevance, and yet that is often how we evaluate new search aids and features.</Paragraph>
    <Paragraph position="1"> Our customers employ a variety of search strategies, depending on their topics, information interests, and the point they are at in their information seeking task. At one end, we see some customers just starting out on an information seeking task, where they typically are looking for a few highly relevant documents to help introduce themselves to the topic.</Paragraph>
    <Paragraph position="2"> Basically they are trying to provide themselves with a good starting point. At the other extreme, we see customers in public relations, competitive intelligence or in the due diligence phase of their information seeking task. These customers often want to retrieve all references to the topic, even those documents that provide even the most limited or mundane information.</Paragraph>
    <Paragraph position="3"> Although some may see this simply as the customary recall-precision trade-off, that is not the case. A document that contains a passing reference to some topic is relevant to those with the all reference evaluation scope (retrieval of that document is considered successful recall), but it is irrelevant to those with a highly relevant reference evaluation scope (retrieval of that document is considered a precision error). A document's relevance with respect to some customer's evaluation scope is what drives customer perceptions of the resulting recall and precision. Instead of a recall-precision trade-off, we have multiple evaluation scopes for which recall and precision are determined.</Paragraph>
    <Paragraph position="4"> We recognize the differences in evaluation scopes in a single user over time when proposing learning systems and personalization tools that adapt retrieval or routing results to a user's changing interests (e.g., Lam et al., 1996), but we do not recognize these differences when we use single answer key evaluations. As a result, over the years, we have seen a number of potentially useful search enhancements dismissed not because they failed to show improvement for any targeted subset of customers, but rather because they failed to show improvement when using a single general evaluation standard (Harmon, 1991; Voorhees, 1994; Sparck Jones, 1999). Query expansion functionality such as some types of morphological or synonym expansion, for example, may produce a drop in precision that offsets any improvements to recall, but we have found that customer segments who require retrieving all references to their topic are willing to put up with a lot of irrelevant information to make sure that they see everything. Of course, those customers would still like to have better precision, but they require better recall.</Paragraph>
    <Paragraph position="5"> This was also a problem with the limited retrieval experiment reported in Brandow et al.</Paragraph>
    <Paragraph position="6"> (1995). Although Searchable LEAD was introduced specifically to support the subset of users seeking only highly relevant documents, Brandow et al. (1995) did not make this distinction when evaluating their test of twelve Boolean queries.</Paragraph>
    <Paragraph position="7"> For each query evaluated in the experiment reported here, two user evaluation scopes were created. One represented Searchable LEAD's targeted customer segment and its desire to retrieve only highly relevant documents; the other represented the due diligence customer segment, which prefers to retrieve all documents that contain information about the topic regardless of how little or how much.</Paragraph>
  </Section>
  <Section position="7" start_page="0" end_page="0" type="metho">
    <SectionTitle>
5 The Experiment
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.1 Test Corpus
</SectionTitle>
      <Paragraph position="0"> Searchable LEAD was tested in the LexisNexis NEWS library, a commercial collection of full-text news documents from thousands of sources, including newspapers, magazines, wire services, abstract services, trade journals, transcript services and other sources. The document types in this document collection reflected these sources. Date-bounded subsets of this collection were used, with date ranges varying in length from one day (typically more than 45,000 documents searched) to two years (typically more than 32 million documents searched).</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.2 Search Topics and Topic Scope
</SectionTitle>
      <Paragraph position="0"> For this investigation thirty topics were selected and defined. The following are a few of the topics included in the set of topics:  For each of the thirty topics, two scope statements were created, where a scope statement is a description of what is considered a relevant document with respect to the topic. One scope statement, the highly relevant reference evaluation scope, defined what would constitute a highly relevant document. These scope statements typically combined quantitative measures with a number of specific pieces of information that must be present in a retrieved document for it to be considered highly relevant. Requiring some specific pieces of information to be present added objectivity to the evaluation process.</Paragraph>
      <Paragraph position="1"> The second scope statement, the all reference evaluation scope, defined the minimum information about the topic that must be present in order to consider the document relevant from that perspective. For a named entity topic, a document relevant to the all reference scope might include as little as a single occurrence of the entity's name.</Paragraph>
      <Paragraph position="2"> The highly relevant reference evaluation scope for the Office Depot query required among other things revenues, earnings (loss) information, and related per-share information. The all reference evaluation scope required at least one of the financial performance measures, with revenue typically being the one found in retrieved documents.</Paragraph>
      <Paragraph position="3"> The highly relevant reference evaluation scope for the Dallas Cowboys-Cincinnati Bengals football game query required some specific game statistics, none of which were required for the all reference evaluation scope. Thus, a pre-game story concerning whether a player might play was relevant to the all reference evaluation scope but it was irrelevant to the highly relevant reference evaluation scope. After all, articles written before the game took place obviously could not include game statistics.</Paragraph>
      <Paragraph position="4"> More than half the topics focused on named entities. This is consistent with our observations of customer search topics applied to news data, and this user behavior has also been reported elsewhere (e.g., Thompson &amp; Dozier, 1997).</Paragraph>
      <Paragraph position="5"> One effect of this was that the recall and precision rates we would observe in this experiment were higher than what is commonly reported for Boolean search results. Because many proper names are relatively unambiguous, and because articles about some named entity almost always mention the name, some of the queries had much higher accuracy rates than might otherwise be expected, and that pulled overall average accuracy rates up somewhat.</Paragraph>
      <Paragraph position="6"> The Boolean search EXXON, for example, virtually assures us of 100% recall regardless of which evaluation scope is used. Although individual Exxon service stations are mentioned periodically in the news, most news articles that mention Exxon are in fact about the major oil company, ensuring fairly high precision for the all references evaluation scope.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.3 Queries
</SectionTitle>
      <Paragraph position="0"> Searchable LEAD was created to be used with a Boolean search engine. With 20% of news documents in our archives containing fewer than 100 words, a sizeable number of documents have one-sentence LEADs, which would be of little value to search engines that rely on term frequency.</Paragraph>
      <Paragraph position="1"> Through our own experience and routine observations of World Wide Web searchers, most customer queries are quite short, typically one or two words or phrases, perhaps connected by one Boolean operator. Similarly short queries were created for use in this evaluation, such as the following:</Paragraph>
    </Section>
  </Section>
  <Section position="8" start_page="0" end_page="0" type="metho">
    <SectionTitle>
* EXXON
* BILL MCCARTNEY
* OFFICE DEPOT AND EARNINGS
* BENGALS AND COWBOYS
* NATIONAL PARK AND OUTHOUSE
</SectionTitle>
    <Paragraph position="0"> In some cases, a date restriction was explicitly added to the query. In all other cases, a most recent two-year period default date restriction was used.</Paragraph>
    <Paragraph position="1"> There was no attempt to maximize the accuracy of the queries tested. Rather, the goal was to use queries that mimic typical user behavior in order to see how Searchable LEAD impacts typical users.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.4 Testing
</SectionTitle>
      <Paragraph position="0"> Each query was applied and corresponding retrieval results evaluated in four ways, once for each evaluation scope-text scope combination:  evaluated first. Because this combination retrieves at least all the documents retrieved by any of the other search-evaluation scope combinations, it was possible to use the results of this evaluation to create an answer key that could also be used by the other evaluations in order to ensure consistency of document relevance judgments with respect to evaluation scope for all the combinations.</Paragraph>
      <Paragraph position="1"> Each test query was applied to all of the documents in date-restricted subsets of the All News (ALLNWS) file in the LexisNexis NEWS library. A date restriction was used to limit the number of documents to be examined when verifying the results. In addition to applying and evaluating the query created for a given topic, additional queries were used in order to find potential recall errors, that is, relevant documents with respect to the evaluation scope of the topic that were missed by the original query. For the Dallas Cowboys-Cincinnati Bengals football game topic, for example, in addition to the test query BENGALS AND COWBOYS, other queries used to search the date range of documents in order to identify potential recall errors included the following:  COWBOYS) (Cinergy Field is the name of the football field where the game was played) All documents retrieved by such queries were examined for their degree of relevance in order to produce more accurate recall results in this test.</Paragraph>
      <Paragraph position="2"> There was no particular attempt to match the date range exactly to a specific event, a characteristic of this test (and typical user behavior) that often contributed to the number of precision errors. For example, the Dallas Cowboys-Cincinnati Bengals football game occurred in the previous week, specifically three days earlier, but documents retrieved from the entire week were examined. Criteria for a highly relevant reference to this game included certain game statistics. Stories written before the game could not possibly include such information, so they were counted as precision errors for the highly relevant reference evaluation scope. From a customer's perspective, our routine reverse chronological presentation of retrieved documents would have effectively hidden such errors from customers until after the desired information was obtained. For evaluation purposes, however, the entire date range was evaluated.</Paragraph>
      <Paragraph position="3"> Full-text queries were limited to HEADLINE and BODY fields of documents. LEAD only queries were limited to the LEAD sub-field of the BODY field. Most news documents in the LexisNexis service also have one or more meta-data fields that may include named entity and/or topic-indicating controlled vocabulary terms, in addition to other information. Limiting queries to the HEADLINE, BODY and LEAD fields focused the evaluation on the impact of using summaries as opposed to that of using other possible editorial enhancements of the data.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.5 Evaluation
</SectionTitle>
      <Paragraph position="0"> The purpose of Searchable LEAD as a retrieval aid is to help some customer segments retrieve a highly relevant documents about some topic, and to minimize the number of irrelevant documents and documents that only contain passing references to the topic in the answer set. If Searchable LEAD works, one would expect that queries restricted to the LEAD field would result in higher precision than queries applied to the full-text would.</Paragraph>
      <Paragraph position="1"> For the all reference evaluation scope, one would expect recall to fall when shifting from full-text to LEAD. After all, a general summary like LEAD typically only includes information on major points in the document.</Paragraph>
      <Paragraph position="2"> The impact on recall for the highly relevant reference evaluation scope is less certain.</Paragraph>
      <Paragraph position="3"> Because the Searchable LEAD represents an acceptable summary in only 94% of general news articles, and a lower figure in other types of documents found in the LexisNexis NEWS library, it is also reasonable to assume that some decline in recall would also occur with this evaluation scope. Given that relevant documents with this evaluation scope must include all the targeted information, recall errors as defined by this scope may actually eliminate information redundancy, and thus are not necessarily critical to the customer. However, the way in which basic pieces of information are presented can also be revealing, so such redundant documents may still be useful.</Paragraph>
      <Paragraph position="4"> Calculating recall in these cases thus is still worthwhile.</Paragraph>
      <Paragraph position="5"> Recall and precision rates were calculated for each query for each evaluation scope-text scope combination. For each full-text/LEAD pair, recall and precision rates were compared to see how consistent increases and decreases were with respect to expectations.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML