File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/93/h93-1012_metho.xml

Size: 11,726 bytes

Last Modified: 2025-10-06 14:13:20

<?xml version="1.0" standalone="yes"?>
<Paper uid="H93-1012">
  <Title>OVERVIEW OF TREC-1</Title>
  <Section position="4" start_page="61" end_page="61" type="metho">
    <SectionTitle>
2. THE TASK
2.1 Introduction
</SectionTitle>
    <Paragraph position="0"> TREC is dcsigned to encouraage research in information retrieval using large data collections. Two types of retrieval and being examined -- retrieval using an &amp;quot;adhoc&amp;quot; query such as a researcher might use in a library environment, and retrieval using a &amp;quot;routing&amp;quot; query such as a profile to filter some incoming document stream. It is assumed that potential users need the ability to do both high precision and high recall searches, and are willing to look at many documents and repeatedly modify queries in order to get high recall. Obviously they would like a system that makes this as easy as possible, but this ease should be reflected in TREC as added intelligence in the system rather than as special interfaces.</Paragraph>
    <Paragraph position="1"> Since 'IREC has been designed to evaluate system performance both in a routing (filtering or profiling) mode, and in an ad-hoc mode, both functions need to be tested.</Paragraph>
    <Paragraph position="2"> The test design was based on traditional information retrieval models, and evaluation used traditional recall and precision measures. The following diagram of the test design shows the various components of TREC (Figure 1).</Paragraph>
    <Paragraph position="3">  This diagram reflects the four data sets (2 sets of topics and 2 sets of documents) that were provided to participants. These data sets (along with a set of sample relevance judgments for the 50 training topics) were used to construct three sets of queries. Q1 is the set of queries (probably multiple sets) created to help in adjusting a system to this task, create better weighting algorithms, and in general to train the system fox testing. The results of this research were used to create Q2, the routing queries to be used against the test documents.</Paragraph>
    <Paragraph position="4"> Q3 is the set of queries created from the test topics as ad-hoc queries for searching against the combined documents (both training documents and test documents).</Paragraph>
    <Paragraph position="5"> The results from searches using Q2 and Q3 were the official test results. The queries could be constructed using one of three alternative methods. They could be constructed automatically from the topics, with no human intervention. Alternatively they could be constructed manually from the topic, but with no &amp;quot;retries&amp;quot; after looking at the results. The third method allowed &amp;quot;retries&amp;quot;, but under eonslrained conditions.</Paragraph>
    <Section position="1" start_page="61" end_page="61" type="sub_section">
      <SectionTitle>
2.2 The Participants
</SectionTitle>
      <Paragraph position="0"> There were 25 participating systems in TREC-1, using a wide range of retrieval techniques. The participants were able to choose from three levels of participation: Category A, full participation, Category B, full participation using a reduced dataset (25 topics and 1/4 of the full document set), and Category C for evaluation only (to allow commercial systems to protect proprietary algorithms). The program committee selected only twenty category A and B groups to present talks because of limited conference time, and requested that the rest of the groups present posters. All groups were asked to submit papers for the proceedings.</Paragraph>
      <Paragraph position="1"> Each group was provided the data and asked to turn in either one or two sets of results for each topic. When two sets of results were sent, they could be made using different methods of creating queries (methods 1, 2, or 3), or by using different parameter settings for one query creation method. Groups could chose to do the routing task, the adhoc task, or both, and were requested to submit the top 200 documents retrieved for each topic for evaluation.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="61" end_page="89" type="metho">
    <SectionTitle>
3. THE TEST COLLECTION
</SectionTitle>
    <Paragraph position="0"> Critical to the success of TREC was the creation of the test collection. Like most traditional retrieval collections, there are three distinct parts to this collection.</Paragraph>
    <Paragraph position="1"> The first is the documents themselves -- the training set (D1) and the test set (D2). Both were distributed as CD-ROMs with about 1 gigabyte of data each, compressed to fit. The training topics, the test topics  and the relevance judgments were supplied by email.</Paragraph>
    <Paragraph position="2"> These components of the test collection -- the documents, the topics, and the relevance judgments, are discussed in the rest of this section.</Paragraph>
    <Section position="1" start_page="62" end_page="89" type="sub_section">
      <SectionTitle>
3.1 The Documents
</SectionTitle>
      <Paragraph position="0"> The particular sources were selected because they reflected the different types of documents used in the imagined TREC application. Specifically they had a varied length, a varied writing style, a varied level of editing and a varied vocabulary. All participants were required to sign a detailed user agreement for the data in order to protect the copyrighted source material. The documents were uniformly formatted into an SGML-Iike structure, as can be seen in the following example.</Paragraph>
      <Paragraph position="1">  the first of a new generation of phone services with broad implications for computer and communications equipment markets.</Paragraph>
      <Paragraph position="2"> AT&amp;T said it is the first national long-distance cartier to announce prices for specific services under a world-wide standardization plan to upgrade phone networks. By announcing commercial services under the plan, which the industry calls the Integrated Services Digital Network, AT&amp;T will influence evolving communications standards to its advantage, consultants said, just as International Business Machines Corp. has created de facto computer standards favoring its products. null  unique DOCNO id field. Additionally other fields taken from the initial data appeared, but these varied widely across the different sources. The documents also had different amounts of errors, which were not checked or corrected. Not only would this have been an impossible task, but the errors in the data provided a better simulation of the real-world tasks. Table 1 shows some basic document collection statistics.</Paragraph>
      <Paragraph position="3">  Note that although the collection sizes are roughly equivalent in megabytes, there is a range of document lengths from very short documents (DOE) to very long (FR). Also the range of document lengths within a collection varies. For example, the documents from AP are similar in length (the median and the average length are very close), but the WSJ and ZIFF documents have a wider range of lengths. The documents from the Federal Register (FR) have a very wide range of lengths.</Paragraph>
      <Paragraph position="4"> What does this mean to the TREC task? First, a major portion of the effort for TREC-1 was spent in the system engineering necessary to handle the huge number of documents. This means that little time was left for system tuning or experimental runs, and therefore the TREC-1 results can best be viewed as a baseline for later research. The longer documents also required major adjustments to the algorithms themselves (or loss of performance). This is particularly true for the very long documents in FR. Since a relevant document might  contain only one or two relevant sentences, many algorithms needed adjustment from working with the abstract length documents found in the old collections. Additionally many documents were composite stories, with different topics, and this caused problems for most algorithms. null</Paragraph>
    </Section>
    <Section position="2" start_page="89" end_page="89" type="sub_section">
      <SectionTitle>
3.2 The Topics
</SectionTitle>
      <Paragraph position="0"> In designing the TREC task, there was a conscious decision made to provide &amp;quot;user need&amp;quot; statements rather than more traditional queries. Two major issues were involved in this decision. First there was a desire to allow a wide range of query construction methods by keeping the topic (the need statement) distinct from the query (the actual text submitted to the system). The second issue was the ability to increase the amount of information available about each topic, in particular to include with each topic a clear statement of what criteria make a document relevant. The topics were designed to mimic a real user's need, and were written by people who are actual users of a retrieval system. Although the subject domain of the topics was diverse, some consideration was given to the documents to be searched.</Paragraph>
      <Paragraph position="1"> The following is one of the topics used in TREC.</Paragraph>
      <Paragraph position="2">  &lt;top&gt; &lt;head&gt; Tipster Topic Description &lt;num&gt; Number: 066 &lt;dom&gt; Domain: Science and Technology &lt;title&gt; Topic: Natural Language Processing &lt;desc&gt; Description: Document will identify a type of natural language processing technology which is being developed or marketed in the U.S.</Paragraph>
      <Paragraph position="3"> &lt;narr&gt; Narrative: A relevant document will identify a company or institution developing or marketing a natural language processing technology, identify the technology, and identify one or more features of the company's product.</Paragraph>
      <Paragraph position="4"> &lt;con&gt; Concept(s): 1. natural language processing 2. translation, language, dictionary, font 3. software applications &lt;fac&gt; Factor(s): &lt;nat&gt; Nationality: U.S.</Paragraph>
      <Paragraph position="5"> &lt;/fac&gt; &lt;def&gt; Definition(s): &lt;/top&gt;</Paragraph>
    </Section>
    <Section position="3" start_page="89" end_page="89" type="sub_section">
      <SectionTitle>
3.3 The Relevance Judgments
</SectionTitle>
      <Paragraph position="0"> The relevance judgments are of critical importance to a test collection. For each topic it is necessary to compile a list of relevant documents; hopefully as comprehensive a list as possible. Relevance judgments were made using a sampling method, with the sample constructed by taking the top 100 documents retrieved by each system for a given topic and merging them into a pool for relevance assessment. This sampling, known as pooling, proved to be an effective method. There was little overlap among the 25 systems in their retrieved documents.</Paragraph>
      <Paragraph position="1"> For example, out of a maximum of 3300 unique documents (33 runs times 100 documents), over one-third were actually unique. This means that the different systems were finding different documents as likely relevant documents for a topic. One reason for the lack of overlap is the very large number of documents that contain many of the same keywords as the relevant documents, but probably a larger reason is the very different sets of keywords in the constructed queries. This lack of overlap should improve the coverage of the relevance set, and verifies the use of the pooling methodology to produce the sample.</Paragraph>
      <Paragraph position="2"> The merged list of results was then shown to the human assessors. Each topic was judged by a single assessor to insure the best consistency of judgment and varying numbers of documents were judged relevant to the topics (with a median of about 250 documents).</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML