File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/93/h93-1071_metho.xml

Size: 14,818 bytes

Last Modified: 2025-10-06 14:13:24

<?xml version="1.0" standalone="yes"?>
<Paper uid="H93-1071">
  <Title>QUERY PROCESSING FOR RETRIEVAL FROM LARGE TEXT BASES</Title>
  <Section position="1" start_page="0" end_page="0" type="metho">
    <SectionTitle>
QUERY PROCESSING FOR RETRIEVAL FROM
LARGE TEXT BASES
</SectionTitle>
    <Paragraph position="0"/>
  </Section>
  <Section position="2" start_page="0" end_page="0" type="metho">
    <SectionTitle>
ABSTRACT
</SectionTitle>
    <Paragraph position="0"> Natural language experiments in information retrieval have often been inconclusive due to the lack of large text bases with associated queries and relevance judgments. This paper describes experiments in incremental query processing and indexing with the INQUERY information retrieval system on the TIPSTER queries and document collection. The results measure the value of processing tailored for different query styles, use of syntactic tags to produce search phrases, recognition and application of generic concepts, and automatic concept extraction based on interword associations in a large text base.</Paragraph>
  </Section>
  <Section position="3" start_page="0" end_page="353" type="metho">
    <SectionTitle>
1. INTRODUCTION: TIPSTER AND
INQUERY
</SectionTitle>
    <Paragraph position="0"> Previous research has suggested that retrieval effectiveness might be enhanced by the use of multiple representations and by automated language processing techniques. Techniques include automatic or interactive introduction of synonyms \[Har88\], forms-based interfaces \[CD90\], automatic recognition of phrases \[CTLgl\], and relevance feedback \[SB90\]. The recent development of the TIPSTER corpus with associated queries and relevance judgments has provided new opportunities for judging the effectiveness of these techniques on large heterogenous document collections.</Paragraph>
    <Paragraph position="1"> 1.1. TIPSTER Text Base and Query Topics The TIPSTER documents comprise two volumes of text, of approximately one gigabyte each, from sources such as newspaper and magazine articles and government publications (Federal Register). Accompanying the collections are two sets of fifty topics. Each topic is a full text description, in a specific format, of an information need. (Figure 1).</Paragraph>
    <Paragraph position="2"> Each TIPSTER topic offers several representations of the same information need. The Topic and Description fields are similar to what might be entered as a query in a traditional information retrieval system. The Narrative field expands on the information need, giving an overview of the classes of documents which would or  &lt;top&gt; &lt;dom&gt; Domain: International Economics &lt;Title&gt; Topic: Satellite Launch Contracts &lt;desc&gt; Description: Document will cite the signing of a contract or preliminary agreement, or the making of a tentative reservation, to launch a commercial satellite.</Paragraph>
    <Paragraph position="3"> &lt;narr&gt; Narrative: A relevant document will mention the signing of a contract or preliminary agreement, or the making of a tentative reservation, to launch a commerciM satellite. &lt;con&gt; Concept(s): 1. contract, agreement 2. launch vehicle, rocket, payload, satellite 3. launch services, commercial space industry, commercial launch industry 4. Arianespace, Martin Marietta, General Dynamics, McDonnell Douglas 5. Titan, Delta II, Atlas, Ariane, Proton  would not be considered satisfactory, and describes facts that must be present in relevant documents, for example, the location of the company. The Concepts field lists words and phrases which are pertinent to the query. The Factors field lists constraints on the geographic and/or time frame of the query. All of these fields offer opportunities for different kinds of natural language processing.</Paragraph>
    <Section position="1" start_page="0" end_page="353" type="sub_section">
      <SectionTitle>
1.2. The INQUERY Information Re-
trieval System
</SectionTitle>
      <Paragraph position="0"> INQUERY is a probabilistic information retrieval system based upon a Bayesian inference network model \[TC91, Tur91\]. The object network consists of object nodes (documents) (o/s) and concept representation nodes (r,~'s). In a typical network information retrieval system, the text representation nodes will correspond to  words extracted from the text \[SM83\], although representations based on more sophisticated language analysis are possible. The estimation of the probabilities P(rm\[oj) is based on the occurrence frequencies of concepts in both individual objects and large collections of objects. In the INQUERY system, representation nodes are the word stems and numbers that occur in the text, after stopwords are discarded.</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="353" end_page="355" type="metho">
    <SectionTitle>
2. QUERY PROCESSING
EXPERIMENTS
</SectionTitle>
    <Paragraph position="0"> Our current set of natural language techniques for query enhancement are: * deletion of potentially misleading text; * grouping of proper names and interrelated noun phrase concepts; * automatic concept expansion; * simple rule-based interactive query modification. Future experiments will use more extensive automatic noun phrase processing and paragraph level retrieval. In addition to the traditional recall/precision table, we show tables of the precision for the top n documents retrieved, for 5 values of n. The recall/precision table measures the ability of the system to retrieve all of the documents known to be relevant. The precision for the top n documents gives a better measure of what a person would experience in using the system.</Paragraph>
    <Paragraph position="1"> 2.1. Deletion processes.</Paragraph>
    <Paragraph position="2"> Table 1 illustrates an incremental query treatment. The (Words) column shows results from the unprocessed words of the query alone. (Formatting information, such as field markers, has been removed.) The first active processing (Dell) removes words and phrases which refer to the information retrieval processes rather than the information need, for example, A relevant document will describe .... We further remove words and phrases which are discursive, like point of view, sort of, discuss, menlion as well as expressions which would require deep inference to process, such as effects of or purpose of (Figure 2). Some of these expressions would be useful in other retrieval contexts and different lists would be appropriate in different domains. An interactive user is given feedback regarding deletions and could have the capability of selectively preventing deletion.</Paragraph>
    <Paragraph position="3"> In the experiment in the fourth column (-NARIq.) the Narrative field has been deleted from each query. Since the Narrative field is usually a very abstract discussion of the criteria for document relevance, it is not well-suited to a system like INQUERY, which relies on matching words from the query to words in the document. New terms introduced by the Narrative field are rarely useful as retrieval terms (but note the small loss in precision at the very lowest level of recall).</Paragraph>
    <Section position="1" start_page="353" end_page="353" type="sub_section">
      <SectionTitle>
2.2. Grouping Noun Phrases and Recog-
nizing Concepts
</SectionTitle>
      <Paragraph position="0"> The simplest phrasing or grouping techniques are recognition of proper noun groups (Caps in Table 1) and recognition of multiple spellings for common concepts such as United States.</Paragraph>
      <Paragraph position="1"> Proximity and phrase operators for noun phrases. Simple noun phrase processing is done in two ways. Sequences of proper nouns are recognized as names and grouped as arguments to a proximity operator. The proximity operator requires that its arguments appear in strict order in a document, but allows an interword distance of three or less. Thus a query such as George Bush matches George Herbert Walker Bush in a document.</Paragraph>
      <Paragraph position="2"> Secondly, the query is passed through a syntactic part of speech tagger \[Chu88\], and rules are used rules to identify noun phrases (Figure 2). Experiments showed that very simple noun phrase rules work better than longer, more complex, noun phrases. We believe this is because the semantic relationships expressed in associated groups of noun phrases in a query may be expressed in a document as a compound noun group, a noun phrase with prepositional phrase arguments, a complex sentence, or a sequence of sentences linked by anaphora. This hypothesis is supported by the success of the unordered text window operator used in the interactive query modification experiments (Table 4).</Paragraph>
      <Paragraph position="3"> On the other hand, there are verbal &amp;quot;red herrings&amp;quot; in some query noun phrases due to overpreclse expression.</Paragraph>
      <Paragraph position="4"> For example, the phrase U.S. House of Representatives would be more effective for retrieval without the U.S.</Paragraph>
      <Paragraph position="5"> component (Congress might be even nicer).</Paragraph>
    </Section>
    <Section position="2" start_page="353" end_page="355" type="sub_section">
      <SectionTitle>
2.3. Concept Recognition
</SectionTitle>
      <Paragraph position="0"> Controlled vocabulary. The INQUERY system has been designed so that it is easy to add optional object types to implement a controlled indexing vocabulary \[CCH92\]. For example, when a document refers to a company by name, the document is indexed both by the the company name (words in the text) and the object type (~company). The standard INQUERY document parsers recognize the names of companies \[Rau91\], coun-</Paragraph>
      <Paragraph position="2"/>
      <Paragraph position="4"> tries, and cities in the United States.</Paragraph>
      <Paragraph position="5"> With wide-ranging queries like the TIPSTER topics, we have had some success with adding//city (and #foreigncountry) concepts to queries that request information on the location of an event (Table 2). But the terms //company and #usa have not yet proved consistently useful. The #corapany concept may be used to good effect to restrict other operators. For example, looking for the terms machine, translation, and #company in an n-word text window would give good results with respect to companies working on or marketing machine translation products. But, the current implementation of the #company concept recognizer has some shortcomings which are exposed by this set of queries. Our next version of the recognizer will be more precise and complete x, and we expect significant improvement from these it.</Paragraph>
      <Paragraph position="6"> The #usa term tends to have unexpected effects, because a large part of the collection consists of articles from U.S. publications. In these documents U.S.</Paragraph>
      <Paragraph position="7"> nationality is often taken for granted (term frequency 1 Ralph Weischedel's group at BBN have been generous in sharing their company database for this purpose.</Paragraph>
      <Paragraph position="8"> of #usa=294408, #foreigneountry=472021), and it is likely that it may be mentioned explicitly only when that presupposition is violated, or when both U.S. and non-U.S, issues are being discussed together in the same document. Therefore, because focussing on the #usa concept will bring in otherwise irrelevant documents, it is more effective to put negative weight on the #foreigncountry concept where the query interest is restricted to U.S. matters. For the same reason, in a query focussed only on non-U.S, interests, we would expect the opposite: using #foreigncountry should give better performance than #NOT(#usa}.</Paragraph>
      <Paragraph position="9"> Research continues on the 'right' mix of concept recognizers for a document collection. In situations where text and queries are more predictable, such as commercial customer support environments, an expanded set of special terms and recognizers is appropriate. Names of products and typical operations and objects can be recognized and treated specially both at indexing and at query time. Our work in this area reveals a significant improvement due to domain-specific concept recognizers, however, standardized queries and relevance judgments are still being developed.</Paragraph>
      <Paragraph position="10">  Original: Document will cite the signing of a contract or preliminary agreement, or the making of a tentative reservation, to launch a commercial satellite.</Paragraph>
      <Paragraph position="11"> Discourse phrase and word deletion: the signing of a contract or preliminary agreement, or the making of a tentative reservation, to launch a commercial satellite.</Paragraph>
      <Paragraph position="12"> Proper noun group recognition (Concept field):  Automatic concept expansion. We have promising preliminary results for experiments in automatic concept expansion. The Expand results in Table 3 were produced by adding five additional concepts to each query. The concepts were selected based on their preponderant association with the query terms in text of the 1987 Wall Street Journal articles from Volume 1 of the TIPSTER corpus. The improvement is modest, and we anticipate better results from refinements in the selection techniques and a larger and more heterogenous sample of the corpus.</Paragraph>
      <Paragraph position="13"> 2.4. Semi-Automatic query processing.</Paragraph>
      <Paragraph position="14"> In the following experiments in interactive query processing, human intervention was used to modify the output of the best automatic query processing. The person  making the modifications was permitted to 1. Add words from the Narrative field; 2. Delete words or phrases from the query; 3. Specify a text window size for the occurrence of  words or phrases in the query.</Paragraph>
      <Paragraph position="15"> The third restriction simulates a paragraph-based retrieval. null Table 4 summarizes the results of the interactive query modification techniques compared with the best automatic query processing Q-1 (similar to NP in the other  with the concepts #us-city and #foreigncountrv. (We do not yet have a #foreigncity recognizer).</Paragraph>
      <Paragraph position="16">  tables). The Q-M query-set was created with rules (1) and (2) only. The Q-O query-set used all three rules. The improvement over the results from automatically generated queries demonstrates the effectiveness of simple user modifications after automatic query processing has been performed. The most dramatic improvement comes at the top end of the recall scale, which is a highly desirable behavior in an interactive system. The results also suggest that, based on the text window simulation, paragraph-based retrieval can significantly improve effectiveness. null</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML