File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/92/p92-1014_intro.xml

Size: 3,626 bytes

Last Modified: 2025-10-06 14:05:23

<?xml version="1.0" standalone="yes"?>
<Paper uid="P92-1014">
  <Title>INFORMATION RETRIEVAL USING ROBUST NATURAL LANGUAGE PROCESSING</Title>
  <Section position="3" start_page="0" end_page="0" type="intro">
    <SectionTitle>
INTRODUCTION
</SectionTitle>
    <Paragraph position="0"> A typical information retrieval fiR) task is to select documents from a database in response to a user's query, and rank these documents according to relevance. This has been usually accomplished using statistical methods (often coupled with manual encoding), but it is now widely believed that these traditional methods have reached their limits. 1 These limits are particularly acute for text databases, where natural language processing (NLP) has long been considered necessary for further progress. Unfortunately, the difficulties encountered in applying computational linguistics technologies to text processing have contributed to a wide-spread belief that automated NLP may not be suitable in IR. These difficulties included inefficiency, limited coverage, and prohibitive cost of manual effort required to build lexicons and knowledge bases for each new text domain. On the other hand, while numerous experiments did not establish the usefulness of NLP, they cannot be considered conclusive because of their very limited scale.</Paragraph>
    <Paragraph position="1"> Another reason is the limited scale at which NLP was used. Syntactic parsing of the database contents, for example, has been attempted in order to extract linguistically motivated &amp;quot;syntactic phrases&amp;quot;, which presumably were better indicators of contents than &amp;quot;statistical phrases&amp;quot; where words were grouped solely on the basis of physical proximity (eg. &amp;quot;college junior&amp;quot; is not the same as &amp;quot;junior college&amp;quot;). These intuitions, however, were not confirmed by experiments; worse still, statistical phrases regularly out-performed syntactic phrases (Fagan, 1987). Attempts to overcome the poor statistical behavior of syntactic phrases has led to various clustering techniques that grouped synonymous or near synonymous phrases into &amp;quot;clusters&amp;quot; and replaced these by single &amp;quot;metaterms&amp;quot;. Clustering techniques were somewhat successful in upgrading overall system performance, but their effectiveness was diminished by frequently poor quality of syntactic analysis. Since full-analysis wide-coverage syntactic parsers were either unavailable or inefficient, various partial parsing methods have been used. Partial parsing was usually fast enough, but it also generated noisy data_&amp;quot; as many as 50% of all generated phrases could be incorrect (Lewis and Croft, 1990). Other efforts concentrated on processing of user queries (eg. Spack Jones and Tait, 1984; Smeaton and van Rijsbergen, 1988).</Paragraph>
    <Paragraph position="2"> Since queries were usually short and few, even relatively inefficient NLP techniques could be of benefit to the system. None of these attempts proved conclusive, and some were never properly evaluated either.</Paragraph>
    <Paragraph position="3"> t Current address: Laboratoire d'lnformatique, Unlversite de Fribourg, ch. du Musee 3, 1700 Fribourg, Switzerland; vauthey@cfmniSl.bitnet.</Paragraph>
    <Paragraph position="4"> i As far as the aut~natic document retrieval is concerned. Techniques involving various forms of relevance feedback are usually far more effective, but they require user's manual intervention in the retrieval process. In this paper, we are concerned with fully automated retrieval only.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML