File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/ackno/92/h92-1039_ackno.xml

Size: 5,578 bytes

Last Modified: 2025-10-06 13:51:52

<?xml version="1.0" standalone="yes"?>
<Paper uid="H92-1039">
  <Title>Session 5b. Information Retrieval</Title>
  <Section position="2" start_page="203" end_page="203" type="ackno">
    <SectionTitle>
2. Papers
</SectionTitle>
    <Paragraph position="0"> The first paper, &amp;quot;Information Retrieval using Robust Natural Language Processing&amp;quot; by Tomek Strzalkowski of New York University, augments a basic statistical information retrieval system with various natural language components. One of these components is the replacement of the standard morphological stemmer with a dictionary-assisted stemmer, improving average precision by 6 to 8%, even in the small test collection being used. Additionally a very fast syntactic parser is used to derive certain types of phrases from the text. These phrases, in addition to the single terms, make for a richer representation of the text, and are also used to expand the queries. The query expansion involves finding similarity relationships between terms in these phrases, and then filtering these relationships to carefully select which terms to add to the query. This filtering (which adds only 1.5% of the possible relations) enables a performance improvement in average precision of over 13%, a significant result for this small test collection. The paper therefore addresses two of the major issues in information retrieval: improving accuracy (precision) using a better stemmer, and improving completeness (recall), without losing accuracy, by adding carefully selected terms to the query.</Paragraph>
    <Paragraph position="1"> The second paper, &amp;quot;Feature Selection and Feature Extraction for Text Categorization&amp;quot; by David D. Lewis of the University of Chicago, deals with the problem of text categorization, or the assigning of texts to predefined categories using automated methods based on the text contents. Two particular areas are investigated. The first area involves finding appropriate statistical methods for assigning categories. Adaptions are made to a statistical model from text retrieval, and methods for determining actual category assignments rather than probability estimates are discussed. The second area of research examines various techniques for selecting the text features for use in this statistical method. Three types of features are tried: 1) single terms from the text, 2) simple noun phrases found using a stochastic class tagger and a simple noun phrase bracketing program, and 3) small clusters of features constructed using several methods. Additionally the effect of using smaller sets of all three types of features is investigated, and is shown to be more effective than using the full set. The problem of selecting which features of text to index is important in information retrieval, as often the terms in the queries are both inaccurate and insufficient for complete retrieval. By improving the indexing of the text, such as by adding selected phrases, clusters, or other features, these queries can be more successful. This work will continue with the larger test collections becoming available in the future.</Paragraph>
    <Paragraph position="2"> The third paper, &amp;quot;Inferencing in Information Retrieval&amp;quot; by Alexa T. McCray of the National Library of Medicine, describes an information retrieval system being designed for the biomedical domain. This system takes advantage of the complex thesaurii built and maintained by the National Library of Medicine by making use of a metathesaurus and semantic network based on these thesaurii. The system uses a syntactic parser against the queries, related text, the metathesaurus, and an online dictionary to construct noun phrases that are grouped into concepts. It then attempts to match these concepts against documents that have not only some naturally-occurring text, but also manual indexing terms based on the thesaurii. The paper discusses the problems found in mapping the language of the queries to the language of the relevant documents, a major difficulty for all information retrieval systems. In this case, as opposed to the earlier papers, the features of the text that are indexed are fixed, and the issue is how to properly construct queries, or properly map natural language queries, into structures that will match the text features.</Paragraph>
    <Paragraph position="3"> The fourth paper, &amp;quot;Classifying Texts using Relevancy Signatures&amp;quot; by Ellen Riloff and Wendy Lehnert of the University of Massachusetts, investigates feature selection for text classification, as did the second paper. The application here, however, is not how to route text into multiple predefined categories, but how to separate articles into only two sets: those relevant to a specific but complex topic, and those not relevant. This is used as a filtering or text skimming preprocessor to text extraction. The paper describes the design of an algorithm that will locate linguistic expressions that are reliable clues to document relevancy. These expressions are found by parsing the training set as input to the algorithm, and then automatically selecting the expressions or features that occur in the relevant and non-relevant documents.</Paragraph>
    <Paragraph position="4"> These features can then be used for later classification in new collections. As contrasted to the second paper, the techniques rely on analysis of the training collection to locate features, rather than on trying to identify more general methods of constructing features from the text.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML