File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/92/h92-1042_intro.xml
Size: 6,209 bytes
Last Modified: 2025-10-06 14:05:16
<?xml version="1.0" standalone="yes"?> <Paper uid="H92-1042"> <Title>INFERENCING IN INFORMATION RETRIEVAL</Title> <Section position="3" start_page="0" end_page="218" type="intro"> <SectionTitle> 1, INTRODUCTION </SectionTitle> <Paragraph position="0"> Retrieval of information from computerized databases is a complex process whose success depends heavily on the user's knowledge of the structure and logic of the particular database being searched. Many databases have associated with them a controlled indexing vocabulary, or thesaurus, which is the primary access point to the material at search time. For example, the National of Library of Medicine's MESH(r) thesaurus includes some 16,000 headings that are available for indexing and searching the biomedical literature stored in MEDLINE(r), NLM's bibliographic database. The major retrieval strategy is to coordinate MeSH terms with boolean operators, although limited text word searching of titles and abstracts is also possible.</Paragraph> <Paragraph position="1"> Several years ago NLM launched its Unified Medical Language System TM (UMLS TM) project. This is a major research initiative whose goal it is to facilitate retrieval and integration of information from multiple disparate biomedical databases.</Paragraph> <Paragraph position="2"> NLM itself has developed and maintains over 40 databases, and there are many other sources of computerized information in the biomedical sciences. These include factual databases of various kinds, diagnostic expert systems, clinical information systems, as well as bibliographic databases. The UMLS project is attempting to develop methods whereby access is provided to these different systems with their different vocabularies in a way which allows the user to navigate among them with relative ease. Recent results of the project have been the development of an Information Sources Map of biomedical databases, a Metathesaurus TM of biomedical vocabularies and a Semantic Network of high-level biomedical concepts\[I,2\]. The first release of the Information Sources Map contains a description of the scope, content, and access conditions for approximately fifty biomedical databases.</Paragraph> <Paragraph position="3"> The Metathesaurus includes over 67,000 biomedical concepts from a variety of controlled vocabularies. Definitions, lexical category information, hierarchical contexts, and interrelationships among many of the terms found in its constituent vocabularies are provided. Each concept in the Metathesaurus has been assigned to at least one of the 131 semantic types in the Semantic Network. The Network has top level nodes for organisms, anatomical structures, biologic function and dysfunction, chemicals, events, and concepts. The Network defines these types and establishes a set of 35 potential relationships between them. These include physical, temporal, functional, and conceptual links, e.g., part of, co-occurs with, causes, measures. The Network and the Metathesaurus together form a rich knowledge source of biomedical concepts.</Paragraph> <Paragraph position="4"> The knowledge sources will continue to be augmented and refined based on experimentation in a variety of applications, including our own.</Paragraph> <Paragraph position="5"> Our work is motivated by an interest in the development and testing of natural language processing techniques for improved methods of information retrieval. Document retrieval systems, in particular, are &quot;language-rich&quot; and afford the opportunity to conduct basic research in processing complex natural language text. The focus of our work is the development of SPECIALIST, an experimental NLP system for the biomedical domain\[3,4,5\]. The system includes a broad coverage parser 1 supported by a large lexicon, a module that accesses the UMLS knowledge sources, and a retrieval module. SPECIALIST runs on Sun workstations and is implemented in Quintus Prolog, with some support modules written in C.</Paragraph> <Paragraph position="6"> We have recently conducted experiments using a test collection of user queries and MEDLINE citation records retrieved for those queries. The data for the test collection were se1During the academic year 1988-1989 we awarded a research contract to the Paoli Research Center of the Unisys Corporation. As a result of this successful collaboration between our two research groups, the syntactic component of the system is extremely robusL See\[6,7\] for a description of the Paoli system.</Paragraph> <Paragraph position="7"> lected from 2,000 search request forms submitted by health professionals to the NIH and NLM libraries. 155 queries were chosen, approximately 50 each in the three major areas covered by MEDLINE - clinical medicine research, basic science research, and health services research. Searches were conducted by an expert NLM searcher, and the approximately 3,000 citations retrieved were evaluated for relevancy by a subject matter expert\[8\]. Each citation record in the collection includes a title and an author-prepared abstract.</Paragraph> <Paragraph position="8"> We parsed the queries, titles and selected portions of abstracts in the test collection. For all successful parses, noun phrases were extracted and whatever synonyms could be found in the Metathesaurus and in our online version of the Dorland's Illustrated Medical Dictionary\[5\] were added to the noun phrases to form a concept group. We then attempted to effect a match between the concepts in the queries and the concepts in relevant citations. We found that the mapping involves a wide range of inferences. It is only in rare cases that concepts map directly from queries to documents. More commonly, several inferences are necessary in order to determine that a citation is in fact relevant to a request.</Paragraph> <Paragraph position="9"> The remainder of this paper begins with a discussion of some of the salient issues in information retrieval. This is followed by a brief description of the major components of the SPECIALIST system, and the paper ends with an account of our recent investigations in mapping queries to relevant documents.</Paragraph> </Section> class="xml-element"></Paper>