File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/00/a00-1036_intro.xml

Size: 5,210 bytes

Last Modified: 2025-10-06 14:00:40

<?xml version="1.0" standalone="yes"?>
<Paper uid="A00-1036">
  <Title>Linguistic Knowledge can Improve Information Retrieval</Title>
  <Section position="3" start_page="0" end_page="262" type="intro">
    <SectionTitle>
2 Conceptual Indexing
</SectionTitle>
    <Paragraph position="0"> The conceptual indexing and retrieval system used for these experiments automatically extracts words and phrases from unrestricted text and organizes them into a semantic network that integrates syntactic, semantic, and morphological relationships.</Paragraph>
    <Paragraph position="1"> The resulting conceptual taxonomy (Woods, 1997) is used by a specific passage-retrieval algorithm to deal with many paraphrase relationships and to find specific passages of text where the information sought is likely to occur. It uses a lexicon containing syntactic, semantic, and morphological information about words, word senses, and phrases to provide a base source of semantic and morphological relationships that are used to organize the taxonomy. In addition, it uses an extensive system of knowledge-based morphological rules and functions to analyze words that are not already in its lexicon, in order to construct new lexical entries for previously unknown words (Woods, 2000). In addition to rules for handling derived and inflected forms of known words, the system includes rules for lexical compounds and rules that are capable of making reasonable guesses for totally unknown words.</Paragraph>
    <Paragraph position="2"> A pilot version of this indexing and retrieval system, implemented in Lisp, uses a collection of approximately 1200 knowledge-based morphological rules to extend a core lexicon of approximately 39,000 words to give coverage that exceeds that of an English lexicon of more than 80,000 base forms (or 150,000 base plus inflected forms). Later versions of the conceptual indexing and retrieval system, implemented in C++, use a lexicon of approximately 150,000 word forms that is automatically generated by the Lisp-based morphological analysis from its core lexicon and an input word list. The base lexicon is extended further by an extensive name dictionary and by further morphological analysis of unknown words at indexing time. This paper will describe some experiments using several versions of this system. In particular, it will focus on the role that the linguistic knowledge sources play in its operation.</Paragraph>
    <Paragraph position="3"> The lexicon used by the conceptual indexing system contains syntactic information that can be used  for the analysis of phrases, as well as morphological and semantic information that is used to relate more specific concepts to more general concepts in the conceptual taxonomy. This information is integrated into the conceptual taxonomy by considering base forms of words to subsume their derived and inflected forms (&amp;quot;root subsumption&amp;quot;) and more general terms to subsume more specific terms. The system uses these relationships as the basis for inferring subsumption relationships between more general phrases and more specific phrases according to the intensional subsumption logic of Woods (Woods, 1991).</Paragraph>
    <Paragraph position="4"> The largest base lexicon used by this system currently contains semantic subsumption information for something in excess of 15,000 words. This information consists of basic &amp;quot;kind of&amp;quot; and &amp;quot;instance of&amp;quot; information such as the fact that book is a kind of document and washing is a kind of cleaning. The lexicon also records morphological roots and affixes for words that are derived or inflected forms of other words, and information about different word senses and their interrelationships. For example, the conceptual indexing system is able to categorize becomes black as a kind of color change because becomes is an inflected form of become, become is a kind of change, and black is a color. Similarly, color disruption is recognized as a kind of color change, because the system recognizes disruption as a derived form of disrupt, which is known in the lexicon to be a kind of damage, which is known to be a kind of change.</Paragraph>
    <Paragraph position="5"> When using root subsumption as a technique for information retrieval, it is important to have a core lexicon that knows correct morphological analyses for words that the rules would otherwise analyze incorrectly. For example, the following are some examples of words that could be analyzed incorrectly if the correct interpretations were not specified in the lexicon: delegate (de4.1eg4.ate) take the legs from caress (car + ess) female car cashier (cashy 4. er) more wealthy daredevil (dared + evil) serious risk lacerate (lace 4. rate) speed of tatting pantry (pant + ry) heavy breathing pigeon (pig + eon) the age of peccaries ratify (rat 4- ify) infest with rodents infantry (infant + ry) childish behavior Although they are not always as humorous as the above examples, there are over 3,000 words in the core lexicon of 39,000 English words that would receive false morphological analyses like the above examples, if the words were not already in the lexicon.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML