File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/93/h93-1062_intro.xml
Size: 2,527 bytes
Last Modified: 2025-10-06 14:05:29
<?xml version="1.0" standalone="yes"?> <Paper uid="H93-1062"> <Title>INTERPRETATION OF PROPER NOUNS FOR INFORMATION RETRIEVAL</Title> <Section position="2" start_page="0" end_page="0" type="intro"> <SectionTitle> 1. INTRODUCTION </SectionTitle> <Paragraph position="0"> Most of the unknown words in texts which degrade the performance of natural language processing systems are proper nouns. On the other hand, proper nouns are recognized as a crucial source of information for identifying a topic in a text, extracting contents from a text, or detecting relevant documents in information retrieval (Rau, 1991).</Paragraph> <Paragraph position="1"> In information retrieval, proper nouns in queries frequently serve as the most important key terms for identifying relevant documents in a database.</Paragraph> <Paragraph position="2"> Furthermore, common nouns (e.g. 'developing countries') or group proper nouns (e.g. 'U.S.</Paragraph> <Paragraph position="3"> government') in queries sometimes need to be expanded to their constituent set of proper nouns in order to serve as useful retrieval terms. We have implemented two solutions to this problem: one approach is to expand a term in a query such as 'U.S. government' to all possible names and variants of United States government entities. Another approach assigns categories from a proper noun classification scheme to every proper noun in both documents and queries to permit proper noun matching at the category level.</Paragraph> <Paragraph position="4"> Category matching is more efficient than keyword matching if the request is for an entity of a particular type. For example, queries about government regulations of use of agrochemicals on produce from abroad, require presence of the following proper noun categories: government agency, chemical and foreign country.</Paragraph> <Paragraph position="5"> Our proper noun classification scheme, which was developed through corpus analysis of newspaper texts, is organized as a hierarchy which consists of 9 branching nodes and 30 terminal nodes. Currently, we use only the terminal nodes to assign categories to proper nouns in texts. Based on an analysis of 588 proper nouns from a set of randomly selected documents from Wall Street Journal, we found that our 29 meaningful categories correctly accounted for 89% of all proper nouns in texts. We reserve the last category as a miscellaneous category. Figure 1 shows a hierarchical view of our proper noun categorization scheme.</Paragraph> </Section> class="xml-element"></Paper>