File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/98/w98-1219_intro.xml
Size: 3,360 bytes
Last Modified: 2025-10-06 14:06:46
<?xml version="1.0" standalone="yes"?> <Paper uid="W98-1219"> <Title>Proper Name Classification in an Information Extraction Toolset</Title> <Section position="2" start_page="0" end_page="0" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> Understanding human languages on any sort of scale is a knowledge intensive task. This paper describes a corpus based approach to gathering language data in the shallow parts of the NLP pond.</Paragraph> <Paragraph position="1"> Information retrieval is a popular application for researchers interested in applied NLP, but the problem of improving retrieval effectiveness appears to be intractable (Smeaton, 1992; Wallis, 1995). One helpful technique is tagging the proper names in text. Tagging and classifying (e.g. Is &quot;Washington&quot; a place or a person?) the named entities and co-references to them (she, he, the company) in text is also a primary concern in systems for information extraction (DARPA, 1995).</Paragraph> <Paragraph position="2"> Information extraction (IE) is a well defined task; the aim being to extract data from free text, and put it in a more structured format. The IE task is not only well defined, it has application and is hence often seen as a prime example of language engineering, where the aim is to explicitly solve a problem rather than to understand the nature of language.</Paragraph> <Paragraph position="3"> IE systems have typically only been successful in narrow domains with significant effort required to move and existing information extraction system to a new problem domain. One approach is to use tools in a development environment that assists the language engineer to create a new information extraction system from pre-exisiting components.</Paragraph> <Paragraph position="4"> The DSTO Fact Extractor Workbench provides the tools to create re-usable text skimming components, called fact extractors, that perform IE on a (very) limited domain. These components can be used directly to find things like dates and the names of companies including co-references, or they can be assembled to create larger fact extractors that skim text for more abstract entities such as company mergers.</Paragraph> <Paragraph position="5"> The workbench provides different views of the domain text to assist in the development process. As an example, the language engineer might be interested in seeing how the word &quot;bought&quot; is used in the domain o.f interest. A &quot;grep&quot;-like tool allows him or her to view all and only those sentences containing &quot;bought&quot;. Naturally more complex patterns are possible incorporating previously developed fact extractors in the pattern.</Paragraph> <Paragraph position="6"> This paper discusses an extension to the corpus viewing tool set that assists the language engineer to find words, called selector terms, that may aid in the classification of proper nouns and determination of possible co-references for those nouns. First, we describe the domain in which we are applying our fact extractors. Next, we introduce our method of measuring the suitability of words as selector terms.</Paragraph> <Paragraph position="7"> Lastly we discuss how this data is collected and presented in the fact extractor workbench.</Paragraph> </Section> class="xml-element"></Paper>