File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/93/x93-1022_intro.xml

Size: 3,329 bytes

Last Modified: 2025-10-06 14:05:35

<?xml version="1.0" standalone="yes"?>
<Paper uid="X93-1022">
  <Title>UMASS/HUGHES: DESCRIPTION OF THE CIRCUS SYSTEM USED FOR TIPSTER TEXT</Title>
  <Section position="2" start_page="0" end_page="0" type="intro">
    <SectionTitle>
INTRODUCTION
</SectionTitle>
    <Paragraph position="0"> The primary goal of our effort is the development of robust and portable language processing capabilities for information extraction appfications. The system under evaluation here is based on language processing components that have demonstrated strong performance capabilities in previous evaluations \[Lehnert et al. 1992a\]. Having demonstrated the general viability of these techniques, we are now concentrating on the practicality of our technology by creating trainable system components to replace hand-coded d~t~ and manually-engineered software.</Paragraph>
    <Paragraph position="1"> Our general strategy is to automate the construction of domain-specific dictionaries and other language-related resources so that information extraction can be customized for specific application s with a minimal amount of human assistance. We employ a hybrid system architecture that combines selective concept extraction \[Lehnert 1991\] technologies developed at UMass with trainable classifier technologies developed at Hughes \[Dolan et al. 1991\]. Our Tipster system incorporates seven trainable language components to handle (1) lexical recognition and part-of-speech tagging, (2) knowledge of semantic/syntactic interactions, (3) semantic feature tagging, (4) noun phrase analysis, (5) limited coreference resolution, (6) domain object recognition, and (7) relational link recognition. Our trainable components have been developed so domain experts who have no background in natural language or machine learning can train individual system components in the space of a few hours.</Paragraph>
    <Paragraph position="2"> Many critical aspects of a complete information extraction are not appropriate for customization or trainable knowledge acquisition. For example, our system uses low-level text specialists designed to recognize dates, locations, revenue objects, and other common constructions that involve knowledge of conventional language. Resources of this type are portable across domains (although not all domains require all specialists) and should be developed as sharable language resources. The UMass/I-Iughes focus has been on other aspects of information extraction that can benefit from corpus-based knowledge acquisition. For example, in any given information extraction application, some sentences are more important than others, and within a single sentence some phrases are more important than others. When a dictionary is customized for a specific application, vocabulary coverage can be sensitive to the fact that a lot of words contribute little or no information to the final extraction task: full dictionary coverage is not needed for information extraction applications.</Paragraph>
    <Paragraph position="3"> In this paper we will overview our hybrid architecture and trainable system components. We will look at examples taken from our official test runs, discuss the test results obtained in our official and optional test runs, and identify promising opportunities for ~cldjtional research.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML