File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/92/m92-1029_intro.xml

Size: 11,847 bytes

Last Modified: 2025-10-06 14:05:16

<?xml version="1.0" standalone="yes"?>
<Paper uid="M92-1029">
  <Title>MCDONNELL DOUGLAS ELECTRONIC SYSTEMS COMPANY : DESCRIPTION OF THE TEXUS SYSTEM AS USED FOR MUC- 4 Amnon Meyers and David de Hilster</Title>
  <Section position="2" start_page="0" end_page="210" type="intro">
    <SectionTitle>
INTRODUCTION AND APPROACH
</SectionTitle>
    <Paragraph position="0"> Unlike most natural language processing (NLP) systems, TexUS (Text Understanding System) is being developed as a domain-independent shell, to facilitate the application oflanguage analysis to a variety of tasks and domains of discourse. Our work not only develops robust and generic language analysis capabilities, but also elaborate s knowledge representations, knowledge engineering methods, and convenient interfaces for knowledge acquisition .</Paragraph>
    <Paragraph position="1"> TexUS builds on INLET (Interactive Natural Language Engineering Tool) [1][2], which was used for MUC3 . Both descend from VOX (Vocabulary Extension System) [3] [4], which was developed from 1983-87 at UCI and 1988-90 at MDESC . Many analysis, knowledge representation, and knowledge acquisition ideas from VOX have evolved in constructing TexUS . In particular, TexUS (1) implements completely new, robust, and tailorable analysi s algorithms ; (2) embeds analyzer data structures within the knowledge representation; (3) supports interactive graphical knowledge engineering ; (4) runs in C on Sun workstations, to improve portability and speed; and (5) employs a pragmatic and domain-independent knowledge representation framework . TexUS and INLET differ primarily in the strength of the analysis capability . Figure 1 exemplifies the graphical utilities available in TexUS.</Paragraph>
    <Paragraph position="2">  modal adv haw adv be adv verb -reduce-) p_ vp_full ~s_np_unemb clan class verb claw verb ties. cl..s ^tJ~ful lm.t1LedJ_noui syntax syntax claw syntax class syntax syntax  The system we used at MUC3 was new (under development for 9 months) and incomplete . In essence, a sophisticated skimming system carried the analysis burden . Since that time, we have implemented a comprehensive analyzer that includes the following phases : (1) key pattern search, which locates potentially interesting texts an d text fragments; (2) pre-processing, where idiomatic constructs are collapsed in bottom-up fashion ; (3) syntactic analysis, which uses primarily top-down mechanisms to successively segment text into finer-grained units, i .e., paragraphs, sentences, clauses, and components ; (4) semantic: analysis, which extracts and standardizes informatio n found in the parse tree; (5) discourse analysis, which traverses the semantic representation to establish relationship s between events, actors, objects, and other cases; and (6) back-end processing, which converts the interna l representation produced by the analyzer to task-specific output . Table 1 highlights differences between the TexUS and INLET systems .</Paragraph>
    <Paragraph position="3">  The algorithms provided with TexUS support truly robust analysis . Construction of useful parse trees does no t depend on complete syntactic characterization of the text, and succeeds even in the presence of ungrammatical , terse, and garbled text abounding in unknown words and phrases . This is accomplished by using grammar rules tha t successively relax constraints . For example, one hierarchy of grammar rules is dedicated to segmenting clauses into components (e.g., noun and verb phrases). Rules that represent the strongest confidence (such as &amp;quot;det quan adj noun&amp;quot; with no optional pans) are applied first. If these fail, rules with optionality are applied next (e .g., with &amp;quot;det&amp;quot; missing). Next, rules containing wildcard elements (e .g., a wildcard &amp;quot;noun&amp;quot; element) but allowing no optionality are applied, followed by rules with wildcards and optionality . In this way, we attempt to match the text to rules wit h highest confidence first (the ordering interleaves, e.g., rules for noun phrases and verb phrases, to maximiz e confidence). Using rules with wildcards and little optionality allows the analyzer to characterize fragments of text containing unknown words with reasonable confidence in many cases. Such rules also support automated knowledge acquisition (See section 2 .2).</Paragraph>
    <Paragraph position="4"> The parsing mechanisms are driven by hierarchies of grammar rules . Much of our analyzer development thu s consists of refining these rule sets, rather than building code . Further, we have developed generic analysi s mechanisms that serve multiple tasks, depending on the rule sets they are given as parameters . Building analyzers for new tasks is substantially reduced to selecting and using existing analysis mechanisms and creating rule set s when needed for new domains . We are currently developing an interactive graphical analyzer tool to simplif y enhancement of the analyzer algorithms.</Paragraph>
    <Paragraph position="5"> In order to exercise the NLP shell approach taken in TexUS, we are developing analyzers for Army SALUT E messages and texts provided by Federal Express . The most important validation of our approach was provided at MUC3. With only 2 man-months of customization, we achieved performance comparable to sites that devoted about one man-year of effort. Similarly, 3.5 man-months of customization for MUC4 have brought us to the same level as before, but with much greater potential for enhanced performance in the near term .</Paragraph>
    <Section position="1" start_page="0" end_page="209" type="sub_section">
      <SectionTitle>
Analysis System
</SectionTitle>
      <Paragraph position="0"> TexUS provides tailorable analysis capabilities . Rather than providing a monolithic analyzer that must be applied to all tasks, we have provided a set of functions that can be mixed and matched to easily construct an analyzer for a new task . Analysis functions perform tasks such as : (1) tokenize the input text, (2) perform morphological analysis and lexical lookup, (3) locate keywords and key phrases within a text, (4) apply rules to segment text, (5) appl y rules to match segments of a text, and (6) perform semantic analysis on parse trees . Functions can apply top down or bottom up, a single or multiple times per node of the parse tree, recursively or not, and so on . Functions typicall y apply a hierarchy of grammar rules to the parse tree, so that augmenting the analyzer often consists of modifying or adding grammar rules, rather than code. Modifying the analyzer code most often consists of adding calls to existin g functions, rather than implementing new functions.</Paragraph>
      <Paragraph position="1">  The analysis algorithms are robust because they don't depend on complete characterization of the text in order t o produce parse trees . When syntactic grammar rules apply, they are used in building the parse tree, but even whe n the text is ill-formed or the syntax knowledge of the system is incomplete, the analyzer produces parse trees from which useful information can be extracted.</Paragraph>
      <Paragraph position="2"> Semantic analysis extracts and standardizes information by traversing the parse tree produced by the grammar-based analyzer. We have implemented semantic analysis capabilities to locate events, actors, objects, instruments, time , and location within a text. We are currently developing deep analysis capabilities that intelligently resolve discours e phenomena (e.g., whether two sentences describe the same or different events) . The semantic analyzer employ s  domain-specific rules whenever possible, and uses more generic knowledge when necessary, in order to extract relevant information .</Paragraph>
      <Paragraph position="3"> The semantic analyzer constructs an internal representation of the objects and relationships between them . The discourse analyzer traverses the internal semantic representation to establish links between events, actors, objects, locations, dates, instruments, and other information extracted from text. The internal representation then serves as input to a task-specific conversion process that produces the desired output . We are investigating interactive tools to specify the output format. Figure 3 depicts the analysis passes implemented in TexUS .</Paragraph>
    </Section>
    <Section position="2" start_page="209" end_page="210" type="sub_section">
      <SectionTitle>
Knowledge Acquisition Capability
</SectionTitle>
      <Paragraph position="0"> The knowledge acquisition system provides a set of iinteractive graphic tools, including a hierarchy-base d knowledge editor, a grammar rule editor, a vocabulary addition tool, and a dictionary tool . These tools allow a user to add lexical, syntactic, semantic, and domain knowledge to the system . The user can also build hierarchies that drive the analysis functions described earlier .</Paragraph>
      <Paragraph position="1"> We have implemented automated knowledge acquisition capabilities that apply during analysis. For example, the analyzer applies patterns with one-word wildcards to categorize unknown words . A pattern such as &amp;quot;det quan WILD noun&amp;quot;, when it matches a piece of text such as &amp;quot;the two rebel positions&amp;quot;, leads the analyzer to hypothesize tha t &amp;quot;rebel&amp;quot; is an adjective or noun . Other mechanisms use morphological evidence and multi-word wildcards t o characterize words and phrases. Expanding the automated knowledge acquisition capability is part of our ongoin g research effort.</Paragraph>
      <Paragraph position="2"> Batch knowledge acquisition tools have also been implemented, for example, to incorporate personal an d geographic names into the knowledge base . We se investigating the use of on-line dictionaries to enlarge the knowledge base. We are already em*'oying the Collin's English Dictionary (CED) provided by the ACL Dat a Collection Initiative for syntax class L..ormation, and are investigating the extraction of semantic information a s well.</Paragraph>
      <Paragraph position="3"> Knowledge Base We have implemented a domain-independent knowledge management framework that improves the Conceptua l Grammar framework of the predecessor VOX system . The knowledge base represents a variety of linguistic and conceptual knowledge, as well as housing the analyzer data structures and internal meaning representation dat a structures. Figure 4 exemplifies the knowledge representation for kidnapping concepts and their associated words .  The knowledge representation elements are concepts and grammar rules that are analogous to Lisp symbols an d lists. Grammar rules represent any kind of sequential information ; we use them for syntax rules, idiomatic phrases , attributes of concepts, patterns with wildcards, logical expressions, and so on . Attributes specify relationships between concepts, such as the parent-child relationship in a hierarchy .</Paragraph>
      <Paragraph position="4"> The system's knowledge is stored in several forms . The raw database consists of knowledge in a form d irectly accessed and updated by TexUS . A second form of the knowledge consists of a set of files containing primitive knowledge addition commands (e .g., a command to add a node to a phrase). Executing the commands in this fil e system rebuilds the entire knowledge base from scratch . A third form of the knowledge consists of a file system of high level knowledge addition commands (e .g., a command to add a noun to the system) . Each form of the knowledge provides a greater degree of independence from the system internals, and each is successively more human readable. The multiple layers of knowledge storage also provide extra knowledge protection, in case on e layer is corrupted.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML