File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/93/m93-1020_metho.xml

Size: 30,060 bytes

Last Modified: 2025-10-06 14:13:29

<?xml version="1.0" standalone="yes"?>
<Paper uid="M93-1020">
  <Title>TRW: DESCRIPTION OF THE DEFT SYSTEM AS USED FOR MUC- 5</Title>
  <Section position="2" start_page="0" end_page="237" type="metho">
    <SectionTitle>
BACKGROUN D
</SectionTitle>
    <Paragraph position="0"> For the past three years, TRW has been developing a text analysis tool called DEFT--Data Extraction from Text . Based on the Fast Data Finder (I'D I .), l)EF'l' processes larg e volumes of text at very high speeds, identifying patterns which serve as indicator s for the presence of relevant objects, relationships, or concepts in the data . These indicators are processed by a series of system-supplied utilities or custom-writte n functions which refine the data and re-formulate it into frames which can b e presented to a user for review, editing, and submission to a downstream application or database.</Paragraph>
    <Paragraph position="1"> Superficially, DEFT resembles a Natural language Understanding (NLUI) system ; however, there are key differences . DEFT entertains very limited goals in th e processing of natural language input . Although DEF&amp;quot;1' processes unconstrained input, it is looking for textual entities which are tightly constrained and presented t o the system as a list of expressions or in a powerful pattern specification language . i t exploits expectations about how a small set of entities will he expressed to reduce th e amount of computation required to locate those-- and only those-- entities . The broader question of the &amp;quot;meaning&amp;quot; of the text in the document is bypassed in favor o f rapid, robust processing that can he readily moved from domain to domain . As lon g as the input for a particular domain is sufficiently predictable, data extraction with a satisfactory level of recall and precision for many applications can be achieved . We are currently installing three DEFT systems for a United States government agency ; initial reviews have been highly favorable .</Paragraph>
    <Paragraph position="2"> Our involvement in MUC-5 derives from a request by the government to turn DEFT t o a COTS product, with the intent of having a fully-supported version of the system by the end of the year . An analysis of the broader commercial and government marke t for text extraction suggested that the scope of problems that DEI'&amp;quot;I' should he able to address needed to be expanded ; however, it was established that replication of the on going research and development work in the NLU community was an inappropriat e role for our development group . Rather, we wanted DEFT to he able to integrate wit h systems already developed or in development for functionality which falls outsid e the narrow boundaries of DEFT's pattern-based capabilities. At the same time, DEFT' s ability to express patterns needed to be extended from it's current, highly effectiv e means for defining &amp;quot;atomic&amp;quot; patterns to the definition of patterns in relationship t o each other, permitting simple syntactic information to be added to DEFT's lexica l knowledge. Thus, DEli' would have the potential to find entities not expressly define d  in a lexicon, improve its ability to correctly determine the relation between entities , and decrease the overgeneration that tends to be associated with approaches that rel y exclusively on pattern matching .</Paragraph>
    <Paragraph position="3"> A mechanism was selected for enhancing pattern specification which was felt to b e compatible with the notion of integrating DEFT with third-party systems . As will be described in some detail, DEFT is intrinsically an engineering shell which is intende d to facilitate such integration while making its rapid pattern-matching service s available to the other system components . Unfortunately, the software implementing this concept was not available at the time of the final MUC- 5 evaluation, the results of which therefore serve only to confirm our expectation s that the recognition of &amp;quot;simple&amp;quot; (i .e. isolated) patterns is woefully insufficient fo r complex data extraction problems .</Paragraph>
    <Paragraph position="4"> While we regret that the capabilities of the extended version of DEFT could not b e demonstrated for MUC-5, we feel that the outcomes justify our belief that real-worl d message understanding problems necessitate an engineering solution that can pit a choice of technologies against the specific problem at hand-- different technologie s being optimum for different tasks . We believe that DEFT's success in handling simpl e data extraction problems can be extended, and that DEFT is well-suited to a role as an integrator of text analysis capabilities . It is toward this end that we are focusing our on-going productization efforts .</Paragraph>
  </Section>
  <Section position="3" start_page="237" end_page="244" type="metho">
    <SectionTitle>
SYSTEM DESCRIPTIO N
</SectionTitle>
    <Paragraph position="0"> It is convenient to envision DEFT as a pipeline, as shown in Figure 1 . At the head is a standardized document interface to message handling systems . At the tail is a process which generates frames and distributes these to the appropriate destinations on th e basis of content . In between is a series of text analysis &amp;quot;filters&amp;quot; which apply DEFT lexicons (pattern searches) against the text (using the FDF) and call specifi c extraction functions to process the textual fragments located by the lexicons . Al l processes are controlled by means of external configuration files and a &amp;quot;workbench &amp;quot; which contains tools for interacting with DEFT and the data DEFT extracts . We will describe each of these major components in turn .</Paragraph>
    <Paragraph position="1"> The Document Interface: Message Queuing . It is assumed that DEFT will be embedde d in an existing automated message handling (AMIl) system . DEFT's interface with these systems is called Message Queuing (MQ) . Text is typically disseminated to MQ (e.g. by a messaging system like TRW's EL,CSS or KOALA that receives governmen t cables, wire service input, etc .) on the basis of subject matter, source, structure, or other characteristic with salience for how the message's language will he analyzed .</Paragraph>
    <Paragraph position="2"> MQ can also accommodate documents loaded from other sources, such as native wir e services, an existing full-text database, CD-ROM, OCR, and so on . Text is assumed to be in ASCIi or extended ASCII; in the near-future, DEFT will build on work currently underway to allow the FDF to accommodate Unicode for foreign character sets, suc h as Japanese . Structural features, such as document boundaries, sentence boundaries, paragraphs, tahularization, encoded tags (such as SGMI .), embedded non-textual media, etc. can he defined for a particular document class using DEFT specification files.</Paragraph>
    <Paragraph position="3"> MQ utilizes a configuration file to assign a processing thread tailored to the problem domain to each category of document classified by the dissemination system or by whatever means (including manual) is used to route documents to DEFT . Document s  are associated with a processing thread by placing them in a particular 1\IQ &amp;quot;in basket&amp;quot; (a standard Unix directory) . Each in-basket is polled periodically, using a se t of criteria (time and number of messages since the last processing thread wa s initiated) defined in the configuration file .</Paragraph>
    <Paragraph position="4">  Extracting Data: Text Analysis [liters. When (\9Q assigns a document to a processin g thread, it is subjected to a sequence of procedures which operate on the text to locat e patterns of interest and use these patterns as a guide to extract the data required for a particular problem domain . This sequence of processes determines what is extracted and how it is extracted . The sequence is defined as an ordered list of &amp;quot;extractio n phases&amp;quot; in a configuration file . This list can be changed at any time to substitute o r add new extraction phases to refine a text processing thread . New threads can h e modeled on existing ones, facilitating transitions to new problem areas .</Paragraph>
    <Paragraph position="5"> Each extraction phase is an executable program . The behavior of a phase i s dependent on the order in which it is called (i .e . its relationship to the phases tha t have been executed before it) and on parameters which are supplied in th e configuration file . In this Way, a generalized extraction phase can be configured fo r a specific analytic objective. DEFT has a library of extraction phases that perform th e most elementary analytic processes; new phases are be written on a problem-specifi c basis. DEFT provides an application programming interface (API) in the form of a library of utilities which allows a custom extraction phase to interact with the dat a structure which is common to all extraction phases, and which is used t o communicate between phases . This structure is the DEFT &amp;quot;Tag File. &amp;quot; The Tag File is a cumulative record of the processing performed by each extractio n phase. Each phase receives the Tag File from the preceding phase, and passes it t o  the next. A &amp;quot;tag&amp;quot; represents a textual pattern identified by DEFT in the text or dat a created by an extraction function .</Paragraph>
    <Paragraph position="6"> Much of the power of DEFT comes from the ability w apply a mixture of extractio n phases that is optimally suited for a given class of document and extraction problem . For example, one extraction phase might reason about the relative time o f occurrence of events located in the text, basing its analysis on the occurrence o f various forms of date/time indicators as well as the presence of such modifiers a s &amp;quot;last week,&amp;quot; or &amp;quot;three years ago .&amp;quot; Another phase might construct corporate name s on the basis of the occurrence of a known name or the presence of a designator (e.g.</Paragraph>
    <Paragraph position="7"> &amp;quot;Inc .&amp;quot; or &amp;quot;S.A.&amp;quot;) . Yet another phase might act upon these names to reason about thei r potential relationship in a joint venture .</Paragraph>
    <Paragraph position="8"> Locating Data : DEFT Lexicons. The patterns that DLIdegT uses to locate data of interest in the text arc contained in DE.FF's lexicons . Lexicons serve various purposes : to identify potential frames, to determine the &amp;quot;scope&amp;quot; of a frame in the text (i .e . the boundaries to be used to find data to fill the frame slots), to find the contents for a slo t in a frame, to determine structural elements (e .g. sentences, paragraphs, heade r information), and to set the attributes of a text object (e .g. classification level) . Lexicons are of two types : list and pattern . The list lexicon associates a set of synonyms (or spelling variants) with a given object . It is useful when the complet e set of strings associated with an object can be specified . The pattern lexicon is use d when the textual variations associated with an object cannot be specified . For example, all possible monetary values cannot he conveniently enumerated, but a single pattern describing monetary values in terms of digits, punctuation, an d denomination strings can he constructed.</Paragraph>
    <Paragraph position="9"> Associated with lexicon entries are attributes, representing the semantics of th e problem domain . An attribute is a characteristic of the object represented in th e text by its synonym list or pattern. It might he the normalized form of a name o r other data about an object which is useful to map into a frame, such as the countr y associated with a corporate name . In a list lexicon, these attributes are know n explicitly when an entry is created ; they are not inferred from the text . In a patter n lexicon, however, the attributes cannot be known in advance because it is not know n what exact value will hit against the pattern . For this reason, attributes must be extracted for a pattern lexicon . Attribute extraction is handled by a C or C++ progra m referred to as an &amp;quot;extraction function .&amp;quot; For example, given the location of a corporate designator, a function might reconstruct the corporate name.</Paragraph>
    <Paragraph position="10"> The success of a data extraction system that relics on pattern matching and strin g finding depends on how exhaustively it can search for the variations expected i n input language . DEFT has proved successful in its current applications in par t because its lexicons can be extremely large, thanks to the capabilities (in terms o f both functionality and performance) of the FDF .</Paragraph>
    <Paragraph position="11"> Searching Te .xt for Lexicon Entries : The PM DEFT uses the TRW-developed Fast Dat a Finder to rapidly locate instances of a potentially enormous set of patterns in th e input text. The power of the FDF originates in two ways : the hardware architectur e and the expressiveness of its Pattern Specification Language (PSI .) .</Paragraph>
    <Paragraph position="12"> The current generation FDF-3, nc,a~* a COTS product manufactured by Paracel, inc . , uses a massively parallel architecture to stream text past a search pattern at dis k speeds (currently 3 .5-million characters/second using a standard SCSi disk) .</Paragraph>
    <Paragraph position="13">  Searches are compiled into micro code for a proprietary chip set which ca n accommodate up to 3,600 simultaneous character searches or Boolean operations .</Paragraph>
    <Paragraph position="14"> Lexicons are broken into &amp;quot;pipelines&amp;quot; which fully fill the chip set ; each pipeline i s run against all of the text in the set of documents currently being processed. MO_ batches messages as they come in so as to optimize the use of the 1. 1)1 .-- larger message sets are processed more efficiently than several smaller ones. The tradeoff betwee n batching and &amp;quot;real-time&amp;quot; processing can he independently balanced in the W I configuration file for each in-basket and processing thread .</Paragraph>
    <Paragraph position="15"> Search patterns are specified in PSI,. Because the l l)f uses a streaming approach , PSL is not dependent on word boundaries . Extremely complex patterns can h e expressed, which can include such features as error tolerance, sliding windows , multiple wildcard options, nested macros, character masking, ranging, and the usua l Boolean operations. Features that support &amp;quot;fuzzy matching,&amp;quot; like error tolerance, ar e extremely important for handling &amp;quot;noisy&amp;quot; input .</Paragraph>
    <Paragraph position="16"> Output Generation : Frame Assemhi) and Rowing . When the filters that comprise a processing sequence have executed, the 'fag File is passed to the &amp;quot;Frame Assembly and Region Routing&amp;quot; (FARR) module . This program, which constitutes the &amp;quot;tail&amp;quot; of th e DEFT pipeline, assembles the data elements generated during the analysis thread int o frames based on an external definition file . This file specifies which slots are associated with which frames, how to transform a data value for display to the use r (e.g. normalize &amp;quot;England&amp;quot; to &amp;quot;United Kingdom&amp;quot;), how to transform a value for storag e in a downstream database (e.g. abbreviate &amp;quot;England&amp;quot; as &amp;quot;UK&amp;quot;), how to validate a dat a value, whether a data type can he multiply-occurring, and so on .</Paragraph>
    <Paragraph position="17"> One issue that arises during frame assembly is when to associate a data value with a n instance of the frame class for which it is defined . In DEFT, this operation i s associated with &amp;quot;scoping.&amp;quot; Scoping is the process of determining the extent in th e text of a concept associated with a pattern . For example, if a pattern of word s indicative of a joint venture is found, the scope of the &amp;quot;tie-up&amp;quot; frame might be take n to be the location of the pattern plus or minus two sentences . The unit of scoping (i n this case, sentence) need not he a syntactic unit-- it can be any pattern stored in a special type of lexicon used exclusively for determining frame scope . The unit of scoping and its extent (e.g., &amp;quot;plus or minus n&amp;quot;) can he determined independently fo r each frame class.</Paragraph>
    <Paragraph position="18"> When a pattern that gives rise to a slot value of a type defined for a given frame clas s is found in the text, the slot is automatically mapped by I ARR to any frame whos e scope encompasses the location of the pattern . &amp;quot;thus, if the name of a corporatio n were to occur within the two sentence range of the tic-up frame in our example, i t would appear in that frame . Of course, this may not he accurate-- DEFT has a tendency to overgenerate slots through bogus associations that arise because of thi s weak scoping mechanism .</Paragraph>
    <Paragraph position="19"> Another issue that is encountered is overlapping frames . The &amp;quot;best available &amp;quot; resolution can he specified in the frame definition file . One alternative is simply t o accept both frames, since they may be describing separate concepts . If the frame s are of different classes, FARR supports the attribution of a priority to each class, an d only the frame with the highest priority need be retained . If the frames are of th e same class, FARR supports a &amp;quot;non-multiply occurring&amp;quot; attribute, which optionall y suppresses all but one of the frames . Unfortunately, the action taken is generalize d to all situations-- the specifics of a given case cannot be taken into account . Thus , DEFT tends to either overgenerate or lose frames .</Paragraph>
    <Paragraph position="20">  When a message's frames have been generated and ambiguities resolved (to th e extent that DEFT can resolve them), the frames (and the message) are routed to a destination directory on the basis of their content . Routing instructions are define d in a rule base using a normalized conjunctive form of field-value pairs . It should b e kept in mind that although DEFT's primary mission is extraction, not dissemination , the routing capability (since it is based on knowledge representation) provides a sensitive mechanism for determining the destination of a message and the structure d representation of its contents .</Paragraph>
    <Paragraph position="21"> Controlling the System: DEI.T' lbois and Specification Management . in order to make DEFT portable to different computing environments and problem domains, th e definition of user-modifiable system characteristics has been exported to a set o f external specification files . These files govern the interface with the surroundin g message handling system, the output data model, FDF configuration, and othe r &amp;quot;housekeeping&amp;quot; functions . Specification files are maintained using any convenien t text editor.</Paragraph>
    <Paragraph position="22"> The most important system specifications from the standpoint of the end-user are the lexicons and the frame routing rules . To facilitate lexicon development and maintenance, a lexicon editor is bundled with DEFT that provides a graphic use r interface (under X/[Motif) for interactively defining lexicons and entering/editin g lexicon entries . Lexicons can also he created/updated from databases or externa l machine-readable files (e .g . gazetteers, corporate name lists) using a hatch loa d protocol .</Paragraph>
    <Paragraph position="23"> Like the lexicon editor, the routing rule manager provides a GUI for maintainin g routing rules. It uses a spreadsheet metaphor to minimize the user's exposure to th e potentially complex Boolean logic that the rules can involve. Menus of valid value s and conditions tests are automatically provided .</Paragraph>
    <Paragraph position="24"> Another important Dlildegl' tool is frame review . D1:1'1' was developed under the assumption that a user would always he in the loop ; it was not intended to run autonomously. 'Phis package therefore supports simultaneous display of message s and the frames derived from them, providing highlights that show where slot value s were extracted . Menus of valid values drawn from the lexicons assist the user i n filling slots that were omitted by&amp;quot; Dl't Features for selectively deleting superfluou s slots and frames are particularly important, since l)I :F I' (like other pattern-based approaches to text analysis) tends to overgenerate data . A mechanism is also provided to facilitate manually linking frames of different classes into higher-leve l logical aggregations, since Dlil'l' was not originally designed with an automate d linking capability . Clearly, these two design assumptions-- human interaction an d manual frame linking-- had an impact on working with the iM11C-5 data .</Paragraph>
    <Paragraph position="25"> DEFT as an Engineering Shell Phis description of the DEFT system has emphasized that analysis threads ar e composed of independent components \ n'hich communicate through a common dat a structure using a library of utilities that constitute an API . It is our contention that  DEFT's strengths are: * A powerful pattern searching capability, which we are extending .</Paragraph>
    <Paragraph position="26"> * The ability to integrate COTS, CO'T'S, and custom-written program s within the DEFT architecture .</Paragraph>
    <Paragraph position="27"> We believe that there will probably not he a single text analysis or NI .0 system that meets the requirements of all conceivable applications . There will always be a tradeoff between such factors as speed, depth of analysis, breadth of coverage , portability, robustness, and analysis methodology that will favor one technolog y over another for a particular problem . The real question is not &amp;quot;What is the bes t system?&amp;quot;, but &amp;quot;What is the best system at this moment? &amp;quot; Our current development work on DEFT is chiefly targeted at its usefulness as a n integration tool. DEFT provides a high-speed pattern searching capability which ca n successfully extract data from structured or tightly constrained textual inputs, whil e providing pre-processing services (e .g. tagging words with part of speech or semantic attributes) for third-party software which performs more extensive natural language processing for unconstrained textual inputs . This approach should h e especially efficient for applications in which messages are mixed (formatted an d unformatted), text analysis tasks are varied in complexity, and throughput is a majo r consideration.</Paragraph>
    <Paragraph position="28"> Inherent Limitations in DEFT's Pattern-Matching Approac h Because DEFT was not originally intended for problems of the scope of MC-5, it s simplistic approach posed some major problems. Among the most fundamental were: Syntactical Patterns. DEFT has very powerful mechanisms for specifying &amp;quot;atomic &amp;quot; patterns-- a corporate name, a place name, a scat of words that indicate a join t venture, etc . DEFT was not designed to have the capability of expressin g relationships among the patterns in its lexicons and providing for the assignment o f values defined with respect to these patterns to variables . These are essentia l capabilities for the implementation of the most rudimentary semantic grammar . For example, DEFT had no way to express: &amp;quot;Look for a corporate name followed by a join t venture formation phrase and take the following corporate name as the partner i n the joint venture .&amp;quot; Frame Scoping. DEFT was designed to interpret the scope of a frame as a function o f proximity to the &amp;quot;hit location&amp;quot; of the pattern that resulted in a frame's instantiation . The boundaries are determined by a pre-defined number of repetitions of a patter n contained in a scope lexicon . An upper ceiling determined by a fixed number o f characters can also be specified, in case the scoping pattern is not detected a &amp;quot;reasonable&amp;quot; distance from the site of the hit . All occurrences of slots defined for a frame within these boundaries are automatically included in the frame when it i s assembled by FARR .</Paragraph>
    <Paragraph position="29"> For highly formatted text (e.g. messages in Military &amp;quot;Text Format), such a mechanis m is adequate. For free-text, it is not . in the MIIC-5 evaluation, DLF'l' failed to repor t valid objects that it located (notably entities) because they were not within the scop e of a tic-up, as DEFT measured scope .</Paragraph>
    <Paragraph position="30">  /Tame Linking. The original DEFT design assumed that a human operator would perform this task. Automated linking is obviously needed for &amp;quot;unattended&amp;quot; operatio n and is clearly useful even if there is a human-in-the-loop .</Paragraph>
    <Section position="1" start_page="243" end_page="244" type="sub_section">
      <SectionTitle>
Solutions
</SectionTitle>
      <Paragraph position="0"> Current internal research and development work aimed at resolving each of thes e problems for the eventual DIET product adheres to the constraint that architectura l extensions must be philosophically compatible with the pattern-based approach , while avoiding significant overlap with NI..0 (which we prefer to view as a n integratable component in a complex system) . As noted earlier, key software bein g developed under IR&amp;D was not available for the MtlC-5 final evaluation ; however, work continues and will be tested on the NIIIC-5 corpus in the near future to validat e the approach .</Paragraph>
      <Paragraph position="1"> Syntactical Patterns. This is the specific area that was not developed in time for th e evaluation ; unfortunately, it is also the most critical for dealing with even the simpl e aspects of the NIt1C problem . The approach we selected is intended to be compatibl e with the integration of more powerful text understanding components in the future , while extending the range of problems DEFT can solve by itself . it exploits DEFT' s atomic pattern-recognition capabilities while separating the definition of a semanti c grammar into an independent extraction phase . This phase could easily be replaced (or supplemented) with an NE,tl system which can optionally take advantage of th e D1 :1'I' lexical pre-processing while performing deep syntactic and semantic analyses .</Paragraph>
      <Paragraph position="2"> This separation is in part intended to provide an initial test of our belief that th e integration of Dl]']' with an Nl .tl component creates a symbiotic association wit h better performance characteristics than either system by itself .</Paragraph>
      <Paragraph position="3"> To stay within the (admittedly loosely defined) hounds of pattern matching, our approach to exploiting syntax consists of providing DEFT with a simple mechanis m for expressing &amp;quot;meta-patterns&amp;quot;-- that is, patterns whose components may be th e atomic patterns (and, by reference, their attributes) located by the DEFT lexicons . We decided to use a l~Nl specification to define a semantic grammar based on a combination of' literal strings and DEFT'-identified tokens .</Paragraph>
      <Paragraph position="4"> The key issue was how to pass the results of DEFT pattern-matching to the parser . An integrated NE,tl component within the 1)1 :1'I' shell could interface directly with th e DEFT Tag File through the All ; the component could also interface with the frame s generated by DLI I', providing a preliminary level of analysis on which to build . For our prototype, however, we chose to mark terms in the text with SGML-like tags t o indicate their properties . The grammar directly references these tags, and routine s were provided within the parser for assigning text strings to slots by extracting DEFT lexicon attributes (e .g. normalized values or semantic characteristics) or collecting words intervening between two tags (of the same or different class) . Additional primitives for manipulating the strings prior to slot assignment were also built into the parser infrastructure to control frame generation and the assignment of slot s (including pointers to other frames) to frames . This significantly improves on th e primitive scoping capability provided with the current version of DEFT .</Paragraph>
      <Paragraph position="5"> The approach selected thus provides a vocabulary for expressing both the expected contents of documents and the rules for instantiating and linking templates . At th e  same time, its intermediate product is human-readable (and, in fact, could he used as a general-purpose &amp;quot;tagger&amp;quot;) and easily interpreted by other programs .</Paragraph>
      <Paragraph position="6"> Frame Scoping. Fundamental changes in the I)I:I'I' frame-scoping mechanism are planned which will exploit domain knowledge as well as limited syntactic (from th e meta-patterns) and semantic (from lexicon attributes) data . For MUC-5, the basi c DEFT mechanism was retained, with its inherent limitations.</Paragraph>
      <Paragraph position="7"> Frame Linking. A primitive frame linking capability was added to DE R'. It was based on frame scoping, however, and therefore suffered from the same limitations. The DEFT frame definition file format was extended to accommodate hierarchica l relationships ; any frame defined as a child of another frame had its generated fram e ID automatically included as a slot value in the parent frame if its &amp;quot;hit location&amp;quot; fel l within the scope of the parent frame . Of course, multiple and spurious association s are easily generated in this way . In the future, frame linking will be improved by combining syntactic and domain knowledge in a final extraction phase to resolv e inter-object relations .</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML