File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/92/m92-1033_metho.xml
Size: 24,525 bytes
Last Modified: 2025-10-06 14:13:12
<?xml version="1.0" standalone="yes"?> <Paper uid="M92-1033"> <Title>NO INJURY : &quot;LEADER&quot; NO INJURY : &quot;JUSTICE &quot; NO INJURY : &quot;DRIVER &quot;NO INJURY : &quot;BODYGUARDS &quot; DEATH: &quot;SALVADORAN PRESIDENT-ELECT &quot; NO INJURY : &quot;ROBERTO GARCIA ALVARADO&quot;DEATH: &quot;ALFREDO CRISTIANI &quot;</Title> <Section position="1" start_page="0" end_page="0" type="metho"> <SectionTitle> PARAMAX SYSTEMS CORPORATION :DESCRIPTION OF THE PARAMAX SYSTEM USED FOR MUC- 4 </SectionTitle> <Paragraph position="0"/> </Section> <Section position="2" start_page="0" end_page="0" type="metho"> <SectionTitle> INTRODUCTIO N </SectionTitle> <Paragraph position="0"> This paper describes the Paramax MUC-4 data extraction system, a system with an analysis component implemented in CLIPS that provides a level of text understanding that falls somewhere between what is possible with conventional information retrieval techniques and deep linguistic analysis .</Paragraph> <Paragraph position="1"> The Paramax MUC-4 system is a reimplementation of ideas originally introduced in the KBIRD system , which was evaluated in MUC-3 [4, 5] . There are plans to incorporate linguistic analysis techniques int o the system, but the current effort is focussed on maximizing the data extraction capabilities of simpler methods . The Paramax MUC-4 system is depicted in Figure 1 .</Paragraph> </Section> <Section position="3" start_page="0" end_page="245" type="metho"> <SectionTitle> APPROACH AND SYSTEM DESCRIPTION </SectionTitle> <Paragraph position="0"> The Paramax MUC-4 system's architecture consists of three main processing components . An initial preprocessing component generates a set of CLIPS facts in which words, sentences, and paragraphs are identified. This component also computes the likelihood of relevant types of events being described in a given text . A core, rule-based analysis component implemented in CLIPS extracts whatever information can be inferred on the basis of the CLIPS facts generated by the preprocessor and the system's application specific rule base . Finally, a third, template generation component sorts through all of the facts that have been inferred, merging descriptions of identical events and generating collections of templates in a n appropriate format .</Paragraph> <Section position="1" start_page="0" end_page="245" type="sub_section"> <SectionTitle> The Preprocessing Component </SectionTitle> <Paragraph position="0"> The preprocessing component is implemented in a programming language specifically designed for processing textual data called PERL .2 During preprocessing, words, sentences, paragraphs and punctuation marks are delimited. In addition, every word in the text is checked against a set of keyword profiles .</Paragraph> <Paragraph position="1"> These profiles are pairs of words and conditional probabilities representing types of events : arsons, attacks, bombings, kidnappings, robberies, and murders . Although murders are not treated as a basic typ e of event in the MUC-4 task, the system nevertheless attempts to identify them in order to predict significant deaths . If a word occurring in a text also occurs in one of the event profiles with a probabilit y greater than 55%, then descriptions of instances of that type of event are asserted to likely occur in th e text. Because the MUC-4 training sample is small, the list of words determined to be highly predictive o f a given type of event is hand-edited to eliminate entries that have been assigned high scores because o f sparse data.</Paragraph> <Paragraph position="2"> The preprocessor's average elapsed processing time per text in the MUC-4 TST3 data set of 100 texts is less than 3 seconds . Total elapsed preprocessing time for the data set is 4 minutes, 8 seconds . It took approximately three days to write the preprocessor, and another three or four days to create and tune th e event profiles it uses.</Paragraph> <Paragraph position="3"> l Barry Silk is a U.S . government employee on sabbatical at Paramax .</Paragraph> <Paragraph position="4"> A CLIPS-Based Analysis System (CBAS ) After the text preprocessor has tokenized a text and has predicted which types of events are likely to b e described, the task of extracting information about instances of the predicted event types falls to a CLIPS based analysis system called CBAS . CLIPS is a forward-chaining system developed at NASA's Johnso n Space Center that has received a lot of attention lately because of its speed and low cost . Rule-based systems similar to CLIPS have been used before to implement data extraction systems ; two well-known implementations of this sort are the Carnegie Group's Text Categorization Shell [3] and the ADS Rubric system, which is a subcomponent of the Codex system evaluated at MUC-3 [1] .</Paragraph> <Paragraph position="5"> CBAS rule format . An important feature of CBAS rules is that the facts which the rules infer ar e associated with specific regions of text in very much the same way that edges in a parser's well-forme d substring table are assigned to specific regions of an input string . In the KBIRD system referred to earlier, the region assigned to an inferred fact is automatically computed to be the maximal cumulative span o f the regions of text associated with each expression in the antecedent of the rule making the inference. In CBAS, this region must be explicitly specified . Giving up the automated computation of the regio n provides more flexibility.</Paragraph> <Paragraph position="6"> Unlike typical parsers, which contain an implicit constraint that adjacent constituents in a rule must b e realized by contiguous strings of text in the input, CBAS requires all constraints to be explicitly encoded . In the earlier KBIRD system, constraints were realized using operators expressing relationships amon g the text regions associated with facts . In CBAS, relationships are expressed via constraints on attribute s of facts rather than with operators taking facts as arguments .</Paragraph> <Paragraph position="7"> The following example of a CBAS rule states that if the text preprocessor has predicted the likel y occurrence of a bombing event and if the lexical item dynamited occurs in the text being analysed, the n a bombing event instance is to be predicted at the same location as the occurrence of the lexical item :</Paragraph> <Paragraph position="9"> The following two CBAS rules illustrate how rudimentary phrases can be constructed . In this case, proper names are recognized through examination of a database of known names, and proper names tha t are next to one another are concatenated together to form compound phrases .</Paragraph> <Paragraph position="11"> Processing Phases/Rule modules . Following standard practice in forward-chaining system development, the antecedent portions of CBAS rules include references to control fact statements . These control facts are asserted and retracted during the processing of a text to enable or disable portions of the Ret e network constructed out of the system's rule base .' Through the use of control facts, different inferencin g phases can be defined and the rules associated with these phases constitute separate modules, some o f which may be used in other domains and/or applications . Three different types of rule modules arise : Event attribute rule modules . Modules of this sort consist of rules which infer' facts describing possible properties of events without associating those properties with any specific event instance . For example, a rule module exists whose only goal is to identify proper name expressions . Similarly, a rule module exists that identifies possible physical targets .</Paragraph> <Paragraph position="12"> Event instance rule modules. Modules of this sort consist of rules which locate instances of types o f events. There is only one module of this sort in the MUC-4 implementation, but multiple modules of this sort could exist . For example, it might be desirable to have separate modules for each even t type .</Paragraph> <Paragraph position="13"> Slotvaluerule modules. Modules of this sort consist of rules which infer possible slot values for tem plates describing event instances. Facts inferred by rules in this type of module are weighted .</Paragraph> <Paragraph position="14"> Key CBAS features . The most significant features of CBAS are the following : * CBAS is implemented in CLIPS, which can be acquired at little or no cost .4 * The implementation is reasonably fast, with an average processing time per text in the MUC-4 TST 3 data set of 1 minute, 46 seconds . Total processing time for all 100 messages in the data set is 2 hours, 57 minutes.' * The implementation was built quickly, in about 2 months with less than 4 person-months of effort .' * The CBAS rule formalism is easy to learn and requires no linguistic expertise . Most of the rules written for the MUC-4 prototype were written by someone with no background in linguistics .</Paragraph> <Paragraph position="15"> A Template Generato r After the CBAS component has inferred whatever information can be extracted from a given text, a third module is responsible for generating templates based on the factbase that has been created . Three tasks are performed by the template generator : 1. It selects the actual templates that should be produced as output.</Paragraph> <Paragraph position="16"> 2. It chooses among candidate slot fillers if more than one filler has been found .</Paragraph> <Paragraph position="17"> 3. It prints the actual templates in the proper format .</Paragraph> <Paragraph position="18"> 6 A Rete network is a data structure commonly used to encode information in forward-chaining systems . See Forgy [2] fo r an explanation of Rete networks .</Paragraph> <Paragraph position="19"> 4 It can be acquired at no cost for NASA and USAF projects and at a marginal cost for all other uses ($490.00, including a three volume set of documentation).</Paragraph> <Paragraph position="20"> 'Total processing time including non-CLIPS processing (PERL preprocessing and PROLOG template generation) is 3 2 hours for the TST3 data set .</Paragraph> <Paragraph position="21"> 6 We did not remember to actually write down the date we started . We have estimated two months because we know that CLIPS was installed at our site on 7 April 92 and we know that we stopped development on 30 May 1992 . We had access to some old rulebases that were created for the MUC-3 KBIRD system, but these were expressed in a sufficiently differen t rule formalism to require a total rewrite . We estimate less than 4 person-months of effort because only two individuals wer e involved, and both individuals were required to perform other tasks in addition to their work on MUC-4 . Template Selection. The process of determining which template structures to build out of the fact s inferred by CBAS begins by determining if any events at all have been predicted . If no event has been predicted, then an &quot;irrelevant template&quot; is created . If several events of the same type have been created, the template. generator will attempt to merge them using a set of heuristics which hypothesize that tw o event descriptions refer to the same event . Some of the general heuristics used for merging events of the same class are the following: The template generator is fast, with an average processing time per text in the MUC-4 TST3 data se t of 11 seconds . Total processing time for all 100 texts in the MUC-4 TST3 data set was 17 minutes, 4 4 seconds.</Paragraph> </Section> </Section> <Section position="4" start_page="245" end_page="249" type="metho"> <SectionTitle> AN EXTENDED EXAMPL E </SectionTitle> <Paragraph position="0"> In this section, we illustrate in a more concrete fashion how the Paramax MUC-4 system goes abou t processing messages by examining in detail what happens during the processing of a specific text, messag e TST2-MUC4-0048, in the MUC-4 corpus.? Our discussion will proceed through the three stages of text processing that have been identified .</Paragraph> <Paragraph position="1"> First Stage : Text Preprocessing Figure 2 contains a sampling of the CLIPS text facts created during the preprocessing stage for TST2 MUC4-0048 . The msgsocation, msg.date, and msg.src facts are extracted from the text dateline . The word facts identify the locations of lexical items, and the ss and pp facts identify the locations of sentenc e and paragraph boundaries, respectively . For this message, three types of events were predicted : attacks, bombings, and murders.</Paragraph> <Paragraph position="2"> Second Stage : CLIPS-Based Analysi s At the start of the second stage of processing, the set of CLIPS facts generated during the preprocessin g stage are asserted to the CBAS factbase. Once this is done, the forward-chaining engine is invoked to extract information .</Paragraph> <Paragraph position="3"> ?Appendix F contains the text and answer key templates for TST2-MUC4-0048 .</Paragraph> <Paragraph position="5"> Since linguistic analysis techniques are not incorporated into the system, information about subconstituent structure is limited . Constraints of this sort on the system's data extraction capabilities are inherent constraints . The system also contains accidental constraints, which are caused by engineerin g flaws in the system and not by inherent limitations in the extraction methods being employed .</Paragraph> <Paragraph position="6"> It is often not obvious that a data extraction problem is a consequence of an inherent constraint o r an accidental one, or if it is a consequence of some combination of inherent and accidental constraints .</Paragraph> <Paragraph position="7"> From the perspective of the MUC-4 conference, detecting inherent constraints is more interesting tha n detecting accidental constraints. Keeping this in mind, performance characteristics of the CBAS analysi s component will now be described by examining some of the slot values extracted by the system for messag e</Paragraph> <Paragraph position="9"> the system is primed to look for instances of attacks, murders, and bombings in TST2-MUC4-0048 . In the CBAS system, each reference to a given type of event detected in a text is treated as if it is referring to a different event instance . It is up to the template generation component to determine that the template s describing different event instances are in fact describing the same event . A key problem with the MUC- 4 implementation is that this identification process is not working well, leading to an excessive number o f spurious values .</Paragraph> <Paragraph position="10"> For TST2-MUC4-0048, the MUC-4 CBAS system generated 4 response templates . The first template describes a terrorist act identified as an attack, but for which a bomb is correctly identified as an instrument . Given the presence of a bomb as an instrument, the failure to identify the event as a bombing i s clearly the consequence of an indequately developed rule base and not the result of any inherent constrain t in the methodology being used .</Paragraph> <Paragraph position="11"> It is important to keep in mind that many event instances are created during the data extraction process , and that the goal is to identify instances of the same actual event and merge their template structures . A number of bombing event instances were generated for this message, as expected given the number o f references to bombs and explosions. However, since the event instances leading to the generation of the first response template were identified as attacks, their template structures were not merged with those o f bombing instances. A total of three other spurious templates all of which were identified as description s of bombings, were generated for the sample text .</Paragraph> <Paragraph position="13"> the clause &quot;AN INDIVIDUAL PLACED A BOMB ON THE ROOF OF THE ARMORED VEHICLE&quot; .</Paragraph> <Paragraph position="14"> The correct string fill, &quot;URBAN GUERRILLAS&quot; was also extracted by CBAS, but the former value wa s assigned a higher score . The following ground clauses were generated by CBAS to represent these strings : potential_ind_perpetrator('TERRORISTS',&quot;'URBAN GUERRILLAS&quot;).</Paragraph> <Paragraph position="15"> potential_ind_perpetrator('UNKNOWN',&quot; 'AN INDIVIDUAL&quot;) .</Paragraph> <Paragraph position="16"> Unfortunately, the type specification for &quot;URBAN GUERRILLAS&quot; should have been output as 'TER-RORIST', not 'TERRORISTS' . Had this been done, then the following rule would have assigned a highe r score to &quot;URBAN GUERRILLAS&quot; .</Paragraph> <Paragraph position="17"> This type of error in the rule base could have been caught with a bit more staffing and/or developmen t time and is characteristic of relatively simple bugs that have a significant cumulative impact on dat a extraction performance .</Paragraph> <Paragraph position="18"> Extracting physical targets . Two separate problems are manifested in the slot values proposed for physical targets in the first response template generated for TST2-MUC4-0048 .</Paragraph> <Paragraph position="19"> The first problem is evidenced by the failure of the system to determine that &quot;CAR&quot; and &quot;ARMORED VEHICLE&quot; are co-referential . The ability to establish co-referentiality is minimal in the CBAS system , and this weakness is probably inherent in the methodology . However, in this particular case, the CBAS rule base could be expanded to treat &quot;CAR&quot; and &quot;ARMORED VEHICLE&quot; as synonyms and therefor e submit only one of them as a reference to a physical target . Were this done, the problem that would the n arise is the need to recognize when two separate objects are in fact being talked about .</Paragraph> <Paragraph position="20"> The second problem evidenced in the extraction of physical targets is the presence of &quot;HOME&quot; as a possible value, along with &quot;CAR&quot; and &quot;ARMORED VEHICLE&quot; . It turns out that the CBAS rule bas e proposed two separate attack event instances, one for the bomb attack on Alvarado 's car, and one fo r the bomb attack on Merino 's home. However, the template generator component, which is responsibl e for merging templates describing the same event, erroneously merged the templates describing these tw o separate attack instances, combining their physical target slot values . Not enough time was available to develop template merging heuristics .</Paragraph> <Paragraph position="21"> Extracting Human Targets . A number of spurious human target values are proposed in the firs t response template. Many of these spurious values have arisen because of a poorly tuned rule base and no t because of any inherent contraints in the analysis techniques employed . If finer-grained rule weightings were to be employed so that strings describing human targets which are located relatively near a region o f text triggering the detection of an event instance are preferred over others, then many of these spuriou s values would not be present .</Paragraph> <Paragraph position="22"> The human targets &quot;RICARDO VALDIVIESO&quot; &quot;ROBERTO GARCIA ALVARADO&quot;, &quot;LEADER&quot; , &quot;JUSTICE&quot;, &quot;DRIVER&quot;, and &quot;BODYGUARDS&quot; were all proposed with equal likelihood as possibl e slot values for an event instance detected through the occurrence of the word &quot;KILLED&quot; in the phrase</Paragraph> </Section> <Section position="5" start_page="249" end_page="249" type="metho"> <SectionTitle> &quot;VIOLENCE THAT KILLED ATTORNEY GENERAL GARCIA .&quot; If a higher score were assigned t o </SectionTitle> <Paragraph position="0"> human targets immediately following &quot;KILLED&quot;, then the other values could be eliminated . This could most certainly be done, and in fact was done in the KBIRD implementation described earlier . For this type of finer-grained score assignment, it would make sense to employ a part-of-speech tagger that is able to distinguish simple past tense forms from past participles used in passive constructions .8 It is less clear how to reintroduce &quot;DRIVER&quot; and &quot;BODYGUARDS&quot; as slot values, given that the y will not be introduced if the suggested improvements in rule scoring are made . It is in the following two sentences that it is made clear that the driver and bodyguards were also targets : &quot;ACCORDING T O</Paragraph> </Section> <Section position="6" start_page="249" end_page="250" type="metho"> <SectionTitle> THE POLICE GARCIA ALVARADO'S DRIVER, WHO ESCAPED UNSCATHED, THE ATTORNE Y GENERAL WAS TRAVELING WITH TWO BODYGUARDS . ONE OF THEM WAS INJURED .&quot; Rules </SectionTitle> <Paragraph position="0"> could certainly be written to capture these additional human targets . For example, rules could be sensitive to the existence of individuals who &quot;ESCAPE UNSCATHED&quot; and to groups of individuals, some of who m are injured and that are &quot;TRAVELING WITH&quot; a highly likely human target . But it is not clear if one would ever be able to write enough rules in a given domain to reach closure in extracting all human target s that are referred to in this sort of indirect fashion.</Paragraph> <Paragraph position="1"> The string &quot;URBAN TERRORISTS VICE PRESIDENT-ELECT&quot; arises as a reference to a huma n target because of a bug in one of the CBAS rules that builds up compound common noun phrases . In this case, the rule was not written to pay attention to sentence boundaries, and consequently the common noun &quot;URBAN TERRORISTS&quot; occurring at the end of one sentence and &quot;VICE PRESIDENT-ELECT &quot; occurring at the beginning of the next sentence were incorrectly identified as a single common nou n expression. Correcting this type of error is easy.</Paragraph> <Paragraph position="2"> The string &quot;SALVADORAN PRESIDENT-ELECT&quot; should have been recognized as a title of Alfred o Cristiani in the phrase *&quot;SALVADORAN PRESIDENT-ELECT ALFREDO CRISTIANI&quot; . A bug in on e of the CBAS rules that recognizes titles next to proper names was to blame. If a common noun expression is recognized as a title, it is not proposed as a separate reference to an individual . The presense of Alfredo Christiani as a possible victim could have been eliminated by the improvement in scoring techniques mentioned earlier .</Paragraph> <Paragraph position="3"> Finally, the string &quot;INDIVIDUAL&quot; arises as a reference to a human target . This error could have been eliminated by improved scoring techniques . Also, since &quot;AN INDIVIDUAL&quot; occurs as a perpetrator , improved template merging heuristics should have eliminated one of the values as a candidate filler .</Paragraph> <Paragraph position="4"> 8Unfortunately, we have been told that taggers generally perform poorly in discriminating passive past participles fro m simple past tense forms and participles used in past perfect constructions.</Paragraph> <Paragraph position="5"> Third Stage : Template Generatio n Time and funding constraints made it impossible to implement a satisfactory template generation component for this system .</Paragraph> <Paragraph position="6"> In the previous discussion of the analysis component's performance, references were made to cases in which the template generator merged template descriptions when it shouldn 't have. In many cases, the opposite is true. For example, two of the four templates generated for TST2-MUC4-0048 are exac t duplicates of one another . Moreover, the duplicates are relatively impoverished, containing no physica l targets, and only one reference to &quot;BODY&quot; as a human target descriptor . With a little cooperation from the rule base to eliminate &quot;BODY&quot; as a human target, the generator should be able to justify not printin g the underpopulated templates, and in any case, an existing heuristic for not printing exact duplicates o f templates needs to be fixed .</Paragraph> </Section> class="xml-element"></Paper>