XML Viewer - m93-1021

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/93/m93-1021_metho.xml
Size: 33,639 bytes
Last Modified: 2025-10-06 14:13:29
<?xml version="1.0" standalone="yes"?>
<Paper uid="M93-1021">
  <Title>UNISYS : Description of the CBAS System Used for MUC- 5</Title>
  <Section position="3" start_page="0" end_page="255" type="metho">
    <SectionTitle>
APPROACH AND SYSTEM DESCRIPTIO N
</SectionTitle>
    <Paragraph position="0"> The data extraction process performed by CBAS takes place in three processing phases . An initial tokenization phase generates a set of primitive facts. A second, intensional reasoning phase involves the use of forward-chaining rules to infer information about possible events and their component objects an d attributes from the basic facts generated in the initial phase . A third and final phase involves extensiona l reasoning activities in which actual events and their component objects and attributes are inferred fro m the set of possible entities introduced during the intensional reasoning phase .</Paragraph>
    <Paragraph position="1"> Tokenization The initial, tokenization phase consists of a collection of processors, each of which contributes what i t can to the set of primitive facts which form the basis for higher-level reasoning. In the MUC-5 version of CBAS, three different tokenization processors were used, all of which were integrated together usin g PERL, a programming language specifically designed for manipulating textual data .</Paragraph>
    <Paragraph position="2"> The most basic of the three processors in the MUC-5 implementation is used to do text zoning, whic h is the detection of regions of text corresponding to words, sentences, paragraphs, punctuation, and othe r regions which frequently arise in newswire text, such as date, source, and title headers, and remarks abou t the location of graphic images . A text zoning processor must be able to recognize the types of document s it is processing in order to properly identify regions of text, since the conventions and/or reliable clues fo r delimiting zones vary across document types .</Paragraph>
    <Paragraph position="3"> The two other tokenization processors in the MUC-5 implementation require the output of the text zonin g processor to perform their tasks ; however they may do their processing asynchronously with respect t,o on e another. The first of these processors determines the part-of-speech of word tokens . Currently a tagge r developed by Eric Brill is being used .2 The second of the processors searchs the word tokens which hav e been delimited for combinations which possibly correspond to company names .</Paragraph>
    <Paragraph position="4"> 'CBAS (pronounced &amp;quot;Sea Bass&amp;quot;) is an acronym for Concept-Based Analysis System . For additional information on th e system, including its availability, contact. Carl Weir, 215-648-2369, weir@vfl .paramax.com .</Paragraph>
    <Paragraph position="5">  ...or va an, Jll n P.m UT ., sr .., nr na,* .T,. ...TT so . .11, &amp;quot;n I,AT ,T *AP PTI UP A JO.</Paragraph>
    <Paragraph position="6"> .</Paragraph>
    <Paragraph position="7"> . ..re CT... AM, A 0,,*. .SU ?MD&amp;quot;.</Paragraph>
    <Paragraph position="8">  UM. .. Pr. n S. ..</Paragraph>
    <Paragraph position="9"> CAT. . To . or sus r. nos..., sr TA. or . . s  Two types of problems were encountered in using the part-of-speech tagger for the Muc-5 task . First . in some cases the tagger did not make sufficiently fine-grained distinctions . A good example of this type of case is the lack of a class distinction between the definite article the and the indefinite article n, both o f which are assigned the tag DT. And second, in cases where the accuracy of a given tag was crucial, ofte n the tagger was not accurate enough . This latter type of problem has arisen in rules which depend on the identification of possessive &amp;quot;s&amp;quot; tokens . Ll general, part-of-speech tagging did not play as significant a rol e as it was anticipated to, and given that it consumed 25% of the time required to process a message, wa s more trouble than it was worth in the MUC-5 task.</Paragraph>
    <Paragraph position="10"> The company name parser used in CBAS was a major success . The parser, which is implemented in (.' , is fast, taking on average about 4 seconds per text to do its job . The parser incorporates three procedures for detecting company names . First, it searches for known company names, looking for matches of toke n sequences against a company name database in Unix DBM format . The matches are not required to b e exact; for example, trailing designators don't need to match--all three of the following sequences woul d be matched against the &amp;quot;Ford Motor &amp;quot; entry in the DBM database :  When looking for matches, lowercase names are not recognized, and names preceded by the prepositio n &amp;quot;in&amp;quot; are not recognized, since so many company names are also the names of places . DBM databases are capable of containing large quantities of data--as many as a billion blocks . (Currently the company name database contains about 8 MB of entries .) Moreover, DBM databases can be accessed very quickly , making them especially attractive in a data extraction task .' In addition to the search for word token sequences corresponding to known company names, the compan y name parser also searches for sequences of capitalized words . This procedure does not. attempt to detect. sequences which start at the beginning of a sentence, or to detect sequences in &amp;quot;all caps&amp;quot; text . Also , sequences of tokens which correspond to place names or months are not recognized as possible compan y names.</Paragraph>
    <Paragraph position="11"> A third and final procedure used by the company name parser is to look for sequences of tokens whic h end in company designators. The basic strategy here is to first locate a company designator and then wor k backwards until the sequence meets one or more delimiting criteria, including the presence of a sentenc e boundary, a punctuation marker, a preposition, another company designator, or something in lower cas e (other than &amp;quot;and&amp;quot;) .</Paragraph>
    <Paragraph position="12"> The CBAS company name parser is a good example of the sort of processor which one wants to develo p in a data extraction system : the procedures it embodies are simple ; the facts it extracts have a consisten t level of reliability ; it relies minimally on other processors (just the text zoner) to perform its task ; it performs its task quickly ; and finally, there are many domains for which the detection of company name s is required, and so it will be a useful preprocessor in many applications .</Paragraph>
    <Paragraph position="13"> During the early stages of developing the MUC-5 version of CBAS, an effort was made to incorporate a n NLP parser as yet another sensor invoked during tokenization . Tomek Strzalkowski 's Tagged Text Parser  issue in performing well on the evaluations, since a rapid rule development cycle is needed for development purposes a failure to consider the need for a rapid rule development cycle is one of the more common errors among less experience d participants in such efforts . Government sponsors have also begun to realize that a data extraction system which can process , say, 100 messages in 15 minutes is useful as an interactive analysis tool, which is a very desirable attribute . A few extractio n systems are capable of this level of performance--systems relying heavily on linguistic analysis techniques take much longer , in the neighborhood 8-10 hours. Typical extraction systems which do not rely heavily on linguistic analysis techniques require 2-5 hours (1-3 minutes per text) to process 100 messages, depending on the texts being processed . However, n o existing data extraction system is truly interactive in the sense that extraction queries can be formulated &amp;quot;on the fly&amp;quot; ; al l implementations of existing extraction architectures are custom-built to answer a single query .  MIT) was acquired for this purpose [4] . However, after the parser was integrated into the system, i t was determined that the structures returned by the parser did not preserve enough information about th e regions of actual text corresponding to recognized syntactic structures to be useful, and that to modif y the parser to return suitable output structures would not be possible, given the staffing resources availabl e for the Muc;-5 effort..' Aside from the fundamental problem with output structures inappropriate for th e blue-5 task, the 'I'TP parser, despite its speed compared to other parsers examined, was nevertheless mor e than doubling the amount, of time required to process a text . Consequently, the effort to incorporate a n NLP parser into CBAS for the Muc-5 evaluation was abandoned .' Unlike the situation in canonical NLP systems, the tokenization phase in the CBAS architecture involve s a great deal of processing . Indeed, in CBAS any processors incorporating linguistic analysis techniques are viewed as components of the tokenization phase . What counts as a &amp;quot;primitive fact &amp;quot; versus a &amp;quot;derive d fact&amp;quot; is a fairly arbitrary decision, and similarly what counts as a tokenization phase component versus a component of some higher-level processing phase is also arbitrary . However, the direction which is bein g taken in CBAS --pushing more and more analysis &amp;quot;up front&amp;quot; in the form of multiple, specialized, relativel y asynchronous processors--is one which other research groups are also finding to be advantageous .' We believe there is a trend underway in which NLP systems applied to information extraction tasks ar e beginning to look more and more like standard multisensor data fusion engines .</Paragraph>
    <Paragraph position="14"> Intensional Reasonin g After the tokenization phase has generated a collection of primitive facts, &amp;quot;higher-level &amp;quot; processing phases of the CBAS architecture are invoked to derive additional information . Two such phases exist in the current. implementation of CBAS, and the first of these involves intensional reasoning, so-named because the general idea at this stage is to detect possible events being referred to, along with thei r component objects and attributes, without firmly committing to their existence .</Paragraph>
    <Paragraph position="15"> Both of the higher-level processing phases are realized as collections of forward-chaining rules . The decision to use forward-chaining as the default reasoning method was motivated by an overall desire il n CBAS to maintain as asynchronous a reasoning process as possible, imposing control only when necessary . CLIPS, a popular forward-chaining system, was used to implement the higher-level phases .' It is easy in (.'LIPS to incorporate calls to external programs via C procedures, and this capability makes it possible to escape from the default forward-chaining reasoning method whenever it is desirable to engage in a differen t style of analysis . In CBAS, calls are made within CLIPS images to external UNIX DBM databases, whic h are used to store static knowledge (just like the company name parser stores relatively static knowledge about known companies) . This use of DBM databases greatly reduces the size of internal CLIPS factbases without a penalty in access time .</Paragraph>
    <Paragraph position="16"> A number of other Muc-5 systems have architectures similar to that of CBAS in that pattern-matching plays a key role in their reasoning phases .' However, CBAS is distinguished from these systems in that the pattern-matching process in CBAS is implemented using general-purpose expert system software whereas the other systems rely on custom-built code, and in most cases the custom-built code involves the use o f In MI fC evaluation tasks, there is a need to supply the actual text substrings corresponding to an analysis structur e when instantiating output data structures (templates), and it has been our experience that the representations generated b y some linguistic analysis components (of which TTP is just one example) do not provide a straightforward means of satisfyin g this requirement .</Paragraph>
    <Paragraph position="17"> `'ludependent of speed and the accessibility of data in the output structures generated by linguistic analysis components , another problem which may be lurking about is a highly inconsistent level of reliability : it could be that the accurac y of results are so unpredictable, that incorporating linguistic analysis results in the contexts of intensional and extensiona l reasoning is too much of a rule-writing burden to be manageable.</Paragraph>
    <Paragraph position="18"> 'Lisa Ran (GE) has expressed this view in discussions .</Paragraph>
    <Paragraph position="19"> 'CLIPS is a &amp;quot;GO'l'S&amp;quot; product developed and maintained at NASA's Johnson Space Center . Rule-based systems similar to ('LIPS have been used before to implement data extraction systems ; two well-known implementations of this sort are th e Carnegie Group's 'lext Categorization Shell [3] and the ADS Rubric system, which is a subcomponent of the Codex syste m evaluated at MtI('-3 [t] .</Paragraph>
    <Paragraph position="20"> &amp;quot;A distinction is being made here between pattern-matching and various forms of NLP-based syntactic analysis, including systems which don't make a strong attetnpt to derive full sentential parses .</Paragraph>
    <Paragraph position="21">  a formalism which is less familiar to ordinary users than standard production rules .</Paragraph>
    <Paragraph position="22"> A fundamental feature of the forward-chaining rules used in CBAS is that. the facts which the rules infer are associated with specific regions of text in very much the same way that edges i~~ a. parser's well-formed substring table are assigned to specific regions of an input string . However, unlike typical parsers , which contain an implicit constraint that adjacent constituents in a . rule must be realized by contiguou s strings of text in the input, all constraints in CBAS inference rules are explicitly encoded via attributes o f facts--contiguity is not assumed .</Paragraph>
    <Paragraph position="23"> A brief digression is needed at this point to provide a basic understanding of the structure of a CLIPS forward-chaining rule . First, any forward-chaining system, CLIPS included, has two basic data types : facts and rules. Facts represent what is already known, and rules describe how to infer new facts, give n whatever facts currently exist . Forward-chaining rules have a &amp;quot;left-hand side &amp;quot; (LHS) and a &amp;quot;right-han d side&amp;quot; (RHS), which are delimited from each other by an arrow symbol, __&gt; . The LHS of a rule consist s primarily of patterns that facts in the factbase might satisfy, and the RHS of a rule consists of actions t o be performed if all the expressions constituting the LHS of the rule do match existing facts, and of cours e a common action performed on the RHS of a rule is to assert new facts and/or to remove existing fact s which match the patterns on the rule's LHS . Pattern-matching never occurs on the RHS of a rule, onl y actions. In CLIPS, rules are defined using a defrule construct, which is fairly transparent in format . It is easier to grasp the nature of a forward-chaining rule by looking at concrete examples . The followin g CBAS rule used in the intensional reasoning phase states that if a company name has been predicted by the company name parser, and if this company name consists of one word token whose part-of-speec h category is PP$, VB, RB, IN, or CC, then the predicted company name is not really describing a compan y object and should be eliminated from consideration .9</Paragraph>
    <Paragraph position="25"> Note in this example how the &amp;quot;1&amp;quot; (left) and &amp;quot;r&amp;quot; (right) attributes, whose values are pointers to locations in the text, are used to capture the fact that the company-name &amp;quot;concept &amp;quot; and the word token span th e same region of text . Typically in forward-chaining formalisms an expression beginning with a questio n mark is a variable to be instantiated by a value in an actual fact in the factbase . Note that for the &amp;quot;cat.&amp;quot; attribute, alternative literal string values are provided--a given actual fact would need to have a valu e for its &amp;quot;cat&amp;quot; attribute which corresponds to one of the literal strings. The CLIPS facts used in CBAS are defined to be &amp;quot;template &amp;quot; structures, which means that the order in which attributes are specified i s irrelevant, and templates will match a pattern on the LHS of a rule even if the template has attributes no t specified in the pattern--the only requirement is that attributes explicitly mentioned in the pattern matc h the template .10 Finally, the ?A &lt;- notation is used to provide a way of pointing to the fact instantiatin g a given pattern on the LHS so that on the RHS the fact can be modified or deleted .</Paragraph>
    <Paragraph position="26"> In the following rule, the 1(eft) and r(ight) attributes of txt_token facts are used to require two wor d tokens to be contiguous. This rule illustrates a rudimentary form of syntactic analysis in which words i n domain-specific classes are combined to infer constituent structures . Constraining the tokens to specifi c word classes is done by unifying the reg attributes of txt_token and word facts, where word facts encode the class information . In this particular case, the only words of type &amp;quot;joint&amp;quot; are joint, co-operative, and new, and the only words of type &amp;quot;venture&amp;quot; are venture, project, plan, deal, firm, concern, and development.</Paragraph>
    <Paragraph position="27"> And the set of possible phrases recognized by this rule is the Cartesian product of these two word classes .</Paragraph>
    <Paragraph position="28"> 9 TreeBank part-of-speech labels are assigned by the tagger used in CBAS .10 Do not confuse the use of the term &amp;quot;template structure&amp;quot; in CLIPS with the use of the same terns in MI JC applications- - i n the latter case, it refers to output structures which are intended to represent generalized data base records .</Paragraph>
    <Paragraph position="29">  Surely the above rule represents the sort of formalism that gives linguists nightmares--subconstituent s are domain-specific, not embodying any linguistic generalizations .11 Nevertheless, such rules are much simpler to compose and maintain, despite their superficially complex appearance, than standard collection s of grammar rules for large-scale systems . Moreover, they are are much more robust--grammar rules are so interdependent that robustness is a chronic problem--and they are much faster to execute, simply becaus e they do not constitute an effort to reach a complete constituent analysis .</Paragraph>
    <Paragraph position="30"> In the following forward-chaining rule a distinction is made between definite and indefinite reference s to joint ventures . In this case, explicit strings corresponding to definite and indefinite articles must b e accessed, since no part-of-speech distinction is available between definite and indefinite determiners . Li the muc'-5 version of 'CBAS only non-definite references to joint ventures permit the inference of a join t  In the above rule, the word represented by the first txt_token fact is not required to be contiguous wit h the word represented by the second txt_token fact. However, the first word is required to be to the lef t of the second word . The negated pattern ensures that any words occurring between the first and second words must be adjectives or numeric expressions--ie, modifying expressions . The &amp; : notation introduces &amp;quot;in-line&amp;quot; functional contraints on variables in patterns . It should be possible to hide a great deal of th e explicit encoding of constraints on location pointers by introducing a slightly higher-level formalism whic h expands to the explicit notation currently being used . The primary reason this has not already been don e is that while encoding the constraints may look complicated, it is actually a fairly straightforward task , and taking the time out to develop the higher-level formalism has not been justifiable .</Paragraph>
    <Paragraph position="31"> A significant. feature of the above rule is the use on the right-hand side of the function get-region-string.</Paragraph>
    <Paragraph position="32"> This function invokes a remote C procedure which accesses Df3M databases . In this rule, the procedure is used to access regions of text both in their citation forms and in a regularized form (all lowercase) .</Paragraph>
    <Paragraph position="33"> The ability to compute arbitrary regions of text in this fashion greatly simplifies the writing of CBA S forward-chaining rules, since it bypasses the need to do explicit pattern-matches on the left-hand sid e ll 'lu be fair, it. has been our experience that &amp;quot;industrial-strength &amp;quot; grammars tend to be very domain-specific as well , requiring a high overhead for rule maintenance .</Paragraph>
    <Paragraph position="34">  of the rule to determine the strings corresponding to word tokens, a particularly problematic situation , given that in this particular case, the distance between the deterutiner and the &amp;quot;venture&amp;quot; constituent is arbitrary. This is a good example of when bypassing a default reasoning method is desirable .</Paragraph>
    <Paragraph position="35"> Following standard practice in forward-chaining system development, the antecedent portions of CBA S forward-chaining rules include references to &amp;quot;control fact &amp;quot; statements (see the above rules for examples) . These control facts are asserted and retracted during the processing of a text to enable or disable portions of the Rete network constructed out of the system 's factbase . 12 The use of control facts is dependent, o n the ability to set the salience of a given forward chaining rule . The salience of a rule determines its position on the agenda CLIPS maintains of all rules whose left-hand side patterns have been satisfied . Below, for example, is a rule which retracts a control fact of the form (control-fact (phase const)) and asserts a .</Paragraph>
    <Paragraph position="36"> fact of the form (control-fact (phase et)) . All rules whose LHS contains the pattern (control-fact (phase const)) and which have a higher salience value than -500 will be activated before this rule ha s a chance to retract the fact, after which those rules will no longer be able to fire . 13 Each rule retracting a control fact generally asserts a new control fact in order to activate another portion of the Rete network .</Paragraph>
    <Paragraph position="38"> The rules which are associated with a given portion of a Rete network which is activated or deactivate d by a given control fact constitute a rule module . Three different types of rule modules arise in th e intensional reasoning phase : * Modules which consist of rules for locating possible references to events . There is only one modul e of this sort in the MUC-5 implementation of CBAS, since only one type of event is of interest, bu t multiple modules of this sort could exist . (In the MUC-4 terrorist domain, for example, differen t types of terrorist acts needed to be distinguished .) * Modules which consist of rules for inferring facts describing possible objects and attributes of events . For example, a rule module exists which &amp;quot;promotes &amp;quot; predicted company names to the status of bein g denotations of company entities .</Paragraph>
    <Paragraph position="39"> * Modules which consist of rules for associating possible objects and attributes of events with specifi c possible events. For example, modules exist for determining the roles played by objects associate d with a given possible event .</Paragraph>
    <Paragraph position="40"> During the intensional reasoning phase, data correlation is done across objects, but not across events . In the Muc-5 joint venture domain, this activity primarily involves reference resolution among compan y entities . The rules used to perform this task in CBAS are fairly primitive; the following rule does most o f the work by insuring that if two company entities exist and one has a &amp;quot;reg &amp;quot; value which is a substring o f the other, then the &amp;quot;cite&amp;quot; and &amp;quot;reg&amp;quot; attribute values of the entity with the shorter reg value are made th e same as the longer cite and reg values. It also insures that both entities have the short cite value as a n &amp;quot;alias&amp;quot;, which is a requirement in the Muc-5 task .</Paragraph>
    <Paragraph position="41"> 12A Rete network is a data structure commonly used to encode information in forward-chaining systems . See Forgy [21 fo r an explanation of Rete networks .</Paragraph>
    <Paragraph position="42"> &amp;quot;Activation of each rule will, of course, also depend on all other LHS patterns matching facts in the factbase as well .</Paragraph>
    <Paragraph position="44"> Determining coreference relations is a critical issue in data extraction technology . Unfortunately, the majority of work done by linguists in this area involves pronominal correference, whereas in the dat a extraction tasks which have been examined in MUC conferences, coreference among common noun descriptions is a. more significant issue.1 4 The Extensional Reasoning Phas e The second &amp;quot;higher-level&amp;quot; processing phase in CBAS is called the extensional reasoning phase . The general purpose of this phase is to take the information about possible events and their component objects and attributes contributed by the intensional reasoning phase and to identify on the basis of this information a collection of actual event instances to be represented as database objects . In practice, rules in the intensional reasoning component have been responsible for data correlation at the object level, an d rules in the extensional reasoning component have been responsible for data correlation at the event level . For the MUC-5 version of CBAS there was not enough time to develop a set of rules for correlatin g descriptions of events . The most significant inference made during this phase is the elimination of join t venture event descriptions from consideration if the descriptions include references to fewer than two non coreferential partners . One would expect that a failure to correlate event descriptions should result in a higher number of spurious actual events being reported . Fortunately, however, the generation of spuriou s events was not a serious problem in the MUC-5 task .1 5 The majority of rules constituting the extensional reasoning phase actually have very little to do wit h inferring information conveyed in an input text . Instead, the purpose of most rules in this phase i s to generate the database objects which are to be returned as the system's output . From a knowledge-engineering perspective, this task is not terribly interesting, but it nevertheless takes a significant amoun t of effort to implement .&amp;quot;</Paragraph>
  </Section>
  <Section position="4" start_page="255" end_page="255" type="metho">
    <SectionTitle>
AN EXTENDED EXAMPL E
</SectionTitle>
    <Paragraph position="0"> In this section, we illustrate in a more concrete fashion how the MUc-5 version of CBAS goes abou t extracting information by examining in detail what happens during the processing of a specific text in the MUC-5 corpus. Our discussion will proceed through the three processing phases which have bee n identified . Figure 2 contains the sample message upon which the discussion is based .</Paragraph>
    <Paragraph position="1"> &amp;quot;A conuuonly appearing form of coreference is &amp;quot;part-whole&amp;quot; reference, Here is an example from a MUC evaluation text :</Paragraph>
  </Section>
  <Section position="5" start_page="255" end_page="256" type="metho">
    <SectionTitle>
WE HAVE ALSO LEARNED THAT TWO VEHICLES OF THE SALVADORAN RED CROSS HAVE ALS O
BEEN ATTACKED . ONE OF THEM WAS TOTALLY DESTROYED BY FIRE IN THE MEJICANOS SEC -
</SectionTitle>
    <Paragraph position="0"/>
  </Section>
  <Section position="6" start_page="256" end_page="257" type="metho">
    <SectionTitle>
BRIDGESTONE SPORTS HAS SO FAR BEEN ENTRUSTING PRODUCTION OF GOLF CLUB PARTS
WITH UNION PRECISION CASTING AND OTHER TAIWAN COMPANIES .
WITH THE ESTABLISHMENT OF THE TAIWAN UNIT, THE JAPANESE SPORTS GOODS
MAKER PLANS TO INCREASE PRODUCTION OF LUXURY CLUBS IN JAPAN .
</SectionTitle>
    <Paragraph position="0"> Tokenization Figure 3 contains a sampling of the basic facts created during the tokenization stage for the exampl e message . These facts are in a format appropriate for processing by CLIPS . The txt_token facts identify the locations of lexical items, and the sentence and paragraph facts identify the locations of sentenc e and paragraph boundaries, respectively . The part-of-speech tagger is invoked during the delimitation o f word tokens, and part-of-speech categories returned by the tagger are added to the other informatio n collected in the tokenization process . 17 The company name parser, which is responsible for generatin g the company_name facts illustrated in Figure 3, relies upon the presence of word and sentence boundaries . (All of the company_name facts generated for this example are listed . ) Intensional Reasoning .</Paragraph>
    <Paragraph position="1"> At the start of the second stage of processing, the set of basic facts detected during tokenization ar e asserted to the facthase of the CLIPS-based intensional reasoning component . Once this is done, th e forward-chaining engine is invoked to infer information about possible events, objects, and attributes . In the example text, only one reference to a joint venture event is detected--in the first sentence, the phrase SET UP A JOINT VENTURE triggers the inference that an event reference has occurred . The phrase TIIE JOINT VENTURE does not trigger an event reference because it is recognized as a definite reference; however, this definite reference is recorded.</Paragraph>
    <Paragraph position="2"> The company name parser invoked in the tokenization phase has detected the presence of several possibl e company name references . Based on testing of the company name parser, it is known that whenever th e metric it assigns to a possible name is less than 1 .0, the likelihood that an actual company name is presen t is relatively low, and consequently, any possible company names with less than 1 .0 likelihood are throw n out, . This heuristic generally works very well (as a heuristic should), but in this example a company name is excluded that it would have been better to keep : BRIDGESTONE SPORTS CO . And because of this error, CBAS misses the identification of one of the parents of the detected joint venture . The heuristi c also fails to rule out TRADING HOUSE as a plausible company name and consequently it is incorrectl y inferred to he a reference to a parent company . The other two parents in the joint venture, UNION</Paragraph>
  </Section>
  <Section position="7" start_page="257" end_page="260" type="metho">
    <SectionTitle>
PRECISION CASTING CO. and TAGA CO., are correctly identified .
</SectionTitle>
    <Paragraph position="0"> Rules for determining the roles played by companies typically involve the detection of a company nam e in a syntactic context within which a relationship of a certain type is likely to be mentioned . For example, definite references to joint ventures followed by a comma followed by a company name typically signa l that the company name denotes a company in a child role . It is for this reason that BRIDGESTONE SPOR'T'S TAIWAN CO. in the context THE JOINT VENTURE, BRIDGESTONE SPORTS TAIWAN CO. is inferred to be referring to a child .</Paragraph>
    <Section position="1" start_page="257" end_page="260" type="sub_section">
      <SectionTitle>
Extensional Reasoning
</SectionTitle>
      <Paragraph position="0"> 'I'he extensional reasoning phase is implemented as a completely separate CLIPS process . During thi s processing phase, decisions are first made about which events are actual and which events are spurious .</Paragraph>
      <Paragraph position="1"> No effort, is made in the MUC-5 version of CBAS to correlate events . The primary processing strategy is a simple one: do not instantiate events which do not have two or more non-coreferential partners . Ll the sample text, only one event is inferred, and since it has two or more partners, it is instantiated . Th e template generated by CBAS for this example is given in Figure 4 . 18 17 In general, we have found it . advantageous (both in terms of rule-writing convenience and processing speed) to have &amp;quot;fat&amp;quot; facts . 'Iiiat is, to Melanie as nwrh information in a single clause as is reasonable instead of distributing inforuuution across clauses . For this reason, the sentence and paragraph facts are actually used very little ; instead, information about sentence and paragraph membership is built into the txt_token facts . (In the example rules and facts shown in this paper, a numbe r of features irrelevant to the discussion have been eliminated to make the presentation more concise and lucid . ) &amp;quot;An important strategy which we employed in Mud : 5 was simply not to try to extract every possible detail specified in th e</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML