XML Viewer - m95-1016

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/95/m95-1016_metho.xml
Size: 32,150 bytes
Last Modified: 2025-10-06 14:13:59
<?xml version="1.0" standalone="yes"?>
<Paper uid="M95-1016">
  <Title>Person Organization Responsibility Period of Participatio Ave. % of Time Approx. No. of n Devoted to Project Person- Months</Title>
  <Section position="3" start_page="193" end_page="200" type="metho">
    <SectionTitle>
SYSTEM DESCRIPTIO N
</SectionTitle>
    <Paragraph position="0"> In this section we summarize the elements of our design objectives and provide extensive detail s of the general system implementation.</Paragraph>
    <Section position="1" start_page="193" end_page="200" type="sub_section">
      <SectionTitle>
Design Approac h
</SectionTitle>
      <Paragraph position="0"> Two sets of considerations determined our selected approach to the data-extraction problem : our set of performance objectives, and the results of our analyses of data-classes . Concerning performance, we wanted a system that would most of all be highly accurate -- less than 5% misses and false alarms -- ye t robust in the sense of failing seldom, failing soft, and restoring easily. We also wanted a system that was easily extendible, serviceable by programmers, supportive of informed guessing (but giving confidenc e and basis), and eventually capable of learning extensions . Extremely rapid processing is not a requirement, in the belief that achieving the quality goals first could be followed subsequent speed-u p enhancements (e .g., parallelization) .</Paragraph>
      <Paragraph position="1"> Our analysis of a wide variety of data-classes indicates that the majority of classes, and of instances, do not require sentence-level linguistic information for their detection and identification. That is, it appears that having a good sentence parse would only infrequently be of value in determining the data-classes embedded in the sentences . The needed information seems to be primarily available withi n the data-class phrase itself, in the form of obligatory and optional elements . For the majority of &amp;quot;difficult&amp;quot; cases, it would appear that the semantic features of local adjacent contexts plus some global context (e.g., the type of document) could resolve the identifications.</Paragraph>
      <Paragraph position="2"> These considerations motivated the three-stage strategy we adopted for the DX project . The first is unabashedly a brute-force method and is to identify by look-up: for those data-classes like person-This conversion activity is part of the Navy program to develop the support technologies for producing and using Interactive Electroni c Technical Manuals (IETM) for new and existing systems, as governed by the MIL-M-87268 standard. In turn, DXL is now being used as the specification language for recognizing various components of existing Navy tech manuals in the SAIC IETM project .  names and organizations that are amenable to this app ; oach, get as many instances as possible and enter in the knowledge-bank and then check for them in any new input. Secondly, identify by pattern, using a language specially developed to have all the functionality useful for this approach, especially very powerful pattern-matching capability coupled with the facility of placing arbitrary constraints on th e patterns or local contexts. The final step is to identify using semantics-based contextual reasoning, wherein potential remaining data-class targets are fuzzily classified as one of several related data-classe s (e.g., proper names) on the basis of suggestive cues (e .g., capitalization) . Then, reasoning heuristics (e .g., &amp;quot;names of people often have appositives whose heads or modifiers are marked semantically as bein g characteristic of people&amp;quot;) are used to direct the search for discriminating contextual clues .</Paragraph>
      <Paragraph position="3"> This three-fold approach can be argued as meeting most of the performance goals and certainl y fits the finding of data-classes being pattern-based. However, it is an empirical question whether the desired accuracy levels -- particularly &lt;5% misses -- can truly be achieved without at least a reasonable identification of the case-roles of the surface constituents . For this reason, an additional design requirement for the new pattern language was that it would have the capability to represent very powerfu l sentence-level parsing grammars (e .g., context-sensitive rules, unrestricted look-ahead, eas y representation of discontinuous constituents).</Paragraph>
      <Paragraph position="4"> Implementation Features System-level Overview.</Paragraph>
      <Paragraph position="5"> A much-simplified view of the DX system is given in Figure 1 . The first step is to convert the character-stream input into a stream of &amp;quot;tokens&amp;quot;, essentially words, based primarily on the presence o f blanks between character-strings . Upon identification, each token is annotated with a set of &amp;quot;prime&amp;quot; attribute-values including : the most important attribute &amp;quot;type&amp;quot; (&amp;quot;word&amp;quot;, &amp;quot;num&amp;quot;, &amp;quot;sgml&amp;quot;, &amp;quot;punc&amp;quot;, or &amp;quot;mix&amp;quot;), start/stop character positions, &amp;quot;chars&amp;quot; (the token's character string), capitalization (&amp;quot;capital&amp;quot; , &amp;quot;upper&amp;quot;, &amp;quot;lower&amp;quot;, or &amp;quot;mixed&amp;quot;), and token number . The value of the &amp;quot;chars&amp;quot; attribute is that string which has all beginning brackets (or parentheses or braces) removed as well as all ending brackets an d punctuation of all kinds; appropriate attributes are set to indicate what was removed . In addition, the singular possessive marking &amp;quot; `s &amp;quot; is also removed and a possessive attribute set. In this way, the toke n &amp;quot;chars&amp;quot; value can be looked up in the knowledge-bank without worrying about punctuation of any kind . All of the subsequent system processing is accomplished with specific pattern-rules and involve s moving up and down the token stream, checking tokens for particular attribute values, and, whe n successful, either changing the attribute-values of single tokens or replacing several tokens with a new one (with a new &amp;quot;type&amp;quot; attribute, whose &amp;quot;chars&amp;quot; attribute is the concatenation of those replaced) . For any particular rule, the whole document is searched from the first token to the last .</Paragraph>
      <Paragraph position="6"> The second major processing step is to check the knowledge-bank to see whether the &amp;quot;chars&amp;quot; value of tokens are known as an instance of a data-class of interest (e .g., known locations or organizations) . If so, the tokens corresponding to the known data-class entry are replaced with a single token whose &amp;quot;type&amp;quot; is set to the name of the data-class. In addition, new attributes are added, includin g part-of-speech, the word-stem, number (singular or plural) . Token &amp;quot;chars&amp;quot; not recognized as data-classe s are analyzed morphologically to determine their part-of-speech, word-stem, and number.</Paragraph>
      <Paragraph position="7"> In the third step, the major data-class recognition rules are applied . The knowledge-bank is again a major source of information but this time primarily for &amp;quot;signal-words&amp;quot; which usually preface o r terminate data-class patterns . For example, a number of organizations fit the canonical form of a &amp;quot;prefix organizational title&amp;quot; followed by &amp;quot;of' or &amp;quot;for&amp;quot; followed by a location or unknown capitalized words (e .g. &amp;quot;Bank of New York&amp;quot;) . Prefix titles like &amp;quot;bank&amp;quot; are characterized in the knowledge-bank by a number o f</Paragraph>
      <Paragraph position="9"> attributes, such as being capitalized, not a name, referring to an organization, whose activity i s commercial. A token whose &amp;quot;chars&amp;quot; value is &amp;quot;bank&amp;quot; will not have these attribute-values alread y associated with it. Rather, only when some rule asks whether the token has the attribute values associated with an organizational title will the knowledge-bank be checked for these . As a result of checking for such attributes, the values returned will be set on the token . This process of adding values to a token onl y when a rule has inquired of the attributes is known as &amp;quot;lazy annotation&amp;quot; and is much more economical than attaching all possible attribute-values slavishly to all known knowledge-bank entries .</Paragraph>
      <Paragraph position="10"> An example of these processing steps is given in Figure 2 for the input &amp;quot;Dr. Joyce Smith lost money.&amp;quot; The tokenization and associated attribute-annotation for the first two words is shown as lists of attribute-value pairs in the LISP-list format used by the Scheme programming language in which th e system is written (supplemented by a few C and C++ modules) . The next line on Figure 2 shows what the results would be were the knowledge-bank looked-up for instances of various kinds of titles .' In this case the character-string &amp;quot;dr&amp;quot; is associated with two types of titles, one a prefix for person-names and one a suffix for in-city street references ; both values are returned for the attribute title_typ. For a person-name recognition rule such as is shown in Figure 3 below, the likely result would be the replacement of the firs t three tokens (&amp;quot;dr&amp;quot;, &amp;quot;Joyce&amp;quot;, and &amp;quot;smith&amp;quot;) with a new single token of type &amp;quot;person&amp;quot;, with start/stop value s reflecting the span of coverage in the input, and with a new attribute giving the name of the rule that was successfully applied (&amp;quot;cap-person&amp;quot;). Outputs include both a tagging of the target in the input and th e inclusion of the target characters in a list of other such person targets .</Paragraph>
      <Paragraph position="11"> The Data-Extraction Language, DXL Five of the major features of DXL are : (1) extremely flexible pattern specification, (2 ) unrestricted rewrite capability, (3) the capability to put constraints on pattern elements or to invoke globa l constraints, (4) the capability to invoke, anywhere, the full power of the Scheme programming language, and (5) the ability to expand complex &amp;quot;chars&amp;quot; strings .</Paragraph>
      <Paragraph position="12"> The first feature is illustrated by Figure 3, which shows a DXL rule for one of the PERSON canonical forms . DXL rules have three components : a left-hand-side (LHS) specifying the total pattern that is to be matched, a right-pointing arrow indicating that the LHS is to be rewritten, and a right-hand side (RHS) indicating what is to be substituted for the LHS, with what actions . In Figure 2, the LHS has four elements : an optional pattern followed by an obligatory one, followed by two optional patterns . This rule would match instances such as &amp;quot;Dr . Harry Morgan Raffler, Jr ., Vice President&amp;quot; and &amp;quot;Frank&amp;quot;.&amp;quot; The meaning of the prefix symbols is given in the following table :  * An optional operator, zero or more occurrences, match the minimum ? An optional operator, zero or one occurrence, match the minimu m + Obligatory element, at least one, possibly more  When placed after the above, forces a match of the maximum number of specified patterns Indicates a defined pattern The LHS therefore involves four defined patterns : 0 or more prefix titles for people (such as &amp;quot;Prof.&amp;quot;), at least one capitalized word (with no comma following it), an optional family title such as &amp;quot;Jr .&amp;quot;, and an optional suffix title, such as &amp;quot;Director &amp;quot; . Two of the four BNF pattern definitions are given at the bottom of the Figure. That for title_pref, for example, specifies that the &amp;quot;chars&amp;quot; value should have an initial capitalized letter, that the type of the title ought to be &amp;quot;pref' (for prefix), and that the value of cl , the second hierarchical knowledge classification value (c0 is the first), should be &amp;quot;person&amp;quot;.</Paragraph>
      <Paragraph position="13"> 3 This would not actually be a good way of approaching recognition of person-name instances, but it does illustrate what additiona l attributes would be added as a result of consulting the knowledge-bank.</Paragraph>
      <Paragraph position="14"> Such a rule is only illustrative and would actually not be used because so many optional elements beg for over-generalizations .</Paragraph>
      <Paragraph position="16"> Discontinuous constituents are easily specified using the &amp;quot;*&amp;quot; (Kleene star) operator, as in @first pattern *.any @second_pattern where &amp;quot;.any&amp;quot; means a token with any &amp;quot;type&amp;quot; value, and the star indicates zero or arbitrarily many intervening tokens between the two patterns of interest . To rewrite these into new patterns requires use o f the dynamic variable-assignment facility which takes whatever tokens were matched by the pattern and assigns them, as a list, to the variable indicated, as shown b y</Paragraph>
      <Paragraph position="18"> On the LHS, variables vl-v3 are assigned to the three pattern elements; those binding to vl and v 3 should, in this example, be understood as binding to single tokens, and these are referenced on the RHS , via the &amp;quot;$&amp;quot; operator. Single tokens bound to vl and v3 are modified by attribute-changing actions, and al l the bound tokens are finally reinserted back where they were in the token stream .</Paragraph>
      <Paragraph position="19"> The second, rewrite, feature of DXL is illustrated by the abstract rule below, which, for clarity' s sake, omits the needed variable-assignments : \ A \ B [X] C D (Y) /E/ _&gt; C F B This rule specifies a left-context of A after which three obligatory patterns are sought, B-D, where B has some additional condition X placed on it. There is also a global test Y which has to succeed, all this i n the right context of E . If all this LHS succeeds, then B and C are to be permuted, D is to be deleted, an d F is to be added (inserted between C and B), with neither A nor E being further involved . With LHS context-sensitivity and the ability to permute, add, and delete LHS elements -- and also unrestricted look-ahead -- the rewrite power is essentially unconstrained, beyond recursively-enumerable context-sensitiv e grammars, having the power of a Turing machine.</Paragraph>
      <Paragraph position="20"> Local and global constraints (the third feature) have been illustrated, and they are often expressed in practice by dropping into full Scheme code (the fourth feature) . The fifth feature, expansion is very useful when the &amp;quot;chars&amp;quot; attribute is a mixture of letters/numbers/symbols . Expansion permits the components of a complex &amp;quot;chars&amp;quot; string to be broken apart and analyzed using the same rule formalisms employed for multiple tokens . For example, the following rule expands all tokens of &amp;quot;type&amp;quot; &amp;quot;mix&amp;quot; : type:&amp;quot;mix&amp;quot; -&gt;vi =&gt; (expand vl) The result of expanding a token whose &amp;quot;chars&amp;quot; are &amp;quot;451bs ./sq.in.&amp;quot;, is to replace this token in the token stream with the following ones : .sot 45 tbs. / sq. in. .eot The first and last tokens have special type values, &amp;quot;start of (expanded) token&amp;quot; and &amp;quot;end of (expanded ) token&amp;quot;, but they have no start or stop attributes, being &amp;quot;dimensionless&amp;quot; with respect to characters . The inserted tokens can now be recognized as a measure by a rules such as the following : .sot type:&amp;quot;num&amp;quot; @unit &amp;quot;/&amp;quot; @unit .eot =&gt; measure where the pattern @unit is appropriately defined as a single or multi-word reference to a unit of measure .  The Knowledge-Ban k The knowledge-bank has five major types of entries (with the approximate quantities presently being added given in parentheses) : (1) words or phrases which are given a part-of-speech and perhaps a word-stem attribute, but little else (64K) ; (2) words/phrases which are full instances of data-classes (such as person first-names, cities, organizations) (6M) ; (3) &amp;quot;signal-words&amp;quot;, usually called &amp;quot;titles&amp;quot; whic h indicate the likely presence of data-classes (such as &amp;quot;Ave.&amp;quot;, &amp;quot;p.m.&amp;quot;, &amp;quot;Bank&amp;quot;, &amp;quot;Mlle .&amp;quot;) (1K); (4) &amp;quot;clusters&amp;quot; of ordinary words which share significant semantic features (such as &amp;quot;communication acts&amp;quot; ) (2k); and (5) isolated ordinary words which have particular significance of one kind or another (1K) . The first type of entry, mostly common words, facilitates sentence parsing when needed . The second type implement the brute-force &amp;quot;identify by look-up&amp;quot; principle of the DX system . The third type contribute the most in supporting the pattern-based identification principle of the system, as facilitated b y the fourth and fifth types .</Paragraph>
      <Paragraph position="21"> The last four types have a part-of-speech attribute plus several classification attributes plus a number of other features . It is this rich set of features that are used as conditions on the DXL pattern elements and largely underlie the potential for very high accuracy in target detection . A sampling of knowledge-bank entries illustrating these features is given in Table 1 .</Paragraph>
      <Paragraph position="22"> In the first row of Table 1 the first &amp;quot;key&amp;quot; column indicates the lower-case look-up character s that will be matched against the &amp;quot;chars&amp;quot; values from the input text . The rest of the column headings in the first row are various attributes which are relevant to the examples givens The entry for &amp;quot;susan &amp;quot; shows that its highest knowledge-classification feature, c0, is &amp;quot;hument&amp;quot; (&amp;quot;human entity&amp;quot;), sub categorized by the next classification feature, cl, as &amp;quot;person&amp;quot;, in contrast to the second entry, &amp;quot;ibm&amp;quot; , which is an organization. These two entries also are coded as being names and capitalized. The third entry, &amp;quot;mr&amp;quot; is also categorized as referring to a person and capitalized but is not a name, rather a title, o f type &amp;quot;pref' (for &amp;quot;prefix&amp;quot;). The fourth entry, &amp;quot;militant&amp;quot;, again refers to a person, but is neither a name nor a title (nor capitalized) but has a role value of either &amp;quot;political&amp;quot; or &amp;quot;terrorist&amp;quot; . The string &amp;quot;texas&amp;quot; i s identified as a place, sub-category &amp;quot;center&amp;quot; and sub-sub-category &amp;quot;state&amp;quot;, and it is both a name an d capitalized . The last two examples both have the knowledge-classification features meas-time-date, bu t &amp;quot;christmas&amp;quot; is also categorized as a &amp;quot;holiday&amp;quot; and is both capitalized and a name ; the phrase &amp;quot;pai d holiday&amp;quot; is neither capitalized nor a name but fills the semantic role of referring to a &amp;quot;holiday&amp;quot; entity . These examples suffice to illustrate the rich set of syntactic and semantic knowledge-ban k features that are available for use in the DXL rules to identify and discriminate among even very simila r data-classes.6 These features would also provide the basis for a neural-net or deterministic classificatio n approach (e.g., C4.5) for a learning capability to be developed later.'</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="200" end_page="201" type="metho">
    <SectionTitle>
SYSTEM PERFORMANC E
</SectionTitle>
    <Paragraph position="0"> ` The Table 1 attribute-by-entry structure is convenient for exposition, but given there are now well over a hundred total possibl e attributes, such a structure would be very space-inefficient . The knowledge-bank actually has a much different representation . Also, not all relevant attributes are shown. For example &amp;quot;susan&amp;quot; also has a name_type attribute with the multiple values of &amp;quot;given I female&amp;quot;.</Paragraph>
    <Paragraph position="1"> 6 For example, the existing features can discriminate among five types of capitalized references relating to people, as indicated by th e following: &amp;quot;reagan&amp;quot;, &amp;quot;american&amp;quot;, &amp;quot;christian&amp;quot;, &amp;quot;irish&amp;quot;, and &amp;quot;mr . president&amp;quot;.</Paragraph>
    <Paragraph position="2"> We envision examples of new data-classes first being clustered by hand into high-similarity groups, but this may not be necessary .</Paragraph>
    <Paragraph position="3">  Apply DXL rules for data-classes which are non-interacting and do not require other data-classes for their identification (also parallelizeable) .</Paragraph>
    <Paragraph position="4"> These were : time, date, percent, money (some references to time involved single-word places, e .g., &amp;quot;4 p.m. Chicago Time&amp;quot;).</Paragraph>
  </Section>
  <Section position="5" start_page="201" end_page="205" type="metho">
    <SectionTitle>
6 Level 2
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="201" end_page="205" type="sub_section">
      <SectionTitle>
Data-Classes
</SectionTitle>
      <Paragraph position="0"> Apply DXL rules for more complex interacting rule-sets in the followin g order: placel, place2, orgl, org2, personl, person2, person3. The numeric suffix on the rule-sets indicates that data-class rules were clustered into groups of canonical forms that were increasingly complex, as indicated by the number-value .</Paragraph>
      <Paragraph position="1">  7 Last-Pass A set of DXL rules which checked the token stream for any remainin g tokens which could be part of place, org, or person.</Paragraph>
      <Paragraph position="2"> 8 Adjust A set of DXL rules which adjusted the start/stop positions of the target s to include or exclude punctuation according to MUC-6 rules.</Paragraph>
      <Paragraph position="3"> 9 Tag Insert the appropriate SGML tags around the targets in the input text as  calculated by the Adjust rules .</Paragraph>
      <Paragraph position="4"> In developing the rules for a particular data-class, the following five-step strategy was employed: (1) a large number of examples of the data-class were collected, and these were clustered into group s having high internal similarity ; (2) a canonical pattern-sequence of obligatory and optional element s (plus signifying contexts) was developed for each cluster; (3) DXL rules were developed for eac h canonical form ; (4) the rules were ordered in a sequence which causes those with the most number o f fixed patterns to be run before those with fewer, and in the case of a tie, with those accounting for the largest number of instances to be run first ; and (5) the rules are broken into two groups with the simple r reliable ones grouped in one rule-set, e .g., placel for the location rules, with the complex rules placed in a second rule-set, e .g., place2. Some experimentation was performed to determine the final grouping of the rule-sets into the Level 1 and Level 2 sets, but that indicated in the table led to the best results . In the Last-Pass step, there were several rules which attempted to use context-inferences an d other heuristics to identify token sequences which were likely place, org, or person instances. One such was to see whether a promising token (a capitalized unaccounted-for one, for example) was an element of any of the instances of the three types of data-classes that had previously been identified or was a n acronym thereof. Thus, both &amp;quot;Consuela Washington&amp;quot; and &amp;quot;Ms. Washington&amp;quot; were recognized by straight-forward person rules (the first by one which looks for known first-names ; the second by one which uses prefix titles). The subsequent reference simply to &amp;quot;Washington&amp;quot; was correctly identified by the previously-seen by being a sub-string of, in this case, the closest prior reference, that of &amp;quot;Ms.  Roughly a third of the 28 person-month effort was devoted to design and implementation of the data extraction language, DXL; another third went for overall system and knowledge-bank development; and the last third was focused on development of general and MUC-specific data-class recognition rules usin g DXL. Not counting the peculiarities of MUC requirements for tagging the identified data-classes, no par t of this effort has been spent on non-reusable MUC-specific activities .</Paragraph>
      <Paragraph position="5"> There were a number of factors which made the timing of the MUC-6 contest inconvenient relative to the external factors determining the pacing of development of the DX system : * The knowledge-bank, upon which all processing relies, was designed and populated only i n the August-September period; its implementation needed to be radically modified i n September to permit handling of massive numbers of entries, and it is still not yet fully reliable * The rule writers and knowledge-base/system administrator were effectively not availabl e until early Septembe r * The language, DXL, while possessing all the needed functionality, had limited high-level function libraries which are only being developed gradually with experience in rule-writin g * The system processing stages (esp. tokenization) are still under development, and the system still has quite limited debugging facilities  * The late release of MUC-6 task information and materials in August precluded advance study of the complex scoring, recognition, and tagging criteria Training The primary method of obtaining training materials was to extract a very large number o f instances of each of the seven Named Entity data-classes from the provided test materials and use these files for development of the canonical groups and rules . An enormous amount of time was spent learnin g how to use DXL (and, as the rule writers did not know UNIX, learning the Linux system) . An equa l amount of time was spent in learning effective strategies for writing reliable rules -- especially, learning to avoid the temptation of trying to recognize too many variations of data-class instances with a singl e rule. It is only in the past month, after the contest, that the grammar writers have learned the &amp;quot;right&amp;quot; leve l of ambition for writing a rule, testing it with the limited debugging capabilities, and revising it modestly . We note that the Wall Street Journal style guide was very useful in reducing the number o f training examples needed.</Paragraph>
      <Paragraph position="6"> Performance on the October 5 MUC-6 Test s We did not do well. It did not magically all come together at the last moment . Our three-week flurry of rule-writing simply didn 't cope.</Paragraph>
      <Paragraph position="7"> In addition to the developmental pressures noted above, the difficulties of learning to program i n a completely new language in a few weeks, and the agonies of understanding MUC identification an d tagging criteria, we also made the mistake of attempting a much too complex approach to identificatio n of the proper name classes: using a &amp;quot;fuzzy logic&amp;quot; approach in which capitalized words had fuzz y membership in the three person/location/organization classes which were gradually and simultaneously t o be resolved . It didn't work.</Paragraph>
      <Paragraph position="8"> Performance One-Month Later! We started over, and performance now is very high . On a recent test involving over a hundre d test-cases per class, we logged on average 2% misses, 6% correctly identified but incorrectly tagged, and 0 false alarms for the four simpler classes of time, date, percentage, and money . For locations and organizations, the numbers are 4%, 9%, and 3 respectively ; the person rules are still in flux but will be i n this ballpark by the time of the conference .</Paragraph>
      <Paragraph position="9"> Things That Didn't Wor k Aside from the fuzzy-logic approach, it took us a long time (several weeks at least) to learn that , for pattern-based rules, simple is terribly much more effective than comprehensive . We also suffered from the fact that, since the language DXL has only just been developed it quite reasonably still has a fe w bugs; these just happened to be difficult ones: e.g., coding that worked perfectly well in the main pattern s of a rule LHS, did not do so in the left-context or in the pattern definitions (e .g., &amp;quot;@title_suf) ; the knowledge-bank could get corrupted just a bit without failing in the main ; etc. And we had endless  problems with the fact that the knowledge-bank is incomplete : the fact that it has so very many entries usually meant that problems were with our rules, not with the knowledge-bank, but there were many times when expected entries were simply not there or had been miscoded in some way . Finally, the absence of user function libraries to provide very high-level scotch-guarded functions for easy use in the rules, mean t that too often we had to use very low-level programming functions or drop into Scheme, neither a forte' of the rule-writers .</Paragraph>
      <Paragraph position="10"> Concerning specific target difficulties, we perhaps had the most trouble with organizations . Single-word organizations or ones without a prefix or suffix title (e .g., &amp;quot;Pilsbury&amp;quot;, &amp;quot;Birds Eye&amp;quot;) required context-sensitive semantics to pull in . Names with commas in the middle, commonly law firms (e .g., L. F. Rotchieff, Unterberg, Towbin), were difficult because we used the commas to suggest phrase boundaries. And organizations with &amp;quot;and&amp;quot; (in contrast to &amp;quot;&amp;&amp;quot;) as part of their name were difficult t o discriminate from conjoined organizations (e .g., &amp;quot;Hollis and Pergamon Holdings, Ltd .&amp;quot;, where &amp;quot;Hollis&amp;quot; i s a reference to a prior-mentioned company) .</Paragraph>
      <Paragraph position="11"> Person targets were often difficult, but they tended to be the default case: we had already done our best with our second set of place and organization rules, and what remained was most likely a person . The high frequency presence of appositives with person-names provide one powerful source of semanti c context resolution.</Paragraph>
      <Paragraph position="12"> Things That Worked Wel l The knowledge-classification feature hierarchy works termendously well in supportin g identification and discrimination of data-classes (e .g., people having the main branches of HUMENT-PERSON, while cities have the main branches of PLACE-CENTER-CITY, and signal words associate d with time references have the branches of MEAS-TIME-DATE-HOUR) . The DXL language, now that we understand it, is wonderfully powerful and flexible . And the brute-force look-up approach handles over 40% of our instance recognition and is easily extensible . We are also quite encouraged by the success of some of the LastPass stray-pickup rules based on semantic contexts and other heuristics . The Thing That Most Needs Reworkin g The leading candidate for this prize is the user function library, to keep the rule-writer out of th e bowels of programming . But, of course, nominations for the library can only come with experience whic h is only now maturing .</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="205" end_page="206" type="metho">
    <SectionTitle>
LESSONS LEARNE D
</SectionTitle>
    <Paragraph position="0"> The outstanding lesson for this project as a result of the MUC is that the DX system will in fact be able to meet its performance objectives in the near future, that the three-part design basis (of look-up, pattern-match, and semantically resolve) is sound .</Paragraph>
    <Paragraph position="1"> Lesson two is : start early on the MUCs.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML