XML Viewer - m93-1016

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/93/m93-1016_metho.xml
Size: 24,200 bytes
Last Modified: 2025-10-06 14:13:30
<?xml version="1.0" standalone="yes"?>
<Paper uid="M93-1016">
  <Title>TOP-LEVEL-FLAG NIL :PREDICATE C-FORM :TENSE PRESENT :ASPECT PERF :JOINT-VENTURE (ENTITY</Title>
  <Section position="1" start_page="0" end_page="182" type="metho">
    <SectionTitle>
NEW YORK UNIVERSITY :
DESCRIPTION OF THE PROTEUS SYSTEM
AS USED FOR MUC- 5
</SectionTitle>
    <Paragraph position="0"> The Proteus system which we have used for MUC-5 is largely unchanged from that used for MUC-3 and MUC-4 . It has three main components : a syntactic analyzer, a semantic analyzer, and a template generator .</Paragraph>
    <Paragraph position="1"> The Proteus syntactic analyzer was developed starting in the fall of 1984 as a common base for all the applications of the Proteus Project . Many aspects of it s design reflect its heritage in the Linguistic String Parser, previously developed an d still in use at New York University.</Paragraph>
    <Paragraph position="2"> The current system, including the Restriction Language compiler, the lexical analyzer, and the parser proper, comprise approximately 4500 lines of Common Lisp .</Paragraph>
    <Paragraph position="3"> The semantic analyzer was initially developed in 1987 for the MUCK- I (RAINFORMs) application, extended for the MUCK-II (OPREPS) application, and ha s been incrementally revised since . It currently consists of about 3000 lines o f Common Lisp (excluding the domain-specific information) .</Paragraph>
    <Paragraph position="4"> The template generator was written from scratch for the MUC-5 joint ventur e task; it is about 1200 lines of Common Lisp .</Paragraph>
    <Paragraph position="5"> Stages of processing The text goes through the five major stages of processing : lexical analysis , syntactic analysis, semantic analysis, reference resolution, and template generation (see figure 1) . In addition, some restructuring of the logical form is performed bot h after semantic analysis and after reference resolution (only the restructuring after reference resolution is shown in figure 1) . Processing is basically sequential : each sentence goes through lexical, syntactic, and semantic analysis and referenc e resolution ; the logical form for the entire message is then fed to template generation . However, semantic (selectional) checking is performed during syntactic analysis , employing essentially the same code later used for semantic analysis .</Paragraph>
    <Section position="1" start_page="182" end_page="182" type="sub_section">
      <SectionTitle>
Lexical Analysis
Dictionary Forma t
</SectionTitle>
      <Paragraph position="0"> Our dictionaries contain only syntactic information : the parts of speech for each word, information about the complement structure of verbs, distributiona l information (e .g., for adjectives and adverbs), etc . We follow closely the set o f syntactic features established for the NYU Linguistic String Parser . This informatio n is entered in LISP form using noun, verb, adjective, and adverb macros for the open-class words, and a word macro for other parts of speech:</Paragraph>
    </Section>
  </Section>
  <Section position="2" start_page="182" end_page="183" type="metho">
    <SectionTitle>
(ADVERB &amp;quot;ABRUPTLY&amp;quot; :ATTRIBUTES (DSA) )
(ADJECTIVE &amp;quot;ABRUPT&amp;quot; )
(NOUN :ROOT &amp;quot;ABSCESS&amp;quot; :ATTRIBUTES (NCOUNT) )
(VERB :ROOT &amp;quot;ABSCOND&amp;quot; :OBJLIST (NULLOBJ PN (PVAL (FROM WITH))) )
</SectionTitle>
    <Paragraph position="0"> The noun and verb macros automatically generate the regular inflectional forms .</Paragraph>
    <Section position="1" start_page="182" end_page="183" type="sub_section">
      <SectionTitle>
Dictionary
Files
</SectionTitle>
      <Paragraph position="0"> The primary source of our dictionary information about open-class words (nouns , verbs, adjectives, and adverbs) is the machine-readable version of the Oxfor d Advanced Learner's Dictionary (&amp;quot;OALD&amp;quot;) . We have written programs which take th e SGML (Standard Generalized Markup Language) version of the dictionary, extrac t information on inflections, parts of speech, and verb subcategorization (includin g information on adverbial particles and prepositions gleaned from the examples), an d generate the LISP-ified form shown above . This is supplemented by a manually-code d dictionary (about 1500 lines, 900 entries) for closed-class words, words not adequatel y defined in the OALD, and a few very common words . In addition, we used severa l specialized dictionaries for MUC-5, including a location dictionary (with al l countries, continents, and major cities (CITY1 or , PORT1 in the gazetteer), a dictionary of corporate designators, a dictionary of job titles, and a dictionary of currencies .</Paragraph>
      <Paragraph position="1"> Looku p The text reader splits the input text into tokens and then attempts to assign to eac h token (or sequence of tokens, in the case of an idiom) a definition (part of speech an d syntactic attributes) . The matching process proceeds in five steps : dictionary lookup, lexical pattern matching, spelling correction, prefix stripping, and defaul t definition assignment . Dictionary lookup immediately retrieves definitions assigne d by any of the dictionaries (including inflected forms) . The specialized dictionarie s are stored in memory, while the main dictionary is accessed from disk (using hashe d index random access) .</Paragraph>
      <Paragraph position="2"> Lexical pattern matching is used to identify a variety of specialized patterns, suc h as numbers, dates, times, and possessive forms . The set of lexical patterns was substantially expanded for MUC-5 to include various forms of people's names , company names, locations, and currencies .</Paragraph>
      <Paragraph position="3"> The lexical patterns are further discusse d below, in the &amp;quot;What's new for MUC-5&amp;quot; section .</Paragraph>
      <Paragraph position="4"> If neither dictionary lookup nor lexical pattern matching is successful, spellin g correction and prefix stripping are attempted . For words of any length, we identify an input token as a misspelled form of a dictionary entry if one of the two has a single instance of a letter while the other has a doubled instance of the letter (e .g. , &amp;quot;mispelled&amp;quot; and &amp;quot;misspelled&amp;quot;) . The prefix stripper attempts to identify the token as a combination of a prefix (e .g.,&amp;quot;un&amp;quot;) and a word defined in the dictionary .  If all of these procedures fail, we assign a default definition . In mixed case text , undefined capitalized words are tagged as proper nouns ; undefined lower case words are tagged as common nouns . In monocase text, all undefined words are tagged a s proper nouns .</Paragraph>
    </Section>
    <Section position="2" start_page="183" end_page="183" type="sub_section">
      <SectionTitle>
Syntactic Analysis
</SectionTitle>
      <Paragraph position="0"> Syntactic analysis involves two stages of processing: parsing and syntacti c regularization. At the core of the system is an active chart parser . The grammar i s an augmented context-free grammar, consisting of BNF rules plus procedura l restrictions which check grammatical constraints not easily captured in the BN F rules. Most restrictions are stated in PROTEUS Restriction Language (a variant of th e language developed for the Linguistic String Parser) and translated into LISP ; a few are coded directly in LISP [1] . For example, the count noun restriction (that singular countable nouns have a determiner) is stated as</Paragraph>
      <Paragraph position="2"/>
    </Section>
  </Section>
  <Section position="3" start_page="183" end_page="183" type="metho">
    <SectionTitle>
IF BOTH CORE Xcore IS NCOUNT AND Xcore IS SINGULAR
THEN IN LN, TPOS IS NOT EMPTY.
</SectionTitle>
    <Paragraph position="0"> Associated with each BNF rule is a regularization rule, which computes th e regularized form of each node in the parse tree from the regularized forms of its immediate constituents. These regularization rules are based on lambda-reduction, a s in GPSG. The primary function of syntactic regularization is to reduce all clauses to a standard form consisting of aspect and tense markers, the operator (verb o r adjective), and syntactically marked cases . For example, the definition of assertion, the basic S structure in our grammar, is &lt;assertion&gt; &lt;sa&gt; &lt;subject&gt; &lt;sa&gt; &lt;verb&gt; &lt;sa&gt; &lt;object&gt; &lt;sa&gt; :(s !(&lt;object&gt; &lt;subject&gt; &lt;verb&gt; &lt;sa*&gt;)) .</Paragraph>
    <Paragraph position="1"> Here the portion after the single colon defines the regularized structure . Coordinate conjunction is introduced by a metarule (as in GPSG), which is applie d to the context-free components of the grammar prior to parsing . The regularizatio n procedure expands any conjunction into a conjuntion of clauses or of noun phrases . The output of the parser for the first sentence of 0592, &amp;quot;BRIDGESTONE SPORTS CO.</Paragraph>
  </Section>
  <Section position="4" start_page="183" end_page="185" type="metho">
    <SectionTitle>
SAID FRIDAY IT HAS SET UP A JOINT VENTURE IN TAIWAN WITH A LOCAL CONCERN AND A
JAPANESE TRADING HOUSE TO PRODUCE GOLF CLUBS TO BE SHIPPED TO JAPAN.&amp;quot; is
</SectionTitle>
    <Paragraph position="0"/>
    <Paragraph position="2"> and the corresponding regularized structure i s</Paragraph>
  </Section>
  <Section position="5" start_page="185" end_page="189" type="metho">
    <SectionTitle>
(TIMEPREP (NP FRIDAY SINGULAR (SN NP157))) )
</SectionTitle>
    <Paragraph position="0"> The system uses a chart parser operating top-down, left-to-right. As edges are completed (i .e., as nodes of the parse tree are built), restrictions associated with thos e productions are invoked to assign and test features of the parse tree nodes .</Paragraph>
    <Paragraph position="1"> If a restriction fails, that edge is not added to the chart . When certain levels of the tree are complete (those producing noun phrase and clause structures), th e regularization rules are invoked to compute a regularized structure for the partia l parse, and selection is invoked to verify the semantic well-formedness of th e structure (as noted earlier, selection uses the same &amp;quot;semantic analysis&amp;quot; cod e subsequently employed to translate the tree into logical form) .</Paragraph>
    <Paragraph position="2"> One unusual feature of the parser is its weighting capability . Restrictions ma y assign scores to nodes ; the parser will perform a best-first search for the parse tree with the highest score . This scoring is used to implement various preferenc e  mechanisms : * closest attachment of modifiers (we penalize each modifier by the number o f words separating it from its head) * preferred narrow conjoining for clauses (we penalize a conjoined claus e structure by the number of words it subsumes ) * preference semantics (selection does not reject a structure, but imposes a heav y penalty if the structure does not match any lexico-semantic model, and a lesser penalty if the structure matches a model but with some operands or modifiers left over) [2,3 ] * relaxation of certain syntactic constraints, such as the count noun constraint , adverb position constraints, and comma constraints * disfavoring (penalizing) headless noun phrases and headless relatives (this i s  important for parsing efficiency ) The grammar is based on Harris's Linguistic String Theory and adapted from th e larger Linguistic String Project (LSP) grammar developed by Naomi Sager at NYU [4] . The grammar is gradually being enlarged to cover more of the LSP grammar . The current grammar is 1600 lines of BNF and Restriction Language plus 300 lines of Lisp ; it includes 186 non-terminals, 464 productions, and 132 restrictions . Over the course of the MUCs we have added several mechanisms for recoverin g from sentences the grammar cannot fully parse . For MUC-5, we found that the mos t  effective was our &amp;quot;fitted parse&amp;quot; mechanism, which attempts to cover the sentenc e with noun phrases and clauses, preferring the longest noun phrases or clauses which can be identified Semantic Analysis And Reference Resolutio n The output of syntactic analysis goes through semantic analysis and referenc e resolution and is then added to the accumulating logical form for the message. Following both semantic analysis and reference resolution certain transformation s are performed to simplify the logical form . All of this processing makes use of a concept hierarchy which captures the class/subclass/instance relations in th e domain .</Paragraph>
    <Paragraph position="3"> Semantic analysis uses a set of lexico-semantic models to map the regularized syntactic analysis into a semantic representation . Each model specifies a class o f verbs, adjectives, or nouns and a set of operands ; for each operand it indicates th e possible syntactic case markers, the semantic class of the operand, whether or no t the operand is required, and the semantic case to be assigned to the operand in th e output representation. For example, the model for &amp;quot;&lt;entity&gt; forms a joint venture wit h  The models are arranged in a shallow hierarchy with inheritance, so tha t arguments and modifiers which are shared by a class of verbs need only be stated once. The model above inherits only from the most general clause model, clause-any , which includes general clausal modifiers such as negation, time, tense, modality, etc . The MUC-5 system has 61 clause models, 2 nominalization models, and 45 other nou n phrase models, a total of about 1700 lines . The class C -mu c 5 -entity in the clause model refers to the concept in the concept hierarchy, whose entries have the form:</Paragraph>
    <Paragraph position="5"> There are currently a total of 154 concepts in the hierarchy .</Paragraph>
    <Paragraph position="6"> The output of semantic analysis is a nested set of entity and event structures, with argument s labeled by keywords primarily designating semantic roles .</Paragraph>
    <Paragraph position="7"> For the first sentence of 0593, the output is  Reference resolutio n Reference resolution is applied to the output of semantic analysis in order t o replace anaphoric noun phrases (representing either events or entities) b y appropriate antecedents . Each potential anaphor is compared to prior entities or events, looking for a suitable antecedent such that the class of the anaphor (in the concept hierarchy) is equal to or more general than that of the antecedent, th e anaphor and antecedent match in number, the restrictive modifiers in the anapho r have corresponding arguments in the antecedent, and the non-restrictive modifier s (e .g., apposition) of the anaphor are not inconsistent with those of the antecedent . Special tests are provided for names, since people and companies may be referred to a subset of their full names .</Paragraph>
    <Paragraph position="8"> Logical form transformation s The transformations which are applied after semantic analysis and afte r reference resolution simplify and regularize the logical form in various ways . The transformations after semantic analysis primarily standardize the attribute structure of entities so that reference resolution will work properly . The transformation s after reference resolution simplify the task of template generation by casting th e events in a more uniform framework and performing a limited number o f inferences. For example, we show here a rule which transforms the logical for m produced from &amp;quot;X formed a joint venture with Y&amp;quot; into the equivalent for &amp;quot;X and Y formed a joint venture&amp;quot; :  ((modify 1 (list :agent (conjoin-entities '?agent '?company-list-2)) ) (modify 2 '( :agent nil :tied-up t))) )  There are currently 32 such rules . These transformations are written a s productions and applied using a simple data-driven production system interprete r which is part of the Proteus system.</Paragraph>
    <Paragraph position="9"> Template generato r Once all the sentences in an article have been processed through syntacti c analysis, semantic analysis, and the logical form transformations, the resultin g logical forms are sent to the template generator . The logical form events and entities produced by the transformations are in close correspondence to the template object s needed for MUC-5, so the template generation is fairly straightforward . The greates t complexity was involved in the procedures for accessing the two large data bases, the gazetteer (for normalizing locations) and the Standard Industrial Classification (for  Precision remained within a fairly narrow range, from 47 to 63, throughout the testing. Five months were available for development (March - July) . One person was assigned full-time for the entire period ; a second person assisted, approximately 2/ 3 time, for the last three months, for a total of about 7 person-months of effort (thi s excludes time in August preparing for the conference) . March and April were devoted to getting an initial understanding of the fill rules, making minimal lexica l scanner additions so that we could parse the text, developing input code to handle the different article formats, and developing some routines for larger-scale patter n matching (which were eventually not used) . System integration and integrate d system testing did not begin until mid-May, a couple of weeks before the dry run . Daily system testing began with a set of 25 articles, but shifted after the dry run t o the first 100 dry-run messages (with the second 100 dry-run messages being used on occasion as a blind test) .</Paragraph>
    <Paragraph position="10">  19 1 In comparison with earlier MUCs, the overhead of getting started - understanding the fill rules, handling the different article formats, generating the more complex templates, and using the various data bases (gazetteer, SIC, currenc y table, corporate designator table) -- was much greater than for prior MUCs, while th e manpower we had for the project was in fact somewhat less. In consequence, our system is relatively less developed than our MUC-3 system, for example. In particular, the attribute structure for the principal entity types (for MUC-5 , companies) were less developed ; this adversely impacted the performance of our reference resolution component and hence our event merging .</Paragraph>
    <Paragraph position="11"> This impact was evident in our performance on the walkthrough message, 0593 . We identified the primary constituent events (the joint venture and the associate d ownership relations), but we failed to identify several of the co-reference relations , because of * a bug in the handling of appositional names followed by relative clauses failure to do spelling correction on names (we only correct spellings to matc h dictionary entries ) shortcomings in the attribute structure of company entitie s Because of these problems and a weak event merging rule (compared to the mor e detailed rules developed for MUC-4, for example), we generated two separate tie-up s for the article, instead of one.</Paragraph>
    <Paragraph position="12"> The system was also not tuned to any significant degree to take advantage of th e MUC-5 scoring rules. Based on a suggestion by Boyan Onyshkevych, we conducted a small experiment after the conference. Because one is told in advance that almost every article in the corpus will have a reportable event, we modified the system t o generate a tie-up between a Japanese company and an Indonesian company (the tw o most frequent nationalities in the training corpus) whenever the text analysi s components were not able to find a tie up . This simple strategy reduced our error rate on the test corpus by 2% .</Paragraph>
  </Section>
  <Section position="6" start_page="189" end_page="192" type="metho">
    <SectionTitle>
WHAT'S NEW FOR MUC- 5
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="189" end_page="192" type="sub_section">
      <SectionTitle>
Lexical Analyzer
</SectionTitle>
      <Paragraph position="0"> The Proteus system has a pattern matcher based on regular expressions wit h provision for procedural tests, which is intended for identifying lexical units before parsing .</Paragraph>
      <Paragraph position="1"> Prior to MUC-5, the system employed a small number of patterns, fo r structures such as dates, times, and numbers .</Paragraph>
      <Paragraph position="2"> The set of patterns was substantiall y enlarged for MUC-5, to include patterns for diffeent types of currencies, for company names, for people's names, for locations, and for names of indeterminate type . In mixed-case text, we used capitalization as the primary indication of the beginning of a name; in monocase text, we employed BBN's part-of-speech tagger and looked for proper noun tags.</Paragraph>
      <Paragraph position="3"> The lexical scanner and the constraints of the lexico-semantic models acted i n concert to classify names. If there was a clear lexical clue (a corporate designator at the end of a name, a title (&amp;quot;Mr.&amp;quot;, &amp;quot;President&amp;quot;, ...) or middle initial in a personal name) , the type was assigned by the lexical scanner . If the type of a name could not be determined by the scanner, but the name occurred in a context where only one typ e was allowed (e.g., as the object of &amp;quot;own&amp;quot;), the type would be assigned as a side effect of applying the lexico-semantic model .</Paragraph>
      <Paragraph position="4">  Semantic Pattern and Similarity Acquisitio n We have spent considerable time over the last two years building tools to acquir e semantic patterns and semantic word similarities from corpora [5, 6], and we had hoped that these would be of significant benefit in our MUC-5 efforts, particularly i n broadening our system's coverage . However, we did not have much opportunity t o use these tools, since so much of our time was consumed in building an initial syste m at some minimal performance level .</Paragraph>
      <Paragraph position="5">  The lexico-sematic models as used previously specified a single level in th e regularized parse tree structure : either a clause with its arguments and modifiers, or an NP with its modifiers . We have found it increasingly valuable, however, to be abl e to specify ' larger patterns which involve several parse tree levels, such as &amp;quot;X signe d an agreement with Y to do Z&amp;quot;, or &amp;quot;X formed a joint venture with Y to do Z&amp;quot; . We have therefore extended our system in order to allow for such larger patterns, and permi t the developer to specify the predicate structure into which this larger pattern shoul d be mapped.</Paragraph>
      <Paragraph position="6"> Model Builder Once we began to allow these larger patterns, we found that the task of writin g such patterns correctly became quite challenging . Our long-term goal is to enable a user to add such patterns, but we seemed (with the added complexity) to be moving further from this goal . We therefore implemented a &amp;quot;model builder&amp;quot; interface whic h allows the developer to enter a prototype sentence and the corresponding predicate structure which should be produced . The interface then creates the required lexico-semantic patterns and mapping rules .</Paragraph>
      <Paragraph position="7"> For example, to handle constructs of the form &amp;quot;company signed an agreemen t with company to ...&amp;quot;, the developer would enter the sentence companyl signed (an agreement with company2 to act3) .</Paragraph>
      <Paragraph position="8"> (where the braces, which are optional, indicate the NP bracketing) and would give the corresponding predicate (c-agree :agent companyl :co-agent company2 :event act3 ) The system would then create models and mapping rules appropriate to a sentenc e such as &amp;quot;IBM signed an agreement with Apple to form a joint venture .&amp;quot; Since thes e rules apply to the syntactically analyzed sentence, they would also handle syntacti c variants such as &amp;quot;The agreement to create the new venture was signed last week b y IBM and Ford . &amp;quot;</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML