File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/93/m93-1016_metho.xml
Size: 24,200 bytes
Last Modified: 2025-10-06 14:13:30
<?xml version="1.0" standalone="yes"?> <Paper uid="M93-1016"> <Title>TOP-LEVEL-FLAG NIL :PREDICATE C-FORM :TENSE PRESENT :ASPECT PERF :JOINT-VENTURE (ENTITY</Title> <Section position="1" start_page="0" end_page="182" type="metho"> <SectionTitle> NEW YORK UNIVERSITY : DESCRIPTION OF THE PROTEUS SYSTEM AS USED FOR MUC- 5 </SectionTitle> <Paragraph position="0"> The Proteus system which we have used for MUC-5 is largely unchanged from that used for MUC-3 and MUC-4 . It has three main components : a syntactic analyzer, a semantic analyzer, and a template generator .</Paragraph> <Paragraph position="1"> The Proteus syntactic analyzer was developed starting in the fall of 1984 as a common base for all the applications of the Proteus Project . Many aspects of it s design reflect its heritage in the Linguistic String Parser, previously developed an d still in use at New York University.</Paragraph> <Paragraph position="2"> The current system, including the Restriction Language compiler, the lexical analyzer, and the parser proper, comprise approximately 4500 lines of Common Lisp .</Paragraph> <Paragraph position="3"> The semantic analyzer was initially developed in 1987 for the MUCK- I (RAINFORMs) application, extended for the MUCK-II (OPREPS) application, and ha s been incrementally revised since . It currently consists of about 3000 lines o f Common Lisp (excluding the domain-specific information) .</Paragraph> <Paragraph position="4"> The template generator was written from scratch for the MUC-5 joint ventur e task; it is about 1200 lines of Common Lisp .</Paragraph> <Paragraph position="5"> Stages of processing The text goes through the five major stages of processing : lexical analysis , syntactic analysis, semantic analysis, reference resolution, and template generation (see figure 1) . In addition, some restructuring of the logical form is performed bot h after semantic analysis and after reference resolution (only the restructuring after reference resolution is shown in figure 1) . Processing is basically sequential : each sentence goes through lexical, syntactic, and semantic analysis and referenc e resolution ; the logical form for the entire message is then fed to template generation . However, semantic (selectional) checking is performed during syntactic analysis , employing essentially the same code later used for semantic analysis .</Paragraph> <Section position="1" start_page="182" end_page="182" type="sub_section"> <SectionTitle> Lexical Analysis Dictionary Forma t </SectionTitle> <Paragraph position="0"> Our dictionaries contain only syntactic information : the parts of speech for each word, information about the complement structure of verbs, distributiona l information (e .g., for adjectives and adverbs), etc . We follow closely the set o f syntactic features established for the NYU Linguistic String Parser . This informatio n is entered in LISP form using noun, verb, adjective, and adverb macros for the open-class words, and a word macro for other parts of speech:</Paragraph> </Section> </Section> <Section position="2" start_page="182" end_page="183" type="metho"> <SectionTitle> (ADVERB &quot;ABRUPTLY&quot; :ATTRIBUTES (DSA) ) (ADJECTIVE &quot;ABRUPT&quot; ) (NOUN :ROOT &quot;ABSCESS&quot; :ATTRIBUTES (NCOUNT) ) (VERB :ROOT &quot;ABSCOND&quot; :OBJLIST (NULLOBJ PN (PVAL (FROM WITH))) ) </SectionTitle> <Paragraph position="0"> The noun and verb macros automatically generate the regular inflectional forms .</Paragraph> <Section position="1" start_page="182" end_page="183" type="sub_section"> <SectionTitle> Dictionary Files </SectionTitle> <Paragraph position="0"> The primary source of our dictionary information about open-class words (nouns , verbs, adjectives, and adverbs) is the machine-readable version of the Oxfor d Advanced Learner's Dictionary (&quot;OALD&quot;) . We have written programs which take th e SGML (Standard Generalized Markup Language) version of the dictionary, extrac t information on inflections, parts of speech, and verb subcategorization (includin g information on adverbial particles and prepositions gleaned from the examples), an d generate the LISP-ified form shown above . This is supplemented by a manually-code d dictionary (about 1500 lines, 900 entries) for closed-class words, words not adequatel y defined in the OALD, and a few very common words . In addition, we used severa l specialized dictionaries for MUC-5, including a location dictionary (with al l countries, continents, and major cities (CITY1 or , PORT1 in the gazetteer), a dictionary of corporate designators, a dictionary of job titles, and a dictionary of currencies .</Paragraph> <Paragraph position="1"> Looku p The text reader splits the input text into tokens and then attempts to assign to eac h token (or sequence of tokens, in the case of an idiom) a definition (part of speech an d syntactic attributes) . The matching process proceeds in five steps : dictionary lookup, lexical pattern matching, spelling correction, prefix stripping, and defaul t definition assignment . Dictionary lookup immediately retrieves definitions assigne d by any of the dictionaries (including inflected forms) . The specialized dictionarie s are stored in memory, while the main dictionary is accessed from disk (using hashe d index random access) .</Paragraph> <Paragraph position="2"> Lexical pattern matching is used to identify a variety of specialized patterns, suc h as numbers, dates, times, and possessive forms . The set of lexical patterns was substantially expanded for MUC-5 to include various forms of people's names , company names, locations, and currencies .</Paragraph> <Paragraph position="3"> The lexical patterns are further discusse d below, in the &quot;What's new for MUC-5&quot; section .</Paragraph> <Paragraph position="4"> If neither dictionary lookup nor lexical pattern matching is successful, spellin g correction and prefix stripping are attempted . For words of any length, we identify an input token as a misspelled form of a dictionary entry if one of the two has a single instance of a letter while the other has a doubled instance of the letter (e .g. , &quot;mispelled&quot; and &quot;misspelled&quot;) . The prefix stripper attempts to identify the token as a combination of a prefix (e .g.,&quot;un&quot;) and a word defined in the dictionary . If all of these procedures fail, we assign a default definition . In mixed case text , undefined capitalized words are tagged as proper nouns ; undefined lower case words are tagged as common nouns . In monocase text, all undefined words are tagged a s proper nouns .</Paragraph> </Section> <Section position="2" start_page="183" end_page="183" type="sub_section"> <SectionTitle> Syntactic Analysis </SectionTitle> <Paragraph position="0"> Syntactic analysis involves two stages of processing: parsing and syntacti c regularization. At the core of the system is an active chart parser . The grammar i s an augmented context-free grammar, consisting of BNF rules plus procedura l restrictions which check grammatical constraints not easily captured in the BN F rules. Most restrictions are stated in PROTEUS Restriction Language (a variant of th e language developed for the Linguistic String Parser) and translated into LISP ; a few are coded directly in LISP [1] . For example, the count noun restriction (that singular countable nouns have a determiner) is stated as</Paragraph> <Paragraph position="2"/> </Section> </Section> <Section position="3" start_page="183" end_page="183" type="metho"> <SectionTitle> IF BOTH CORE Xcore IS NCOUNT AND Xcore IS SINGULAR THEN IN LN, TPOS IS NOT EMPTY. </SectionTitle> <Paragraph position="0"> Associated with each BNF rule is a regularization rule, which computes th e regularized form of each node in the parse tree from the regularized forms of its immediate constituents. These regularization rules are based on lambda-reduction, a s in GPSG. The primary function of syntactic regularization is to reduce all clauses to a standard form consisting of aspect and tense markers, the operator (verb o r adjective), and syntactically marked cases . For example, the definition of assertion, the basic S structure in our grammar, is <assertion> <sa> <subject> <sa> <verb> <sa> <object> <sa> :(s !(<object> <subject> <verb> <sa*>)) .</Paragraph> <Paragraph position="1"> Here the portion after the single colon defines the regularized structure . Coordinate conjunction is introduced by a metarule (as in GPSG), which is applie d to the context-free components of the grammar prior to parsing . The regularizatio n procedure expands any conjunction into a conjuntion of clauses or of noun phrases . The output of the parser for the first sentence of 0592, &quot;BRIDGESTONE SPORTS CO.</Paragraph> </Section> <Section position="4" start_page="183" end_page="185" type="metho"> <SectionTitle> SAID FRIDAY IT HAS SET UP A JOINT VENTURE IN TAIWAN WITH A LOCAL CONCERN AND A JAPANESE TRADING HOUSE TO PRODUCE GOLF CLUBS TO BE SHIPPED TO JAPAN.&quot; is </SectionTitle> <Paragraph position="0"/> <Paragraph position="2"> and the corresponding regularized structure i s</Paragraph> </Section> <Section position="5" start_page="185" end_page="189" type="metho"> <SectionTitle> (TIMEPREP (NP FRIDAY SINGULAR (SN NP157))) ) </SectionTitle> <Paragraph position="0"> The system uses a chart parser operating top-down, left-to-right. As edges are completed (i .e., as nodes of the parse tree are built), restrictions associated with thos e productions are invoked to assign and test features of the parse tree nodes .</Paragraph> <Paragraph position="1"> If a restriction fails, that edge is not added to the chart . When certain levels of the tree are complete (those producing noun phrase and clause structures), th e regularization rules are invoked to compute a regularized structure for the partia l parse, and selection is invoked to verify the semantic well-formedness of th e structure (as noted earlier, selection uses the same &quot;semantic analysis&quot; cod e subsequently employed to translate the tree into logical form) .</Paragraph> <Paragraph position="2"> One unusual feature of the parser is its weighting capability . Restrictions ma y assign scores to nodes ; the parser will perform a best-first search for the parse tree with the highest score . This scoring is used to implement various preferenc e mechanisms : * closest attachment of modifiers (we penalize each modifier by the number o f words separating it from its head) * preferred narrow conjoining for clauses (we penalize a conjoined claus e structure by the number of words it subsumes ) * preference semantics (selection does not reject a structure, but imposes a heav y penalty if the structure does not match any lexico-semantic model, and a lesser penalty if the structure matches a model but with some operands or modifiers left over) [2,3 ] * relaxation of certain syntactic constraints, such as the count noun constraint , adverb position constraints, and comma constraints * disfavoring (penalizing) headless noun phrases and headless relatives (this i s important for parsing efficiency ) The grammar is based on Harris's Linguistic String Theory and adapted from th e larger Linguistic String Project (LSP) grammar developed by Naomi Sager at NYU [4] . The grammar is gradually being enlarged to cover more of the LSP grammar . The current grammar is 1600 lines of BNF and Restriction Language plus 300 lines of Lisp ; it includes 186 non-terminals, 464 productions, and 132 restrictions . Over the course of the MUCs we have added several mechanisms for recoverin g from sentences the grammar cannot fully parse . For MUC-5, we found that the mos t effective was our &quot;fitted parse&quot; mechanism, which attempts to cover the sentenc e with noun phrases and clauses, preferring the longest noun phrases or clauses which can be identified Semantic Analysis And Reference Resolutio n The output of syntactic analysis goes through semantic analysis and referenc e resolution and is then added to the accumulating logical form for the message. Following both semantic analysis and reference resolution certain transformation s are performed to simplify the logical form . All of this processing makes use of a concept hierarchy which captures the class/subclass/instance relations in th e domain .</Paragraph> <Paragraph position="3"> Semantic analysis uses a set of lexico-semantic models to map the regularized syntactic analysis into a semantic representation . Each model specifies a class o f verbs, adjectives, or nouns and a set of operands ; for each operand it indicates th e possible syntactic case markers, the semantic class of the operand, whether or no t the operand is required, and the semantic case to be assigned to the operand in th e output representation. For example, the model for &quot;<entity> forms a joint venture wit h The models are arranged in a shallow hierarchy with inheritance, so tha t arguments and modifiers which are shared by a class of verbs need only be stated once. The model above inherits only from the most general clause model, clause-any , which includes general clausal modifiers such as negation, time, tense, modality, etc . The MUC-5 system has 61 clause models, 2 nominalization models, and 45 other nou n phrase models, a total of about 1700 lines . The class C -mu c 5 -entity in the clause model refers to the concept in the concept hierarchy, whose entries have the form:</Paragraph> <Paragraph position="5"> There are currently a total of 154 concepts in the hierarchy .</Paragraph> <Paragraph position="6"> The output of semantic analysis is a nested set of entity and event structures, with argument s labeled by keywords primarily designating semantic roles .</Paragraph> <Paragraph position="7"> For the first sentence of 0593, the output is Reference resolutio n Reference resolution is applied to the output of semantic analysis in order t o replace anaphoric noun phrases (representing either events or entities) b y appropriate antecedents . Each potential anaphor is compared to prior entities or events, looking for a suitable antecedent such that the class of the anaphor (in the concept hierarchy) is equal to or more general than that of the antecedent, th e anaphor and antecedent match in number, the restrictive modifiers in the anapho r have corresponding arguments in the antecedent, and the non-restrictive modifier s (e .g., apposition) of the anaphor are not inconsistent with those of the antecedent . Special tests are provided for names, since people and companies may be referred to a subset of their full names .</Paragraph> <Paragraph position="8"> Logical form transformation s The transformations which are applied after semantic analysis and afte r reference resolution simplify and regularize the logical form in various ways . The transformations after semantic analysis primarily standardize the attribute structure of entities so that reference resolution will work properly . The transformation s after reference resolution simplify the task of template generation by casting th e events in a more uniform framework and performing a limited number o f inferences. For example, we show here a rule which transforms the logical for m produced from &quot;X formed a joint venture with Y&quot; into the equivalent for &quot;X and Y formed a joint venture&quot; : ((modify 1 (list :agent (conjoin-entities '?agent '?company-list-2)) ) (modify 2 '( :agent nil :tied-up t))) ) There are currently 32 such rules . These transformations are written a s productions and applied using a simple data-driven production system interprete r which is part of the Proteus system.</Paragraph> <Paragraph position="9"> Template generato r Once all the sentences in an article have been processed through syntacti c analysis, semantic analysis, and the logical form transformations, the resultin g logical forms are sent to the template generator . The logical form events and entities produced by the transformations are in close correspondence to the template object s needed for MUC-5, so the template generation is fairly straightforward . The greates t complexity was involved in the procedures for accessing the two large data bases, the gazetteer (for normalizing locations) and the Standard Industrial Classification (for Precision remained within a fairly narrow range, from 47 to 63, throughout the testing. Five months were available for development (March - July) . One person was assigned full-time for the entire period ; a second person assisted, approximately 2/ 3 time, for the last three months, for a total of about 7 person-months of effort (thi s excludes time in August preparing for the conference) . March and April were devoted to getting an initial understanding of the fill rules, making minimal lexica l scanner additions so that we could parse the text, developing input code to handle the different article formats, and developing some routines for larger-scale patter n matching (which were eventually not used) . System integration and integrate d system testing did not begin until mid-May, a couple of weeks before the dry run . Daily system testing began with a set of 25 articles, but shifted after the dry run t o the first 100 dry-run messages (with the second 100 dry-run messages being used on occasion as a blind test) .</Paragraph> <Paragraph position="10"> 19 1 In comparison with earlier MUCs, the overhead of getting started - understanding the fill rules, handling the different article formats, generating the more complex templates, and using the various data bases (gazetteer, SIC, currenc y table, corporate designator table) -- was much greater than for prior MUCs, while th e manpower we had for the project was in fact somewhat less. In consequence, our system is relatively less developed than our MUC-3 system, for example. In particular, the attribute structure for the principal entity types (for MUC-5 , companies) were less developed ; this adversely impacted the performance of our reference resolution component and hence our event merging .</Paragraph> <Paragraph position="11"> This impact was evident in our performance on the walkthrough message, 0593 . We identified the primary constituent events (the joint venture and the associate d ownership relations), but we failed to identify several of the co-reference relations , because of * a bug in the handling of appositional names followed by relative clauses failure to do spelling correction on names (we only correct spellings to matc h dictionary entries ) shortcomings in the attribute structure of company entitie s Because of these problems and a weak event merging rule (compared to the mor e detailed rules developed for MUC-4, for example), we generated two separate tie-up s for the article, instead of one.</Paragraph> <Paragraph position="12"> The system was also not tuned to any significant degree to take advantage of th e MUC-5 scoring rules. Based on a suggestion by Boyan Onyshkevych, we conducted a small experiment after the conference. Because one is told in advance that almost every article in the corpus will have a reportable event, we modified the system t o generate a tie-up between a Japanese company and an Indonesian company (the tw o most frequent nationalities in the training corpus) whenever the text analysi s components were not able to find a tie up . This simple strategy reduced our error rate on the test corpus by 2% .</Paragraph> </Section> <Section position="6" start_page="189" end_page="192" type="metho"> <SectionTitle> WHAT'S NEW FOR MUC- 5 </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="189" end_page="192" type="sub_section"> <SectionTitle> Lexical Analyzer </SectionTitle> <Paragraph position="0"> The Proteus system has a pattern matcher based on regular expressions wit h provision for procedural tests, which is intended for identifying lexical units before parsing .</Paragraph> <Paragraph position="1"> Prior to MUC-5, the system employed a small number of patterns, fo r structures such as dates, times, and numbers .</Paragraph> <Paragraph position="2"> The set of patterns was substantiall y enlarged for MUC-5, to include patterns for diffeent types of currencies, for company names, for people's names, for locations, and for names of indeterminate type . In mixed-case text, we used capitalization as the primary indication of the beginning of a name; in monocase text, we employed BBN's part-of-speech tagger and looked for proper noun tags.</Paragraph> <Paragraph position="3"> The lexical scanner and the constraints of the lexico-semantic models acted i n concert to classify names. If there was a clear lexical clue (a corporate designator at the end of a name, a title (&quot;Mr.&quot;, &quot;President&quot;, ...) or middle initial in a personal name) , the type was assigned by the lexical scanner . If the type of a name could not be determined by the scanner, but the name occurred in a context where only one typ e was allowed (e.g., as the object of &quot;own&quot;), the type would be assigned as a side effect of applying the lexico-semantic model .</Paragraph> <Paragraph position="4"> Semantic Pattern and Similarity Acquisitio n We have spent considerable time over the last two years building tools to acquir e semantic patterns and semantic word similarities from corpora [5, 6], and we had hoped that these would be of significant benefit in our MUC-5 efforts, particularly i n broadening our system's coverage . However, we did not have much opportunity t o use these tools, since so much of our time was consumed in building an initial syste m at some minimal performance level .</Paragraph> <Paragraph position="5"> The lexico-sematic models as used previously specified a single level in th e regularized parse tree structure : either a clause with its arguments and modifiers, or an NP with its modifiers . We have found it increasingly valuable, however, to be abl e to specify ' larger patterns which involve several parse tree levels, such as &quot;X signe d an agreement with Y to do Z&quot;, or &quot;X formed a joint venture with Y to do Z&quot; . We have therefore extended our system in order to allow for such larger patterns, and permi t the developer to specify the predicate structure into which this larger pattern shoul d be mapped.</Paragraph> <Paragraph position="6"> Model Builder Once we began to allow these larger patterns, we found that the task of writin g such patterns correctly became quite challenging . Our long-term goal is to enable a user to add such patterns, but we seemed (with the added complexity) to be moving further from this goal . We therefore implemented a &quot;model builder&quot; interface whic h allows the developer to enter a prototype sentence and the corresponding predicate structure which should be produced . The interface then creates the required lexico-semantic patterns and mapping rules .</Paragraph> <Paragraph position="7"> For example, to handle constructs of the form &quot;company signed an agreemen t with company to ...&quot;, the developer would enter the sentence companyl signed (an agreement with company2 to act3) .</Paragraph> <Paragraph position="8"> (where the braces, which are optional, indicate the NP bracketing) and would give the corresponding predicate (c-agree :agent companyl :co-agent company2 :event act3 ) The system would then create models and mapping rules appropriate to a sentenc e such as &quot;IBM signed an agreement with Apple to form a joint venture .&quot; Since thes e rules apply to the syntactically analyzed sentence, they would also handle syntacti c variants such as &quot;The agreement to create the new venture was signed last week b y IBM and Ford . &quot;</Paragraph> </Section> </Section> class="xml-element"></Paper>