File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/88/c88-1072_intro.xml

Size: 22,079 bytes

Last Modified: 2025-10-06 14:04:37

<?xml version="1.0" standalone="yes"?>
<Paper uid="C88-1072">
  <Title>A News Analysis System</Title>
  <Section position="4" start_page="351" end_page="354" type="intro">
    <SectionTitle>
3,$ The Parser
</SectionTitle>
    <Paragraph position="0"> Central to NAS is the parser which provides syntactic structures that are eventually mapped onto concepts resulting in an index for a story.</Paragraph>
    <Paragraph position="1"> The parser is a prlnclple-based GB parser and is a substantially revised &amp;quot;version of /Kuhns 1986/.</Paragraph>
    <Paragraph position="2"> (See /Abney 1986/,/Berwlck 1987/, /Kashket 1987/, /Thiersch 1987/, and /Wehrll 1987/ for descriptions of prlnclple-based parsers.) The parsing strategy is deterministic in that no temporary Structures are built or information deleted during the course of a parse (/Berwlek 1987/ and /Marcus 1980/). It sh6uld be noted that in connection with this type of application, speed is crucial and although a deterministic parser is strict in that it cannot backtrack or produce alternate parses in ambiguous sentences, its speed of approximately 100 words/second in linear time is essential.</Paragraph>
    <Paragraph position="3"> The parser has two main subsystems, vlz., the set of inthractlng GB-modules and the lexicon.</Paragraph>
    <Paragraph position="4"> These modules include principles and constraints from Case and bounding theories and, especially , X-bar, thematic or 0, trace, binding, and control theories. These latter subsystems have a particularly prominent role for the parser.</Paragraph>
    <Paragraph position="5"> Predlcate-argument relations or 8-role assignments to arguments of predicates are determined by 0-theory. In the ease where movement has occurred, trace theory will relate an .argument which now must reside in a position which cannot receive a 0-role with its empty category or trace in a 8-marked position from which the constituent has moved.</Paragraph>
    <Paragraph position="6">  This enables the 0-role of the argument to be determined. Possible eoreferentlal relations for pronomlnals and anaphors are identified with principles of binding and control theory.</Paragraph>
    <Paragraph position="7"> Moreover, the Extended Projection Principle, 8-Criterlon, and Case filter are observed by the parser. (For a full discussion of these modules and principles see /Chomsky 1981/.) The primary output of the parser is a set of licensing relations. &amp;quot;Licensing&amp;quot; is a cover term ~for any of a number of possible relatlons b@tween_ projections. Nonmaximal projections are licensed by maximal projections via X-bar theory and these maximal projections are licensed by an argument.or a trace of an argument, a predicate, or an operator. Specifically, a predicate licenses its internal arguments or complements and its external argument or its subject. (Again for a more detailed discussion of these aspects of GB theory see /Chomsky 1986/.) In that the goal of the parser is to license projections of each element of a sentence, it can perform two basic operations. It can construct a projection of a lexical item in direct use of X-bar theory or it can establish or assert a licensing relation between two maximal projections with respect to other constraints of GB Theory. The parser proceeds by first building a maximal projection and then attempts to license it to another maximal projection or vice versa, i.e., another projection to it.</Paragraph>
    <Paragraph position="8"> Upon encountering a lexical item, the parser creates a maximal projection consisting of a set of features. Each node receives a type in terms of X-bar primitives (+-N, +_V), an index, and its lexleal item from which it has projected.</Paragraph>
    <Paragraph position="9"> Relevant GB systems are invoked during the parse to determine binding relations and 8-role assignments. The proper index to encode binding or eoreferen~e will be incorporated in the projection and co-lndexed projections share all of their features. However, it is not always possible to assign an index or 0-role at the inception of a projection because of inadequate information. The parser will not commit itself and will only include the syntactie structure that it can derive at that stage of the parse. When the relevant information is available, the parser will incorporate it in the incomplete node which preserves the monotonicity of parsing information. This process is constrained to the current cyclic node which is the left bounded context of the parser. (/Knhns in preparation/ will discuss the specifics of this parser.) The parser produces a llst of lleenslng relations for each sentence of a news story. In turn it outputs an ordered llst of the relations corresponding to the sentences of a news report.</Paragraph>
    <Paragraph position="10"> This set is then passed to the semantic processor.</Paragraph>
    <Paragraph position="11"> The other component of the linguistic processor of NAS is the lexicon whi__ceh contains words and distinguished strings, together with their syntactic and subcategorlzatlon features including X-bar primitives (+-~, iV), number, name or referential expressions, complement types, control features (for interpreting empty subjects (PRO) of infinitival complements), and @-grids or 8-role assignments for predicates. An ambiguous lexleal entry has features for all of its potential types associated wlth that item and lexlcal ambiguity resolution procedures choose the appropriate features during the parse(/Milne 1983/ and /Milne 1986/).</Paragraph>
    <Paragraph position="12"> Morphology is minlmal~ reflecting :only r~atlon s bet,@on roots and th~i~. ~ri~a~onal forms and associations between words and affixes.</Paragraph>
    <Paragraph position="13"> Lexical redundancy rules for specifying correspol~dences between sets of features have been implemented. Since news reports frequently have abbreviations, lexlcal entries which have an abbreviated form will be marked as such, and when the abbreviation appears in a story, the lexical scanner ):etrieves the lexlcal information of the unabbreviated form. Relationships between lexical items and their extragrammatical features will be discussed below (Section 3.4).</Paragraph>
    <Paragraph position="14"> The lexicon consists of less than 15,000 members and in building the lexicon the emphasis has been on the inclusion of verbs, adjectives, and prepositions. Names, espeelally of individuals, corporations, and geographical locations, not present In the lexicon are found in news reports regularly. While many familiar names are in the lexicon, unfamiliar nouns are handled by the error handling routines (Section 5.0).</Paragraph>
    <Paragraph position="15"> While the lexicon is updated as needed, the way it was originally constructed was to collect distinct &amp;quot;words&amp;quot; from stories received from a satellite feed. Numbers were disregarded but names and abbreviations were included. During several non-continuous weeks of scanning the stream for new words, the task of assigning syntactic features to each valid item began. While this is a laborious and time.consuming process, it was aided by a menu-drlw~n facility for feature assignment where typing wa!~ minimized and much time saved.</Paragraph>
    <Paragraph position="16"> Also~ during the time that previously unknown words were being &amp;quot;collected,&amp;quot; a counter was indicatin~ the number of current words in increments of I00. When the llst was slightly over 7,000, the nu,~er of new words being added to it slowed. Fur~,ermore, a point of convergence seemed to occur under 9,500 items. At this stage of lexicon development, a comparison of the existing words against a sample of over 50 words (mainly verbs and adjectives) taken from another news service suggested that the present llst was sufficient in that it contained every word taken from the news stories. This is significant because a system which is to parse sentences within a story must have the capability of recognizing each word.</Paragraph>
    <Paragraph position="17"> Since it appears that the vocabulary of reports is bounded (~ith the exception of names), rapid linguistic processing of news is realizable with respect to \].exical recognition.</Paragraph>
    <Paragraph position="18"> ~_~/,. Tbe_j_e!,m_D_~c processor The semantic processor is an automatic pattern marcher which incorporates world knowledge that is used to determine the &amp;quot;meaning&amp;quot; of its linguistic input with respect to a set of topics and designators in its concept base. The term concept refers to a general notion such as merger/ acquisition, terrorism, currency report, or strikes and lockouts. Designators are subtopics which provide detail to an index. A story categorized as a merger/acqulsltlon could be further characterized by designators indicating specific tentacles involved or by the industries impacted. The existing system has the capability of processing over 200 concepts and designators.</Paragraph>
    <Paragraph position="19"> ~e output of this processor (and NAS) is a classification or index of a story consisting of one or mor~) general concepts and their designators. %f no general concept is found, the system may still assign designators. In other words, a story may be ~,out Air France while the general classification is unknown.</Paragraph>
    <Paragraph position="20"> Structurally, the processor can be viewed as having a concept base and a #-relation interpreter which takes as input the predicate-argument structures denoted by #-relations and attempts to find matches with elements in the concept base.</Paragraph>
    <Paragraph position="21"> The concept base itself possesses an internal structure consisting of several levels of abstraction. The most concrete level consists of names which enter into an index whenever present in a story. This level primarily contains names of corporations, industries, corporate executives, government officials, and geographical locations.</Paragraph>
    <Paragraph position="22"> In order to keep linguistic and the application dependent concepts independent, pointers between the lowest level of the concept base and the lexicon are used. A change to the concept base or substitution of a new one will not affect the linguistic component.</Paragraph>
    <Paragraph position="23"> Representations at the next level reflect con~onality which the elements at the first level share and together they provlde desi~ators for a story. The objects at this more abstract level are called entity types and they further characterize the members of the first level. Two common entity types are industry type and company. The semantic processor can assign an industry designator to a story if either the industry is explicitly mentioned in the story or if companies or individuals mentioned in the story are related to a particular industry. So a news item about Swiss Air will have both the name Swiss Air and its associated industry, viz., Airline Industry, assigned to its index.</Paragraph>
    <Paragraph position="24"> The last and most abstract level is that of a general concept such as merger/acquisition, currency report, strikes and lockouts, and terrorism. These are represented by frames where there is one action slot and at least one entity type slot (determined from the previous level).</Paragraph>
    <Paragraph position="25"> Moreover, one concept may have several different representations. 1~e action slot is a list of one or more synonomous words or phrases that denote an action or the &amp;quot;doing&amp;quot; component of a concept. The members of the action slot are not semantic primitives but are actual words. Furthermore, they are word stems and not all of their morphological variants. The entity type slots contain types of entities which are found in the previously discussed level of the concept base. For example, a partial representation for merger/acquisition is:  where b~ or take over is the action and the entity type slots are labeled agent and object and their members must be of the type company. Details of this formalism are discussed below in connection with the #-relation interpreter.</Paragraph>
    <Paragraph position="26"> The other module of the semantic processor is a #-relation interpreter which maps #-relations of each sentence of a news story into the concept base, or, in other words, onto specific concepts and designators. This mapping is executed as follows. First, recall that the parser returns a set of licens'ing relations including #degrelations for each story. Each member of this set is a llst of the relations for a sentence of the story. In examining the #-relations for a sentence, the interpreter attempts to establish general concepts by pairing the predicate and arbalests of a  #-relation with the action and entity type slots of a concept, respectively. For example, consider a merger/acquisition frame (i) and a g-relation which has boh~h_~t as a predicate with its agent being Acme Co_~. and its object as Software Inc.</Paragraph>
    <Paragraph position="27"> The 0-relatlon interpreter first determines that is related to ~ and that ~ is a member of the action slot. Since this comparison is successful, the interpreter then derives the entity types of Acme ~ and Software Inc from the abbreviations ~ and ~ Both have an entity type of company, and the interpreter can match the argument structure of the #-relation with the entity type slots of (I), resulting in a merger/ acquisition classification being assigned to the story.</Paragraph>
    <Paragraph position="28"> In attempting to determine a general categorization, the interpreter is encountering specific company names and, perhaps, their associated industry names. If these are contained in the concept base, they are also entered into the index. In this hypothetical example, if Software Inc. was listed in the concept base and related to the computer industry, then independent of the general classification, the final index wou\]d contain both the name of the company and its industry. In this way, a user can specify a particular company and receive all stories mentioning it, although there may not be any further index.</Paragraph>
    <Paragraph position="29"> Since the mapping of the interpreter between the #-relations of the parser and the concepts in the concept base is structure preserving, the items within indexes can also exhibit certain relationships. Arg~llents which are either an agent or object in a @-relation will correspond to entity slots marked agent and object in a concept, respectively. Thus, the index will reflect the roles in which the participants are engaged, e.g., in a merger/acqulsitlon the buyer and the acquired could be distinguished.</Paragraph>
    <Paragraph position="30"> The next section provides several examples.</Paragraph>
    <Section position="1" start_page="353" end_page="353" type="sub_section">
      <SectionTitle>
4.0 Examp__~
</SectionTitle>
      <Paragraph position="0"> This section illustrates the type of indexes which NAS produces. The stories are from Reuters and the results are actual outputs from NAS.</Paragraph>
      <Paragraph position="1"> Story__!l Montreal, Nov 3 Air Canada's 8,500 groundworkers plan rotating strikes in the next few days following a collapse in contract talks with the government-owned airline earlier today, a union spokesman said.</Paragraph>
      <Paragraph position="2"> Chief union negotiator Ron Fontaine said the workers will give 24 hours notice of a walkout but only two hours notice of which airports or maintenance centres they will strike.</Paragraph>
      <Paragraph position="3"> The airline has warned that it will lock out any workers participating in rotating strikes until a new contract agreement is reached. The union last went on strike in 1978, shutting down the airline for two weeks.</Paragraph>
      <Paragraph position="4"> Indexes:</Paragraph>
    </Section>
    <Section position="2" start_page="353" end_page="353" type="sub_section">
      <SectionTitle>
Strikes and Lockouts
Industry - Airlines
</SectionTitle>
      <Paragraph position="0"> The system has the concepts of strikes and lockouts and airlines industry in its concept base.</Paragraph>
      <Paragraph position="1"> The designator Airlines Industry is arrived at by a relation between Air Canada and its industry. The more general notion of Strikes and Lockouts appears as a frame in the concept base of the form: (2) Strikes and Lockouts Action: plan, participate Agent: employee Object: strike where the action slot consists of Rlan and i~ and the agent slot is of type employee of which Kroundworkers is so marked. The word strike is simply marked as strike. The parser returns a #-relation for the first sentence with as a predicate, grQundworkers as the agent, and strikes as the object. The g-interpreter operates as described in the previous section and the Strikes and Lockouts frame is satisfied. Other typical results of processing by NAS are stories 2 and 3. Only the first sentence of each are provided since the remaining sentences of these news reports did not add any new information to the index.</Paragraph>
      <Paragraph position="2"> StorM_/2 Valley Forge, Pa, November 3 o Alco Standard Corp. said it sold two of its gift and glassware companies for an undisclosed amount of cash to management groups in leveraged huyouts.</Paragraph>
    </Section>
    <Section position="3" start_page="353" end_page="354" type="sub_section">
      <SectionTitle>
Instrument - Bombings
</SectionTitle>
      <Paragraph position="0"> Since the details of indexing are identical to those above, they will be omitted here. However, it is noteworthy to indieat@ that the word divestment does not appear anywhere in Story 2.</Paragraph>
      <Paragraph position="1"> (Clearly, the verb sold alone could not trigger a divestment.) Similarly, in Story 3 terrorism is never used, yet NAS correctly indexes the story and also identifies the location and the weapon or instrument used.</Paragraph>
      <Paragraph position="2">  ,9. 1_,~~:~ ~ U&lt;~,~t(:! i n~.~ ')?h~.,:e ~!.re ,qevera! ways in which NAS ca~ fail to p~_~:i,~PSm an a~ta\].ysi.'.~. %f the seamier i.PS::~ds an o.~tkuown word, it will tl'Igger procedures in an attempt to ~nfer its category. For I.astanee~ it ~.~i\]% look ahead for abbreviations such as i_~!nq, ~ozp., o:\[ co. and if any of the striugs are preseut, the sca~m.er will a~:sign name features with the hmned~ ately preceding unidentified words, (%deal.ly, in a fully deployed applieatJou, NAS would h:.ve interfaces to specialized databa~:es of names, ~;ay, of compsnies. ) Also, the lexleal ~.;canner, in fail:Lug to find a word In the lexicon ~l!~d ~-~ t.\]~ a}~su\[tce of t:ertain triggeft\] (e.g., iloe), ~.y~\[\].\] \]o\[el the unknown word a noun and pass the word to the parser in the sentence. This method for hand.ilng unknown words works well only if verbs, z~dj natives, and prepositions used in news reports are nearly exhaustively contained in the lexicen~ and NAS has been extremely successful by using Eh is technique.</Paragraph>
      <Paragraph position="3"> Another potent:ial problem for NAS is an ~lleomplei:e or incorrect parse. Both eases often indicat:e insufficient info 13nat ion of a lexica\] item. }~owever, during execution of NAS, if the parser c~nnot final a licensing relationship for a projeatilm of an item in its input ~,;tream, it will\[ move t:o the next word. This projection will remain tmlieen.,:ed o~. uninterpreted. \]:f the word has a semantic mark that may trigger a designator, the semantic processor will use it for constructing an index, \].'or example, Yen is a low-level designator word and it is also semantically marked as currency .. If the parser cannot license a projection c~entaining this word to a verb or a preposition o~', perhaps, misassign~; a relation, the index wJT.\], still contain ~9\[~ and currency report. What may be missing is a general, cat~.gorization, In add.iticn to ext:end\]ng and enhancing the eomponenL~ oi the se~llalltic processor and parser, ~he lleaY t~erm nilerrs will focus on establishing quantitative benchma~lks for both speed and accuracy t~sing stories from an active newswire. While a pre-protatype o_~ NAS with a different and \]ess- .qoph \[stlcated scanner and less-developed parsei? a~d semantic processor relied on stories f\[:om flol,py disks or manual entry, the current versioll \[s linked te a live feed. A rough performance measure of the pre~prototype on a very .~al\[l sample of less than 50 sterles showed that it &lt;,~as eompletel 2 correct for over 70% of the stories. The pres~nt semantic processor operates on a much i_a~ get conceptual base and while it is premattzz',~ to make assessments, the system has indexed (~le day of news stor.le.~ front Reuters and the results were independently examined by a group of p~efe~slonal indexers. The indexers who had manually l~dexed the stories snpplied over 400 topics for inclusion in the concept base of NAS, some of which were OOt relewmt to any of the ~'4to~':i.e:~. The:re was no communication with these indexers before or during tile process and while there ,:;ei:e &amp;quot;no formal criteria previously specified, the inde~;:e ~s found the results very promising.</Paragraph>
      <Paragraph position="4"> Gui'y.'ent\].y, a precise evaluation Iaetrie for NAS Is bc~ing formulated with these indexers.</Paragraph>
      <Paragraph position="5"> Long-term work will include ~nha~lceme.nt to the semantic proeessor and a refinement of its classlfic~tion scheme. Inferencing across classified stories is also an option as we\]\]. as the eapabl lity of allowing the user to query those processed stori~s (using the same p~r s;or) * Automatin stumnarization of stories is also a fut:ur~ possibility.</Paragraph>
      <Paragraph position="6"> J\] ,0 Acknowledgf~\[mp~t~s Steve Gushing made va\]uabl.~ ~;omments o~ a~ earlier draft of this p~per. Oa~ Su!\].ivau was ~ co-developer and implementer of thC/~ pre-prototype. On the present version of NAS, Steve Gander ha;; made significant contributions to its des:\[gll and implementation.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML