File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/96/c96-1071_metho.xml
Size: 16,238 bytes
Last Modified: 2025-10-06 14:14:13
<?xml version="1.0" standalone="yes"?> <Paper uid="C96-1071"> <Title>Evaluation of an Algorithm for the Recognition and Classification of Proper Names</Title> <Section position="3" start_page="418" end_page="421" type="metho"> <SectionTitle> 2 LaSIE system overview </SectionTitle> <Paragraph position="0"/> <Paragraph position="2"> LaSIE has been designed as a general tmrpose IE research system, initially geared towards, but not solely restricted to, carrying out the tasks specified by the sixth Message Understanding Confe, rence: named entity recognition, coreference resolution, template element tilling, and scenario template filling tasks (see (MUC6, 1995) for fllrther details of the task descriptions). In addition, the system can generate a brief natural language summary of the scenario it. has detected in the text.</Paragraph> <Paragraph position="3"> All of these tasks are carried out by building a single rich inodel of the text the discourse model from which the various results are read oil.</Paragraph> <Paragraph position="4"> Tile high level structure of LaSIE is illustrated in Figure 1. The system is a pipelined architecture which processes a text sentence-at-a-time and consists of three principal processing stages: lexical preproeessing, parsing plus semant;ic inter-pretation, and discourse interpretation. The over-all contributions of these stages may be briefly described as follows: * lexlcal preproeessing reads and tokenises tile raw inlmt text, tags the toke, ns with parts-of-speech, t)e, rforms morI)hological analysis, Imrtbrms phrasal matching against lists of proper names, and builds lexical and phrasal chart edges in a h'.ature-based formalism for hand-over to the parser; * parsing does two pass parsing, pass one with a special proper name grairlmal'~ pass two with a general grammar and, after selecting a 'best parse', passes on a semantic representation of the current senteliC(~ which includes nanle clans infi)rmation; * discourse interpretation adds the information ill its input semantic representation to a hierarchically structured selnantic ne, t which encodes the system's world model, adds additional ilffOl',natioi1 presupposed by the input to the world model, perforlns coreference resolution be, tween new instances added and others already ill the world model, and adds information consequent upon the addition of the input to the worhl Inodel.</Paragraph> <Paragraph position="5"> For fltrther det~fils of the systeln see (Gaizauskas ct al, 1.995).</Paragraph> <Paragraph position="6"> 3 How proper names are recognised and classified As indicated in section 1, our approach is a heterogeneous one ill which the system makes use of graI)hological, syntactic, selnantic, world knowledge,, and discourse level intb,'mation for the recognition and classification of proper names. The system utilises both the information which comes fl'oln the name itse.lf (internal evidence ill McDonald's sense (McDonaht, 1993)) as well as tile information which colnes from outside, the name, froln its context in the text: (external evidence). In what tbllows we describe how proper names are recognised and classified in LaSIE by considering the contribution of each system component.</Paragraph> <Section position="1" start_page="418" end_page="419" type="sub_section"> <SectionTitle> 3.1 Lexieal preproeessing </SectionTitle> <Paragraph position="0"> The input text is first tokenise.d and then each token is tagged with a part-of-stmech tag from the, Penn qtYeebank tagset (Marcus ct al, 1993) using a slightly custolnised ~ version of Brill's tag- null ger (Brill, :1.994). The tagset contains two tags fin' proper nouns NNP for singular proper nouns and I~NPS for plurals. The tagger tags a word as a proper noun as follows: if the word is timnd in the tagger's lexicon and listed as a proper noun then tag it, as such; otherwise, if the word is not found in the lexicon and is uppercase initial then tag it as a proper noun. Thus, capitalised unknown tokens are tagged as proper nouns by default.</Paragraph> <Paragraph position="1"> Before parsing an attempt is made to identi\[y proper naine phrases sequences of proper names and to classify them. This is done by matching tile input against pre-stored lists of proper nalnes. The.se lists are compiled via a flex program into a finite state recogniser. Each sentence is fed to the recogniser and all single and multi-word matches are tagged with special tags which indicate the name (:lass.</Paragraph> <Paragraph position="2"> Lists of names used include:</Paragraph> <Paragraph position="4"> governmental institution nmnes based on an organisation name list which was semi-automatically collected from the MUC-5 answer keys and training corl)us (Wall Street</Paragraph> <Paragraph position="6"> province~state, and city names derived fl'om a gazetteer list of about 150,000 place naines; * person : about 500 given names taken Dom a list; of given names in the Oxford Advanced Le.arner's Dictionary (Hornby, 1980); * eompmly designator : 94 designators (e.g.</Paragraph> <Paragraph position="7"> 'Co.','PLC'), based on the company designator list provided in the MUC6 reference resources. null * human titles : about 160 titles, (e.g. 'President','Mr.'), manually collected; As well as name phrase matching, another technique is applied at this point, inside multi-word proper names, certain words m~y flmction as triqget words. A trigger word indicates that the tokens surrounding it are' probably a proper name and may reliably pernfit the class or even sub-class 2 of tile proper nmne to be determined. For example, 'Wing and l'rwer Airlines' is ahnost certainly a company, given tile presence of the word 'Airlines'. ~Digger words are detected by matching against, lists of such words and are then specially tagged. Subsequently these tags are used by the proper nmne parser to build complex proper name constituents.</Paragraph> <Paragraph position="8"> The lists of trigger words are: * AMine company: 3 trigger words for finding airline company names, e.g. 'Airlines'; * Governmental institutions: 7 trigger words for governmental institutions, e.g 'Ministry'; for word classes such as days of the week and months. 2company and governmental institution are sub-classes of the class organisation, airline is a sub-class of company.</Paragraph> <Paragraph position="9"> * Location: 8 trigger words for location nanle, s, e.g. 'Gulf'; * Organisation: 135 trigger words for organisa null tion names, e.g 'Association'.</Paragraph> <Paragraph position="10"> These lists of trigger words were produced by hand, though the organisation trigger word lists were generated semi-automatically by looking at organisation names in tile MUC-6 training texts and applying certain heuristics. So, for example, words were (:ollected which come inmmdiately before 'of' in those organisation names which eOlltain 'of', e.g. 'Association' in ~Assoeiation of Air Flight Attendants'; l;he last; words of organisation names which do not contain 'of' were examined to find trigge.r words like 'International'.</Paragraph> </Section> <Section position="2" start_page="419" end_page="421" type="sub_section"> <SectionTitle> 3.2 Grammar rules lbr proper names </SectionTitle> <Paragraph position="0"> The LaSIE parser is a simple bottom-up chart parser iinplemented in Prolog. The grammars it processes are unification-style feature-based context, fl'ee grammars. During parsing, semantic representations of constituents are constructed using Prolog terin unification. When parsing {:eases, i.e.</Paragraph> <Paragraph position="1"> when the parser can generate no further edges, a 'best parse selection' algorithm is rml on the final chart to chose ~ single analysis. The semantics are then extracted fl'om this analysis trod passed on to the discourse interpreter.</Paragraph> <Paragraph position="2"> Parsing takes place in two passes, each using a separate grammar. In the first pass a special grammar is used to identify proper nanms.</Paragraph> <Paragraph position="3"> These constituents are then treated as unanalysable units during the second pass which employs a more general 'sentence' grammar.</Paragraph> <Paragraph position="4"> Proper Name Grammar The grammar rules for proper names constitute a subset of the system's noun t)hrase. (NP) rules. All the rules were produ(:ed 1)y hand. There are 177 such rules ill total of which 94 are for organisation, 54 for person, 11 for location, and 18 for time exl)ressions. Here are some examt)les of the i)roper nmne grammar rules:</Paragraph> <Paragraph position="6"> The non-terminals LIST_LOCJNP~ LIST_0RGAN_NP and CDG~IP are tags assigned to one or lnor(~ input tokens in the name phrase tagging stage of lexical preproeessing. The non-terminal NNP is the tag for proper name assigned to a single token by the Brill tagger.</Paragraph> <Paragraph position="7"> The rule 0RGAN_hIP--> NAMES_NP '&' NAMES_NP means that if an as yet unclassified or ambiguous proper name (NANES~P) is followed by '&' and another mnbiguous proper nmne, then it is an organisation name. So, for example, 'Marks & Spell- null (:(n&quot; and 'American Telct)hone & %le.gratth' will tie (:lassilicd as (/rganisat, i(m names by this rule. Nearly half of the t/rol)('x name rules ard for (n'ganisation names be(:ausc they may contain fmther prOller name, s (e.g. l/erson or location name, s) as well as normal nomls, att(l their coml/inations.</Paragraph> <Paragraph position="8"> There arc Mso a good nmnb(~r of rules tilt' 1)(wson names sin(:c care must be taken with given names, family nmnes, titles (e.g. 'Mr.','President'), and special lcxical items su(:h as 'de' (as in 'J. lgnacio Lot)cz (1(,' Arriortua') and 'Jr.','II', ct;(:.</Paragraph> <Paragraph position="9"> Thor(; are thwcr rules lin' location ttmnes, as th(!y are i(h;ntiti(*.d mainly in tim 1)r(!vious l)r(!l)rO(:(~ssing stage by lool<-ul / in tim miifi-gaz('.tt;e('a'.</Paragraph> <Paragraph position="10"> Sentence (~rannnar Rules The grammar used for l/arsing at the scnten(:e l(,.vel contains at/t)roximately 1 l0 rules and was derived automaticat\[y from the Penn 3i'ceBank-ll (PTB-II) (Marcus ct al, 1993), (Mar('.us ct al, 1995). When llarsing for a senten(:e is (:omplet('. the resultant chart ix analysed to i(hmtitly the 'best parsC. From tit(', best pars(', the. associated selnallt;i(:s ate ('.xtra(:t(;d to lie 1)ass(xl on to I;\]le dis('.om'sc int(~rl)r(%(;r. Rules for COml/()siti(ntally ('onstrut:ting s(mmnti(: representations were assigned t/y han(l t;() {,tm grammar rules. F()r simple verbs and llotnls I;h(; mort)hologi('.al root is llSe(l as a in(~(ti(:at(~ natnc ill tim s('.nt;tllti(:s, and t(;llS(? alld lIlltll})Of fcatllt'(~,s are translat(,.d (tir(w.tly inl;o tll(', s(muult, i(: l'ellrCs(rotation where, ai)l)ropriat(;. F(/r \[latnc(l ci,\[;i\[;ies a t(/kcn (if the most siiccific tyi)c 1)ossiblc (e.g. company or perhaps only object) is ereaged and a name attrit)ute ix associated with the entity, the a ttritlute?s vahm being the, SllrBtc(~ string form of th(.' name. St), \['or examtlh b itsstoning 'Ford Mol;or Co.' has ah'eady I)('.cn (:lassi\[ie(l as a c(nnl)any nam(~, its scmanti(: rei)r(~s(,ntation will be something like company(e23) & name(e23,'Ford Motor Co.').</Paragraph> <Paragraph position="11"> a.a Discourse interl)retation The discourse inl;('.rt)r(,t;(w too(hilt performs two a(:l;ivities l;ha\[; (:(nttribute to t)roper name (:lassification (no fllrth(;r rc(:ogniti(m of pr(/1)(!r ll&nlcs goes on at this point, only a rctlning of their classification). The first a(:tivity is (',orcf(~rcnc(,' resolution an unclassified name may bc corefcrr(;d wil;h t~ previously classified one tly virtue of which the (:lass of the unclassifi('.d name. b(,.(:om(;s known. The second activity, whi(:h is arguably not l/rolterly '(lis(:ourse intcrilr(.'t;ation' but nevert, heh'~ss takes tllac(', in this module, is t(/tier form inf(;ren(:(;s al/(/ut, the s(;manti(: I;yl)eS of al'glllnCtll;S iIl (:crtain reladons; for example, in comtl(nmd n(/minals such as 'Erikson stocks' our s('.mantic inter1)retcr will tell us that the, re is a qnalitier relation l/ctwcen 'Erikson' and 'stocks' and sin(:e the system stores the fact thai; named entities qualifying things of type stock are, of type company it can classil~y the i)roper name 'l@ikson ~ as a (:Oral)any.</Paragraph> <Paragraph position="12"> Note that both of these tcctmiques Inake use of external evidence, i.e. rely on information supplied by the. (:ontext beyond the words in th('. instance of the proper name being classilic(l.</Paragraph> <Paragraph position="13"> &quot;1.3.1 Proper name coreference (~orcfcr(mcc rcsolul;ion for i)ropcr names is carried out in or(let to rtx:ognis() alternativ(', forms, (;specially of otganisation names. For cx~mq)le., ,Ford Motor Co.' might lm used in a text whim th(,' (:ompally iS first mentioned, but~ subscquc.nl~ ref('xenccs are likely to b(; to 'For(t'. Similarly, '(beativc Artists Agency' might lm al)bre, viated to %',At\' lat(n ()it in th(; same Lexl;. ~qltch s\]lorttumd \['(Wills lllllS\[; 1)O l't;solvett aS llalliOs of t;h('..qmn(' of ganisi~t;ion.</Paragraph> <Paragraph position="14"> In or(let t(l (h%('.rmin(,. wlmther giv(m two prolmr IDl.l\[l(~S ttntl;t:h, vatiOllS hem'istics are used. For c,xaant)lc , given two itg:l.ln(~,s, Rain(<\[ aIltl Nmnc2: * if Name2 is consists (if an initial SllbSt!(lllClt(:(,, of th(; words in Namel then Name2 matctms Namel t'..g. 'American Airlines Co.' aud</Paragraph> <Paragraph position="16"> (qth(w the first, tim family, or 1)()l;h nanms (if Nam(<l, then Name2 niat(:hes Nanml e.g.</Paragraph> <Paragraph position="17"> ',lohn .I. Major .h.' lind ',lohn Major'.</Paragraph> <Paragraph position="18"> There are 31 such heuristic ruh~s for matchlug organisation names, I \] tmuristi(:s for 1)(ns(/n names, ;m(t 3 rules lbr h/(:al;ion names.</Paragraph> <Paragraph position="19"> Wh(',n an un(:lassified t/rolmr noun is matched with a previously classilied proper llatn( ~, ill the text, it is marke(l as a tn'(/p(',r name of the (:lass of th( ~, kllOWt\] l)rop(;r ltai\[lO. ThllS, whell w(~ know 'Ford Motor Co.' is an organisati(m name bill have n(/t (:lassificd 'F(/rd' in the same text, (:or(> f(w('.n(:(', resolution (let(~rmin('.s 'Ford' to lie an organisi~tion name.</Paragraph> <Paragraph position="20"> \[n t,h(; f'(/llowing (;onl;(~xl,q, se.nmnd(; l;yI)e inf()l'tItal;ion al)olll; th(; tyI/eS of il, t'gllln(~ll\[;S ill ('.(!rl;ain relad(ms is used to (lriv(; illfCl(?llt;Cs permitting the dassitication of prt)lmr nanlcs. The sysl;cltt llSeS thes(~ t;et:hnittucs in a fairly linfited and ext)erimental way ill; l/resent, and there ix much room f(n their (',xtcnsi(m.</Paragraph> <Paragraph position="21"> * nOllll-itOlllt qllalificati(m: when an un(:\]assi-. tled t)rot)(!r nanle qualifies ttlt organisationrelated thing then the name is c, lassifie(l its an orga.nisation; e.g. in 't,h'i(:kson sl;o(:ks' sin(:c 'sl;(/(:k' ix scmanti(:ally tyt/ed as an organisation-r(~qa(;(!(t dfing, 'Erickson' get;s (:lassiticd as an organisation name.</Paragraph> <Paragraph position="22"> * t/ossessivcs: when an un(:\[assitic(l proller ll;I.Ill(~ stands in a possessive r(;lation to a.n (/rganisation post, then th(, ~ name is classiti(xl as all organisation; e.g. 'vice l/resident of ABC', 'ABC's vice 1)resi(h',nt'.</Paragraph> <Paragraph position="23"> * at)Iiosition : when an unclassilied proper name ix apt/(/s('.d with a known locati(m nanm, the former name is also classified as a location; e.g. given 'Fort Lauderdale, Fla.' if we know 'Fla.' is a location name, 'Fort Lauderdale' is also classified as a location name.</Paragraph> <Paragraph position="24"> * verbal arguments: when an unclassified proper name names an entity playing a role in a verbal fi'ame where the semantic type of the argument position is known, then the name is classified accordingly; e.g. in 'Smith retired from his position as ...' we (:an infer that 'Smith' is a person name since the semantic type of the logical subject of 'retire' (in this sense) is person.</Paragraph> </Section> </Section> class="xml-element"></Paper>