File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/93/w93-0105_intro.xml
Size: 8,545 bytes
Last Modified: 2025-10-06 14:05:27
<?xml version="1.0" standalone="yes"?> <Paper uid="W93-0105"> <Title>Identifying Unknown Proper Names in Newswire Text</Title> <Section position="2" start_page="0" end_page="45" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> The identification of unknown proper names in text is a significant challenge for NLP systems operating on unrestricted text. A system which indexes documents according to name references can be useful for information retrieval or as a pre-processor for more knowledge intensive tasks such as database extraction. With the growing use of tagged corpora in a variety of language-related research areas, being able to reliably tag proper names is an obvious advantage. In addition, the development of practical techniques for name identification help to shed light on the various uses of proper names in text.</Paragraph> <Paragraph position="1"> Traditional approaches to unknown proper name identification involve, broadly speaking, the lexical lookup of names or name fragments in a name database. For example, approaches such as \[Aone et al., 92\], \[Aberdeen et al., 92\], and \[Cowie et al., 92\], identify person names by marking off phrases which contain unknown words close to known name elements like first or last names, and (in \[Cowie et al., 92\]) unknown words close to specific title-words. As the above studies show, name databases such as cross-cultural listings of common first and last names as well as existing geographical gazetteers, are helpful in name recognition. However, approaches based exclusively on unknown words and known name elements can be confused by known common nouns (or other parts of speech) which occur in proper names, even person names. More importantly, such approaches require an initial name element database. Creating such databases can be a labor-intensive task. Furthermore, no matter how large the database one can manually construct, the problem still arises of identifying names which don't happen to be present in any given name database. The fact that proper names form, lexically speaking, an open class whose elements grow far more rapidly than other open classes, and the fact that they often contain other open-class elements, makes the incompleteness of such databases an obvious problem.</Paragraph> <Paragraph position="2"> Our approach aims at deriving proper names and their semantic attributes automatically from large corpora, without relying on any listing of name elements. The overall approach is based on two main ideas. Firstly, we hypothesize that for certain genres of text (for example, Wall Street Journal news stories), new references are introduced by information occurring in the immediate syntactic environment of the proper name. (What the precise set of such genres is remains to be determined, but our initial set includes the most common forms of news stories and excludes literary narratives.) Many of these local contextual clues reflect felicity conventions for introducing new names. New names of people (as well as organization names, and to some extent location names) are generally accompanied by honorifics and various appositive phrases which help anchor the new name reference to mutually assumed knowledge. Further contextual clues come from selectional restrictions, for example, given &quot;Kambomambo murdered Zombaluma&quot; (from \[Radford, 88\]), the verb is the main clue to the hypothesis that the two names are those of people.</Paragraph> <Paragraph position="3"> Although the idea of exploiting local context to identify semantic attributes in new names is in itself not new (e.g. \[Coates-Stephens, 91\], \[Paik et al., 93\]), little attention has been paid in name identification work to the discourse properties of names. Our second, and more general idea is to view proper names as linguistic expressions whose interpretation often depends on the discourse context. For example, in the discourse &quot;U.S. President Bill Clinton....Clinton....Mr. Clinton....President Clinton&quot;, the interpretations of &quot;Clinton&quot;, &quot;Mr. Clinton&quot; and &quot;President Clinton&quot; are dependent on the prior reference to &quot;U.S. President Bill Clinton&quot;, much as &quot;the president&quot;, &quot;he&quot; and &quot;himself&quot; are dependent on prior context in the discourse &quot;U.S. President Bill Clintoni .... the president/ .... he/ .... himself/&quot;. The need for text-driven extraction of names presupposes in turn a computational model of discourse which identifies individuals based on the way they are described in the text, instead of relying on their description in a pre-existing knowledge base. The overall discourse representation framework which we use is Luperfoy's three-tiered model \[Luperfoy, 91\], which in turn is a computational adaptation of Landman's pegs model of NP semantics \[Landman 86\].</Paragraph> <Paragraph position="4"> The idea of the three-tiered model is that there are three significant levels of representation: linguistic expressions, Discourse Pegs, and knowledge base objects. A distinctive feature of Discourse Pegs (hereafter referred to as Pegs) as opposed to similar constructs in the literature, like File Cards (\[Heim, 81\]), Database Objects (\[Sidner, 79\]), Discourse Referents (\[Karttunen, 68\]), and Discourse Entities (\[Webber, 78\], \[Dahl and Ball, 90\]), is that they describe unique objects with respect to the current discourse, rather than with respect to the underlying belief system or world model. Thus, in an article mentioning Bill Clinton there may be two guises in which he may appear, as Governor Clinton and President Clinton; these would correspond to two distinct pegs. It is important to stress that pegs, as a result, do not correspond to equivalence classes of coreferential mentions; rather, there is one peg for each distinct object under discussion, irrespective of the number of entities in the world of reference. Objects which are distinct in the text may still need to be related to each other for their interpretation; for example, in the discourse &quot;President Bill Clinton... the Clintons....Hilary&quot;, the expressions &quot;President Bill Clinton&quot;, &quot;the Clintons&quot; and &quot;Hilary&quot; each introduce new pegs, but these pegs are each linked, as &quot;partial dependents&quot;, to the previous one. An interesting subcase of this involves name mergers, e.g. an article describing a joint venture between two companies may use the two individual company names followed by a merged name for the joint venture.</Paragraph> <Paragraph position="5"> In applying this framework to the unknown name problem, we first distinguish three types of entities: (i) Mentions - these are text segments which are tokens of proper names in text; (ii) Contexts - these are text segments which provide information about syntactic and semantic properties associated with a name; and (iii) Hypotheses - these are hypotheses about individuals and their semantic attributes, associated with a Mention.</Paragraph> <Paragraph position="6"> Given this framework, the goal of unknown name identification is to use the text itself to generate Hypotheses about possible individuals distinguished by a Mention. In a given text context, descriptions from earlier Mentions of a name may be further specified by new information associated with subsequent Mentions of the name (which may take a somewhat different form from previous Mentions). In general, two Hypotheses, each associated with a different Mention, are linked together (by means of a common Peg) whenever they are mutually compatible. Thus, two Mentions, Mention 1 and Mention 2, can be considered to be indirectly anchored together to a common Peg whenever hypothetical information associated with each is mutually compatible. For ease of presentation, we may speak of these coanchored mentions as &quot;coreferential&quot; (when what we really mean is this more specific sense of coanchoring); also, we will use the capitalized word &quot;Coreference&quot; for the process of computing pegs for a mention, a process which may result in either the coanchoring of the mention to one or more existing pegs, or the allocation of a new peg.</Paragraph> <Paragraph position="7"> We describe the Coreference process in more detail in Section 4.</Paragraph> </Section> class="xml-element"></Paper>