File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/abstr/99/w99-0202_abstr.xml

Size: 6,148 bytes

Last Modified: 2025-10-06 13:49:50

<?xml version="1.0" standalone="yes"?>
<Paper uid="W99-0202">
  <Title>Is Hillary Rodham Clinton the President? Disambiguating Names across Documents Yael RAVIN</Title>
  <Section position="2" start_page="0" end_page="9" type="abstr">
    <SectionTitle>
Abstract
</SectionTitle>
    <Paragraph position="0"> A number of research and software development groups have developed name identification technology, but few have addressed the issue of cross-document coreference, or identifying the same named entities across documents. In a collection of documents, where there are multiple discourse contexts, there exists a many-to-many correspondence between names and entities, making it a challenge to automatically map them correctly. Recently, Bagga and Baldwin proposed a method for determining whether two names refer to the same entity by measuring the similarity between the document contexts in which they appear. Inspired by their approach, we have revisited our current cross-document coreference heuristics that make relatively simple decisions based on matching strings and entity types. We have devised an improved and promising algorithm, which we discuss in this paper.</Paragraph>
    <Paragraph position="1"> Introduction The need to identify and extract important concepts in online text documents is by now commonly acknowledged by researchers and practitioners in the fields of information retrieval, knowledge management and digital libraries. It is a necessary first step towards achieving a reduction in the ever-increasing volumes of online text. In this paper we focus on the identification of one kind of concept - names and the entities they refer to.</Paragraph>
    <Paragraph position="2"> There are several challenging aspects to the identification of names: identifying the text strings (words or phrases) that express names; relating names to the entities discussed in the document; and relating named entities across documents. In relating names to entities, the  main difficulty is the many-to-many mapping between them. A single entity can be referred to by several name variants: Ford Motor Company, Ford Motor Co., or simply Ford. A single variant often names several entities: Ford refers to the car company, but also to a place (Ford, Michigan) as well as to several people: President Gerald Ford, Senator Wendell Ford, and others.</Paragraph>
    <Paragraph position="3"> Context is crucial in identifying the intended mapping. A document usually defines a single context, in which it is quite unlikely to find several entities corresponding to the same variant. For example, if the document talks about the car company, it is unlikely to also discuss Gerald Ford. Thus, within documents, the problem is usually reduced to a many-to-one mapping between several variants and a single entity. In the few cases where multiple entities in the document may potentially share a name variant, the problem is addressed by careful editors, who refrain from using ambiguous variants. If Henry Ford, for example, is mentioned in the context of the car company, he will most likely be referred to by the unambiguous Mr. Ford.</Paragraph>
    <Paragraph position="4"> Much recent work has been devoted to the identification of names within documents and to linking names to entities within the document.</Paragraph>
    <Paragraph position="5"> Several research groups \[DAR95, DAR98\], as well as a few commercial software packages \[NetOw197\], have developed name identification technologyk In contrast, few have investigated named entities across documents. In a collection of documents, there are multiple contexts; variants may or may not refer to the same entity; i among them our own research group, whose technology is now embedded in IBM's Intelligent Miner for Text \[IBM99\].</Paragraph>
    <Paragraph position="6">  and ambiguity is a much greater problem.</Paragraph>
    <Paragraph position="7"> Cross-document coreference was briefly considered as a task for the Sixth Message Understanding Conference but then discarded as being too difficult \[DAR95\].</Paragraph>
    <Paragraph position="8"> Recently, Bagga and Baldwin \[BB98\] proposed a method for determining whether two names (mostly of people) or events refer to the same entity by measuring the similarity between the document contexts in which they appear.</Paragraph>
    <Paragraph position="9"> Inspired by their approach, we have revisited our current cross-document coreference heuristics and have devised an improved algorithm that seems promising. In contrast to the approach in \[BB98\], our algorithm capitalizes on the careful intra-document name recognition we have developed. To minimize the processing cost involved in comparing contexts we define compatible names -- groups of names that are good candidates for coreference -- and compare their internal structures first, to decide whether they corefer. Only then, if needed, we apply our own version of context comparisons, reusing a tool -- the Context Thesaurus -- which we have developed independently, as part of an application to assist users in querying a collection of documents.</Paragraph>
    <Paragraph position="10"> Cross-document coreference depends heavily on the results of intra-document coreference, a process which we describe in Section 1. In Section 2 we discuss our current cross-document coreference. One of our challenges is to recognize that some &amp;quot;names&amp;quot; we identify are not valid, in that they do not have a single referent. Rather, they form combinations of component names. In Section 3 we describe our algorithm for splitting these combinations. Another cross-document challenge is to merge different names. Our intra-document analysis stipulates more names than there are entities mentioned in the collection. In Sections 4-5 we discuss how we merge these distinct but eoreferent names across documents. Section 4 defines compatible names and how their internal structure determines coreference. Section 5 describes the Context Thesaurus and its use to compare contexts in which names occur. Section 6 describes preliminary results and future work.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML