File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/06/w06-3801_intro.xml

Size: 6,045 bytes

Last Modified: 2025-10-06 14:04:15

<?xml version="1.0" standalone="yes"?>
<Paper uid="W06-3801">
  <Title>A Graphical Framework for Contextual Search and Name Disambiguation in Email</Title>
  <Section position="2" start_page="0" end_page="1" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> Many tasks in information retrieval can be performed by clever application of textual similarity metrics. In particular, The canonical IR problem of ad hoc retrieval is often formulated as the task of finding documents &amp;quot;similar to&amp;quot; a query. In modern IR settings, however, documents are usually not isolated objects: instead, they are frequently connected to other objects, via hyperlinks or meta-data. (An email message, for instance, is connected via header information to other emails in the same thread and also to the recipient's social network.) Thus it is important to understand how text-based document similarity measures can be extended to documents embedded in complex structural settings.</Paragraph>
    <Paragraph position="1"> Our similarity metric is based on a lazy graph walk, and is closely related to the well-known PageRank algorithm (Page et al., 1998). PageRank and its variants are based on a graph walk of infinite length with random resets. In a lazy graph walk, there is a fixed probability of halting the walk at each step. In previous work (Toutanova et al., 2004), lazy walks over graphs were used for estimating word dependency distributions: in this case, the graph was one constructed especially for this task, and the edges in the graph represented different flavors of word-to-word similarity. Other recent papers have also used walks over graphs for query expansion (Xi et al., 2005; Collins-Thompson and Callan, 2005).</Paragraph>
    <Paragraph position="2"> In these tasks, the walk propagates similarity to a start node through edges in the graph--incidentally accumulating evidence of similarity over multiple connecting paths.</Paragraph>
    <Paragraph position="3"> In contrast to this previous work, we consider schemes for propogating similarity across a graph that naturally models a structured dataset like an email corpus: entities correspond to objects including email addresses and dates, (as well as the usual types of documents and terms), and edges correspond to relations like sent-by. We view the similarity metric as a tool for performing search across this structured dataset, in which related entities that are not directly similar to a query can be reached via multi-step graph walk.</Paragraph>
    <Paragraph position="4"> In this paper, we formulate and evaluate this extended similarity metric. The principal problem we  consider is disambiguating personal names in email, which we formulate as the task of retrieving the per-son most related to a particular name mention. We show that for this task, the graph-based approach improves substantially over plausible baselines. After retrieval, learning can be used to adjust the ranking of retrieved names based on the edges in the paths traversed to find these names, which leads to an additional performance improvement. Name disambiguation is a particular application of the suggested general framework, which is also applicable to any real-world setting in which structural data is available as well as text.</Paragraph>
    <Paragraph position="5"> This paper proceeds as follows. Sections 2 and 3 formalize the general framework and its instantiation for email. Section 4 gives a short summary of the learning approach. Section 5 includes experimental evaluation, describing the corpora and results for the person name disambiguation task. The paper concludes with a review of related work, summary and future directions.</Paragraph>
    <Paragraph position="6"> 2 Email as a Graph A graph G consists of a set of nodes, and a set of labeled directed edges. Nodes will be denoted by letters like x, y, or z, and we will denote an edge from x to y with label lscript as x lscript[?]- y. Every node x has a type, denoted T(x), and we will assume that there are a fixed set of possible types. We will assume for convenience that there are no edges from a node to itself (this assumption can be easily relaxed.) We will use these graphs to represent real-world data. Each node represents some real-world entity, and each edge x lscript[?]- y asserts that some binary relation lscript(x,y) holds. The entity types used here to represent an email corpus are shown in the left-most column of Table 1. They include the traditional types in information retrieval systems, namely file and term. In addition, however, they include the types person, email-address and date. These entities are constructed from a collection of email messages in the obvious way-for example, a recipient of &amp;quot;Einat Minkov &lt;einat@cs.cmu.edu&gt;&amp;quot; indicates the existence of a person node &amp;quot;Einat Minkov&amp;quot; and an email-address node &amp;quot;einat@cs.cmu.edu&amp;quot;. (We assume here that person names are unique identifiers.) The graph edges are directed. We will assume that edge labels determine the source and target node types: i.e., if x lscript[?]- z and w lscript[?]- y then T(w) = T(x) and T(y) = T(z). However, multiple relations can hold between any particular pair of nodes types: for instance, it could be that x lscript[?]- y or x lscriptprime[?]- y, where lscript negationslash= lscriptprime. (For instance, an email message x could be sent-from y, or sent-to y.) Note also that edges need not denote functional relations: for a given x and lscript, there may be many distinct nodes y such that x lscript[?]- y. For instance, for a file x, there are many distinct terms y such that x has-term[?]- y holds. In representing email, we also create an inverse label lscript[?]1 for each edge label (relation) lscript. Note that this means that the graph will definitely be cyclic.</Paragraph>
    <Paragraph position="7"> Table 1 gives the full set of relations used in our email represention scheme.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML