File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/06/w06-3801_evalu.xml

Size: 11,746 bytes

Last Modified: 2025-10-06 13:59:54

<?xml version="1.0" standalone="yes"?>
<Paper uid="W06-3801">
  <Title>A Graphical Framework for Contextual Search and Name Disambiguation in Email</Title>
  <Section position="5" start_page="3" end_page="6" type="evalu">
    <SectionTitle>
5 Evaluation
</SectionTitle>
    <Paragraph position="0"> We experiment with three separate corpora.</Paragraph>
    <Paragraph position="1"> The Cspace corpus contains email messages collected from a management course conducted at Carnegie Mellon University in 1997 (Minkov et al., 2005). In this course, MBA students, organized in teams of four to six members, ran simulated companies in different market scenarios. The corpus we used here includes the emails of all teams over a period of four days. The Enron corpus is a collection of mail from the Enron corpus that has been made available for the research community (Klimt and Yang, 2004). Here, we used the saved email of two different users.2 To eliminate spam and news postings we removed email files sent from email addresses with suffix &amp;quot;.com&amp;quot; that are not Enron's; widely distributed email files (sent from &amp;quot;enron.announcement&amp;quot;, to &amp;quot;all.employees@enron.com&amp;quot; etc.). Text from forwarded messages, or replied-to messages were also removed from the corpus.</Paragraph>
    <Paragraph position="2"> Table 2 gives the size of each processed corpus, and the number of nodes in the graph representation of it. In deriving terms for the graph, terms were Porter-stemmed and stop words were removed. The processed Enron corpora are available from the first author's home page.</Paragraph>
    <Section position="1" start_page="3" end_page="6" type="sub_section">
      <SectionTitle>
5.1 Person Name Disambiguation
</SectionTitle>
      <Paragraph position="0"> Consider an email message containing a common name like &amp;quot;Andrew&amp;quot;. Ideally an intelligent mailer would, like the user, understand which person &amp;quot;Andrew&amp;quot; refers to, and would rapidly perform tasks like retrieving Andrew's prefered email address or home page. Resolving the referent of a person name is also an important complement to the ability to perform named entity extraction for tasks like social network analysis or studies of social interaction in email.</Paragraph>
      <Paragraph position="1"> 2Specifially, we used the &amp;quot;all documents&amp;quot; folder, including both incoming and outgoing files.</Paragraph>
      <Paragraph position="2">  However, although the referent of the name is unambiguous to the recipient of the email, it can be non-trivial for an automated system to find out which &amp;quot;Andrew&amp;quot; is indicated. Automatically determining that &amp;quot;Andrew&amp;quot; refers to &amp;quot;Andrew Y. Ng&amp;quot; and not &amp;quot;Andrew McCallum&amp;quot; (for instance) is especially difficult when an informal nickname is used, or when the mentioned person does not appear in the email header. As noted above, we model this problem as a search task: based on a name-mention in an email message m, we formulate query distribution Vq, and then retrieve a ranked list of person nodes.</Paragraph>
      <Paragraph position="3">  Unfortunately, building a corpus for evaluating this task is non-trivial, because (if trivial cases are eliminated) determining a name's referent is often non-trivial for a human other than the intended recipient. We evaluated this task using three labeled datasets, as detailed in Table 2.</Paragraph>
      <Paragraph position="4"> The Cspace corpus has been manually annotated with personal names (Minkov et al., 2005). Additionally, with the corpus, there is a great deal of information available about the composition of the individual teams, the way the teams interact, and the full names of the team members. Using this extra information it is possible to manually resolve name mentions. We collected 106 cases in which single-token names were mentioned in the the body of a message but did not match any name from the header. Instances for which there was not sufficient information to determine a unique person entity were excluded from the example set. In addition to names that refer to people that are simply not in the header, the names in this corpus include people that are in the email header, but cannot be matched because they are referred to using: initials-this is commonly done in the sign-off to an email; nicknames, including common nicknames (e.g., &amp;quot;Dave&amp;quot; for &amp;quot;David&amp;quot;), unusual nicknames (e.g., &amp;quot;Kai&amp;quot; for &amp;quot;Keiko&amp;quot;); or American names adopted in place of a foreign name (e.g., &amp;quot;Jenny&amp;quot; for &amp;quot;Qing&amp;quot;). For Enron, two datasets were generated automatically. We collected name mentions which correspond uniquely a names that is in the email &amp;quot;Cc&amp;quot; header line; then, to simulate a non-trivial matching task, we eliminate the collected person name from the email header. We also used a small dictionary of 16 common American nicknames to identify nicknames that mapped uniquely to full person names on the &amp;quot;Cc&amp;quot; header line.</Paragraph>
      <Paragraph position="5"> For each dataset, some examples were picked randomly and set aside for learning and evaluation purposes. null initials nicknames other  All of the methods applied generate a ranked list of person nodes, and there is exactly one correct answer per example.3 Figure 1 gives results4 for two of the datasets as a function of recall at rank k, up to rank 10. Table 4 shows the mean average precision (MAP) of the ranked lists as well as accuracy, which we define as the percentage of correct answers at rank 1 (i.e., precision at rank 1.)  To our knowledge, there are no previously reported experiments for this task on email data. As a baseline, we apply a reasonably sophisticated string matching method (Cohen et al., 2003). Each name mention in question is matched against all of the per-son names in the corpus. The similarity score between the name term and a person name is calculated as the maximal Jaro similarity score (Cohen et al., 2003) between the term and any single token of the personal name (ranging between 0 to 1). In addition, we incorporate a nickname dictionary5, such that if the name term is a known nickname of a name, the similarity score of that pair is set to 1.</Paragraph>
      <Paragraph position="6"> The results are shown in Figure 1 and Table 4. As can be seen, the baseline approach is substantially less effective for the more informal Cspace dataset.</Paragraph>
      <Paragraph position="7"> Recall that the Cspace corpus includes many cases such as initials, and also nicknames that have no literal resemblance to the person's name (section  5.1.2), which are not handled well by the string similarity approach. For the Enron datasets, the base-line approach perfoms generally better (Table 4). In all the corpora there are many ambiguous instances, e.g., common names like &amp;quot;Dave&amp;quot; or &amp;quot;Andy&amp;quot; that match many people with equal strength.</Paragraph>
      <Paragraph position="8">  We perform two variants of graph walk, corresponding to different methods of forming the query distribution Vq. Unless otherwise stated, we will use a uniform weighting of labels--i.e., thlscript,T = 1/ST ; g = 1/2; and a walk of length 2.</Paragraph>
      <Paragraph position="9"> In the first variant, we concentrate all the probability in the query distribution on the name term. The column labeled term gives the results of the graph walk from this probability vector. Intuitively, using this variant, the name term propagates its weight to the files in which it appears. Then, weight is propagated to person nodes which co-occur frequently with these files. Note that in our graph scheme there is a direct path between terms to per-son names, so that they recieve weight as well.</Paragraph>
      <Paragraph position="10"> As can be seen in the results, this leads to very effective performance: e.g., it leads to 61.3% vs.</Paragraph>
      <Paragraph position="11"> 41.3% accuracy for the baseline approach on the CSpace dataset. However, it does not handle ambiguous terms as well as one would like, as the query does not include any information of the context in which the name occurred: the top-ranked answer for ambiguous name terms (e.g., &amp;quot;Dave&amp;quot;) will always be the same person. To solve this problem, we also used a file+term walk, in which the query Vq gives equal weight to the name term node and the file in which it appears.</Paragraph>
      <Paragraph position="12"> We found that adding the file node to Vq provides useful context for ambiguous instances--e.g., the correct &amp;quot;David&amp;quot; would in general be ranked higher than other persons with this same name. On the other hand, though, adding the file node reduces the the contribution of the term node. Although the MAP and accuracy are decreased, file+term has better performance than term at higher recall levels, as can be seen in Figure 1.</Paragraph>
      <Paragraph position="13"> 5.2.4 Reranking the output of a walk We now examine reranking as a technique for improving the results. After some preliminary experimentation, we adopted the following types of features f for a node x. The set of features are fairly generic. Edge unigram features indicate, for each edge label lscript, whether lscript was used in reaching x from Vq. Edge bigram features indicate, for each pair of edge labels lscript1, lscript2, whether lscript1 and lscript2 were used (in that order) in reaching x from Vq. Top edge bigram features are similar but indicate if lscript1,lscript2 were used in one of the two highest-scoring paths between Vq and x (where the &amp;quot;score&amp;quot; of a path is the product of Pr(y lscript[?]- z) for all edges in the path.) We believe that these features could all be computed using dynamic programming methods. Currently, however, we compute features by using a method we call path unfolding, which is similar to the back-propagation through time algorithm (Haykin, 1994; Diligenti et al., 2005) used in training recurrent neural networks. Graph unfolding is based on a backward breadth-first visit of the graph, starting at the target node at time step k, and expanding the unfolded paths by one layer per each time step. This procedure is more expensive, but offers more flexibility in choosing alternative features, and was useful in determining an optimal feature set.</Paragraph>
      <Paragraph position="14"> In addition, we used for this task some additional problem-specific features. One feature indicates whether the set of paths leading to a node originate from one or two nodes in Vq. (We conjecture that in the file+term walk, nodes are connected to both the source term and file nodes are more relevant comparing to nodes that are reached from the file node or term node only.) We also form features that indicate whether the given term is a nickname of the person name, per the nicknames dictionary; and whether the Jaro similarity score between the term and the person name is above 0.8. This information is similar to that used by the baseline method.</Paragraph>
      <Paragraph position="15"> The results (for the test set, after training on the train set) are shown in Table 4 and (for two representative cases) Figure 1. In each case the top 10 nodes were reranked. Reranking substantially improves performance, especially for the file+term walk. The accuracy rate is higher than 75% across all datasets.</Paragraph>
      <Paragraph position="16"> The features that were assigned the highest weights by the re-ranker were the literal similarity features and the source count feature.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML