File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/05/h05-1056_intro.xml
Size: 2,820 bytes
Last Modified: 2025-10-06 14:02:52
<?xml version="1.0" standalone="yes"?> <Paper uid="H05-1056"> <Title>Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing (HLT/EMNLP), pages 443-450, Vancouver, October 2005. c(c)2005 Association for Computational Linguistics Extracting Personal Names from Email: Applying Named Entity Recognition to Informal Text</Title> <Section position="3" start_page="443" end_page="443" type="intro"> <SectionTitle> 2 Corpora </SectionTitle> <Paragraph position="0"> Two email corpora used in our experiments were extracted from the CSpace email corpus (Kraut et al., 2004), which contains email messages collected from a management course conducted at Carnegie Mellon University in 1997. In this course, MBA students, organized in teams of four to six members, ran simulated companies in different market scenarios. We believe this corpus to be quite similar to the work-oriented mail of employees of a small or medium-sized company. This text corpus contains three header fields: &quot;From&quot;, &quot;Subject&quot;, and &quot;Time&quot;. Mgmt-Game is a subcorpora consisting of all emails written over a five-day period. In the experiments, the first day worth of email was used as a training set, the fourth for tuning and the fifth day as a test set. Mgmt-Teams forms another split of this data, where the training set contains messages between different teams than in the test set; hence in Mgmt-Teams, the person names appearing in the test set are generally different than those that appear in the training set.</Paragraph> <Paragraph position="1"> The next two collections of email were extracted from the Enron corpus (Klimt and Yang, 2004). The first subset, Enron-Meetings, consists of messages in folders named &quot;meetings&quot; or &quot;calendar&quot;2. Most but not all of these messages are meeting-related. The second subset, Enron-Random, was formed by repeatedly sampling a user name (uniformly at random among 158 users), and then sampling an email from 2with two exceptions: (a) six very large files were removed, and (b) one very large &quot;calendar&quot; folder was excluded. that user (uniformly at random).</Paragraph> <Paragraph position="2"> Annotators were instructed to include nicknames and misspelled names, but exclude person names that are part of an email address and names that are part of a larger entity name like an organization or location (e.g., &quot;David Tepper School of Business&quot;). The sizes of the corpora are given in Table 1. We limited training size to be relatively small, reflecting a real-world scenario.</Paragraph> <Paragraph position="3"> The number of words and names refer to the whole annotated corpora.</Paragraph> </Section> class="xml-element"></Paper>