File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/concl/05/h05-1056_concl.xml
Size: 3,321 bytes
Last Modified: 2025-10-06 13:54:31
<?xml version="1.0" standalone="yes"?> <Paper uid="H05-1056"> <Title>Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing (HLT/EMNLP), pages 443-450, Vancouver, October 2005. c(c)2005 Association for Computational Linguistics Extracting Personal Names from Email: Applying Named Entity Recognition to Informal Text</Title> <Section position="7" start_page="449" end_page="449" type="concl"> <SectionTitle> 6 Conclusion </SectionTitle> <Paragraph position="0"> This work applies recently-developed sequential learning methods to the task of extraction of named entities from email. This problem is of interest as an example of NER from informal text--text that has been prepared quickly for a narrow audience.</Paragraph> <Paragraph position="1"> We showed that informal text has different characteristics from formal text such as newswire. Analysis of the highly-weighted features selected by the learners showed that names in informal text have different (and less informative) types of contextual evidence. However, email also has some structural regularities which make it easier to extract personal names. We presented a detailed description of a set of features that address these regularities and significantly improve extraction performance on email.</Paragraph> <Paragraph position="2"> In the second part of this paper, we analyzed the way in which names repeat in different types of corpora. We showed that repetitions within a single document are more common in newswire text, and that repetitions that span multiple documents are more common in email corpora. Additional analysis confirms that the potential gains in recall from exploiting multiple-document repetition is much higher than the potential gains from exploiting single-document repetition.</Paragraph> <Paragraph position="3"> Based on this insight, we introduced a simple and effective method for exploiting multiple-document repetition to improve an extractor. One drawback of the recall-enhancing approach is that it requires the entire test set to be available: however, our test sets are of only moderate size (83 to 264 documents), and it is likely that a similar-size sample of unlabeled data would be available in many practical applications. The approach substantially improves recall and often improves F1 performance; furthermore, it can be easily used with any NER method.</Paragraph> <Paragraph position="4"> Taken together, extraction performance is substantially improved by this approach. The improvements seem to be strongest for email corpora collected from closely interacting groups. On the Mgmt-Teams dataset, which was designed to reduce the value of memorizing specific names appearing in the training set, F1 performance is improved from 68.1% for the out-of-the-box system (or 82.0% for the dictionary-augmented system) to 91.3%. For the less difficult Mgmt-Game dataset, F1 performance is improved from 79.2% for an out-of-the-box CRF-based NER system (or 90.7% for a CRF-based system that uses several large dictionaries) to 95.4%.</Paragraph> <Paragraph position="5"> As future work, experiments should be expanded to include additional entity types and other types of informal text, such as blogs and forum postings.</Paragraph> </Section> class="xml-element"></Paper>