File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/06/w06-2919_intro.xml
Size: 2,536 bytes
Last Modified: 2025-10-06 14:04:11
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-2919"> <Title>A Context Pattern Induction Method for Named Entity Extraction</Title> <Section position="3" start_page="0" end_page="141" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> Partial entity lists and massive amounts of unlabeled data are becoming available with the growth of the Web as well as the increased availability of specialized corpora and entity lists. For example, the primary public resource for biomedical research, MEDLINE, contains over 13 million entries and is growing at an accelerating rate. Combined with these large corpora, the recent availability of entity lists in those domains has opened up interesting opportunities and challenges. Such lists are never complete and suffer from sampling biases, but we would like to exploit them, in combination with large unlabeled corpora, to speed up the creation of information extraction systems for different domains and languages. In this paper, we concentrate on exploring utility of such resources for named entity extraction. null Currently available entity lists contain a small fraction of named entities, but there are orders of magnitude more present in the unlabeled data1. In this paper, we test the following hypotheses: i. Starting with a few seed entities, it is possible to induce high-precision context patterns by exploiting entity context redundancy.</Paragraph> <Paragraph position="1"> ii. New entity instances of the same category can be extracted from unlabeled data with the induced patterns to create high-precision extensions of the seed lists.</Paragraph> <Paragraph position="2"> iii. Features derived from token membership in the extended lists improve the accuracy of learned named-entity taggers.</Paragraph> <Paragraph position="3"> Previous approaches to context pattern induction were described by Riloff and Jones (1999), Agichtein and Gravano (2000), Thelen and Riloff (2002), Lin et al. (2003), and Etzioni et al. (2005), among others. The main advance in the present method is the combination of grammatical induction and statistical techniques to create high-precision patterns.</Paragraph> <Paragraph position="4"> The paper is organized as follows. Section 2 describes our pattern induction algorithm. Section 3 shows how to extend seed sets with entities extracted by the patterns from unlabeled data. Section 4 gives experimental results, and Section 5 compares our method with previous work.</Paragraph> </Section> class="xml-element"></Paper>