File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/05/p05-1060_intro.xml
Size: 2,614 bytes
Last Modified: 2025-10-06 14:03:04
<?xml version="1.0" standalone="yes"?> <Paper uid="P05-1060"> <Title>Multi-Field Information Extraction and Cross-Document Fusion</Title> <Section position="2" start_page="0" end_page="0" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> Much recent statistical information extraction research has applied graphical models to extract information from one particular document after training on a large corpus of annotated data (Leek, 1997; Freitag and McCallum, 1999).1 Such systems are widely applicable, yet there remain many information extraction tasks that are not readily amenable to these methods. Annotated data required for training statistical extraction systems is sometimes unavailable, while there are examples of the desired information. Further, the goal may be to find a few inter-related pieces of information that are stated multiple times in a set of documents.</Paragraph> <Paragraph position="1"> Here, we investigate one task that meets the above criteria. Given the name of a celebrity such as 1Alternatively, Riloff (1996) trains on in-domain and out-of-domain texts and then has a human filtering step.</Paragraph> <Paragraph position="2"> Huffman (1995) proposes a method to train a different type of extraction system by example.</Paragraph> <Paragraph position="3"> &quot;Frank Zappa&quot;, our goal is to extract a set of biographic facts (e.g., birthdate, birth place and occupation) about that person from documents on the Web.</Paragraph> <Paragraph position="4"> First, we describe a general method of automatic annotation for training from positive and negative examples and use the method to train Rote, Na&quot;ive Bayes, and Conditional Random Field models (Section 2). We then examine how multiple extractions can be combined to form one consensus answer (Section 3). We compare fusion methods and show that frequency voting outperforms the single highest confidence answer by an average of 11% across the various extractors. Increasing the number of retrieved documents boosts the overall system accuracy as additional documents which mention the individual in question lead to higher recall. This improved recall more than compensates for a loss in per-extraction precision from these additional documents. Next, we present a method for cross-field bootstrapping (Section 4) which improves per-field accuracy by 7%. We demonstrate that a small training set with only the most relevant documents can be as effective as a larger training set with additional, less relevant documents (Section 5).</Paragraph> </Section> class="xml-element"></Paper>