File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/06/w06-1671_intro.xml

Size: 3,703 bytes

Last Modified: 2025-10-06 14:04:05

<?xml version="1.0" standalone="yes"?>
<Paper uid="W06-1671">
  <Title>Learning Field Compatibilities to Extract Database Records from Unstructured Text</Title>
  <Section position="3" start_page="0" end_page="603" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> Information extraction (IE) algorithms populate a database with facts discovered from unstructured text. This database is often used by higher-level tasks such as question answering or knowledge discovery. The richer the structure of the database, the more useful it is to higher-level tasks.</Paragraph>
    <Paragraph position="1"> A common IE task is named-entity recognition (NER), the problem of locating mentions of entities in text, such as people, places, and organizations. NER techniques range from regular expressions to finite-state sequence models (Bikel et al., 1999; Grishman, 1997; Sutton and McCallum, 2006). NER can be viewed as method of populating a database with single-tuple records, e.g. PER-SON=Cecil Conner or ORGANIZATION= IBM.</Paragraph>
    <Paragraph position="2"> We can add richer structure to these single-tuple records by extracting the associations among entities. For example, we can populate multi-field records such as a contact record [PERSON=Steve Jobs, JOBTITLE = CEO, COMPANY = Apple, CITY = Cupertino, STATE = CA]. The relational information in these types of records presents a greater opportunity for text analysis.</Paragraph>
    <Paragraph position="3"> The task of associating together entities is often framed as a binary relation extraction task: Given a pair of entities, label the relation between them (e.g. Steve Jobs LOCATED-IN Cupertino). Common approaches to relation extraction include pattern matching (Brin, 1998; Agichtein and Gravano, 2000) and classification (Zelenko et al., 2003; Kambhatla, 2004).</Paragraph>
    <Paragraph position="4"> However, binary relation extraction alone is not well-suited for the contact record example above, which requires associating together many fields into one record. We refer to this task of piecing together many fields into a single record as record extraction.</Paragraph>
    <Paragraph position="5"> Consider the task of extracting contact records from personal homepages. An NER system may label all mentions of cities, people, organizations, phone numbers, job titles, etc. on a page, from both semi-structured an unstructured text. Even with a highly accurate NER system, it is not obvious which fields belong to the same record. For example, a single document could contain five names, three phone numbers and only one email.</Paragraph>
    <Paragraph position="6"> Additionally, the layout of certain fields may be convoluted or vary across documents.</Paragraph>
    <Paragraph position="7"> Intuitively, we would like to learn the compatibility among fields, for example the likelihood that the organization University of North Dakota is located in the state North Dakota, or that phone numbers with area code 212 co-occur with the  city New York. Additionally, the system should take into account page layout information, so that nearbyfieldsaremorelikelytobegroupedintothe same record.</Paragraph>
    <Paragraph position="8"> In this paper, we describe a method to induce a probabilistic compatibility function between sets of fields. Embedding this compatibility function within a graph partitioning method, we describe how to cluster highly compatible fields into records.</Paragraph>
    <Paragraph position="9"> We evaluate our approach on personal homepages that have been manually annotated with contact record information, and demonstrate a 53% error reduction over baseline methods.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML