File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/relat/06/w06-1671_relat.xml
Size: 4,173 bytes
Last Modified: 2025-10-06 14:15:57
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-1671"> <Title>Learning Field Compatibilities to Extract Database Records from Unstructured Text</Title> <Section position="4" start_page="603" end_page="603" type="relat"> <SectionTitle> 2 Related Work </SectionTitle> <Paragraph position="0"> McDonald et al. (2005) present clustering techniques to extract complex relations, i.e. relations with more than two arguments. Record extraction can be viewed as an instance of complex relation extraction. We build upon this work in three ways: (1) Our system learns the compatibility between sets of fields, rather than just pairs of field; (2) our system is not restricted to relations between entities in the same sentence; and (3) our problem domain has a varying number of fields per record, as opposed to the fixed schema in McDonald et al.</Paragraph> <Paragraph position="1"> (2005).</Paragraph> <Paragraph position="2"> Bansal et al. (2004) present algorithms for the related task of correlational clustering: finding an optimal clustering from a matrix of pairwise compatibility scores. The correlational clustering approachdoesnothandlecompatibilityscorescalcu- null lated over sets of nodes, which we address in this paper.</Paragraph> <Paragraph position="3"> McCallum and Wellner (2005) discriminatively train a model to learn binary coreference decisions, then perform joint inference using graph partitioning. This is analogous to our work, with two distinctions. First, instead of binary coreference decisions, our model makes binary compatibility decisions, reflecting whether a set of fields belong together in the same record. Second, whereas McCallum and Wellner (2005) factor the coreference decisions into pairs of vertices, our compatibility decisions are made between sets of vertices. Asweshowinourexperiments,factoring decisions into sets of vertices enables more powerful features that can improve performance. These higher-order features have also recently been investigated in other models of coreference, both discriminative (Culotta and McCallum, 2006) and generative (Milch et al., 2005).</Paragraph> <Paragraph position="4"> Viola and Narasimhan (2005) present a probabilistic grammar to parse contact information blocks. While this model is capable of learninglong-distancecompatibilities(suchasCityand null State relations), features to enable this are not explored. Additionally, their work focuses on labeling fields in documents that have been pre-segmented into records. This record segmentation is precisely what we address in this paper.</Paragraph> <Paragraph position="5"> Borkar et al. (2001) and Kristjannson et al.</Paragraph> <Paragraph position="6"> (2004) also label contact address blocks, but ignore the problem of clustering fields into records. Also, Culotta et al. (2004) automatically extract contact records from web pages, but use heuristics to cluster fields into records.</Paragraph> <Paragraph position="7"> Embley et al. (1999) provide heuristics to detect record boundaries in highly structured web documents, such as classified ads, and Embley and Xu (2000) improve upon these heuristics for slightly more ambiguous domains using a vector space model. Both of these techniques apply to data for which the records are highly contiguous and have a distinctive separator between records.</Paragraph> <Paragraph position="8"> These heuristic approaches are unlikely to be successful in the unstructured text domain we address in this paper.</Paragraph> <Paragraph position="9"> Most other work on relation extraction focuses only on binary relations (Zelenko et al., 2003; Miller et al., 2000; Agichtein and Gravano, 2000; Culotta and Sorensen, 2004). A serious difficulty in applying binary relation extractors to the record extractiontaskisthatratherthanenumeratingover all pairs of entities, the system must enumerate over all subsets of entities, up to subsets of size k, the maximum number of fields per record. We address this difficulty by employing two sampling methods: one that samples uniformly, and another that samples on a focused subset of the combinatorial space.</Paragraph> </Section> class="xml-element"></Paper>