XML Viewer - w06-1671

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-1671_metho.xml
Size: 21,623 bytes
Last Modified: 2025-10-06 14:10:46
<?xml version="1.0" standalone="yes"?>
<Paper uid="W06-1671">
  <Title>Learning Field Compatibilities to Extract Database Records from Unstructured Text</Title>
  <Section position="5" start_page="603" end_page="606" type="metho">
    <SectionTitle>
3 From Fields to Records
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="603" end_page="604" type="sub_section">
      <SectionTitle>
3.1 Problem Definition
</SectionTitle>
      <Paragraph position="0"> LetafieldF beapair&lt;a,v&gt; ,whereaisanattribute (column label) and v is a value, e.g. Fi = &lt;CITY, San Francisco&gt; . Let record R be a set of fields,</Paragraph>
      <Paragraph position="2"> tiple fields with the same attribute but different values (e.g. a person may have multiple job titles). Assume we are given the output of a named- null entity recognizer, which labels tokens in a document with their attribute type (e.g. NAME or CITY). Thus, a document initially contains a set of fields, {F1 ...Fm}.</Paragraph>
      <Paragraph position="3"> The task is to partition the fields in each annotated document into a set of records {R1 ...Rk} such that each record Ri contains exactly the set of fields pertinent to that record. In this paper, we assume each field belongs to exactly one record.</Paragraph>
    </Section>
    <Section position="2" start_page="604" end_page="604" type="sub_section">
      <SectionTitle>
3.2 Solution Overview
</SectionTitle>
      <Paragraph position="0"> For each document, we construct a fully-connected weighted graph G = (V,E), with vertices V and weighted edges E. Each field in the document is represented by a vertex in V, and the edges are weighted by the compatibility of adjacent fields, i.e. a measure of how likely it is that Fi and Fj belong to the same record.</Paragraph>
      <Paragraph position="1"> Partitioning V into k disjoint clusters uniquely maps the set of fields to a set of k records. Below, we provide more detail on the two principal steps in our solution: (1) estimating the compatibility function and (2) partitioning V into disjoint clusters.</Paragraph>
    </Section>
    <Section position="3" start_page="604" end_page="604" type="sub_section">
      <SectionTitle>
3.3 Learning field compatibility
</SectionTitle>
      <Paragraph position="0"> Let F be a candidate cluster of fields forming a partial record. We construct a compatibility function C that maps two sets of fields to a real value, i.e. C : Fi x Fj - R. We abbreviate the value C(Fi,Fj) as Cij. The higher the value of Cij the more likely it is that Fi and Fj belong to the same record.</Paragraph>
      <Paragraph position="1"> For example, in the contact record domain, Cij can reflect whether a city and state should cooccur, or how likely a company is to have a certain job title.</Paragraph>
      <Paragraph position="2"> We represent Cij by a maximum-entropy classifier over the binary variable Sij, which is true if and only if field set Fi belongs to the same record as field set Fj. Thus, we model the conditional distribution</Paragraph>
      <Paragraph position="4"> parenrightBigg where fk is a binary feature function that computes attributes over the field sets, and L = {lk} is the set of real-valued weights that are the parameters of the maximum-entropy model. We set Cij = PL(Sij =true|Fi,Fj). This approach can be viewed as a logistic regression model for field compatibility.</Paragraph>
      <Paragraph position="5"> Examples of feature functions include formatting evidence (Fi appears at the top of the document, Fj at the bottom), conflicting value information (Fi and Fj contain conflicting values for the state field), or other measures of compatibility (a city value in Fi is known to exist in a state in Fj). A feature may involve more than one field, for example, if a name, title and university occurs consecutively in some order. We give a more detailed description of the feature functions in Section 4.3.</Paragraph>
      <Paragraph position="6"> We propose learning the L weights for each of these features using supervised machine learning. Given a set of documents D for which the true mapping from fields to set of records is known, we wish to estimate P(Sij|Fi,Fj) for all pairs of field sets Fi,Fj.</Paragraph>
      <Paragraph position="7"> Enumerating all positive and negative pairs of field sets is computationally infeasible for large datasets, so we instead propose two sampling methods to generate training examples. The first simply samples pairs of field sets uniformly from the training data. For example, given a document D containing true records {R1 ...Rk}, we sample positive and negative examples of field sets of varying sizes from {Ri ...Rj}. The second samplingmethodfirsttrainsthemodelusingtheexam- null ples generated by uniform sampling. This model is then used to cluster the training data. Additional training examples are created during the clustering process and are used to retrain the model parameters. Thissecondsamplingmethodisanattemptto more closely align the characteristics of the training and testing examples.</Paragraph>
      <Paragraph position="8"> Given a sample of labeled training data, we set the parameters of the maximum-entropy classifier in standard maximum-likelihood fashion, performing gradient ascent on the log-likelihood of the training data. The resulting weights indicate how important each feature is in determining whether two sets of fields belong to the same record.</Paragraph>
    </Section>
    <Section position="4" start_page="604" end_page="605" type="sub_section">
      <SectionTitle>
3.4 Partitioning Fields into Records
</SectionTitle>
      <Paragraph position="0"> One could employ the estimated classifier to convert fields into records as follows: Classify each pair of fields as positive or negative, and perform transitive closure to enforce transitivity of decisions. That is, if the classifier determines that A and B belong to the same record and that B and C belong to the same record, then by transitivity  A and C must belong to the same record. The drawback of this approach is that the compatibility between A and C is ignored. In cases where the classifier determines that A and C are highly incompatible, transitive closure can lead to poor precision. McCallum and Wellner (2005) explore this issue in depth for the related task of noun coreference resolution.</Paragraph>
      <Paragraph position="1"> With this in mind, we choose to avoid transitive closure, and instead employ a graph partitioning method to make record merging decisions jointly.</Paragraph>
      <Paragraph position="2"> Given a document D with fields {F1 ...Fn}, we construct a fully connected graph G = (V,E), with edge weights determined by the learned compatibility functionC. We wish to partition vertices V into clusters with high intra-cluster compatibility. null One approach is to simply use greedy agglomerative clustering: initialize each vertex to its own cluster, then iteratively merge clusters with the highest inter-cluster edge weights. The compatibility between two clusters can be measured using single-link or average-link clustering. The clustering algorithm converges when the inter-cluster edge weight between any pair of clusters is below a specified threshold.</Paragraph>
      <Paragraph position="3"> We propose a modification to this approach.</Paragraph>
      <Paragraph position="4"> Since the compatibility function we have described maps two sets of vertices to a real value, we can use this directly to calculate the compatibilitybetweentwoclusters, ratherthanperforming average or single link clustering.</Paragraph>
      <Paragraph position="5"> Wenowdescribethealgorithmmoreconcretely.</Paragraph>
      <Paragraph position="6">  * Input: (1) Graph G = (V,E), where each vertexvi representsafieldFi. (2)Athreshold value t.</Paragraph>
      <Paragraph position="7"> * Initialization: Placeeachvertexvi initsown cluster ^Ri. (The hat notation indicates that this cluster represents a possible record.) * Iterate: Re-calculate the compatibility functionCij between each pair of clusters. Merge the two most compatible clusters, ^R[?]i, ^R[?]j. * Termination: If there does not exist a pair of clusters ^Ri, ^Rj such that Cij &gt; t, the algo null rithm terminates and returns the current set of clusters.</Paragraph>
      <Paragraph position="8"> A natural threshold value is t = 0.5, since this is the point at which the binary compatibility classifier predicts that the fields belong to different records. In Section 4.4, we examine how performance varies with t.</Paragraph>
    </Section>
    <Section position="5" start_page="605" end_page="606" type="sub_section">
      <SectionTitle>
3.5 Representational power of cluster
</SectionTitle>
      <Paragraph position="0"> compatibility functions Most previous work on inducing compatibility functionslearnsthecompatibilitybetweenpairsof vertices, not clusters of vertices. In this section, we provide intuition to explain why directly modeling the compatibility of clusters of vertices may be advantageous. We refer to the cluster compatibility function as Cij, and the pairwise (binary) compatibility function as Bij.</Paragraph>
      <Paragraph position="1"> First, we note that Cij is a generalization of single-link and average-link clustering methods that use Bij, since the output of these methods can simply be included as features in Cij. For example, given two clusters ^Ri = {v1,v2,v3} and</Paragraph>
      <Paragraph position="3"> SAL( ^Ri, ^Rj) can be included as a feature for the compatibility function Cij, with an associated weight estimated from training data.</Paragraph>
      <Paragraph position="4"> Second, there may exist phenomena of the data that can only be captured by a classifier that considers &amp;quot;higher-order&amp;quot; features. Below we describe two such cases.</Paragraph>
      <Paragraph position="5"> In the first example, consider three vertices of mild compatibility, as in Figure 1(a). (For these examples, let Bij,Cij [?] [0,1].) Suppose that these three phone numbers occur nearby in a document. Since it is not uncommon for a person to havetwophonenumberswithdifferentareacodes, the pairwise compatibility function may score any pair of nearby phone numbers as relatively compatible. However, since it is fairly uncommon for a person to have three phone numbers with three different area codes, we would not like all three numbers to be merged into the same record.</Paragraph>
      <Paragraph position="6"> Assume an average-link clustering algorithm.</Paragraph>
      <Paragraph position="7"> After merging together the 333 and 444 numbers, Bij will recompute the new inter-cluster compatibility as 0.51, the average of the inter-cluster edges. In contrast, the cluster compatibility function Cij can represent the fact that three numbers withdifferentareacodesaretobemerged, andcan penalize their compatibility accordingly. Thus, in  higher representational power than the pairwise compatibility measure (B). In (a), the pairwise measure over-estimates the inter-cluster compatibility when there exist higher-order features such as A person is unlikely to have phone numbers with three different area codes. In (b), the pairwise measure underestimates inter-cluster compatibility when weak features like string comparisons can be combined into a more powerful feature by examining multiple field values.</Paragraph>
      <Paragraph position="8"> this example, the pairwise compatibility function over-estimates the true compatibility.</Paragraph>
      <Paragraph position="9"> In the second example (Figure 1(b)), we consider the opposite case. Consider three edges, two of which have weak compatibility, and one of which has high compatibility. For example, perhaps the system has access to a list of city-state pairs, and can reliably conclude that Pleasantville is a city in the state North Dakota.</Paragraph>
      <Paragraph position="10"> Deciding that Univ of North Dakota, Pleasantville belongs in the same record as North Dakota and Pleasantville is a bit more difficult.</Paragraph>
      <Paragraph position="11"> Suppose a feature function measures the string similarity between the city field Pleasantville and the company field Univ of North Dakota, Pleasantville. Alone, this string similarity might not be very strong, and so the pairwise compatibility is low. However, after Pleasantville and North Dakota are merged together, the cluster compatibility function can compute the string similarity of the concatenation of the city and state fields, resulting in a higher compatibility. In this example, the pairwise compatibility function underestimates the true compatibility.</Paragraph>
      <Paragraph position="12"> These two examples show that the cluster compatibility score can have more representational power than the average of pairwise compatibility scores.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="606" end_page="609" type="metho">
    <SectionTitle>
4 Experiments
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="606" end_page="606" type="sub_section">
      <SectionTitle>
4.1 Data
</SectionTitle>
      <Paragraph position="0"> We hand-labeled a subset of faculty and student homepages from the WebKB dataset1. Each page was labeled with the 25 fields listed in Table 1.</Paragraph>
      <Paragraph position="1"> In addition, we labeled the records to which each field belonged. For example, in Figure 2, we labeled the contact information for Professor Smith into a separate record from that of her administrative assistant. There are 252 labeled pages in total, containing8996fieldsand16679wordtokens. We perform ten random samples of 70-30 splits of the data for all experiments.</Paragraph>
    </Section>
    <Section position="2" start_page="606" end_page="607" type="sub_section">
      <SectionTitle>
4.2 Systems
</SectionTitle>
      <Paragraph position="0"> We evaluate five different record extraction systems. With the exception of Transitive Closure, all methods employ the agglomerative clustering  both from an address block and from free text, and that Record 2 must be separated from Record 1 even though fields from each may be nearby in the text.</Paragraph>
      <Paragraph position="1"> algorithm described previously. The difference is inhowtheinter-clustercompatibilityiscalculated.</Paragraph>
      <Paragraph position="2"> * Transitive Closure: The method described in the beginning of Section 3.4, where hard classification decisions are made, and transi- null tivity is enforced.</Paragraph>
      <Paragraph position="3"> * Pairwise Compatibility: In this approach, the compatibility function only estimates the compatibility between pairs of fields, not sets of fields. To compute inter-cluster compatibility, the mean of the edges between the clusters is calculated.</Paragraph>
      <Paragraph position="4"> * McDonald: This method uses the pairwise  compatibility function, but instead of calculating the mean of inter-cluster edges, it calculates the geometric mean of all pairs of edges in the potential new cluster. That is, to calculate the compatibility of records Ri and Rj, we construct a new record Rij that contains all fields of Ri and Rj, then calculate the geometric mean of all pairs of fields in Rij. This is analogous to the method used in McDonald et al. (2005) for relation extraction. null * Cluster Compatibility (uniform): Inter-cluster compatibility is calculated directly by the cluster compatibility function. This is the method we advocate in Section 3. Training examplesaresampleduniformlyasdescribed in Section 3.3.</Paragraph>
      <Paragraph position="5"> * Cluster Compatibility (iterative): Same as above, but training examples are sampled using the iterative method described in Section 3.3.</Paragraph>
    </Section>
    <Section position="3" start_page="607" end_page="607" type="sub_section">
      <SectionTitle>
4.3 Features
</SectionTitle>
      <Paragraph position="0"> For the pairwise compatibility classifier, we exploit various formatting as well as knowledge-based features. Formatting features include the number of hard returns between fields, whether the fields occur on the same line, and whether the fields occur consecutively. Knowledge-based features include a mapping we compiled of cities and states in the United States and Canada. Additionally, weusedcompatibilityfeatures, suchaswhich fields are of the same type but have different values. null In building the cluster compatibility classifier, we use many of the same features as in the binary classifier, but cast them as first-order existential features that are generated if the feature exists between any pair of fields in the two clusters. Additionally, we are able to exploit more powerful compatibility and knowledge-base features. For example, we examine if a title, a first name and a last name occur consecutively (i.e., no other fields occur in-between them). Also, we examine multiple telephone numbers to ensure that they have the same area codes. Additionally, we employ count features that indicate if a certain field occurs more than a given threshold.</Paragraph>
    </Section>
    <Section position="4" start_page="607" end_page="609" type="sub_section">
      <SectionTitle>
4.4 Results
</SectionTitle>
      <Paragraph position="0"> For these experiments, we compare performance on the true record for each page. That is, we calculate how often each system returns a complete and accurate extraction of the contact record pertaining to the owner of the webpage. We refer to  this record as the canonical record and measure performance in terms of precision, recall and F1 for each field in the canonical record.</Paragraph>
      <Paragraph position="1"> Table2comparesprecision,recallandF1across the various systems. The cluster compatibility method with iterative sampling has the highest F1, demonstrating a 14% error reduction over the next best method and a 53% error reduction over the transitive closure baseline.</Paragraph>
      <Paragraph position="2"> Transitive closure has the highest recall, but it comes at the expense of precision, and hence obtains lower F1 scores than more conservative compatibility methods. The McDonald method also has high recall, but drastically improves precision over the transitivity method by taking into consideration all edge weights.</Paragraph>
      <Paragraph position="3"> The pairwise measure yields a slightly higher F1 score than McDonald mostly due to precision improvements. Because the McDonald method calculates the mean of all edge weights rather than just the inter-cluster edge weights, inter-cluster weights are often outweighed by intra-cluster weights. This can cause two denselyconnected clusters to be merged despite low inter-cluster edge weights.</Paragraph>
      <Paragraph position="4"> To further investigate performance differences, we perform three additional experiments. The first measures how sensitive the algorithms are to the threshold value t. Figure 3 plots the precision-recall curve obtained by varyingt from 1.0 to 0.1. As expected, high values of t result in low recall but high precision, since the algorithms halt with a large number of small clusters. The highlighted points correspond to t = 0.5. These results indicate that setting t to 0.5 is near optimal, and that the cluster compatibility method outperforms the pairwise across a wide range of values for t.</Paragraph>
      <Paragraph position="5"> In the second experiment, we plot F1 versus the size of the canonical record. Figure 4 indicates that most of the performance gain occurs in smaller canonical records (containing between 6 and 12 fields). Small canonical records are most susceptible to precision errors simply because there are more extraneous fields that may be incorrectly assigned to them. These precision errors are often addressed by the cluster compatibility method, as shown in Table 2.</Paragraph>
      <Paragraph position="6"> In the final experiment, we plot F1 versus the total number of fields on the page. Figure 5 indicates that the cluster compatibility method is best athandlingdocumentswithlargenumberoffields.</Paragraph>
      <Paragraph position="7">  wise, and mcdonald. The graph is obtained by varying the stopping threshold t from 1.0 to 0.1.</Paragraph>
      <Paragraph position="8"> The highlighted points correspond to t = 0.5.</Paragraph>
      <Paragraph position="9"> When there are over 80 fields in the document, the  performanceofthepairwisemethoddropsdramatically, while cluster compatibility only declines slightly. We believe the improved precision of the cluster compatibility method explains this trend as well.</Paragraph>
      <Paragraph position="10">  Wealsoexaminedocumentswhereclustercompatibility outperforms the pairwise methods. Typically, these documents contain interleaving contact records. Often, it is the case that a single pair of fields is sufficient to determine whether a cluster should not be merged. For example, the cluster classifier can directly model the fact that a contact record should not have multiple first or last names. It can also associate a weight with the fact that several fields overlap (e.g., the chances that a cluster has two first names, two last names and two cities). In contrast, the binary classifier only examines pairs of fields in isolation and averages these probabilities with other edges. This averaging can dilute the evidence from a single pair of fields. Embarrassing errors may result, such as a contact record with two first names or two last  the document increases. This figure suggests that  clustercompatibilityismosthelpfulwhenthedocument has more than 80 fields. names. These errors are particularly prevalent in interleaving contact records since adjacent fields often belong to the same record.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML