File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/n06-1038_metho.xml
Size: 12,875 bytes
Last Modified: 2025-10-06 14:10:11
<?xml version="1.0" standalone="yes"?> <Paper uid="N06-1038"> <Title>Integrating Probabilistic Extraction Models and Data Mining to Discover Relations and Patterns in Text</Title> <Section position="4" start_page="296" end_page="297" type="metho"> <SectionTitle> 3 Relation Extraction as Sequence </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="296" end_page="297" type="sub_section"> <SectionTitle> Labeling </SectionTitle> <Paragraph position="0"> Relation extraction is the task of discovering semantic connections between entities. In text, this usually amounts to examining pairs of entities in a document and determining (from local language cues) whether a relation exists between them. Common approaches to this problem include pattern matching (Brin, 1998; Agichtein and Gravano, 2000), kernel methods (Zelenko et al., 2003; Culotta and Sorensen, 2004; Bunescu and Mooney, 2006), logistic regression (Kambhatla, 2004), and augmented parsing (Miller et al., 2000).</Paragraph> <Paragraph position="1"> The pairwise classification approach of kernel methods and logistic regression is commonly a two-phase method: first the entities in a document are identified, then a relation type is predicted for each pair of entities. This approach presents at least two difficulties: (1) enumerating all pairs of entities, even when restricted to pairs within a sentence, results in a low density of positive relation examples; and (2) errors in the entity recognition phase can propagate to errors in the relation classification stage. As an example of the latter difficulty, if a per-son is mislabeled as a company, then the relation classifier will be unsuccessful in finding a brother relation, despite local evidence.</Paragraph> <Paragraph position="2"> We avoid these difficulties by restricting our investigation to biographical texts, e.g. encyclopedia articles. A biographical text mostly discusses one entity, which we refer to as the principal entity. We refer to other mentioned entities as secondary entities. For each secondary entity, our goal is to predict what relation, if any, it has to the principal entity.</Paragraph> <Paragraph position="3"> This formulation allows us to treat relation extraction as a sequence labeling task such as named-entity recognition or part-of-speech tagging, and we can now apply models that have been successful on those tasks. By anchoring one argument of relations to be the principal entity, we alleviate the difficulty of enumerating all pairs of entities in a document.</Paragraph> <Paragraph position="4"> By converting to a sequence labeling task, we fold the entity recognition step into the relation extraction task. There is no initial pass to label each entity as a person or company. Instead, an entity's label is its relation to the principal entity. Below is an example of a labeled article: George W. Bush George is the son of George H. W. Bushbracehtipupleft bracehtipdownrightbracehtipdownleft bracehtipupright fatherand Barbara Bush bracehtipupleft bracehtipdownrightbracehtipdownleft bracehtipupright mother .</Paragraph> <Paragraph position="5"> Additionally, by using a sequence model we can capturethedependencebetweenadjacentlabels. For example, in our data it is common to see phrases such as &quot;son of the Republican president George H. W. Bush&quot; for which the labels politicalParty, jobTitle, and father occur consecutively. Sequence models are specifically designed to handle these kinds of dependencies. We now discuss the details of our extraction model.</Paragraph> </Section> <Section position="2" start_page="297" end_page="297" type="sub_section"> <SectionTitle> 3.1 Conditional Random Fields </SectionTitle> <Paragraph position="0"> We build a model to extract relations using linear-chain conditional random fields (CRFs) (Lafferty et al., 2001; Sutton and McCallum, 2006). CRFs are undirected graphical models (i.e. Markov networks)thatarediscriminatively-trainedtomaximize null the conditional probability of a set of output variables y given a set of input variables x. This conditional distribution has the form</Paragraph> <Paragraph position="2"> where ph are potential functions parameterized by L and Zx = summationtextyproducttextc[?]C ph(yc,xc) is a normalization factor. Assuming phc factorizes as a log-linear combination of arbitrary features computed over clique c, then phc(yc,xc;L) = exp(summationtextk lkfk(yc,xc)), where f is a set of arbitrary feature functions over the input, each of which has an associate model parameter lk. Parameters L = {lk} are a set of real-valued weights typically estimated from labeled training data by maximizing the data likelihood function using gradient ascent.</Paragraph> <Paragraph position="3"> In these experiments, we make a first-order Markov assumption on the dependencies among y, resulting in a linear-chain CRF.</Paragraph> </Section> </Section> <Section position="5" start_page="297" end_page="298" type="metho"> <SectionTitle> 4 Relational Patterns </SectionTitle> <Paragraph position="0"> The modeling flexibility of CRFs permits the featurefunctionstobecomplex,overlappingfeaturesof null the input without requiring additional assumptions on their inter-dependencies. In addition to common language features (e.g. neighboring words and syntactic information), in this work we explore features that cull relational patterns from a database of entities. null As described in the introductory example (Figure 1), context alone is often insufficient to extract relations. Even in simpler examples, it may be the case that modeling relational patterns can improve extraction accuracy.</Paragraph> <Paragraph position="1"> To capture this evidence, we compute features from a database to indicate relational connections between entities, similar to the relational path-finding performed in Richards and Mooney (1992). Imagine that the four sentence example about the Bush family is included in a training set, and the en- null tities are labeled with their correct relations. In this case, the cousin relation in sentence 4 would also be labeled. From this data, we can create a relational database that contains the relations in Figure 1.</Paragraph> <Paragraph position="2"> Assumesentence4comesfromabiographyabout John Ellis. We calculate a feature for the entity George W. Bush that indicates the path from John Ellis to George W. Bush in the database, annotating each edge in the path with its relation label; i.e. father-sibling-son. By abstracting away the actual entity names, we have created a cousin template feature, as shown in Figure 2.</Paragraph> <Paragraph position="3"> By adding these relational paths as features to the model, we can learn interesting relational patterns that may have low precision (e.g. &quot;people are likely to be friends with their classmates&quot;) without hampering extraction performance. This is in contrast to the system described in Nahm and Mooney (2000), in which patterns are induced from a noisy database and then applied directly to extraction. In our system, since each learned path has an associated weight, it is simply another piece of evidence to help the extractor. Low precision patterns may have lower weights than high precision patterns, but they will still influence the extractor.</Paragraph> <Paragraph position="4"> A nice property of this approach is that examininghighlyweightedpatternscanprovideinsightinto null regularities of the data.</Paragraph> <Section position="1" start_page="297" end_page="298" type="sub_section"> <SectionTitle> 4.1 Feature Induction </SectionTitle> <Paragraph position="0"> During CRF training, weights are learned for each relational pattern. Patterns that increase extraction performance will receive higher weights, while patterns that have little effect on performance will receive low weights.</Paragraph> <Paragraph position="1"> We can explore the space of possible conjunctions of these patterns using feature induction for CRFs, as described in McCallum (2003). Search through the large space of possible conjunctions is guided by adding features that are estimated to increase the likelihood function most.</Paragraph> <Paragraph position="2"> When feature induction is used with relational patterns, we can view this as a type of data mining, in which patterns are created based on their influence on an extraction model. This is similar to work by Dehaspe (1997), where inductive logic programming is embedded as a feature induction technique for a maximum entropy classifier. Our work restricts induced features to conjunctions of base features, rather than using first-order clauses. However, the patterns we learn are based on information extracted from natural language.</Paragraph> </Section> <Section position="2" start_page="298" end_page="298" type="sub_section"> <SectionTitle> 4.2 Iterative Database Construction </SectionTitle> <Paragraph position="0"> The top-down knowledge provided by data mining algorithms has the potential to improve the performance of information extraction systems. Conversely, bottom-up knowledge generated by extraction systems can be used to populate a large database, from which more top-down knowledge can be discovered. By carefully communicating the uncertainty between these systems, we hope to iteratively expand a knowledge base, while minimizing fallacious inferences.</Paragraph> <Paragraph position="1"> In this work, the top-down knowledge consists of relational patterns describing the database path between entities in text. The uncertainty of this knowledge is handled by associating a real-valued CRF weight with each pattern, which increases when the pattern is predictive of other relations. Thus, the extraction model can adapt to noise in these patterns.</Paragraph> <Paragraph position="2"> Since we also desire to extract relations between entities that appear in text but not in the database, we first populate the database with relations extracted by a CRF that does not use relational patterns. We then do further extraction with a CRF that incorporates the relational patterns found in this automatically generated database. In this manner, we create a closed-loop system that alternates between bottom-up extraction and top-down pattern discovery. This approach can be viewed as a type of alternating optimization, with analogies to formal methods such as expectation-maximization.</Paragraph> <Paragraph position="3"> The uncertainty in the bottom-up extraction step is handled by estimating the confidence of each extraction and pruning the database to remove entries with low confidence. One of the benefits of a probabilistic extraction model is that confidence estimates can be straight-forwardly obtained. Culotta and McCallum (2004) describe the constrained forward-backward algorithm to efficiently estimate the conditional probability that a segment of text is correctly extracted by a CRF.</Paragraph> <Paragraph position="4"> Using this algorithm, we associate a confidence value with each relation extracted by the CRF. This confidence value is then used to limit the noise introduced by incorrect extractions. This differs from Nahm and Mooney (2000) and Mooney and Bunescu(2005), inwhichstandarddecisiontreerule learners are applied to the unfiltered output of extraction. null</Paragraph> </Section> <Section position="3" start_page="298" end_page="298" type="sub_section"> <SectionTitle> 4.3 Extracting Implicit Relations </SectionTitle> <Paragraph position="0"> An implicit relation is one that does not have direct contextual evidence, for example the cousin relation in our initial example. Implicit relations generally require some background knowledge to be detected, such as relational patterns (e.g. rules about familial relations). These are the sorts of relations on which current extraction models perform most poorly.</Paragraph> <Paragraph position="1"> Notably, these are exactly the sorts of relations thatarelikelytohavethebiggestimpactoninformation access. A system that can accurately discover knowledge that is only implied by the text will dramatically increase the amount of information a user can uncover, effectively providing access to the implications of a corpus.</Paragraph> <Paragraph position="2"> We argue that integrating top-down and bottom-up knowledge discovery algorithms discussed in Section 4.2 can enable this technology. By performing pattern discovery in conjunction with information extraction, we can collate facts from multiple sources to infer new relations. This is an example of cross-document fusion or cross-document information extraction, a growing area of research transforming raw extractions into usable knowledge bases (Mann and Yarowsky, 2005; Masterson and Kushmerik, 2003).</Paragraph> </Section> </Section> class="xml-element"></Paper>