File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-3802_metho.xml
Size: 15,686 bytes
Last Modified: 2025-10-06 14:10:59
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-3802"> <Title>Graph Based Semi-Supervised Approach for Information Extraction</Title> <Section position="4" start_page="9" end_page="9" type="metho"> <SectionTitle> 3 Background </SectionTitle> <Paragraph position="0"> In graph theory, a graph is a set of objects called vertices joined by links called edges. A bipartite graph, also called a bigraph, is a special graph where the set of vertices can be divided into two disjoint sets with no two vertices of the same set sharing an edge.</Paragraph> <Paragraph position="1"> The Hypertext Induced Topic Selection (HITS) algorithm is an algorithm for rating, and therefore ranking, web pages. The HITS algorithm makes use of the following observation: when a page (hub) links to another page (authority), the former confers authority over the latter. HITS uses two values for each page, the &quot;authority value&quot; and the &quot;hub value&quot;. &quot;Authority value&quot; and &quot;hub value&quot; are defined in terms of one another in a mutual recursion. An authority value is computed as the sum of the scaled hub values that point to that authority. A hub value is the sum of the scaled authority values of the authorities it points to.</Paragraph> <Paragraph position="2"> A template, as we define for this work, is a sequence of generic forms that could generalize over the given training instance. An example template is:</Paragraph> </Section> <Section position="5" start_page="9" end_page="9" type="metho"> <SectionTitle> COUNTRY NOUN_PHRASE PERSON </SectionTitle> <Paragraph position="0"> VERB_PHRASE This template could represent the sentence: &quot;American vice President Al Gore visited ...&quot;. This template is derived from the representation of the Named Entity tags, Part-of-Speech (POS) tags and semantic tags. The choice of the template representation here is for illustration purpose only; any combination of tags, representations and tagging styles might be used.</Paragraph> <Paragraph position="1"> A pattern is more specific than a template. A pattern specifies the role played by the tags (first entity, second entity, or relation). An example of a pattern is:</Paragraph> <Paragraph position="3"> This pattern indicates that the word(s) with the tag COUNTRY in the sentence represents the second entity (Entity 2) in the relation, while the word(s) tagged PERSON represents the first entity (Entity 1) in this relation. Finally, the word(s) with the tag NOUN_PHRASE represents the relation between the two previous entities.</Paragraph> <Paragraph position="4"> A tuple, in our notation during this paper, is the result of the application of a pattern to unstructured text. In the above example, one result of applying the pattern to some raw text is the following tuple:</Paragraph> </Section> <Section position="6" start_page="9" end_page="11" type="metho"> <SectionTitle> 4 The Approach </SectionTitle> <Paragraph position="0"> The semi-supervised graph-based approach we propose depends on the construction of generalized extraction patterns that could match many training instances. The patterns are then weighted according to their importance by deploying graph based mutual reinforcement techniques. Patterns derived from the supervised training instances should have a superior effect in the reinforcement weighting process. This duality in patterns and tuples relation could be stated that patterns could match different tuples, and tuples in turn could be matched by different patterns. The proposed approach is composed of two main steps namely, pattern extraction and pattern weighting or induction. Both steps are detailed in the next subsections.</Paragraph> <Section position="1" start_page="10" end_page="10" type="sub_section"> <SectionTitle> 4.1 Patterns Extraction </SectionTitle> <Paragraph position="0"> As shown in Figure 1, several syntactic, lexical, and semantic analyzers could be applied to the training instances. The resulting analyses could be employed in the construction of extraction patterns. Any extraction pattern could match different relations and hence could produce several tuples.</Paragraph> <Paragraph position="1"> As an example let's consider the pattern depicted in figure 1:</Paragraph> <Paragraph position="3"> This pattern could extract the tuple: Relation: vice President Another tuple that could be extracted by the same pattern is: Relation: Prime Minister On the other hand, many other patterns could extract the same information in the tuple from different contexts. It is worth mentioning that the proposed approach is general enough to accommodate any pattern design; the introduced pattern design is for illustration purposes only. To further increase the number of patterns that could match a single tuple, the tuple space might be reduced i.e. by grouping tuples conveying the same information content together into a single tuple. This will be detailed further in the experimental setup section.</Paragraph> </Section> <Section position="2" start_page="10" end_page="11" type="sub_section"> <SectionTitle> 4.2 Pattern Induction </SectionTitle> <Paragraph position="0"> The inherent duality in the patterns and tuples relation suggests that the problem could be interpreted as a hub authority problem. This problem could be solved by applying the HITS algorithm to iteratively assign authority and hub scores to patterns and tuples respectively.</Paragraph> <Paragraph position="1"> and tuples Patterns and tuples are represented by a bipartite graph as illustrated in figure 2. Each pattern or tuple is represented by a node in the graph. Edges represent matching between patterns and tuples. The pattern induction problem can be formulated as follows: Given a very large set of data D containing a large set of patterns P which match a</Paragraph> <Paragraph position="3"> large set of tuples T, the problem is to identify P~ , the set of patterns that match the set of the most correct tuplesT~ . The intuition is that the tuples matched by many different patterns tend to be correct and the patterns matching many different tuples tend to be good patterns. In other words; we want to choose, among the large space of patterns in the data, the most informative, highest confidence patterns that could identify correct tuples; i.e. choosing the most &quot;authoritative&quot; patterns in analogy with the hub authority problem. However, both P~ andT~ are unknown. The induction process proceeds as follows: each pattern p in P is associated with a numerical authority weight av which expresses how many tuples match that pattern.</Paragraph> <Paragraph position="4"> Similarly, each tuple t in T has a numerical hub weight ht which expresses how many patterns were matched by this tuple. The weights are calculated iteratively as follows:</Paragraph> <Paragraph position="6"> where T(p) is the set of tuples matched by p, P(t) is the set of patterns matching t, ( )pa i )1( + is the authoritative weight of pattern p at iteration )1( +i , and ( )th i )1( + is the hub weight of tuple t at iteration )1( +i . H(i) and A(i) are normalization factors defined as:</Paragraph> <Paragraph position="8"> Patterns with weights lower than a predefined threshold are rejected, and examples associated with highly ranked patterns are then used in unsupervised training.</Paragraph> <Paragraph position="9"> It is worth mentioning that both T and P contain supervised and unsupervised examples, however the proposed method could assign weights to the correct examples (tuples and patterns) in a completely unsupervised setup. For semi-supervised data some supervised examples are provided, which are associated in turn with tuples and patterns. null We adopt the HITS extension introduced in (White and Smyth, 2003) to extend HITS with Priors. By analogy, we handle the supervised examples as priors to the HITS induction algorithm. A prior probabilities vector pr ={pr1, . . . , prn} is defined such that the probabilities sum to 1, where prv denotes the relative importance (or &quot;prior bias&quot;) we attach to node v. A pattern Pi is assigned a prior pri=1/n if pattern Pi matches a supervised tuple, otherwise pri is set to zero, n is the total number of patterns that have a supervised match. We also define a &quot;back probability&quot; a0 , 0 a1 a1 1 which determines how often we bias the supervised nodes:</Paragraph> <Paragraph position="11"> where T(p) is the set of tuples matched by p , P(t) is the set of patterns matching t, and H(i) and A(i) are normalization factors defined as in equations (3) and (4) Thus each node in the graph (pattern or tuple) has an associated prior weight depending on its supervised data. The induction process proceeds to iteratively assign weights to the patterns and tuples. In the current work we used 5.0=b .</Paragraph> </Section> </Section> <Section position="7" start_page="11" end_page="1411" type="metho"> <SectionTitle> 5 Experimental Setup </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="11" end_page="1411" type="sub_section"> <SectionTitle> 5.1 ACE Relation Detection and Characteri- </SectionTitle> <Paragraph position="0"> zation In this section, we describe Automatic Content Extraction (ACE). ACE is an evaluation conducted by NIST to measure Entity Detection and Tracking (EDT) and Relation Detection and Characterization (RDC). The EDT task is concerned with the detection of mentions of entities, and grouping them together by identifying their coreference. The RDC task detects relations between entities identified by the EDT task. We choose the RDC task to show the performance of the graph based semi-supervised information extraction approach we propose. To this end we need to introduce the notion of mentions and entities. Mentions are any instances of textual references to objects like peo- null ple, organizations, geo-political entities (countries, cities ...etc), locations, or facilities. On the other hand, entities are objects containing all mentions to the same object.</Paragraph> <Paragraph position="1"> for the ACE RDC task. Here, we present an example for those relations: Spain's Interior Minister announced this evening the arrest of separatist organization Eta's presumed leader Ignacio Garcia Arregui. Arregui, who is considered to be the Eta organization's top man, was arrested at 17h45 Greenwich.</Paragraph> <Paragraph position="2"> The Spanish judiciary suspects Arregui of ordering a failed attack on King Juan Carlos in 1995.</Paragraph> <Paragraph position="3"> In this fragment, all the underlined phrases are mentions to Eta organization, or to &quot;Garcia Arregui&quot;. There is a management relation between leader which references to &quot;Garcia Arregui&quot; and Eta.</Paragraph> </Section> <Section position="2" start_page="1411" end_page="1411" type="sub_section"> <SectionTitle> 5.2 Baseline System </SectionTitle> <Paragraph position="0"> The base line system uses a Maximum Entropy model that combines diverse lexical, syntactic and semantic features derived from text, like the system described in (Nanda, 2004). The system was trained on the ACE training data provided by LDC.</Paragraph> <Paragraph position="1"> The training set contained 145K words, and 4764 instances of relations, the number of instances corresponding to each relation is shown in Table 1.</Paragraph> <Paragraph position="2"> The test set contained around 41K words, and 1097 instances of relations. The system was evaluated using standard ACE evaluation procedure.</Paragraph> <Paragraph position="3"> ACE evaluation procedure assigns the system an ACE value for each relation type and a total ACE value. The ACE value is a standard NIST metric for evaluating relation extraction. The reader is referred to the ACE web site (ACE, 2004) for more details.</Paragraph> </Section> <Section position="3" start_page="1411" end_page="1411" type="sub_section"> <SectionTitle> 5.3 Pattern Construction </SectionTitle> <Paragraph position="0"> We used the baseline system described in the previous section to label a large amount of unsupervised data. The data comes from LDC English Gigaword corpus, Agence France Press English Service (AFE). The data contains around 3M words, from which 80K instances of relations have been extracted.</Paragraph> <Paragraph position="1"> We start by extracting a set of patterns that represent the supervised and unsupervised data. We consider each relation type separately and extract a pattern for each instance in the selected relation. The pattern we used consists of a mix between the part of speech (POS) tags and the mention tags for the words in the training instance. We use the mention tag, if it exists; otherwise we use the part of speech tag. An example of a pattern is:</Paragraph> </Section> <Section position="4" start_page="1411" end_page="1411" type="sub_section"> <SectionTitle> 5.4 Tuples Clustering </SectionTitle> <Paragraph position="0"> As discussed in the previous section, the tuple space should be reduced to allow more matching between pattern-tuple pairs. This space reduction could be accomplished by seeking a tuple similarity measure, and constructing a weighted undirected graph of tuples. Two tuples are linked with an edge if their similarity measure exceeds a certain threshold. Graph clustering algorithms could be deployed to partition the graph into a set of homogeneous communities or clusters. To reduce the space of tuples, we seek a matching criterion that group similar tuples together. Using WordNet, we can measure the semantic similarity or relatedness between a pair of concepts (or word senses), and by extension, between a pair of sentences. We use the similarity measure described in (Wu and Palmer, 1994) which finds the path length to the root node from the least common subsumer (LCS) of the two word senses which is the most specific word sense they share as an ancestor. The similarity score of two tuples, ST, is calculated as follows:.</Paragraph> <Paragraph position="2"> where SE1, and SE2 are the similarity scores of the first entities in the two tuples, and their second entitles respectively.</Paragraph> <Paragraph position="3"> The tuple matching procedure assigns a similarity measure to each pair of tuples in the dataset. Using this measure we can construct an undirected graph G. The vertices of G are the tuples. Two vertices are connected with an edge if the similarity measure between their underlying tuples exceeds a certain threshold. It was noticed that the constructed graph consists of a set of semi isolated groups as shown in figure 3. Those groups have a very large number of inter-group edges and meanwhile a rather small number of intra-group edges. This implies that using a graph clustering algorithm would eliminate those weak intra-group edges and produce separate groups or clusters representing similar tuples. We used Markov Cluster Algorithm (MCL) for graph clustering (Dongen, 2000). MCL is a fast and scalable unsupervised cluster algorithm for graphs based on simulation of stochastic flow.</Paragraph> <Paragraph position="4"> A bipartite graph of patterns and tuple clusters is constructed. Weights are assigned to patterns and tuple clusters by iteratively applying the HITS with Priors' algorithm. Instances associated with highly ranked patterns are then added to the training data and the model is retrained. Samples of some highly ranked patterns and corresponding matching text are introduced in Table 2.</Paragraph> </Section> </Section> class="xml-element"></Paper>