XML Viewer - n06-1037

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/n06-1037_metho.xml
Size: 26,613 bytes
Last Modified: 2025-10-06 14:10:09
<?xml version="1.0" standalone="yes"?>
<Paper uid="N06-1037">
  <Title>Exploring Syntactic Features for Relation Extraction using a Convolution Tree Kernel</Title>
  <Section position="3" start_page="0" end_page="288" type="metho">
    <SectionTitle>
5 ACE relation types.
</SectionTitle>
    <Paragraph position="0"> The rest of the paper is organized as follows. In Section 2, we review the previous work. Section 3 discusses our tree kernel based learning algorithm.</Paragraph>
    <Paragraph position="1">  Section 4 shows the experimental results and compares our work with the related work. We conclude our work in Section 5.</Paragraph>
  </Section>
  <Section position="4" start_page="288" end_page="289" type="metho">
    <SectionTitle>
2 Related Work
</SectionTitle>
    <Paragraph position="0"> The task of relation extraction was introduced as a part of the Template Element task in MUC6 and formulated as the Template Relation task in MUC7 (MUC, 1987-1998).</Paragraph>
    <Paragraph position="1"> Miller et al. (2000) address the task of relation extraction from the statistical parsing viewpoint. They integrate various tasks such as POS tagging, NE tagging, template extraction and relation extraction into a generative model. Their results essentially depend on the entire full parse tree.</Paragraph>
    <Paragraph position="2"> Kambhatla (2004) employs Maximum Entropy models to combine diverse lexical, syntactic and semantic features derived from the text for relation extraction. Zhou et al. (2005) explore various features in relation extraction using SVM. They conduct exhaustive experiments to investigate the incorporation and the individual contribution of diverse features. They report that chunking information contributes to most of the performance improvement from the syntactic aspect.</Paragraph>
    <Paragraph position="3"> The features used in Kambhatla (2004) and Zhou et al. (2005) have to be selected and carefully calibrated manually. Kambhatla (2004) use the path of non-terminals connecting two mentions in a parse tree as the parse tree features. Besides, Zhou et al. (2005) introduce additional chunking features to enhance the parse tree features. However, the hierarchical structured information in the parse trees is not well preserved in their parse tree-related features.</Paragraph>
    <Paragraph position="4"> As an alternative to the feature-based methods, kernel methods (Haussler, 1999) have been proposed to implicitly explore features in a high dimensional space by employing a kernel function to calculate the similarity between two objects directly. In particular, the kernel methods could be very effective at reducing the burden of feature engineering for structured objects in NLP research (Culotta and Sorensen, 2004). This is because a kernel can measure the similarity between two discrete structured objects directly using the original representation of the objects instead of explicitly enumerating their features.</Paragraph>
    <Paragraph position="5"> Zelenko et al. (2003) develop a tree kernel for relation extraction. Their tree kernel is recursively defined in a top-down manner, matching nodes from roots to leaf nodes. For each pair of matching nodes, a subsequence kernel on their child nodes is invoked, which matches either contiguous or sparse subsequences of node. Culotta and Sorensen (2004) generalize this kernel to estimate similarity between dependency trees. One may note that their tree kernel requires the matchable nodes must be at the same depth counting from the root node. This is a strong constraint on the matching of syntax so it is not surprising that the model has good precision but very low recall on the ACE corpus (Zhao and Grishman, 2005). In addition, according to the top-down node matching mechanism of the kernel, once a node is not matchable with any node in the same layer in another tree, all the sub-trees below this node are discarded even if some of them are matchable to their counterparts in another tree.</Paragraph>
    <Paragraph position="6"> Bunescu and Mooney (2005) propose a shortest path dependency kernel for relation extraction.</Paragraph>
    <Paragraph position="7"> They argue that the information to model a relationship between entities is typically captured by the shortest path between the two entities in the dependency graph. Their kernel is very straightforward. It just sums up the number of common word classes at each position in the two paths. We notice that one issue of this kernel is that they limit the two paths must have the same length, otherwise the kernel similarity score is zero. Therefore, although this kernel shows non-trivial performance improvement than that of Culotta and Sorensen (2004), the constraint makes the two dependency kernels share the similar behavior: good precision but much lower recall on the ACE corpus.</Paragraph>
    <Paragraph position="8"> Zhao and Grishman (2005) define a feature-based composite kernel to integrate diverse features. Their kernel displays very good performance on the 2004 version of ACE corpus. Since this is a feature-based kernel, all the features used in the kernel have to be explicitly enumerated. Similar with the feature-based method, they also represent the tree feature as a link path between two entities.</Paragraph>
    <Paragraph position="9"> Therefore, we wonder whether their performance improvement is mainly due to the explicitly incorporation of diverse linguistic features instead of the kernel method itself.</Paragraph>
    <Paragraph position="10"> The above discussion suggests that the syntactic features in a parse tree may not be fully utilized in the previous work, whether feature-based or kernel-based. We believe that the syntactic tree features could play a more important role than that  reported in the previous work. Since convolution kernels aim to capture structural information in terms of sub-structures, which providing a viable alternative to flat features, in this paper, we propose to use a convolution tree kernel to explore syntactic features for relation extraction. To our knowledge, convolution kernels have not been explored for relation extraction</Paragraph>
    <Paragraph position="12"/>
  </Section>
  <Section position="5" start_page="289" end_page="289" type="metho">
    <SectionTitle>
3 Tree Kernels for Relation Extraction
</SectionTitle>
    <Paragraph position="0"> In this section, we discuss the convolution tree kernel associated with different relation feature spaces. In Subsection 3.1, we define seven different relation feature spaces over parse trees. In Sub-section 3.2, we introduce a convolution tree kernel for relation extraction. Finally we compare our method with the previous work in Subsection 3.3.</Paragraph>
    <Section position="1" start_page="289" end_page="289" type="sub_section">
      <SectionTitle>
3.1 Relation Feature Spaces
</SectionTitle>
      <Paragraph position="0"> In order to study which relation feature spaces (i.e., which portion of parse trees) are optimal for relation extraction, we define seven different relation feature spaces as follows (as shown in Figure 1):</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="289" end_page="291" type="metho">
    <SectionTitle>
(1) Minimum Complete Tree (MCT):
</SectionTitle>
    <Paragraph position="0"> It is the complete sub-tree rooted by the node of the nearest common ancestor of the two entities under consideration.</Paragraph>
    <Paragraph position="1"> (2) Path-enclosed Tree (PT): It is the smallest common sub-tree including the two entities. In other words, the sub-tree is enclosed by the shortest path linking the two entities in the parse tree (this path is also typically used as the path tree features in the feature-based methods). null  (3) Chunking Tree (CT):  It is the base phrase list extracted from the PT. We prune out all the internal structures of the PT and only keep the root node and the base phrase list for generating the chunking tree.</Paragraph>
    <Paragraph position="2">  Convolution kernels were proposed as a concept of kernels for a discrete structure by Haussler (1999) in machine learning study. This framework defines a kernel between input objects by applying convolution &amp;quot;sub-kernels&amp;quot; that are the kernels for the decompositions (parts) of the objects. Convolution kernels are abstract concepts, and the instances of them are determined by the definition of &amp;quot;sub-kernels&amp;quot;. The Tree Kernel (Collins and Duffy, 2001), String Subsequence Kernel (SSK) (Lodhi et al., 2002) and Graph Kernel (HDAG Kernel) (Suzuki et al., 2003) are examples of convolution kernels instances in the NLP field.</Paragraph>
    <Paragraph position="3">  (4) Context-Sensitive Path Tree (CPT):  It is the PT extending with the 1 st left sibling of the node of entity 1 and the 1 st right sibling of the node of entity 2. If the sibling is unavailable, then we move to the parent of current node and repeat the same process until the sibling is available or the root is reached.</Paragraph>
    <Paragraph position="4">  (5) Context-Sensitive Chunking Tree (CCT):  It is the CT extending with the 1 st left sibling of the node of entity 1 and the 1 st right sibling of the node of entity 2. If the sibling is unavailable, the same process as generating the CPT is applied. Then we do a further pruning process to guarantee that the context structures of the CCT is still a list  of base phrases.</Paragraph>
    <Paragraph position="5"> (6) Flattened PT (FPT):  We define two criteria to flatten the PT in order to generate the Flattened Parse tree: if the in and out arcs of a non-terminal node (except POS node) are both single, the node is to be removed; if a node has the same phrase type with its father node, the node is also to be removed.</Paragraph>
    <Paragraph position="6"> (7) Flattened CPT (FCPT): We use the above two criteria to flatten the CPT tree to generate the Flattened CPT.</Paragraph>
    <Paragraph position="7"> Figure 1 in the next page illustrates the different sub-tree structures for a relation instance in sentence &amp;quot;Akyetsu testified he was powerless to stop the merger of an estimated 2000 ethnic Tutsi's in the district of Tawba.&amp;quot;. The relation instance is an example excerpted from the ACE corpus, where an ACE-defined relation &amp;quot;AT.LOCATED&amp;quot; exists between the entities &amp;quot;Tutsi's&amp;quot; (PER) and &amp;quot;district&amp;quot; (GPE).</Paragraph>
    <Paragraph position="8"> We use Charniak's parser (Charniak, 2001) to parse the example sentence. Due to space limitation, we do not show the whole parse tree of the entire sentence here. Tree T  in Figure 1 is the MCT of the relation instance example, where the sub-structure circled by a dashed line is the PT. For clarity, we re-draw the PT as in T  . The only difference between the MCT and the PT lies in that the MCT does not allow the partial production rules. For instance, the most-left two-layer sub-tree  we can test whether the sub-structures with partial production rules as in T  will decrease performance. T  is the CT. By comparing the performance of T  and T  , we want to study whether the chunking information or the parse tree is more effective  for relation extraction. T  is the CPT, where the two structures circled by dashed lines are the so-called context structures. T  is the CCT, where the additional context structures are also circled by dashed lines. We want to study if the limited context information in the CPT and the CCT can help boost performance. Moreover, we illustrate the other two flattened trees in T  and T  . The two circled nodes in T  are removed in the flattened trees. We want to study if the eliminated small structures are noisy features for relation extraction.</Paragraph>
    <Section position="1" start_page="290" end_page="291" type="sub_section">
      <SectionTitle>
3.2 The Convolution Tree Kernel
</SectionTitle>
      <Paragraph position="0"> Given the relation instances defined in the previous section, we use the same convolution tree kernel as the parse tree kernel (Collins and Duffy, 2001) and the semantic kernel (Moschitti, 2004). Generally, we can represent a parse tree T by a vector of integer counts of each sub-tree type (regardless of its ancestors): ()Tph =(# of sub-trees of type 1, ..., # of sub-trees of type i, ..., # of sub-trees of type n) This results in a very high dimensionality since the number of different sub-trees is exponential in its size. Thus it is computational infeasible to directly use the feature vector ()Tph . To solve the compu- null 2000 ethnic Tutsi's in the district of Tawba.&amp;quot;, where the phrase type &amp;quot;E1-O-PER&amp;quot; denotes that the current phrase is the 1 st entity, its entity type is &amp;quot;PERSON&amp;quot; and its mention level is &amp;quot;NOMIAL&amp;quot;, and likewise for the other two phrase types &amp;quot;E2-O-GPE&amp;quot; and &amp;quot;E-N-GPE&amp;quot;.  tational issue, we introduce the tree kernel function which is able to calculate the dot product between the above high dimensional vectors efficiently. The kernel function is defined as follows:</Paragraph>
      <Paragraph position="2"/>
      <Paragraph position="4"> (n) is the indicator function that is 1 iff a sub-tree of type i occurs with root at node n and zero otherwise. Collins and Duffy (2002) show that  (, )KT T is an instance of convolution kernels over tree structures, and which can be computed in  child of node n and l ( 01l&lt;&lt;) is the decay factor in order to make the kernel value less variable with respect to the tree sizes.</Paragraph>
    </Section>
    <Section position="2" start_page="291" end_page="291" type="sub_section">
      <SectionTitle>
3.3 Comparison with Previous Work
</SectionTitle>
      <Paragraph position="0"> It would be interesting to review the differences between our method and the feature-based methods. The basic difference between them lies in the relation instance representation and the similarity calculation mechanism. A relation instance in our method is represented as a parse tree while it is represented as a vector of features in the feature-based methods. Our method estimates the similarity between two relation instances by only counting the number of sub-structures that are in common while the feature methods calculate the dot-product between the feature vectors directly.</Paragraph>
      <Paragraph position="1"> The main difference between them is the different feature spaces. By the kernel method, we implicitly represent a parse tree by a vector of integer counts of each sub-structure type. That is to say, we consider the entire sub-structure types and their occurring frequencies. In this way, on the one hand, the parse tree-related features in the flat feature set  are embedded in the feature space of our method: &amp;quot;Base Phrase Chunking&amp;quot; and &amp;quot;Parse Tree&amp;quot; features explicitly appear as substructures of a parse tree. A few of entity-related features in the flat feature set are also captured by our feature space: &amp;quot;entity type&amp;quot; and &amp;quot;mention level&amp;quot; explicitly appear as phrase types in a parse tree. On the other hand, the other features in the flat feature set, such as &amp;quot;word features&amp;quot;, &amp;quot;bigram word features&amp;quot;, &amp;quot;overlap&amp;quot; and &amp;quot;dependency tree&amp;quot; are not contained in our feature space. From the syntactic viewpoint, the tree representation in our feature space is more robust than &amp;quot;Parse Tree Path&amp;quot; feature in the flat feature set since the path feature is very sensitive to the small changes of parse trees (Moschitti, 2004) and it also does not maintain the hierarchical information of a parse tree. Due to the extensive exploration of syntactic features by kernel, our method is expected to show better performance than the previous feature-based methods.</Paragraph>
      <Paragraph position="2"> It is also worth comparing our method with the previous relation kernels. Since our method only counts the occurrence of each sub-tree without considering its ancestors, our method is not limited by the constraints in Culotta and Sorensen (2004) and that in Bunescu and Mooney (2005) as discussed in Section 2. Compared with Zhao and Grishman's kernel, our method directly uses the original representation of a parse tree while they flatten a parse tree into a link and a path. Given the above improvements, our method is expected to outperform the previous relation kernels.</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="291" end_page="293" type="metho">
    <SectionTitle>
4 Experiments
</SectionTitle>
    <Paragraph position="0"> The aim of our experiment is to verify the effectiveness of using richer syntactic structures and the convolution tree kernel for relation extraction.</Paragraph>
    <Section position="1" start_page="291" end_page="292" type="sub_section">
      <SectionTitle>
4.1 Experimental Setting
</SectionTitle>
      <Paragraph position="0"> Corpus: we use the official ACE corpus for 2003 evaluation from LDC as our test corpus. The ACE corpus is gathered from various newspaper, news-wire and broadcasts. The same as previous work  For the convenience of discussion, without losing generality, we call the features used in Zhou et al. (2005) and Kambhatla (2004) flat feature set.</Paragraph>
      <Paragraph position="1">  (Zhou et al., 2005), our experiments are carried out on explicit relations due to the poor inter-annotator agreement in annotation of implicit relations and their limited numbers. The training set consists of 674 annotated text documents and 9683 relation instances. The test set consists of 97 documents and 1386 relation instances. The 2003 evaluation defined 5 types of entities: Persons, Organizations, Locations, Facilities and GPE. Each mention of an entity is associated with a mention type: proper name, nominal or pronoun. They further defined 5 major relation types and 24 subtypes: AT (Base-In, Located...), NEAR (Relative-Location), PART (Part-of, Subsidiary ...), ROLE (Member, Owner ...) and SOCIAL (Associate, Parent...). As previous work, we explicitly model the argument order of the two mentions involved. We thus model relation extraction as a multi-class classification problem with 10 classes on the major types (2 for each relation major type and a &amp;quot;NONE&amp;quot; class for non-relation (except 1 symmetric type)) and 43 classes on the subtypes (2 for each relation subtype and a &amp;quot;NONE&amp;quot; class for non-relation (except 6 symmetric subtypes)). In this paper, we only measure the performance of relation extraction models on &amp;quot;true&amp;quot; mentions with &amp;quot;true&amp;quot; chaining of coreference (i.e. as annotated by LDC annotators).</Paragraph>
      <Paragraph position="2"> Classifier: we select SVM as the classifier used in this paper since SVM can naturally work with kernel methods and it also represents the state-of-the-art machine learning algorithm. We adopt the one vs. others strategy and select the one with largest margin as the final answer. The training parameters are chosen using cross-validation (C=2.4 (SVM); l =0.4(tree kernel)). In our implementation, we use the binary SVMLight developed by Joachims (1998) and Tree Kernel Toolkits developed by Moschitti (2004).</Paragraph>
      <Paragraph position="3"> Kernel Normalization: since the size of a parse tree is not constant, we normalize  Charniak parser and iterate over all pair of mentions occurring in the same sentence to generate potential instances. We find the negative samples are 10 times more than the positive samples. Thus data imbalance and sparseness are potential problems. Recall (R), Precision (P) and F-measure (F) are adopted as the performance measure.</Paragraph>
    </Section>
    <Section position="2" start_page="292" end_page="293" type="sub_section">
      <SectionTitle>
4.2 Experimental Results
</SectionTitle>
      <Paragraph position="0"> In order to study the impact of the sole syntactic structure information embedded in parse trees on relation extraction, we remove the entity information from parse trees by replacing the entity-related phrase type (&amp;quot;E1-O-PER&amp;quot;, etc., in Figure 1) with &amp;quot;NP&amp;quot;. Then we carry out a couple of preliminary experiments on the test set using parse trees regardless of entity information.</Paragraph>
      <Paragraph position="1">  spaces over the 5 ACE major types using parse tree information only Table 1 reports the performance of our defined seven relation feature spaces over the 5 ACE major types using parse tree information regardless of any entity information. This preliminary experiments show that: * Overall the tree kernel over different relation feature spaces is effective for relation extraction since we use the parse tree information only. We will report the detailed performance comparison results between our method and previous work later in this section.</Paragraph>
      <Paragraph position="2"> * Using the PTs achieves the best performance.</Paragraph>
      <Paragraph position="3"> This means the portion of a parse tree enclosed by the shortest path between entities can model relations better than other sub-trees.</Paragraph>
      <Paragraph position="4"> * Using the MCTs get the worst performance.</Paragraph>
      <Paragraph position="5"> This is because the MCTs introduce too much left and right context information, which may be noisy features, as shown in Figure 1. It suggests that only allowing complete (not partial) production rules in the MCTs does harm performance. * The performance of using CTs drops by 5 in F-measure compared with that of using the PTs.</Paragraph>
      <Paragraph position="6"> This suggests that the middle and high-level structures beyond chunking is also very useful for relation extraction.</Paragraph>
      <Paragraph position="7">  * The context-sensitive trees show lower performance than the corresponding original PTs and CTs. In some cases (e.g. in sentence &amp;quot;the merge of company A and company B....&amp;quot;, &amp;quot;merge&amp;quot; is the context word), the context information is helpful. However the effective scope of context is hard to determine.</Paragraph>
      <Paragraph position="8"> * The two flattened trees perform worse than the original trees, but better than the corresponding context-sensitive trees. This suggests that the removed structures by the flattened trees contribute non-trivial performance improvement.</Paragraph>
      <Paragraph position="9"> In the above experiments, the path-enclosed tree displays the best performance among the seven feature spaces when using the parse tree structural information only. In the following incremental experiments, we incorporate more features into the path-enclosed parse trees and it shows significant performance improvement.</Paragraph>
      <Paragraph position="10">  major types using Path-enclosed trees enhanced with more features in nodes. The 1 st row is the baseline performance using structural information only. We then integrate entity information, including Entity type and Mention level features, into the corresponding nodes as shown in Figure 1. The 2 nd row in Table 2 reports the performance of this setup. Besides the entity information, we further incorporate the semantic features used in Zhou et al. (2005) into the corresponding leaf nodes. The  rd row in Table 2 reports the performance of this setup. Please note that in the 2 nd and 3 rd setups, we still use the same tree kernel function with slight modification on the rule (2) in calculating  (, )nn[?] (see subsection 3.2) to make it consider more features associated with each individual node:  (, ) n n feature weight l[?]= x. From Table 2, we can see that the basic feature of entity information is quite useful, which largely boosts performance by 7 in F-measure. The final performance of our tree kernel method for relation extraction is 76.32/62.99/69.02 in precision/recall/F-measure over the 5 ACE major types.  parentheses report the performance over the 24 ACE subtypes while the numbers outside parentheses is for the 5 ACE major types Table 3 compares the performance of different methods on the ACE corpus  . It shows that our method achieves the best-reported performance on both the 24 ACE subtypes and the 5 ACE major types. It also shows that our tree kernel method significantly outperform the previous two dependency kernel algorithms by 16 in F-measure on the</Paragraph>
    </Section>
  </Section>
  <Section position="8" start_page="293" end_page="294" type="metho">
    <SectionTitle>
5 ACE relation types
</SectionTitle>
    <Paragraph position="0"> . This may be due to two reasons: one reason is that the dependency tree lacks the hierarchical syntactic information, and another reason is due to the two constraints of the two dependency kernels as discussed in Section 2 and Subsection 3.3. The performance improvement by our method suggests that the convolution tree kernel can explore the syntactic features (e.g. parse tree structures and entity information) very effectively and the syntactic features are also particu- null Zhao and Grishman (2005) also evaluated their algorithm on the ACE corpus and got good performance. But their experimental data is for 2004 evaluation, which defined 7 entity types with 44 entity subtypes, and 7 relation major types with 27 subtypes, so we are not ready to compare with each other.  Bunescu and Mooney (2005) used the ACE 2002 corpus, including 422 documents, which is known to have many inconsistencies than the 2003 version. Culotta and Sorensen (2004) used an ACE corpus including about 800 documents, and they did not specify the corpus version. Since the testing corpora are in different sizes and versions, strictly speaking, it is not ready to compare these methods exactly and fairly. Thus Table 3 is only for reference purpose. We just hope that we can get a few clues from this table.</Paragraph>
    <Paragraph position="1">  larly effective for the task of relation extraction. In addition, we observe from Table 1 that the feature space selection (the effective portion of a parse tree) is also critical to relation extraction.</Paragraph>
    <Paragraph position="2">  Finally, Table 4 reports the error distribution in the case of the 3 rd experiment in Table 2. It shows that 85.9% (587/684) of the errors result from relation detection and only 14.1% (97/684) of the errors result from relation characterization. This is mainly due to the imbalance of the positive/negative instances and the sparseness of some relation types on the ACE corpus.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML