File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/p06-1016_metho.xml

Size: 14,827 bytes

Last Modified: 2025-10-06 14:10:18

<?xml version="1.0" standalone="yes"?>
<Paper uid="P06-1016">
  <Title>Modeling Commonality among Related Classes in Relation Extraction</Title>
  <Section position="5" start_page="121" end_page="122" type="metho">
    <SectionTitle>
ACE RDC 2003 corpus
</SectionTitle>
    <Paragraph position="0"> . Bunescu and Mooney (2005a) proposed a shortest path dependency kernel. They argued that the information to model a relationship between two entities can be typically captured by the shortest path between them in the dependency graph. It achieved the F-measure of 52.5 on the 5 relation types in the ACE RDC 2003 corpus. Bunescu and Mooney (2005b) proposed a subsequence kernel and ap- null The ACE RDC 2003 corpus defines 5/24 relation types/subtypes between 4 entity types.</Paragraph>
    <Paragraph position="1"> plied it in protein interaction and ACE relation extraction tasks. Zhang et al (2005) adopted clustering algorithms in unsupervised relation extraction using tree kernels. To overcome the data sparseness problem, various scales of sub-trees are applied in the tree kernel computation. Although tree kernel-based approaches are able to explore the huge implicit feature space without much feature engineering, further research work is necessary to make them effective and efficient. Comparably, feature-based approaches achieved much success recently. Roth and Yih (2002) used the SNoW classifier to incorporate various features such as word, part-of-speech and semantic information from WordNet, and proposed a probabilistic reasoning approach to integrate named entity recognition and relation extraction. Kambhatla (2004) employed maximum entropy models with features derived from word, entity type, mention level, overlap, dependency tree, parse tree and achieved F-measure of 52.8 on the 24 relation subtypes in the ACE RDC 2003 corpus. Zhao and Grisman (2005)  combined various kinds of knowledge from tokenization, sentence parsing and deep dependency analysis through support vector machines and achieved F-measure of 70.1 on the 7 relation types of the ACE RDC 2004 corpus  .</Paragraph>
    <Paragraph position="2"> Zhou et al (2005) further systematically explored diverse lexical, syntactic and semantic features through support vector machines and achieved F-measure of 68.1 and 55.5 on the 5 relation types and the 24 relation subtypes in the ACE RDC 2003 corpus respectively. To overcome the data sparseness problem, feature-based approaches normally incorporate various scales of contexts into the feature vector extensively. These approaches then depend on adopted learning algorithms to weight and combine each feature effectively. For example, an exponential model and a linear model are applied in the maximum entropy models and support vector machines respectively to combine each feature via the learned weight vector.</Paragraph>
    <Paragraph position="3"> In summary, although various approaches have been employed in relation extraction, they implicitly attack the data sparseness problem by using features of different contexts in feature-based approaches or including different sub- null Here, we classify this paper into feature-based approaches since the feature space in the kernels of Zhao and Grisman (2005) can be easily represented by an explicit feature vector.</Paragraph>
    <Paragraph position="4">  structures in kernel-based approaches. Until now, there are no explicit ways to capture the hierarchical topology in relation extraction. Currently, all the current approaches apply the flat learning strategy which equally treats training examples in different relations independently and ignore the commonality among different relations. This paper proposes a novel hierarchical learning strategy to resolve this problem by considering the relatedness among different relations and capturing the commonality among related relations. By doing so, the data sparseness problem can be well dealt with and much better performance can be achieved, especially for those relations with small amounts of annotated examples.</Paragraph>
  </Section>
  <Section position="6" start_page="122" end_page="124" type="metho">
    <SectionTitle>
3 Hierarchical Learning Strategy
</SectionTitle>
    <Paragraph position="0"> Traditional classifier learning approaches apply the flat learning strategy. That is, they equally treat training examples in different classes independently and ignore the commonality among related classes. The flat strategy will not cause any problem when there are a large amount of training examples for each class, since, in this case, a classifier learning approach can always learn a nearly optimal discriminative function for each class against the remaining classes. However, such flat strategy may cause big problems when there is only a small amount of training examples for some of the classes. In this case, a classifier learning approach may fail to learn a reliable (or nearly optimal) discriminative function for a class with a small amount of training examples, and, as a result, may significantly affect the performance of the class or even the overall performance.</Paragraph>
    <Paragraph position="1"> To overcome the inherent problems in the flat strategy, this paper proposes a hierarchical learning strategy which explores the inherent commonality among related classes through a class hierarchy. In this way, the training examples of related classes can help in learning a reliable discriminative function for a class with only a small amount of training examples. To reduce computation time and memory requirements, we will only consider linear classifiers and apply the simple and widely-used perceptron algorithm for this purpose with more options open for future research. In the following, we will first introduce the perceptron algorithm in linear classifier learning, followed by the hierarchical learning strategy using the perceptron algorithm. Finally, we will consider several ways in building the class hierarchy.</Paragraph>
    <Section position="1" start_page="122" end_page="123" type="sub_section">
      <SectionTitle>
3.1 Perceptron Algorithm
</SectionTitle>
      <Paragraph position="0"> 1. Receive the instance n t Rx [?] 2. Compute the output ttt xwo [?]= 3. Give the prediction )( tt osigny = [?] 4. Receive the desired label }1,1{ +[?][?] t y 5. Update the hypothesis according to</Paragraph>
      <Paragraph position="2"> This section first deals with binary classification using linear classifiers. Assume an instance space  linear classifier learning attemps to find a weight vector w that achieves a positive margin on as many examples as possible.</Paragraph>
      <Paragraph position="3">  The training example sequence is feed N times for better performance. Moreover, this number can control the maximal affect a training example can pose. This is similar to the regulation parameter C in SVM, which affects the trade-off between complexity and proportion of non-separable examples. As a result, it can be used to control over-fitting and robustness.</Paragraph>
      <Paragraph position="4">  The well-known perceptron algorithm, as shown in Figure 1, belongs to online learning of linear classifiers, where the learning algorithm represents its t -th hyposthesis by a weight vector</Paragraph>
      <Paragraph position="6"> and receives the desired label</Paragraph>
      <Paragraph position="8"> y . What distinguishes different online algorithms is how they update</Paragraph>
      <Paragraph position="10"> yx received at trial t . In particular, the perceptron algorithm updates the hypothesis by adding a scalar multiple of the instance, as shown in Equation 1 of Figure 1, when there is an error. Normally, the tradictional perceptron algorithm initializes the hypothesis as the zero vector 0  =w . This is usually the most natural choice, lacking any other preference. Smoothing In order to further improve the performance, we iteratively feed the training examples for a possible better discriminative function. In this paper, we have set the maximal iteration number to 10 for both efficiency and stable performance and the final weight vector in the discriminative function is averaged over those of the discriminative functions in the last few iterations (e.g. 5 in this paper).</Paragraph>
      <Paragraph position="11"> Bagging One more problem with any online classifier learning algorithm, including the perceptron algorithm, is that the learned discriminative function somewhat depends on the feeding order of the training examples. In order to eliminate such dependence and further improve the performance, an ensemble technique, called bagging (Breiman 1996), is applied in this paper. In bagging, the bootstrap technique is first used to build M (e.g. 10 in this paper) replicate sample sets by randomly re-sampling with replacement from the given training set repeatedly. Then, each training sample set is used to train a certain discriminative function. Finally, the final weight vector in the discriminative function is averaged over those of the M discriminative functions in the ensemble.</Paragraph>
    </Section>
    <Section position="2" start_page="123" end_page="123" type="sub_section">
      <SectionTitle>
Multi-Class Classification
</SectionTitle>
      <Paragraph position="0"> Basically, the perceptron algorithm is only for binary classification. Therefore, we must extend the perceptron algorithms to multi-class classification, such as the ACE RDC task. For efficiency, we apply the one vs. others strategy, which builds K classifiers so as to separate one class from all others. However, the outputs for the perceptron algorithms of different classes may be not directly comparable since any positive scalar multiple of the weight vector will not affect the actual prediction of a perceptron algorithm. For comparability, we map the perceptron algorithm output into the probability by using an additional sigmoid model:</Paragraph>
      <Paragraph position="2"> where xwf [?]= is the output of a perceptron algorithm and the coefficients A &amp; B are to be trained using the model trust alorithm as described in Platt (1999). The final decision of an instance in multi-class classification is determined by the class which has the maximal probability from the corresponding perceptron algorithm.</Paragraph>
    </Section>
    <Section position="3" start_page="123" end_page="124" type="sub_section">
      <SectionTitle>
3.2 Hierarchical Learning Strategy using the
Perceptron Algorithm
</SectionTitle>
      <Paragraph position="0"> Assume we have a class hierarchy for a task, e.g.</Paragraph>
      <Paragraph position="1"> the one in the ACE RDC 2003 corpus as shown in Table 1 of Section 4.1. The hierarchical learning strategy explores the inherent commonality among related classes in a top-down way. For each class in the hierarchy, a linear discriminative function is determined in a top-down way with the lower-level weight vector derived from the upper-level weight vector iteratively. This is done by initializing the weight vector in training the linear discriminative function for the lower-level class as that of the upper-level class. That is, the lower-level discriminative function has the preference toward the discriminative function of its upper-level class. For an example, let's look at the training of the &amp;quot;Located&amp;quot; relation subtype in the class hierarchy as shown in Table 1: 1) Train the weight vector of the linear discriminative function for the &amp;quot;YES&amp;quot; relation vs. the &amp;quot;NON&amp;quot; relation with the weight vector initialized as the zero vector.</Paragraph>
      <Paragraph position="2"> 2) Train the weight vector of the linear discriminative function for the &amp;quot;AT&amp;quot; relation type vs. all the remaining relation types (including the &amp;quot;NON&amp;quot; relation) with the weight vector initialized as the weight vector of the linear discriminative function for the &amp;quot;YES&amp;quot; relation vs. the &amp;quot;NON&amp;quot; relation. 3) Train the weight vector of the linear discriminative function for the &amp;quot;Located&amp;quot; relation subtype vs. all the remaining relation subtypes under all the relation types (including the &amp;quot;NON&amp;quot; relation) with the  weight vector initialized as the weight vector of the linear discriminative function for the &amp;quot;AT&amp;quot; relation type vs. all the remaining relation types.</Paragraph>
      <Paragraph position="3"> 4) Return the above trained weight vector as the discriminatie function for the &amp;quot;Located&amp;quot; relation subtype.</Paragraph>
      <Paragraph position="4"> In this way, the training examples in different classes are not treated independently any more, and the commonality among related classes can be captured via the hierarchical learning strategy. The intuition behind this strategy is that the upper-level class normally has more positive training examples than the lower-level class so that the corresponding linear discriminative function can be determined more reliably. In this way, the training examples of related classes can help in learning a reliable discriminative function for a class with only a small amount of training examples in a top-down way and thus alleviate its data sparseness problem.</Paragraph>
    </Section>
    <Section position="4" start_page="124" end_page="124" type="sub_section">
      <SectionTitle>
3.3 Building the Class Hierarchy
</SectionTitle>
      <Paragraph position="0"> We have just described the hierarchical learning strategy using a given class hierarchy. Normally, a rough class hierarchy can be given manually according to human intuition, such as the one in the ACE RDC 2003 corpus. In order to explore more commonality among sibling classes, we make use of binary hierarchical clustering for sibling classes at both lowest and all levels. This can be done by first using the flat learning strategy to learn the discriminative functions for individual classes and then iteratively combining the two most related classes using the cosine similarity function between their weight vectors in a bottom-up way. The intuition is that related classes should have similar hyper-planes to separate from other classes and thus have similar weight vectors.</Paragraph>
      <Paragraph position="1"> * Lowest-level hybrid: Binary hierarchical clustering is only done at the lowest level while keeping the upper-level class hierarchy. That is, only sibling classes at the lowest level are hierarchically clustered.</Paragraph>
      <Paragraph position="2"> * All-level hybrid: Binary hierarchical clustering is done at all levels in a bottom-up way. That is, sibling classes at the lowest level are hierarchically clustered first and then sibling classes at the upper-level. In this way, the binary class hierarchy can be built iteratively in a bottom-up way.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML