File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/w05-1002_metho.xml

Size: 23,809 bytes

Last Modified: 2025-10-06 14:09:58

<?xml version="1.0" standalone="yes"?>
<Paper uid="W05-1002">
  <Title>Verb subcategorization kernels for automatic semantic labeling</Title>
  <Section position="4" start_page="10" end_page="10" type="metho">
    <SectionTitle>
2 Parsing of Semantic Roles and Semantic
Arguments
</SectionTitle>
    <Paragraph position="0"> There are two main resources that relate to predicate argument structures: PropBank (PB) and FrameNet (FN). PB is a 300,000 word corpus annotated with predicative information on top of the Penn Treebank</Paragraph>
  </Section>
  <Section position="5" start_page="10" end_page="12" type="metho">
    <SectionTitle>
2 Wall Street Journal texts. For any given pred-
</SectionTitle>
    <Paragraph position="0"> icate, the expected arguments are labeled sequentially from Arg 0 to Arg 9, ArgA and ArgM. The Figure 1 shows an example of the PB predicate annotation. Predicates in PB are only embodied by verbs whereas most of the times Arg 0 is the subject, Arg 1 is the direct object and ArgM may indicate locations, as in our example.</Paragraph>
    <Paragraph position="1"> FrameNet also describes predicate/argument structures but for this purpose it uses richer semantic structures called frames. These latter are schematic representations of situations involving various participants, properties and roles, in which a word may be typically used. Frame elements or semantic roles are arguments of target words that can be verbs or nouns or adjectives. In FrameNet, the argument names are local to the target frames.</Paragraph>
    <Paragraph position="2"> For example, assuming that attach is the target word and Attaching is the target frame, a typical sentence annotation is the following.</Paragraph>
    <Paragraph position="3"> [Agent They] attachTgt [Item themselves] [Connector with their mouthparts] and then release a digestive enzyme secretion which eats into the skin.</Paragraph>
    <Paragraph position="4"> Several machine learning approaches for argument identification and classification have been developed, e.g. (Gildea and Jurasfky, 2002; Gildea and Palmer, ; Gildea and Hockenmaier, 2003; Pradhan et al., 2004). Their common characteristic is the adoption of feature spaces that model predicate-argument structures in a flat feature representation. In the next section we present the common parse tree-based approach to this problem.</Paragraph>
    <Section position="1" start_page="10" end_page="11" type="sub_section">
      <SectionTitle>
2.1 Predicate Argument Extraction
</SectionTitle>
      <Paragraph position="0"> Given a sentence in natural language, all the predicates associated with the verbs have to be identified  along with their arguments. This problem can be divided into two subtasks: (a) the detection of the target argument boundaries, i.e. all its compounding words, and (b) the classification of the argument type, e.g. Arg0 or ArgM in PropBank or Agent and Goal in FrameNet.</Paragraph>
      <Paragraph position="1"> The standard approach to learn both the detection and the classification of predicate arguments is summarized by the following steps:  1. Given a sentence from the training-set, generate a full syntactic parse-tree; 2. let P and A be the set of predicates and the set of parse-tree nodes (i.e. the potential arguments), respectively; 3. for each pair &lt;p,a&gt;[?]P xA: * extract the feature representation set, Fp,a; * if the subtree rooted in a covers exactly  the words of one argument of p, put Fp,a in T+ (positive examples), otherwise put it in T[?] (negative examples).</Paragraph>
      <Paragraph position="2"> For instance, in Figure 1, for each combination of the predicate rent with the nodes N, S, VP, V, NP, PP, D or IN the instances Frent,a are generated. In case the node a exactly covers &amp;quot;Paul&amp;quot;, &amp;quot;a room&amp;quot; or &amp;quot;in Boston&amp;quot;, it will be a positive instance otherwise it will be a negative one, e.g. Frent,IN.</Paragraph>
      <Paragraph position="3"> The T+ and T[?] sets can be re-organized as positive T+argi and negative T[?]argi examples for each argument i. In this way, an individual ONE-vs-ALL classifier for each argument i can be trained. We adopted this solution as it is simple and effective (Pradhan et al., 2004). In the classification phase, given a sentence of the test-set, all its Fp,a are generated and classified by each individual classifier Ci. As a final decision, we select the argument associated with the maximum value among the scores provided by the individual classifiers.</Paragraph>
    </Section>
    <Section position="2" start_page="11" end_page="11" type="sub_section">
      <SectionTitle>
2.2 Standard feature space
</SectionTitle>
      <Paragraph position="0"> The discovery of relevant features is, as usual, a complex task, nevertheless, there is a common consensus on the basic features that should be adopted.</Paragraph>
      <Paragraph position="1"> These standard features, firstly proposed in (Gildea and Jurasfky, 2002), refer to a flat information derived from parse trees, i.e. Phrase Type, Predicate Word, Head Word, Governing Category, Position and Voice. For example, the Phrase Type indicates the syntactic type of the phrase labeled as a predicate argument, e.g. NP for Arg1 in Figure 1. The Parse Tree Path contains the path in the parse tree between the predicate and the argument phrase, expressed as a sequence of non-terminal labels linked by direction (up or down) symbols, e.g. V  |VP | NP for Arg1 in Figure 1. The Predicate Word is the surface form of the verbal predicate, e.g. rent for all arguments.</Paragraph>
      <Paragraph position="2"> In the next section we describe the SVM approach and the basic kernel theory for the predicate argument classification.</Paragraph>
    </Section>
    <Section position="3" start_page="11" end_page="12" type="sub_section">
      <SectionTitle>
2.3 Learning with Support Vector Machines
</SectionTitle>
      <Paragraph position="0"> Given a vector space in Rfracturn and a set of positive and negative points, SVMs classify vectors according to a separating hyperplane, H(vectorx) = vectorw x vectorx + b = 0, where vectorw [?] Rfracturn and b [?] Rfractur are learned by applying the Structural Risk Minimization principle (Vapnik, 1995).</Paragraph>
      <Paragraph position="1"> To apply the SVM algorithm to Predicate Argument Classification, we need a function ph : F -Rfracturn to map our features space F = {f1,..,f|F|} and our predicate/argument pair representation, Fp,a = Fz, into Rfracturn, such that:</Paragraph>
      <Paragraph position="3"> From the kernel theory we have that:</Paragraph>
      <Paragraph position="5"> aiph(Fi)*ph(Fz)+b.</Paragraph>
      <Paragraph position="6"> where, Fi [?]i [?] {1,..,l} are the training instances and the product KT(Fi,Fz) =&lt;ph(Fi) * ph(Fz)&gt; is the kernel function associated with the mapping ph. The simplest mapping that we can apply is ph(Fz) = vectorz = (z1,...,zn) where zi = 1 if fi [?] Fz and zi = 0 otherwise, i.e. the characteristic vector of the set Fz with respect to F. If we choose the scalar product as a kernel function we obtain the linear kernel KL(Fx,Fz) = vectorx*vectorz.</Paragraph>
      <Paragraph position="7"> Another function that has shown high accuracy for the predicate argument classification (Pradhan et al., 2004) is the polynomial kernel:  Arg. 0 Arg. 0 Arg. 1 Arg. 1 Sentence Parse-Tree Ftook Fread  the arguments of Ftook of Figure 2.</Paragraph>
      <Paragraph position="8"> KPoly(Fx,Fz) = (c+vectorx*vectorz)d, where c is a constant and d is the degree of the polynom.</Paragraph>
      <Paragraph position="9"> The interesting property is that we do not need to evaluate the ph function to compute the above vector; only the K(vectorx,vectorz) values are required. This allows us to define efficient classifiers in a huge (possible infinite) feature set, provided that the kernel is processed in an efficient way. In the next section, we introduce the convolution kernel that we used to represent subcategorization structures.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="12" end_page="13" type="metho">
    <SectionTitle>
3 Subcategorization Frame Kernel (SK)
</SectionTitle>
    <Paragraph position="0"> The convolution kernel that we have experimented was devised in (Moschitti, 2004) and is characterized by two aspects: the semantic space of the sub-categorization structures and the kernel function that measure their similarities.</Paragraph>
    <Section position="1" start_page="12" end_page="12" type="sub_section">
      <SectionTitle>
3.1 Subcategorization Frame Structure (SCFS)
</SectionTitle>
      <Paragraph position="0"> We consider the predicate argument structures annotated in PropBank or FrameNet as our semantic space. As we assume that semantic structures are correlated to syntactic structures, we used a kernel that selects semantic information according to the syntactic structure of a predicate. The subparse tree which describes the subcategorization frame of the target verbal predicate defines the target Sub-categorization Frame Structure (SCFS). For example, Figure 2 shows the parse tree of the sentence &amp;quot;John took the book and read its title&amp;quot; together with two SCFS structures, Ftook and Fread associated with the two predicates took and read, respectively. Note that SCFS includes also the external argument (i.e. the subject) although some linguistic theories do not consider it being part of the SCFs.</Paragraph>
      <Paragraph position="1"> Once the semantic representation is defined, we need to design a tree kernel function to estimate the similarity between our objects.</Paragraph>
    </Section>
    <Section position="2" start_page="12" end_page="13" type="sub_section">
      <SectionTitle>
3.2 The tree kernel function
</SectionTitle>
      <Paragraph position="0"> The main idea of tree kernels is to model a K(T1,T2) function which computes the number of the common substructures between two trees T1 and T2. For example, Figure 3 shows all the fragments of the argument structure Ftook (see Figure 2) which will be matched against the fragment of another SCFS.</Paragraph>
      <Paragraph position="1"> Given the set of fragments {f1,f2,..} = F extracted from all SCFSs of the training set, we define the indicator function Ii(n) which is equal 1 if the target fi is rooted at node n and 0 otherwise. It follows that:</Paragraph>
      <Paragraph position="3"> where NT1 and NT2 are the sets of the T1's and T2's nodes, respectively and [?](n1,n2) =summationtext |F| i=1 Ii(n1)Ii(n2). This latter is equal to the number of common fragments rooted in the n1 and n2 nodes. We can compute [?] as follows:  1. if the productions at n1 and n2 are different then [?](n1,n2) = 0; 2. if the productions at n1 and n2 are the same, and n1 and n2 have only leaf children (i.e. they are pre-terminals symbols) then [?](n1,n2) = 1; 3. if the productions at n1 and n2 are the same, and n1 and n2 are not pre-terminals then</Paragraph>
      <Paragraph position="5"> where s [?] {0,1}, nc(n1) is the number of the children of n1 and cjn is the j-th child of the node n.</Paragraph>
      <Paragraph position="6">  Note that, as the productions are the same nc(n1) = nc(n2).</Paragraph>
      <Paragraph position="7"> The above kernel has the drawback of assigning higher weights to larger structures1. To overcome this problem we can scale the relative importance of the tree fragments using a parameter l in the conditions 2 and 3 as follows: [?](nx,nz) = l and</Paragraph>
      <Paragraph position="9"> The set of fragments that belongs to SCFs are derived by human annotators according to semantic considerations, thus they generate a semantic subcategorization frame kernel (SK). We also note that SK estimates the similarity between two SCFSs by counting the number of fragments that are in common. For example, in Figure 2, KT(ph(Ftook),ph(Fread)) is quite high (i.e. 6 out 10 substructures) as the two verbs have the same syntactic realization.</Paragraph>
      <Paragraph position="10"> In other words the fragments encode semantic information which is measured by SK. This provides the argument classifiers with important clues about the possible set of arguments suited for a target verbal predicate. To support this hypothesis the next section presents the experiments on the predicate argument type of FrameNet and ProbBank.</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="13" end_page="14" type="metho">
    <SectionTitle>
4 The Experiments
</SectionTitle>
    <Paragraph position="0"> A clustering algorithm which uses SK would group together verbs that show a similar syntactic structure. To study the properties of such clusters we experimented SK in combination with the traditional kernel used for the predicate argument classification.</Paragraph>
    <Paragraph position="1"> As the polynomial kernel with degree 3 was shown to be the most accurate for the argument classification (Pradhan et al., 2004; Moschitti, 2004) we use it to build two kernel combinations:  ized product between the polynomial kernel 1With a similar aim and to have a similarity score between 0 and 1, we also apply the normalization in the kernel space, i.e.</Paragraph>
    <Paragraph position="3"> /[?]treebank) (Marcus et al., 1993). This corpus contains about 53,700 sentences and a fixed split between training and testing which has been used in other researches, e.g. (Pradhan et al., 2004; Gildea and Palmer, ). In this split, Sections from 02 to 21 are used for training, section 23 for testing and sections 1 and 22 as development set. We considered all 12 arguments from Arg0 to Arg9, ArgA and ArgM for a total of 123,918 and 7,426 arguments in the training and test sets, respectively. It is worth noting that in the experiments we used the gold standard parsing from the Penn TreeBank, thus our kernel structures are derived with high precision.</Paragraph>
    <Paragraph position="4"> The second corpus was obtained by extracting from FrameNet (www.icsi.berkeley.edu/ [?]framenet/) all 24,558 sentences from 40 frames of the Senseval 3 (http://www.senseval.org) Automatic Labeling of Semantic Role task. We considered 18 of the most frequent roles for a total of 37,948 arguments3. Only verbs are selected to be predicates in our evaluations. Moreover, as there is no fixed split between training and testing, we randomly selected 30% of the sentences for testing and 30% for validation-set, respectively. Both training and testing sentences were processed using Collins' parser (Collins, 1997) to generate parse-tree automatically. This means that our shallow semantic parser for FrameNet is fully automated.</Paragraph>
    <Section position="1" start_page="13" end_page="14" type="sub_section">
      <SectionTitle>
4.1 The Classification set-up
</SectionTitle>
      <Paragraph position="0"> The evaluations were carried out with the SVM-</Paragraph>
      <Paragraph position="2"> which encodes the tree kernels in the SVM-light software (Joachims, 1999).</Paragraph>
      <Paragraph position="3"> The classification performance was measured using the F1 measure4 for the individual arguments and the accuracy for the final multi-class classifier. This latter choice allows us to compare the results  with previous literature works, e.g. (Gildea and Jurasfky, 2002; Pradhan et al., 2004; Gildea and Palmer, ).</Paragraph>
      <Paragraph position="4"> For the evaluation of SVMs, we used the default regularization parameter (e.g., C = 1 for normalized kernels) and we tried a few cost-factor values (i.e., j [?] {1,2,3,5,7,10,100}) to adjust the rate between Precision and Recall. We chose the parameters by evaluating the SVMs using the KPoly kernel (degree = 3) over the validation-set. Both l (see Section 3.2) and g parameters were evaluated in a similar way by maximizing the performance of SVM using Poly+SK. We found that the best values were 0.4 and 0.3, respectively.</Paragraph>
    </Section>
    <Section position="2" start_page="14" end_page="14" type="sub_section">
      <SectionTitle>
4.2 Comparative results
</SectionTitle>
      <Paragraph position="0"> To study the impact of the subcategorization frame kernel we experimented the three models Poly, Poly + SK and Poly x SK on different training conditions.</Paragraph>
      <Paragraph position="1"> First, we run the above models using all the verbal predicates available in the training and test sets. Tables 1 and 2 report the F1 measure and the global accuracy for PB and FN, respectively. Column 2 shows the accuracy of Poly (90.5%) which is substantially equal to the accuracy obtained in (Pradhan et al., 2004) on the same training and test sets with the same SVM model. Columns 3 and 4 show that the kernel combinations Poly + SK and Poly x SK remarkably improve Poly accuracy, i.e. 2.7% (93.2% vs. 90.5%) whereas on FN only Poly+SK produces a small accuracy increase, i.e.</Paragraph>
      <Paragraph position="2"> 0.7% (86.2% vs. 85.5%).</Paragraph>
      <Paragraph position="3"> This outcome is lower since the FN classification requires dealing with a higher variability of its semantic roles. For example, in ProbBank most of the time, the PB Arg0 and Arg1 corresponds to the logical subject and logical direct object, respectively. On the contrary, the FN Cause and Agent roles are often both associated with the logical subject and share similar syntactic realizations, making SCFS less effective to distinguish between them. Moreover, the training data available for FrameNet is smaller than that used for PropBank, thus, the tree kernel may not have enough examples to generalize, correctly.</Paragraph>
      <Paragraph position="4"> Second, we carried out other experiments using a subset of the total verbs for training and another  disjoint subset for testing. In these conditions, the impact of SK is amplified: on PB, SKxPoly out-performs Poly by 4.8% (86.9% vs. 82.1%), whereas, on FN, SK increases Poly of about 2%, i.e. 74.6% vs. 72.8%. These results suggest that (a) when test-set verbs are not observed during training, the classification task is harder, e.g. 82.1% vs. 90.5% on PB and (b) the syntactic structures of the verbs, i.e. the SCFSs, allow the SVMs to better generalize on unseen verbs.</Paragraph>
      <Paragraph position="5"> To verify that the kernel representation is superior to the traditional representation we carried out an experiment using a flat feature representation of the SCFs, i.e. we used the syntactic frame feature described (Xue and Palmer, 2004) in place of SK.</Paragraph>
      <Paragraph position="6"> The result as well as other literature findings, e.g. (Pradhan et al., 2004) show an improvement on PB of about 0.7% only. Evidently flat features cannot derive the same information of a convolution kernel. Finally, to study how the verb complexity impacts on the usefulness of SK, we carried out additional experiments with different verb sets. One dimension of complexity is the frequency of the verbs in the target corpus. Infrequent verbs are associated with predicate argument structures poorly represented in the training set thus they are more difficult to classify. Another dimension of the verb complexity is the number of different SCFs that they show in different contexts. Intuitively, the higher is the number  trend line plot of Poly and SK + Poly according to subsets of different verb frequency. For example, the label 1-5 refers to the class of verbal predicates whose frequency ranges from 1 to 5. The associated accuracy is evaluated on the portions of the training and test-sets which contain only the verbs in such class. We note that SK improves Poly for any verb frequency. Such improvement decreases when the frequency becomes very high, i.e. when there are many training instances that can suggest the correct classification. A similar behavior is shown in Figure 4.b where the F1 measure for Arg0 of PB is reported.</Paragraph>
      <Paragraph position="7"> Figures 4.c and 4.d illustrate the accuracy and the F1 measure for all arguments and Arg0 of PB according to the number of SCF types, respectively.</Paragraph>
      <Paragraph position="8"> We observe that the Semantic Kernel does not produce any improvement on the verbs which are syntactically expressed by only one type of SCF. As the number of SCF types increases (&gt; 1) Poly + SK outperforms Poly for any verb class, i.e. when the verb is enough complex SK always produces useful information independently of the number of the training set instances. On the one hand, a high number of verb instances reduces the complexity of the classification task. On the other hand, as the number of verb type increases the complexity of the task increases as well.</Paragraph>
      <Paragraph position="9"> A similar behavior can be noted on the FN data (Figure 4.e) even if the not so strict correlation between syntax and semantics prevents SK to produce high improvements. Figure 4.f shows the impact of SK on the Agent role. We note that, the F1 increases more than the global accuracy (Figure 4.e) as the Agent most of the time corresponds to Arg0. This is confirmed by the Table 2 which shows an improvement for the Agent of up to 2% when SK is used along with the polynomial kernel.</Paragraph>
    </Section>
  </Section>
  <Section position="8" start_page="14" end_page="16" type="metho">
    <SectionTitle>
5 Conclusive Remarks
</SectionTitle>
    <Paragraph position="0"> In this article, we used Support Vector Machines (SVMs) to deeply analyze the role of the subcategorization frame kernel (SK) in the automatic predicate argument classification of PropBank and  FrameNet corpora. To study the SK's verb classification properties we have combined it with the polynomial kernel on standard flat features.</Paragraph>
    <Paragraph position="1"> We run the SVMs on diverse levels of task complexity. The results show that: (1) in general SK remarkably improves the classification accuracy. (2) When there are no training instances of the test-set verbs the improvement of SK is almost double. This suggests that tree kernels automatically derive features which support also a sort of back-off estimation in case of unseen verbs. (3) In all complexity conditions the structural features are in general very robust, maintaining a high improvement over the basic accuracy. (4) The semantic role classification in FrameNet is affected with more noisy data as it is based on the output of a statistical parser. As a consequence the improvement is lower. Anyway, the systematic superiority of SK suggests that it is less sensible to parsing errors than other models. This opens promising direction for a more weakly supervised application of the statistical semantic tagging supported by SK.</Paragraph>
    <Paragraph position="2"> In summary, the extensive experimentation has shown that the SK provides information robust with respect to the complexity of the task, i.e. verbs with richer syntactic structures and sparse training data.</Paragraph>
    <Paragraph position="3"> An important observation on the use of tree kernels has been pointed out in (Cumby and Roth, 2003). Both computational efficiency and classification accuracy can often be superior if we select the most informative tree fragments and carry out the learning in the feature space. Nevertheless, the case studied in this paper is well suited for using kernels as: (1) it is difficult to guess which fragment from an SCF should be retained and which should be discarded, (2) it may be the case that all fragments are useful as SCFs are small structures and all theirs substructures may serve as different back-off levels and (3) small structures do not heavily penalize efficiency.</Paragraph>
    <Paragraph position="4"> Future research may be addressed to (a) the use of SK kernel to explicitly generate verb clusters and (b) the use of convolution kernels to study other linguistic phenomena: we can use tree kernels to investigate which syntactic features are suited for an unknown phenomenon.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML