File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/w04-2403_metho.xml
Size: 25,384 bytes
Last Modified: 2025-10-06 14:09:22
<?xml version="1.0" standalone="yes"?> <Paper uid="W04-2403"> <Title>A Semantic Kernel for Predicate Argument Classification</Title> <Section position="3" start_page="1" end_page="1" type="metho"> <SectionTitle> 2 Automatic Predicate-Argument extraction </SectionTitle> <Paragraph position="0"> Given a sentence in natural language, all the predicates associated with its verbs have to be identified along with their arguments. This problem can be divided in two subtasks: (a) detection of the target argument boundaries, i.e. all its compounding words, and (b) classification of the argument type, e.g. Arg0 or ArgM.</Paragraph> <Paragraph position="1"> A direct approach to learn both detection and classification of predicate arguments is summarized by the following steps: 1. Given a sentence from the training-set, generate a full syntactic parse-tree; 2. let P and A be the set of predicates and the set of parse-tree nodes (i.e. the potential arguments), respectively; null 3. for each pair <p;a>2P PSA: + extract the feature representation set, Fp;a; + if the subtree rooted in a covers exactly the words of one argument of p, put Fp;a in T+ (positive examples), otherwise put it in T! (negative examples).</Paragraph> <Paragraph position="2"> For example, in Figure 1, for each combination of the predicate give with the nodes N, S, VP, V, NP, PP, D or INthe instances F&quot;give&quot;;a are generated. In case the node a exactly covers Paul, a lecture or in Rome, it will be a positive instance otherwise it will be a negative one, e.g. F&quot;give&quot;;&quot;IN&quot;.</Paragraph> <Paragraph position="3"> The above T+ and T! sets can be re-organized as positive T+argi and negative T!argi examples for each argument i. In this way, an individual ONE-vs-ALL classifier for each argument i can be trained. We adopted this solution as it is simple and effective (Pradhan et al., 2003). In the classification phase, given a sentence of the test-set, all its Fp;a are generated and classified by each individual classifier. As a final decision, we select the argument as-sociated with the maximum value among the scores provided by the SVMs2, i.e. argmaxi2S Ci, where S is the target set of arguments.</Paragraph> <Section position="1" start_page="1" end_page="1" type="sub_section"> <SectionTitle> 2.1 Standard feature space </SectionTitle> <Paragraph position="0"> The discovering of relevant features is, as usual, a complex task, nevertheless there is a common consensus on the basic features that should be adopted. These standard features, firstly proposed in (Gildea and Jurasky, 2002), refer to a flat information derived from parse trees, i.e. Phrase Type, Predicate Word, Head Word, Governing Category, Position and Voice. Table 1 presents the standard features and exemplifies how they are extracted from a given parse tree.</Paragraph> <Paragraph position="1"> - Phrase Type: This feature indicates the syntactic type of the phrase labeled as a predicate argument, e.g. NP for Arg1 in Figure 1.</Paragraph> <Paragraph position="2"> - Parse Tree Path: This feature contains the path in the parse tree between the predicate and the argument phrase, expressed as a sequence of nonterminal labels linked by direction (up or down) symbols, e.g. V&quot;VP #NP for Arg1 in Figure 1.</Paragraph> <Paragraph position="3"> - Position: Indicates if the constituent, i.e. the potential argument, appears before or after the predicate in the sentence, e.g. after for Arg1 and before for Arg0 (see Figure 1).</Paragraph> <Paragraph position="4"> - Voice: This feature distinguishes between active or passive voice for the predicate phrase, e.g. active for every argument (see Figure 1).</Paragraph> <Paragraph position="5"> - Head Word: This feature contains the head word of the evaluated phrase. Case and morphological information are preserved, e.g. lecture for Arg1 (see Figure 1).</Paragraph> <Paragraph position="6"> - Governing Category: This feature applies to noun phrases only, and it indicates if the NP is dominated by a sentence phrase (typical for subject arguments with active voice predicates), or by a verb phrase (typical for object arguments), e.g. the NP associated with Arg1 is dominated by a verbal phrase VP (see Figure 1).</Paragraph> <Paragraph position="7"> - Predicate Word: In our implementation this feature consists of two components: (1) the word itself with the case and morphological information preserved, e.g. gives for all arguments; and (2) the lemma which represents the verb normalized to lower case and infinitive form, e.g. give for all arguments (see Figure 1).</Paragraph> <Paragraph position="8"> For example, the Parse Tree Path feature represents the path in the parse-tree between a predicate node and one of its argument nodes. It is expressed as a sequence of non-terminal labels linked by direction symbols (up or down), e.g. in Figure 1, V&quot;VP#NP is the path between the predicate to give and the argument 1, a lecture. If two pairs <p1;a1> and <p2;a2> have a Path that differs even for one character (e.g. a node in the parse-tree) the match will not be carried out, preventing the learning algorithm to generalize well on unseen data. In order to address also into a multi-class categorization problem; several optimization have been proposed, e.g. (Goh et al., 2001).</Paragraph> <Paragraph position="9"> this problem, next section describes a novel kernel space for predicate argument classification.</Paragraph> </Section> </Section> <Section position="4" start_page="1" end_page="1" type="metho"> <SectionTitle> 3 A semantic kernel for argument </SectionTitle> <Paragraph position="0"> classification We consider the predicate argument structures annotated in PropBank as our semantic space. Many semantic structures may constitute the objects of our space. Some possibilities are: (a) the selection of the whole sentence parsetree, in which the target predicate is contained or (b) the selection of the sub-tree that encloses the whole predicate annotation (i.e. all its arguments). However, both choices would cause an exponential explosion on the potential sub-parse-trees that have to be classified during the testing phase. In fact, during this phase we do not know which are the arguments associated with a predicate. Thus, we need to build all the possible structures, which contain groups of potential arguments for the target predicate. More in detail, assuming that S is the set of PropBank argument types, and m is the maximum number of entries that the target predicate can have, we have to evaluate !jSjmC/ argument combinations for each target predicate.</Paragraph> <Paragraph position="1"> In order to define an efficient semantic space we select as objects only the minimal sub-structures that include one predicate with only one of its arguments. For example, Figure 2 illustrates the parse-tree of the sentence &quot;Paul delivers a lecture in formal style&quot;. The circled substructures in (a), (b) and (c) are our semantic objects associated with the three arguments of the verb to deliver, i.e. <deliver, Arg0>, <deliver, Arg1> and <deliver, ArgM>. In this formulation, only one of the above structures is associated with each predicate/argument pair, i.e. Fp;a contain only one of the circled sub-trees.</Paragraph> <Paragraph position="2"> We note that our approach has the following properties: + The overall semantic feature space F contain sub-structures composed of syntactic information embodied by parse-tree dependencies and semantic information under the form of predicate/argument annotation. null + This solution is efficient as we have to classify at maximum jAj nodes for each predicate, i.e. the set of the parse-tree nodes of a testing sentence.</Paragraph> <Paragraph position="3"> + A constituent cannot be part of two different arguments of the target predicate, i.e. there is no overlapping between the words of two arguments. Thus, two semantic structures Fp1;a1 and Fp2;a23, asso3Fp;a was defined as the set of features of our objects <p;a>. Since in our kernel we have only one element in Fp;a with an abuse of notation we use it to indicate the objects themselves. null ciated with two different arguments, cannot be included one in the other. This property is important because, a convolution kernel would not be effective to distinguish between an object and its sub-parts.</Paragraph> <Paragraph position="4"> Once defined our semantic space we need to design a kernel function to measure a similarity between two objects. These latter may still be seen as described by complex features but such a similarity is carried out avoiding the explicit feature computation. For this purpose we define a mapping ` : F ! F0 such as:</Paragraph> <Paragraph position="6"> where F0 allows us to design an efficient semantic kernel K(~x;~z) =<`(~x)C/`(~z)>.</Paragraph> <Section position="1" start_page="1" end_page="1" type="sub_section"> <SectionTitle> 3.1 The Semantic Kernel (SK) </SectionTitle> <Paragraph position="0"> Given the semantic objects defined in the previous section, we design a convolution kernel in a way similar to the parse-tree kernel proposed in (Collins and Duffy, 2002). Our feature set F0 is the set of all possible sub-structures (enumerated from 1 to jF0j) of the semantic objects extracted from PropBank. For example, Figure 3 illustrates all valid fragments of the semantic structure Fdeliver;Arg1 (see also Figure 2). It is worth noting that the allowed sub-trees contain the entire (not partial) production rules. For instance, the sub-tree [NP [D a]] is excluded from the set of the Figure 3 since only a part of the production NP ! D N is used in its generation.</Paragraph> <Paragraph position="1"> However, this constraint does not apply to the production VP!V NP PP along with the fragment [VP [VNP]] as the subtree [VP [PP [...]]] is not considered part of the semantic structure.</Paragraph> <Paragraph position="2"> Even if the cardinality of F0 will be very large the evaluation of the kernel function is polynomial in the number of parse-tree nodes.</Paragraph> <Paragraph position="3"> More precisely, a semantic structure ~x is mapped in `(~x) = (h1(~x);h2(~x);:::), where the feature function hi(~x) simply counts the number of times that the i-th sub-structure of the training data appears in ~x. Let talk Figure 3: All 17 valid fragments of the semantic structure as-sociated with Arg 1 (see Figure 2). Ii(n) be the indicator function: 1 if the sub-structure</Paragraph> <Paragraph position="5"> where Nx and Nz are the nodes in x and z, respectively. In (Collins and Duffy, 2002), it has been shown that Eq. 1 can be computed in O(jNxjPSjNzj) by evaluating [?](nx;nz) = Pi Ii(nx)Ii(nz) with the following recursive equations: + if the production at nx and nz are different then</Paragraph> <Paragraph position="7"> 4Additionally, we carried out the normalization in the kernel space, thus the final kernel is K0(~x;~z) = K(~x;~z)pK(~x;~x)PSK(~z;~z).</Paragraph> <Paragraph position="8"> + if the production at nx and nz are the same, and nx and nz are pre-terminals then</Paragraph> <Paragraph position="10"> where nc(nx) is the number of children of nx and ch(n;i) is the i-th child of the node n. Note that as the productions are the same ch(nx;i) = ch(nz;i).</Paragraph> <Paragraph position="11"> This kind of kernel has the drawback of assigning more weight to larger structures while the argument type does not depend at all on the size of its structure. In fact two sentences such as: (1) [Arg0 Paul ][predicate delivers ] [Arg1 a lecture] and (2) [Arg0 Paul ][predicate delivers ][Arg1 a plan on the detection of theorist groups active in the North Iraq] have the same argument type with a very different size.</Paragraph> <Paragraph position="12"> To overcome this problem we can scale the relative importance of the tree fragments with their size. For this purpose a parameter , is introduced in equations 2 and 3 obtaining:</Paragraph> <Paragraph position="14"> It is worth noting that even if the above equations define a kernel function similar to the one proposed in (Collins and Duffy, 2002), the substructures on which SK operates are different from the parse-tree kernel. For example, Figure 3 shows that structures such as [VP [V] [NP]], [VP [V delivers ] [NP]] and [VP [V] [NP [DT N]]] are valid features, but these fragments (and many others) are not generated by a complete production, i.e.</Paragraph> <Paragraph position="15"> VP!VNPPP. As a consequence they are not included in the parse-tree kernel representation of the sentence.</Paragraph> </Section> <Section position="2" start_page="1" end_page="1" type="sub_section"> <SectionTitle> 3.2 Comparison with Standard Features </SectionTitle> <Paragraph position="0"> We have synthesized the comparison between standard features and the SK representation in the following points. First, SK estimates a similarity between two semantic structures by counting the number of sub-structures that are in common. As an example, the similarity between the two structures in Figure 2, F&quot;delivers&quot;;Arg0 and F&quot;delivers&quot;;Arg1, is equal to 1 since they have in common only the [Vdelivers] substructure. Such low value depends on the fact that different argument types tend to appear in different structures.</Paragraph> <Paragraph position="1"> On the contrary, if two structures differ only for a few nodes (especially terminal or near terminal nodes) the similarity remains quite high. For example, if we change the tense of the verb to deliver (Figure 2) in delivered, the [VP [V delivers] NP] subtree will be transformed in [VP [VBD delivered] NP], where the NP is unchanged. Thus, the similarity with the previous structure will be quite high as: (1) the NP with all sub-parts will be matched and (2) the small difference will not highly affect the kernel norm and consequently the final score.</Paragraph> <Paragraph position="2"> This conservative property does not apply to the Parse Tree Path feature which is very sensible to small changes in the tree-structure, e.g. two predicates, expressed in different tenses, generate two different Path features.</Paragraph> <Paragraph position="3"> Second, some information contained in the standard features is embedded in SK: Phrase Type, Predicate Word and Head Word explicitly appear as structure fragments.</Paragraph> <Paragraph position="4"> For example, in Figure 3 are shown fragments like [NP [DT] [N]] or [NP [DT a] [N talk]] which explicitly encode the Phrase Type feature NP for Arg 1 in Figure 2.b.</Paragraph> <Paragraph position="5"> The Predicate Word is represented by the fragment [V delivers] and the Head Word is present as [N talk].</Paragraph> <Paragraph position="6"> Finally, Governing Category, Position and Voice cannot be expressed by SK. This suggests that a combination of the flat features (especially the named entity class (Surdeanu et al., 2003)) with SK could furthermore improve the predicate argument representation.</Paragraph> </Section> </Section> <Section position="5" start_page="1" end_page="1" type="metho"> <SectionTitle> 4 The Experiments </SectionTitle> <Paragraph position="0"> For the experiments, we used PropBank (www.cis.upenn.edu/>>ace) along with Penn-TreeBank5 2 (www.cis.upenn.edu/>>treebank) (Marcus et al., 1993). This corpus contains about 53,700 sentences and a fixed split between training and testing which has been used in other researches (Gildea and Jurasky, 2002; Surdeanu et al., 2003; Hacioglu et al., 2003; Chen and Rambow, 2003; Gildea and Hockenmaier, 2003; Gildea and Palmer, 2002; Pradhan et al., 2003). In this split, Sections from 02 to 21 are used for training, section 23 for testing and sections 1 and 22 as developing set. We considered all PropBank arguments from Arg0 to Arg9, ArgA and ArgM even if only Arg0 from Arg4 and ArgM contain enough training/testing data to affect the global performance. In Table 2 some characteristics of the corpus used in our experiments are reported.</Paragraph> <Paragraph position="1"> The classifier evaluations were carried out using the SVM-light software (Joachims, 1999) available at http://svmlight.joachims.org/ with the default linear kernel for the standard feature evaluations. 5We point out that we removed from the Penn TreeBank the special tags of noun phrases like Subj and TMP as parsers usually are not able to provide this information.</Paragraph> <Paragraph position="2"> Number of Args Number of unique train. test-set Std. features For processing our semantic structures, we implemented our own kernel and we used it inside SVM-light.</Paragraph> <Paragraph position="3"> The classification performances were evaluated using the f1 measure for single arguments as each of them has a different Precision and Recall and by using the accuracy for the final multi-class classifier as the global Precision = Recall = accuracy. The latter measure allows us to compare the results with previous literature works, e.g. (Gildea and Palmer, 2002; Surdeanu et al., 2003; Hacioglu et al., 2003; Chen and Rambow, 2003; Gildea and Hockenmaier, 2003).</Paragraph> <Paragraph position="4"> To evaluate the effectiveness of our new kernel we divided the experiments in 3 steps: + The evaluation of SVMs trained with standard features in a linear kernel, for comparison purposes. + The estimation of the , parameter (equations 4 and 5) for SK from the validation-set .</Paragraph> <Paragraph position="5"> + The performance measurement of SVMs, using SK along with , computed in the previous step.</Paragraph> <Paragraph position="6"> Additionally, both Linear and SK kernels were evaluated using different percentages of training data to compare the gradients of their learning curves.</Paragraph> <Section position="1" start_page="1" end_page="1" type="sub_section"> <SectionTitle> 4.1 SVM performance on Linear and Semantic Kernel </SectionTitle> <Paragraph position="0"> The evaluation of SVMs using a linear kernel on the standard features did not raise particular problems. We used the default regularization parameter (i.e., C = 1 for normalized kernels) and we tried a few cost-factor values (i.e., j 2 f0:1;1;2;3;4;5g) to adjust the rate between precision and recall. Given the huge amount of training data, we used only 30% of training-set in these validation experiments. Once the parameters were derived, we learned 6 different classifiers (one for each role) and measured their performances on the test-set.</Paragraph> <Paragraph position="1"> For SVM, using the Semantic Kernel, we derived that a good , parameter for the validation-set is 0.4. In Figure 4 we report the curves, f1 function of ,, for the 3 largest (in term of training examples) arguments on the test-set.</Paragraph> <Paragraph position="2"> different , values.</Paragraph> <Paragraph position="3"> We note that the maximal value from the validation-set is also the maximal value from the test-set for every argument. This suggests that: (a) it is easy to detect an optimal parameter and (b) there is a common (to all arguments) ,value which defines how much the size of two structures impacts on their similarity. Moreover, some experiments using , greater than 1 have shown a remarkable decrease in performance, i.e. a correct , seems to be essential to obtain a good generalization6 of the training-set.</Paragraph> <Paragraph position="4"> Table 3 reports the performances of SVM trained with the standard features (STD column) and with the Semantic Kernel (SK column). In columns Prob. and C4.5 are reported the results for argument classification achieved in (Gildea and Palmer, 2002) and (Surdeanu et al., 2003). This latter used C4.5 model on standard feature set (STD sub-column) and on an extended feature set (EXT subcolumn). We note that: (a) SVM performs better than the probabilistic approach and C4.5 learning model independently of the adopted features and (b) the Semantic Kernel considerably improves the standard feature set.</Paragraph> <Paragraph position="5"> In order to investigate if SK generalizes better than the 6For example, , = 1 would generate low kernel values between small and large structures. This is in contrast with the observation in Section 3.1, i.e. argument type is independent of its constituent size.</Paragraph> <Paragraph position="6"> linear kernel, we measured the performances by selecting different percentages of training data. Figure 5 shows the curves for the three roles Arg0, Arg1 and ArgM, respectively for linear and semantic kernel whereas Figure We note that not only SK produces higher accuracy but also the gradient of the learning curves is higher: for example, Figure 6 shows that with only 20% of training data, SVM using SK approaches the accuracy of SVM trained with all data on standard features.</Paragraph> <Paragraph position="7"> Additionally, we carried out some preliminary experiments for argument identification (boundary detection), but the learning algorithm was not able to converge. In fact, for this task the non-inclusion property (discussed in Section 3) does not hold. A constituent ai, which has incorrect boundaries, can include or be included in the correct argument ac. Thus, the similarity K(ai;ac) between ai and ac is quite high preventing the algorithm to learn the structures of correct arguments.</Paragraph> </Section> <Section position="2" start_page="1" end_page="1" type="sub_section"> <SectionTitle> 4.2 Discussion and Related Work </SectionTitle> <Paragraph position="0"> The material of the previous sections requires a discussion of the following points: firstly, in Section 3.2 we have noted that some standard features are explicitly coded in SK but Governing Category, Position and Voice features are not expressible as a single fragment of a semantic structure. For example, to derive the Position of an argument relatively to the target predicate is required a visit of the tree. No parse-tree information, i.e. node tags or edges, explicitly indicates this feature. A similar rationale applies to Governing Category and Voice, even if for the latter some tree fragments may code the to be feature.</Paragraph> <Paragraph position="1"> Since these three features have been proved important for role classification we argue that either (a) SK implicitly produces this kind of information or (b) SK is able to provide a different but equally effective information which allows it to perform better than the standard features. In this latter case, it would be interesting to study which features can be backported from SK to the linear kernel to obtain a fast and improved system (Cumby and Roth, 2003). As an example, the fragment [VP [VNP]] defines a sort of sub-categorization frame that may be used to cluster together syntactically similar verbs.</Paragraph> <Paragraph position="2"> Secondly, it is worth noting that we compared SK against a linear kernel of standard features. A recent study, (Pradhan et al., 2003), has suggested that a polynomial kernel with degree = 2 performs better than the linear one. Using such a kernel, the authors obtained 88% in classification but we should take into account that they also used a larger set of flat features, i.e. sub-categorization information (e.g. VP!V NP PP for the tree in Figure 1), Named Entity Classes and a Partial Path feature.</Paragraph> <Paragraph position="3"> Thirdly, this is one of the first massive use of convolution kernels for Natural Language Processing tasks, we trained SK and tested it on 123,918 and 7,426 arguments, respectively. For training each large argument (in term of instances) were required more than 1.5 billion of kernel iterations. This was a little time consuming (about a couple of days for each argument on a Intel Pentium 4, 1,70 GHz, 512 Mbytes Ram) as the SK computation complexity is quadratic in the number of semantic structure nodes7. This prevented us to carry out cross/fold validation. An important aspect is that a recent paper (Vishwanathan and Smola, 2002) assesses that the tree-kernel complexity can be reduced to linear one; this would make our approach largely applicable.</Paragraph> <Paragraph position="4"> Finally, there is a considerable work in Natural Language Processing oriented kernel (Collins and Duffy, 2002; Lodhi et al., 2000; G&quot;artner, 2003; Cumby and Roth, 2003; Zelenko et al., 2003) about string, parse7More precisely, it is O(jFp;aj2) where Fp;a is the largest semantic structure of the training data.</Paragraph> <Paragraph position="5"> tree, graph, and relational kernels but, to our knowledge, none of them was used to derive semantic information on the form of predicate argument structures. In particular, (Cristianini et al., 2001; Kandola et al., 2003) address the problem of semantic similarity between two terms by using, respectively, document sets as term context and the latent semantic indexing. Both techniques attempt to cluster together terms that express the same meaning.</Paragraph> <Paragraph position="6"> This is quite different in means and purpose of our approach that derives more specific semantic information expressed as argument/predicate relations.</Paragraph> </Section> </Section> class="xml-element"></Paper>