File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/p06-1006_metho.xml
Size: 21,540 bytes
Last Modified: 2025-10-06 14:10:16
<?xml version="1.0" standalone="yes"?> <Paper uid="P06-1006"> <Title>Sydney, July 2006. c(c)2006 Association for Computational Linguistics Kernel-Based Pronoun Resolution with Structured Syntactic Knowledge</Title> <Section position="5" start_page="41" end_page="42" type="metho"> <SectionTitle> 3 The Resolution Framework </SectionTitle> <Paragraph position="0"> Our pronoun resolution system adopts the common learning-based framework similar to those by Soon et al. (2001) and Ng and Cardie (2002).</Paragraph> <Paragraph position="1"> In the learning framework, a training or testing instance is formed by a pronoun and one of its antecedent candidate. During training, for each pronominal anaphor encountered, a positive instance is created by paring the anaphor and its closest antecedent. Also a set of negative instances is formed by paring the anaphor with each of the non-coreferential candidates. Based on the training instances, a binary classifier is generated using a particular learning algorithm. During resolution, a pronominal anaphor to be resolved is paired in turn with each preceding antecedent candidate to form a testing instance. This instance is presented to the classifier which then returns a class label with a confidence value indicating the likelihood that the candidate is the antecedent. The candidate with the highest confidence value will be selected as the antecedent of the pronominal anaphor.</Paragraph> <Section position="1" start_page="42" end_page="42" type="sub_section"> <SectionTitle> 3.1 Feature Space </SectionTitle> <Paragraph position="0"> As with many other learning-based approaches, the knowledge for the reference determination is represented as a set of features associated with the training or test instances. In our baseline system, the features adopted include lexical property, morphologic type, distance, salience, parallelism, grammatical role and so on. Listed in Table 1, all these features have been proved effective for pronoun resolution in previous work.</Paragraph> </Section> <Section position="2" start_page="42" end_page="42" type="sub_section"> <SectionTitle> 3.2 Support Vector Machine </SectionTitle> <Paragraph position="0"> In theory, any discriminative learning algorithm is applicable to learn the classifier for pronoun resolution. In our study, we use Support Vector Machine (Vapnik, 1995) to allow the use of kernels to incorporate the structured feature.</Paragraph> <Paragraph position="1"> Suppose the training set S consists of labelled vectors {(xi,yi)}, where xi is the feature vector of a training instance and yi is its class label. The classifier learned by SVM is</Paragraph> <Paragraph position="3"> where ai is the learned parameter for a support vector xi. An instance x is classified as positive</Paragraph> <Paragraph position="5"> One advantage of SVM is that we can use kernel methods to map a feature space to a particular high-dimension space, in case that the current problem could not be separated in a linear way.</Paragraph> <Paragraph position="6"> Thus the dot-product x1 [?]x2 is replaced by a kernel function (or kernel) between two vectors, that is K(x1,x2). For the learning with the normal features listed in Table 1, we can just employ the well-known polynomial or radial basis kernels that can be computed efficiently. In the next section we 1For our task, the result of f(x) is used as the confidence value of the candidate to be the antecedent of the pronoun described by x.</Paragraph> <Paragraph position="7"> will discuss how to use kernels to incorporate the more complex structured feature.</Paragraph> </Section> </Section> <Section position="6" start_page="42" end_page="44" type="metho"> <SectionTitle> 4 Incorporating Structured Syntactic </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="42" end_page="42" type="sub_section"> <SectionTitle> Information 4.1 Main Idea </SectionTitle> <Paragraph position="0"> A parse tree that covers a pronoun and its antecedent candidate could provide us much syntactic information related to the pair. The commonly used syntactic knowledge for pronoun resolution, such as grammatical roles or the governing relations, can be directly described by the tree structure. Other syntactic knowledge that may be helpful for resolution could also be implicitly represented in the tree. Therefore, by comparing the common substructures between two trees we can find out to what degree two trees contain similar syntactic information, which can be done using a convolution tree kernel.</Paragraph> <Paragraph position="1"> The value returned from the tree kernel reflects the similarity between two instances in syntax.</Paragraph> <Paragraph position="2"> Such syntactic similarity can be further combined with other knowledge to compute the overall similarity between two instances, through a composite kernel. And thus a SVM classifier can be learned and then used for resolution. This is just the main idea of our approach.</Paragraph> </Section> <Section position="2" start_page="42" end_page="43" type="sub_section"> <SectionTitle> 4.2 Structured Syntactic Feature </SectionTitle> <Paragraph position="0"> Normally, parsing is done on the sentence level.</Paragraph> <Paragraph position="1"> However, in many cases a pronoun and an antecedent candidate do not occur in the same sentence. To present their syntactic properties and relations in a single tree structure, we construct a syntax tree for an entire text, by attaching the parse trees of all its sentences to an upper node.</Paragraph> <Paragraph position="2"> Having obtained the parse tree of a text, we shall consider how to select the appropriate portion of the tree as the structured feature for a given instance. As each instance is related to a pronoun and a candidate, the structured feature at least should be able to cover both of these two expressions. Generally, the more substructure of the tree is included, the more syntactic information would be provided, but at the same time the more noisy information that comes from parsing errors would likely be introduced. In our study, we examine three possible structured features that contain different substructures of the parse tree: Min-Expansion This feature records the minimal structure covering both the pronoun and the candidate in the parse tree. It only includes the nodes occurring in the shortest path connecting the pronoun and the candidate, via the nearest commonly commanding node. For example, considering the sentence &quot;The man in the room saw him.&quot;, the structured feature for the instance i{&quot;him&quot;,&quot;the man&quot;} is circled with dash lines as shown in the leftmost picture of Figure 1.</Paragraph> <Paragraph position="3"> Simple-Expansion Min-Expansion could, to some degree, describe the syntactic relationships between the candidate and pronoun.</Paragraph> <Paragraph position="4"> However, it is incapable of capturing the syntactic properties of the candidate or the pronoun, because the tree structure surrounding the expression is not taken into consideration. To incorporate such information, feature Simple-Expansion not only contains all the nodes in Min-Expansion, but also includes the first-level children of these nodes2. The middle of Figure 1 shows such a feature for i{&quot;him&quot;, &quot;the man&quot;}. We can see that the nodes &quot;PP&quot; (for &quot;in the room&quot;) and &quot;VB&quot; (for &quot;saw&quot;) are included in the feature, which provides clues that the candidate is modified by a prepositional phrase and the pronoun is the object of a verb.</Paragraph> <Paragraph position="5"> Full-Expansion This feature focusses on the whole tree structure between the candidate ent) that cover the words between the candidate and the pronoun3. Such a feature keeps the most information related to the pronoun other than where the pronoun and the candidate occur. and candidate pair. The rightmost picture of Figure 1 shows the structure for feature Full-Expansion of i{&quot;him&quot;, &quot;the man&quot;}. As illustrated, different from in Simple-Expansion, the subtree of &quot;PP&quot; (for &quot;in the room&quot;) is fully expanded and all its children nodes are included in Full-Expansion.</Paragraph> <Paragraph position="6"> Note that to distinguish from other words, we explicitly mark up in the structured feature the pronoun and the antecedent candidate under consideration, by appending a string tag &quot;ANA&quot; and &quot;CANDI&quot; in their respective nodes (e.g.,&quot;NN-CANDI&quot; for &quot;man&quot; and &quot;PRP-ANA&quot; for &quot;him&quot; as shown in Figure 1).</Paragraph> </Section> <Section position="3" start_page="43" end_page="44" type="sub_section"> <SectionTitle> 4.3 Structural Kernel and Composite Kernel </SectionTitle> <Paragraph position="0"> To calculate the similarity between two structured features, we use the convolution tree kernel that is defined by Collins and Duffy (2002) and Moschitti (2004). Given two trees, the kernel will enumerate all their subtrees and use the number of common subtrees as the measure of the similarity between the trees. As has been proved, the convolution kernel can be efficiently computed in polynomial time.</Paragraph> <Paragraph position="1"> The above tree kernel only aims for the structured feature. We also need a composite kernel to combine together the structured feature and the normal features described in Section 3.1. In our study we define the composite kernel as follows:</Paragraph> <Paragraph position="3"> (2) where Kt is the convolution tree kernel defined for the structured feature, and Kn is the kernel applied on the normal features. Both kernels are divided by their respective length4 for normalization. The new composite kernel Kc, defined as the multiplier of normalized Kt and Kn, will return a value close to 1 only if both the structured features and the normal features from the two vectors have high similarity under their respective kernels.</Paragraph> </Section> </Section> <Section position="7" start_page="44" end_page="46" type="metho"> <SectionTitle> 5 Experiments and Discussions </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="44" end_page="44" type="sub_section"> <SectionTitle> 5.1 Experimental Setup </SectionTitle> <Paragraph position="0"> In our study we focussed on the third-person pronominal anaphora resolution. All the experiments were done on the ACE-2 V1.0 corpus (NIST, 2003), which contain two data sets, training and devtest, used for training and testing respectively. Each of these sets is further divided into three domains: newswire (NWire), newspaper (NPaper), and broadcast news (BNews).</Paragraph> <Paragraph position="1"> An input raw text was preprocessed automatically by a pipeline of NLP components, including sentence boundary detection, POS-tagging, Text Chunking and Named-Entity Recognition. The texts were parsed using the maximum-entropy-based Charniak parser (Charniak, 2000), based on which the structured features were computed automatically. For learning, the SVM-Light software (Joachims, 1999) was employed with the convolution tree kernel implemented by Moschitti (2004). All classifiers were trained with default learning parameters.</Paragraph> <Paragraph position="2"> The performance was evaluated based on the metric success, the ratio of the number of correctly resolved5 anaphor over the number of all anaphors. For each anaphor, the NPs occurring within the current and previous two sentences were taken as the initial antecedent candidates.</Paragraph> <Paragraph position="3"> Those with mismatched number and gender agreements were filtered from the candidate set. Also, pronouns or NEs that disagreed in person with the anaphor were removed in advance. For training, there were 1207, 1440, and 1260 pronouns with non-empty candidate set found pronouns in the three domains respectively, while for testing, the number was 313, 399 and 271. On average, a pronoun anaphor had 6[?]9 antecedent candidates ahead. Totally, we got around 10k, 13k and 8k training instances for the three domains.</Paragraph> </Section> <Section position="2" start_page="44" end_page="44" type="sub_section"> <SectionTitle> 5.2 Baseline Systems </SectionTitle> <Paragraph position="0"> Table 2 lists the performance of different systems.</Paragraph> <Paragraph position="1"> We first tested Hobbs' algorithm (Hobbs, 1978).</Paragraph> <Paragraph position="2"> Described in Section 2, the algorithm uses heuristic rules to search the parse tree for the antecedent, and will act as a good baseline to compare with the learned-based approach with the structured feature. As shown in the first line of Table 2, Hobbs' algorithm obtains 66%[?]72% success rates on the three domains.</Paragraph> <Paragraph position="3"> The second block of Table 2 shows the baseline system (NORM) that uses only the normal features listed in Table 1. Throughout our experiments, we applied the polynomial kernel on the normal features to learn the SVM classifiers. In the table we also compared the SVM-based results with those using other learning algorithms, i.e., Maximum Entropy (Maxent) and C5 decision tree, which are more commonly used in the anaphora resolution task.</Paragraph> <Paragraph position="4"> As shown in the table, the system with normal features (NORM) obtains 74%[?]77% success rates for the three domains. The performance is similar to other published results like those by Keller and Lapata (2003), who adopted a similar feature set and reported around 75% success rates on the ACE data set. The comparison between different learning algorithms indicates that SVM can work as well as or even better than Maxent</Paragraph> <Paragraph position="6"/> </Section> <Section position="3" start_page="44" end_page="46" type="sub_section"> <SectionTitle> 5.3 Systems with Structured Features </SectionTitle> <Paragraph position="0"> The last two blocks of Table 2 summarize the results using the three syntactic structured features, i.e, Min Expansion (S MIN), Simple Expansion (S SIMPLE) and Full Expansion (S FULL). Between them, the third block is for the systems using the individual structured feature alone. We can see that all the three structured features per- null form better than the normal features for NPaper (up to 5.3% success) and BNews (up to 8.1% success), or equally well (+-1 [?] 2% in success) for NWire. When used together with the normal features, as shown in the last block, the three structured features all outperform the baselines. Especially, the combinations of NORM+S SIMPLE and NORM+S FULL can achieve significantly6 better results than NORM, with the success rate increasing by (4.8%, 5.3% and 8.1%) and (7.1%, 5.8%, 7.2%) respectively. All these results prove that the structured syntactic feature is effective for pronoun resolution.</Paragraph> <Paragraph position="1"> We further compare the performance of the three different structured features. As shown in Table 2, when used together with the normal features, Full Expansion gives the highest success rates in NWire and NPaper, but nevertheless the lowest in BNews. This should be because feature Full-Expansion captures a larger portion of the parse trees, and thus can provide more syntactic information than Min Expansion or Simple Expansion. However, if the texts are less-formally structured as those in BNews, Full-Expansion would inevitably involve more noises and thus adversely affect the resolution performance. By contrast, feature Simple Expansion would achieve balance between the information and the noises to be introduced: from Table 2 we can find that compared with the other two features, Simple Expansion is capable of producing average results for all the three domains. And for this 6p < 0.05 by a 2-tailed t test.</Paragraph> <Paragraph position="2"> reason, our subsequent reports will focus on Simple Expansion, unless otherwise specified.</Paragraph> <Paragraph position="3"> As described, to compute the structured feature, parse trees for different sentences are connected to form a large tree for the text. It would be interesting to find how the structured feature works for pronouns whose antecedents reside in different sentences. For this purpose we tested the success rates for the pronouns with the closest antecedent occurring in the same sentence, one-sentence apart, and two-sentence apart. Table 3 compares the learning systems with/without the structured feature present. From the table, for all the systems, the success rates drop with the increase of the distances between the pronoun and the antecedent. However, in most cases, adding the structured feature would bring consistent improvement against the baselines regardless of the number of sentence distance. This observation suggests that the structured syntactic information is helpful for both intra-sentential and inter-sentential pronoun resolution.</Paragraph> <Paragraph position="4"> We were also concerned about how the structured feature works for different types of pronouns. Table 4 lists the resolution results for two types of pronouns: person pronouns (i.e., &quot;he&quot;, &quot;she&quot;) and neuter-gender pronouns (i.e., &quot;it&quot; and &quot;they&quot;). As shown, with the structured feature incorporated, the system NORM+S Simple can significantly boost the performance of the baseline (NORM), for both personal pronoun and neuter-gender pronoun resolution.</Paragraph> </Section> <Section position="4" start_page="46" end_page="46" type="sub_section"> <SectionTitle> 5.4 Learning Curves </SectionTitle> <Paragraph position="0"> Figure 2 plots the learning curves for the systems with three feature sets, i.e, normal features (NORM), structured feature alone (S Simple), and combined features (NORM+S Simple). We trained each system with different number of instances from 1k, 2k, 3k, . . . , till the full size. Each point in the figures was the average over two trails with instances selected forwards and backwards respectively. From the figures we can find that (1) Used in combination (NORM+S Simple), the structured feature shows superiority over NORM, achieving results consistently better than the normal features (NORM) do in all the three domains.</Paragraph> <Paragraph position="1"> (2) With training instances above 3k, the structured feature, used either in isolation (S Simple) or in combination (NORM+S Simple), leads to steady increase in the success rates and exhibit smoother learning curves than the normal features (NORM). These observations further prove the reliability of the structured feature in pronoun resolution. null</Paragraph> </Section> </Section> <Section position="8" start_page="46" end_page="47" type="metho"> <SectionTitle> 5.5 Feature Analysis </SectionTitle> <Paragraph position="0"> In our experiment we were also interested to compare the structured feature with the normal flat features extracted from the parse tree, like feature Subject and Object. For this purpose we took out these two grammatical features from the normal feature set, and then trained the systems again. As shown in Table 5, the two grammaticalrole features are important for the pronoun resolution: removing these features results in up to 5.7% (NWire) decrease in success. However, when the structured feature is included, the loss in success reduces to 1.9% and 1.1% for NWire and BNews, and a slight improvement can even be achieved for NPaper. This indicates that the structured feature can effectively provide the syntactic information NWire NPaper BNews important for pronoun resolution.</Paragraph> <Paragraph position="1"> We also tested the flat syntactic feature set proposed in Luo and Zitouni (2005)'s work. As described in Section 2, the feature set is inspired the binding theory, including those features like whether the candidate is c commanding the pronoun, and the counts of &quot;NP&quot;, &quot;VP&quot;, &quot;S&quot; nodes in the commanding path. The last line of Table 5 shows the results by adding these features into the normal feature set. In line with the reports in (Luo and Zitouni, 2005) we do observe the performance improvement against the baseline (NORM) for all the domains. However, the increase in the success rates (up to 1.3%) is not so large as by adding the structured feature (NORM+S Simple) instead.</Paragraph> <Section position="1" start_page="46" end_page="47" type="sub_section"> <SectionTitle> 5.6 Comparison with Different Parsers </SectionTitle> <Paragraph position="0"> As mentioned, the above reported results were based on Charniak (2000)'s parser. It would be interesting to examine the influence of different parsers on the resolution performance. For this purpose, we also tried the parser by Collins (1999) (Mode II)7, and the results are shown in Table 6. We can see that Charniak (2000)'s parser leads to higher success rates for NPaper and BNews, while Collins (1999)'s achieves better results for NWire. However, the difference between the results of the two parsers is not significant (less than 2% success) for the three domains, no matter whether the structured feature is used alone or in combination.</Paragraph> </Section> </Section> class="xml-element"></Paper>