File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/02/w02-1008_intro.xml

Size: 9,471 bytes

Last Modified: 2025-10-06 14:01:33

<?xml version="1.0" standalone="yes"?>
<Paper uid="W02-1008">
  <Title>Combining Sample Selection and Error-Driven Pruning for Machine Learning of Coreference Rules</Title>
  <Section position="2" start_page="0" end_page="0" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> Noun phrase coreference resolution refers to the problem of determining which noun phrases (NPs) refer to each real-world entity mentioned in a document. Machine learning approaches to this problem have been reasonably successful, operating primarily by recasting the problem as a classification task (e.g. Aone and Bennett (1995), McCarthy and Lehnert (1995), Soon et al. (2001)). Specifically, an inductive learning algorithm is used to train a classifier that decides whether or not two NPs in a document are coreferent. Training data are typically created by relying on coreference chains from the training documents: training instances are generated by pairing each NP with each of its preceding NPs; instances are labeled as positive if the two NPs are in the same coreference chain, and labeled as negative otherwise.1 A separate clustering mechanism then coordinates the possibly contradictory pairwise coreference classification decisions and constructs a partition on the set of NPs with one cluster for each set of coreferent NPs. Although, in principle, any clustering algorithm can be used, most previous work uses a single-link clustering algorithm to impose coreference partitions.2 An implicit assumption in the choice of the single-link clustering algorithm is that coreference resolution is viewed as anaphora resolution, i.e. the goal during clustering is to find an antecedent for each anaphoric NP in a document.3 Three intrinsic properties of coreference4, however, make the formulation of the problem as a classification-based single-link clustering task po- null coreference chain but is not the head of the chain.</Paragraph>
    <Paragraph position="1"> 4Here, we use the term coreference loosely to refer to either the problem or the binary relation defined on a set of NPs. The particular choice should be clear from the context.</Paragraph>
    <Paragraph position="2"> Association for Computational Linguistics.</Paragraph>
    <Paragraph position="3"> Language Processing (EMNLP), Philadelphia, July 2002, pp. 55-62. Proceedings of the Conference on Empirical Methods in Natural sequently, generating training instances by pairing each NP with each of its preceding NPs creates highly skewed class distributions, in which the number of positive instances is overwhelmed by the number of negative instances. For example, the standard MUC-6 and MUC-7 (1995; 1998) coreference data sets contain only 2% positive instances. Unfortunately, learning in the presence of such skewed class distributions remains an open area of research in the machine learning community (e.g. Pazzani et al. (1994), Fawcett (1996), Cardie and Howe (1997), Kubat and Matwin (1997)).</Paragraph>
    <Paragraph position="4"> Coreference is a discourse-level problem with different solutions for different types of NPs. The interpretation of a pronoun, for example, may be dependent only on its closest antecedent and not on the rest of the members of the same coreference chain.</Paragraph>
    <Paragraph position="5"> Proper name resolution, on the other hand, may be better served by ignoring locality constraints altogether and relying on string-matching or more sophisticated aliasing techniques. Consequently, generating positive instances from all pairs of NPs from the same coreference chain can potentially make the learning task harder: all but a few coreference links derived from any chain might be hard to identify based on the available contextual cues.</Paragraph>
    <Paragraph position="6"> Coreference is an equivalence relation. Recasting the problem as a classification task precludes enforcement of the transitivity constraint. After training, for example, the classifier might determine that A is coreferent with B, and B with C, but that A and C are not coreferent. Hence, the clustering mechanism is needed to coordinate these possibly contradictory pairwise classifications. In addition, because the coreference classifiers are trained independent of the clustering algorithm to be used, improvements in classification accuracy do not guarantee corresponding improvements in clustering-level accuracy, i.e. overall performance on the coreference resolution task might not improve.</Paragraph>
    <Paragraph position="7"> This paper examines each of the above issues.</Paragraph>
    <Paragraph position="8"> First, to address the problem of skewed class distributions, we apply a technique for negative instance selection similar to that proposed in Soon et al. (2001). In contrast to results reported there, however, we show empirically that system performance increases noticeably in response to negative example selection, with increases in F-measure of 3-5%.</Paragraph>
    <Paragraph position="9"> Second, in an attempt to avoid the inclusion of &amp;quot;hard&amp;quot; training instances, we present a corpus-based method for implicit selection of positive instances.</Paragraph>
    <Paragraph position="10"> The approach is a fully automated variant of the example selection algorithm introduced in Harabagiu et al. (2001). With positive example selection, system performance (F-measure) again increases, by 12-14%.</Paragraph>
    <Paragraph position="11"> Finally, to more tightly tie the classification- and clustering-level coreference decisions, we propose an error-driven rule pruning algorithm that optimizes the coreference classifier ruleset with respect to the clustering-level coreference scoring function.</Paragraph>
    <Paragraph position="12"> Overall, the use of pruning boosts system performance from an F-measure of 69.3 to 69.5, and from 57.2 to 63.4 for the MUC-6 and MUC-7 data sets, respectively, enabling the system to achieve performance that surpasses that of the best MUC coreference systems by 4.6% and 1.6%. In particular, the system outperforms the best-performing learning-based coreference system (Soon et al., 2001) by 6.9% and 3.0%.</Paragraph>
    <Paragraph position="13"> The remainder of the paper is organized as follows. In sections 2 and 3, we present the machine learning framework underlying the baseline coreference system and examine the effect of negative sample selection. Section 4 presents our corpus-based algorithm for selection of positive instances. Section 5 describes and evaluates the error-driven pruning algorithm. We conclude with future work in section</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
Coreference Resolution
</SectionTitle>
      <Paragraph position="0"> Our machine learning framework for coreference resolution is a standard combination of classification and clustering, as described above.</Paragraph>
      <Paragraph position="1"> Creating an instance. An instance in our machine learning framework is a description of two NPs in a document. More formally, let NPa2a4a3 be the a5 th NP in document a6 . An instance formed from NPa7 a3 and NPa8 a3 is denoted by a9a11a10a13a12a15a14a17a16a19a18a21a20a12a22a14a24a23a25a18a27a26 . A valid instance is an instance a9 a10a13a12a22a14 a16a19a18 a20a12a22a14 a23a28a18 a26 such that NPa7 a3 precedes NPa8 a3 .5 Following previous work (Aone and Bennett (1995), 5By definition, exactly a29a30 valid instances can be created from a31 NPs in a given document.</Paragraph>
      <Paragraph position="2"> Soon et al. (2001)), we assume throughout the paper that only valid instances will be generated and used for training and testing. Each instance consists of 25 features, which are described in Table 1.6 The classification associated with a training instance is one of COREFERENT or NOT COREFERENT depending on whether the NPs co-refer in the associated training text.7 Building an NP coreference classifier. We use RIPPER (Cohen, 1995), an information gain-based propositional rule learning system, to train a classifier that, given a test instance a9a11a10a13a12a22a14a32a16a19a18a21a20a12a22a14a24a23a25a18a27a26 , decides whether or not NPa7 a3 and NPa8 a3 are coreferent. Specifically, RIPPER sequentially covers the positive training instances and induces a ruleset that determines when two NPs are coreferent. When none of the rules in the ruleset is applicable to a given NP pair, a default rule that classifies the pair as not coreferent is automatically invoked. The output of the classifier is either COREFERENT or NOT COREFERENT along with a number between 0 and 1 that indicates the confidence of the classification.</Paragraph>
      <Paragraph position="3"> Applying the classifier to create coreference chains. After training, the resulting ruleset is used by a best-first clustering algorithm to impose a partitioning on all NPs in the test texts, creating one cluster for each set of coreferent NPs. Texts are processed from left to right. Each NP encountered, NPa8 a3 , is compared in turn to each preceding NP, NPa7 a3 , from right to left. For each pair, a test instance is created as during training and is presented to the coreference classifier. The NP with the highest confidence value among the preceding NPs that are classified as being coreferent with NPa8 a3 is selected as the antecedent of NPa8 a3 ; otherwise, no antecedent is selected for NPa8 a3 .</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML