File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/99/w99-0634_metho.xml
Size: 17,486 bytes
Last Modified: 2025-10-06 14:15:32
<?xml version="1.0" standalone="yes"?> <Paper uid="W99-0634"> <Title>Corpus-Based Learning for Noun Phrase Coreference Resolution</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> Coreference resolution refers to the process of determining if two expressions in natural language refer to the same entity in the world.</Paragraph> <Paragraph position="1"> It is an important subtask in natural language processing systems. In particular, information extraction (IE) systems like those built in the DAI:tPA Message Understanding Conferences (Chinchor, 1998; Sundheim, 1995) have revealed that coreference resolution is such a critical component of IE systems that a separate coreference subtask has been defined and evaluated since MUC-6 (Committee, 1995).</Paragraph> <Paragraph position="2"> In this paper, we focus on the task of determining coreference relations as defined in MUC-6 (Committee, 1995). Specifically, a coreference relation denotes an identity of reference and holds between two textual elements known as markables, which are nouns, noun phrases, or pronouns. Thus, our coreference task resolves general noun phrases and not just pronouns, unlike in some previous work on anaphora resolution. The ability to link co-referring noun phrases both within and across sentences is critical to discourse analysis and language understanding in general.</Paragraph> </Section> <Section position="4" start_page="0" end_page="285" type="metho"> <SectionTitle> 2 A Learning Approach for Coreference Resolution </SectionTitle> <Paragraph position="0"> We adopt a corpus-based, learning approach for noun phrase coreference resolution. In this approach, we need a relatively small corpus of training documents that have been annotated with coreference chains of noun phrases. All possible markables in a training document are determined by a pipeline of language processing modules, and training examples in the form of feature vectors are generated for appropriate pairs of markables. These training examples are then given to a learning algorithm to build a classifier. To determine the coreference chains in a new document, all markables are determined and potential pairs of co-referring markables are presented to the classifier which will decide whether the two markables actually co-refer. We give the details of these steps in the following subsections.</Paragraph> <Section position="1" start_page="0" end_page="285" type="sub_section"> <SectionTitle> 2.1 Determination of Markables </SectionTitle> <Paragraph position="0"> A pre-requisite for coreference resolution is to obtain most, if not all, of the possible markables in a raw input text. To determine the markables, a pipeline of natural language processing (NLP) modules is used. They consist of sentence segmentation, tokenization, morphological analysis, part-of-speech tagging, noun phrase identification, named entity recognition, and semantic class determination. As far as coreference resolution is concerned, the goal of these NLP modules is to determine the boundary of the markables, and to provide the necessary in.formation about each markable for subsequent generation of features in the training examples.</Paragraph> <Paragraph position="1"> Our part-of-speech tagger is a standard sta- null tistical bigram tagger based on the Hidden Markov Model (HMM) (Church, 1988). Similarly, we built a statistical HMM-based noun phrase identification module where the noun phrase boundaries are determined solely based on the part-of-speech tags assigned to the words in a sentence. We also implemented a module that recognizes MUC-style named entities, i.e., organization, person, location, date, time, money, and percent. Our named entity recognition module uses the HMM approach of (Bikel et al., 1999; Bikel et al., 1997), which learns from a tagged corpus of named entities. That is, our part-of-speech tagger, noun phrase identification module, and the named entity recognition module are all based on HMM and learn from corpora tagged with parts-of-speech, noun phrases, and named entities, respectively. The markables needed for coreference resolution is the union of the noun phrases and named entities found.</Paragraph> <Paragraph position="2"> To achieve high accuracy in coreference resolution, it is most critical that the eligible candidates for coreference are identified correctly in the first place. In order to test the effectiveness of our system in determining the markables, we attempted to match the markables generated by our system against those appearing in the coreference chains annotated in 100 SGML documents, a subset of the documents available in MUC-6. We found that our system is able to correctly identify about 85% of the noun phrases appearing in coreference chains in the 100 annotated SGML documents. Most of the unmatched noun phrases are of the following types: (1) Our system generated a head noun which is a subset of the noun phrase in the annotated corpus. For example, &quot;Saudi Arabia, the cartel's biggest producer,&quot; was annotated as a markable but our system generated only &quot;Saudi Arabia&quot;. (2) Our system extracted a sequence of words that cannot be considered as a markable. (3) Unclear notion of what constitutes a markable. For example, &quot;wage reductions&quot; was annotated, but &quot;selective wage reductions&quot; was identified by our system instead.</Paragraph> </Section> <Section position="2" start_page="285" end_page="285" type="sub_section"> <SectionTitle> 2.2 Determination of Feature Vectors </SectionTitle> <Paragraph position="0"> Feature vectors are required for training and testing the coreference engine. A feature vector consists of 10 features described below, and is derived based on two extracted markables, i and j, where i is the antecedent and j is the anaphor. In.formation needed to derive the feature vectors is provided by the pipeline of language modules prior to the coreference engine.</Paragraph> <Paragraph position="1"> 1. Distance Feature Its possible values are 0, 1, 2, 3, .... This feature captures the distance between i and j. If i and j are in the same sentence, the value is 0; if they are 1 sentence apart, the value is 1; and so on.</Paragraph> <Paragraph position="2"> 2. Pronoun Feature Its possible values are true or false. If j is a pronoun, return true; else return false. Pronouns include reflexive pronouns (himself, herself), personal pronouns (he, him, you), and possessive pronouns (hers, her).</Paragraph> <Paragraph position="3"> 3. String Match Feature Its possible values are true or false. If the string of i matches the string of j, return true; else return false. 4. Definite Noun Phrase Feature Its pos null sible values are true or false. In our definition, a definite noun phrase is a noun phrase that starts with the word &quot;the&quot;. For example, &quot;the car&quot; is a definite noun phrase. If j is a definite noun phrase, return true; else return false.</Paragraph> </Section> </Section> <Section position="5" start_page="285" end_page="285" type="metho"> <SectionTitle> 5. Demonstrative Noun Phrase Feature </SectionTitle> <Paragraph position="0"> Its possible values are true or false. A demonstrative noun phrase is one that starts with the word &quot;this&quot;, &quot;that&quot;, &quot;these&quot; or &quot;those&quot;. If j is a demonstrative noun phrase, then return true; else return false.</Paragraph> <Paragraph position="1"> 6. Number Agreement Feature Its possi- null ble values are true or false. If i and j agree in number, i.e., they are both singular or both plural, the value is true; otherwise false. Pronouns such as &quot;they&quot;, &quot;them&quot;, etc., are plural, while &quot;it&quot;, &quot;him&quot;, etc., are singular. The morphological root of a noun is used to determine whether it is singular or plural if the noun is not a pronoun.</Paragraph> </Section> <Section position="6" start_page="285" end_page="287" type="metho"> <SectionTitle> 7. Semantic Class Agreement Feature </SectionTitle> <Paragraph position="0"> Its possible values are true, false, or unknown. In our system, we defined the following semantic classes: &quot;female&quot;, &quot;male&quot;, &quot;person&quot;, &quot;organization&quot;, &quot;location&quot;, &quot;date&quot;, &quot;time&quot;, &quot;money&quot;, &quot;percent&quot;, and &quot;object&quot;. These semantic classes are arranged in a simple ISA hierarchy. Each of the &quot;female&quot; and &quot;male&quot; semantic classes is a subclass of the semantic class &quot;person&quot;, while each of the semantic classes &quot;organization&quot;, &quot;location&quot;, &quot;date&quot;, &quot;time&quot;, &quot;money&quot;, and &quot;percent&quot; is a subclass of the semantic class &quot;object&quot;. Each of these defined semantic classes is then mapped to a WORDNET synset (Miller, 1990). For example, &quot;male&quot; is mapped to sense 2 of the noun &quot;male&quot; in WORDNET, &quot;location&quot; is mapped to sense 1 of the noun &quot;location&quot;, etc.</Paragraph> <Paragraph position="1"> The semantic class determination module assumes that the semantic class for every markable extracted is the first sense of the head noun of the markable. Since WORD-NET orders the senses of a noun by their frequency, this is equivalent to choosing the most frequent sense as the semantic class for each noun. If the selected semantic class of a markable is a subclass of one of our defined semantic class C, then the semantic class of the markable is C, else its semantic class is &quot;unknown&quot;.</Paragraph> <Paragraph position="2"> The semantic classes of markables i and j are in agreement if one is the parent of the other (e.g., &quot;chairman&quot; with semantic class &quot;person&quot; and &quot;Mr. Lim&quot; with semantic class &quot;male&quot;), or both of them are the same (e.g., &quot;Mr. Lira&quot; and &quot;he&quot; both of semantic class &quot;male&quot;). The value returned for such cases is true. If the semantic classes of i and j are not the same (e.g. &quot;IBM&quot; with semantic class &quot;organization&quot; and &quot;Mr. Lim&quot; with semantic class &quot;male&quot;), return false.</Paragraph> <Paragraph position="3"> If either semantic class is &quot;unknown&quot;, then the head nouns of both markables are compared. If they are the same, return true, else return unknown.</Paragraph> <Paragraph position="4"> 8. Gender Agreement Feature Its possible values are true, false, or unknown. The gender of a markable is determined in severai ways. Designators and pronouns such as &quot;Mr.&quot;, &quot;Mrs.&quot;, &quot;she&quot;, &quot;he&quot;, etc., can determine the gender. For a markable that is a person's name such as &quot;Peter H. Diller', the gender cannot be determined by the above method. In our system, the gender of such a markable can only be determined if there are markables found later in the doc. null 10.</Paragraph> <Paragraph position="5"> ument that refer to &quot;Peter H. DiUer&quot; by using the designator-form of the name, such as &quot;Mr. Diller&quot;. The gender of a markable will be unknown for noun phrases such as &quot;the president&quot;, &quot;chief executive officer&quot;, etc. If the gender of either markable i or j is unknown, then the gender agreement feature value is unknown; else if i and j agree in gender, then the feature value is true; otherwise its value is false.</Paragraph> <Paragraph position="6"> Proper Name Feature Its possible values are true or false. A proper name is determined based on capitalization. Prepositions appearing in the name such as &quot;off, &quot;and&quot;, etc., need not be in upper case. If i and j are both proper names, return true; else return false.</Paragraph> <Paragraph position="7"> Alias Feature Its possible values are true or false. If i is an alias of j or vice versa, return true; else return false. That is, this feature value is true if i and j are proper names that refer to the same entity. For example, the pairs &quot;Mr. Simpson&quot; and &quot;Bent Simpson&quot;, &quot;IBM&quot; and &quot;International Business Machines Corp.&quot;, &quot;SEC&quot; and &quot;the Securities and Exchange Commission&quot;, &quot;Mr. Dingell&quot; and &quot;Chairman John DingeU&quot;, are aliases. However, the pairs &quot;Mrs. Washington&quot; and &quot;her&quot;, &quot;the talk&quot; and &quot;the meeting&quot;, are not aliases.</Paragraph> <Section position="1" start_page="286" end_page="287" type="sub_section"> <SectionTitle> 2.3 Generating Training Examples </SectionTitle> <Paragraph position="0"> Consider a coreference chain A1 - A2 - A3 - A4 found in an annotated training document. Only pairs of noun phrases in the chain that are immediately adjacent (i.e., A1 - A2, A2 - A3, and</Paragraph> <Paragraph position="2"> ing examples. The first noun phrase in a pair is always considered the antecedent while the second is the anaphor. On the other hand, negative training examples are extracted as follows.</Paragraph> <Paragraph position="3"> For each antecedent-anaphor pair, first obtain all markables between the antecedent and the anaphor. These markables are either not found in any coreference chain or they appear in other chains. Each of them is then paired with the anaphor to form a negative example. For example, if markables a, b, B1 appear between A1 and A2, then the negative examples are a - A2, b - A2 and B1 - A2. Note that a and b do not appear in any coreference chain while B1 appears in another coreference chain.</Paragraph> <Paragraph position="4"> For an annotated noun phrase in a coreference chain in a training document, the same noun phrase must be identified as a markable by our pipeline of language processing modules before this noun phrase can be used to form a feature vector for use as a training example. This is because the information necessary to derive a feature vector, such as semantic class and gender, is computed by the language modules. If an annotated noun phrase is not identified as a markable, it will not contribute any training example. Note that the language modules are also needed to identify markables not already annotated in the training document so that they can used for generating the negative examples.</Paragraph> </Section> <Section position="2" start_page="287" end_page="287" type="sub_section"> <SectionTitle> 2.4 Building a Classifier </SectionTitle> <Paragraph position="0"> The next step is to use a machine learning algorithm to learn a classifier based on the feature vectors generated from the training documents.</Paragraph> <Paragraph position="1"> The learning algorithm used in our coreference engine is C4.5 (Quinlan, 1993). C4.5 is a commonly used machine learning algorithm and thus it may be considered as a baseline method against which other learning algorithms can be compared.</Paragraph> </Section> <Section position="3" start_page="287" end_page="287" type="sub_section"> <SectionTitle> 2.5 Generating Coreference Chains for Test Documents </SectionTitle> <Paragraph position="0"> Before determining the coreference chains for a test document, all possible markables need to be extracted from the document. Every markable is a possible anaphor, and every markable before the anaphor in document order is a possible antecedent of the anaphor, except when the anaphor is nested. If the anaphor is a child or nested markable, then its possible antecedents must not be any markable with the same root markable as the current anaphor.</Paragraph> <Paragraph position="1"> However, the possible antecedents can be other root markables and their children that are before the anaphor in document order. For example, consider the 2 root markables, &quot;Mr. Tom's daughter&quot; and &quot;His daughter's eyes&quot;, appearing in that order in a test document. The possible antecedents of &quot;His&quot; cannot be &quot;His daughter&quot;, nor &quot;His daughter's eyes&quot;, but can be &quot;Mr. Tom&quot; or &quot;Mr. Tom's daughter&quot;.</Paragraph> <Paragraph position="2"> The coreference resolution algorithm considers every markable j starting from the second markable in the document to be a potential candidate as an anaphor. For each j, the algorithm considers every markable before j as a potential antecedent. For each pair i and j, a feature vector is generated and given to the decision tree classifier. A co-referring antecedent is found if the classifier returns true. The algorithm starts from the immediately preceding markable and proceeds backwards in the reverse order of the markables in the document until there is no more markable to test or an antecedent is found.</Paragraph> </Section> </Section> class="xml-element"></Paper>