File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/n04-1002_metho.xml
Size: 16,188 bytes
Last Modified: 2025-10-06 14:08:53
<?xml version="1.0" standalone="yes"?> <Paper uid="N04-1002"> <Title>Cross-Document Coreference on a Large Scale Corpus</Title> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 4. Corpora </SectionTitle> <Paragraph position="0"> To evaluate the effectiveness of our various techniques for cross document coreference, we use the same &quot;John Smith&quot; corpus created by Bagga and Baldwin (1998). In addition, we created a larger, richer and highly ambiguous corpus that we call the &quot;Person-x corpus.&quot;</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.1 John Smith Corpus </SectionTitle> <Paragraph position="0"> Bagga and Baldwin tested their coreference algorithm against a set of 197 articles from 1996 and 1997 editions of the New York Times, all of which refer to &quot;John Smith&quot;. All articles either contain the name John Smith or some variation with a middle name or initial. There are 35 different John Smiths mentioned in the articles. Of these, 24 refer to a unique John Smith entity which is not mentioned in the other 173 articles (197 minus 24).</Paragraph> <Paragraph position="1"> We present results on this corpus for comparison with past work, to show that our approximation of those algorithms is approximately as effective as the originals. The corpus also permits us to show how our additional algorithms compare on that data. However, our primarily evaluation corpus is the larger corpus that we now discuss.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.2 Person-x Corpus </SectionTitle> <Paragraph position="0"> Since the task of annotating documents is time consuming, we used a different technique to create a large test set with different entities of the same name.</Paragraph> <Paragraph position="1"> The technique used to construct this corpus is similar to the well known technique of creating artificial sense tagged corpora. Artificial sense tagged corpora is used to evaluate word sense disambiguation algorithms and is created by adding ambiguity to a corpus. Similarly, we consider the task of coreferencing multiple occurrences of &quot;John Smith&quot; to be similar to coreferencing multiple occurrences of &quot;person-x&quot;, where the occurrences of &quot;person-x&quot; are a disguise of multiple named entities such as &quot;George Bush&quot; and &quot;Saddam Hussein&quot;. This approach simplifies the task of looking for a large collection of &quot;John Smith&quot; articles and obtaining the actual coreference links between the many &quot;John Smith&quot; entities. It also allows us to create a vastly larger corpus of documents mentioning the &quot;same person.&quot; We first obtained from 10,000 to 50,000 unique documents from the TREC 1, 2 and 3 volumes using the Inquery search engine from UMass Amherst for each of the following subjects: art, business, education, government, healthcare, movies, music, politics, occurrences within the Person X corpus.</Paragraph> <Paragraph position="2"> religion, science and sports. Then, we ran the documents through Identifinder, a named entity extraction system developed by BBN, to tag the named entities in the documents.</Paragraph> <Paragraph position="3"> Next, we selected one person entity randomly from each document. Since Identifinder occasionally tags an entity incorrectly, we manually went through each selection to filter out entities that were not people's names. We also manually filter out cases where the tagged entity is only one word (e.g., John, Alex, etc.). We replaced the occurrences of the selected entity in each document with &quot;person-x&quot; as follows: In the late 1970s, the company hired producers <enamex type=&quot;person&quot;>jon peters</enamex> and <enamex type=&quot;person&quot;>peter guber </enamex> to start a studio from scratch.</Paragraph> <Paragraph position="4"> In the late 1970s, the company hired producers <enamex type=&quot;person&quot;>jon peters</enamex> and <enamex type=&quot;person&quot;> person-x </enamex> to start a studio from scratch.</Paragraph> <Paragraph position="5"> We also replaced all additional occurrences of the same name and names that matched (except for a middle initial or name) in that document with &quot;personx&quot;. For example, in the case above, other occurrences of Peter Guber or names such as Peter X. Guber would be replaced by &quot;person-x&quot;.</Paragraph> <Paragraph position="6"> We now have a large set of documents containing a reference to &quot;Person X&quot; and we know for each document &quot;which&quot; person entity it is referring to. We actually verified that names of the same name were the same entity, though with the large number of entities, the task was potentially overwhelming. However, since the entities are categorized according to domain (by the query that found the document), determining the actual coreference links becomes significantly easier. In an article discussing sports, the multiple occurrences of the name &quot;Michael Chang&quot; are most likely to be pointing to the tennis player--and the same tennis player.</Paragraph> <Paragraph position="7"> These mappings from &quot;Person X&quot; to their true names will serve as our evaluation/true coreference chains.</Paragraph> <Paragraph position="8"> Since we know the name that &quot;Person X&quot; replaced, we assume that if those names are identical, they refer to the same person. So all references to &quot;Person X&quot; that correspond to, say, &quot;Bill Clinton,&quot; will be put into the same coreference chain.</Paragraph> <Paragraph position="9"> We manually removed documents whose Person X entity pointed to a different person than the person in its corresponding chain. Consider the scenario where we have four documents, three of which contains Person X entities pointing to John Smith (president of General Electric Corporation) and the other pointing to John Smith (the character in Pocahontas). The last John Smith document will be removed from the chain and the entire corpus. The final Person X corpus contains 34,404 unique documents. Hence, there are 34,404 &quot;Person X&quot;s in the corpus and they point to 14,767 different actual entities. 15.24% of the entities occur in more than one domain subject.</Paragraph> <Paragraph position="10"> Table 1 displays the distribution of entities versus their occurrences in our corpus. Slightly over 46% of entities occur only once in the collection of 34,404 entities. That compares to about 12% in the John Smith corpus. Of the total of 315,415 unique entities that Identifinder recognized in the entire corpus, just under 49% occurred precisely once, so our sample appears to be representative of the larger corpus even if it does not represent how &quot;John Smith&quot; appears.</Paragraph> <Paragraph position="11"> A potential shortcoming that was noted is that variation of names such as &quot;Bob Matthews&quot; versus &quot;Robert Matthews&quot; may have been missed during the construction of this corpus. However, this problem did not show up in a set entities randomly sampled for analysis.</Paragraph> </Section> </Section> <Section position="6" start_page="0" end_page="0" type="metho"> <SectionTitle> 5. Methodology </SectionTitle> <Paragraph position="0"> In all cases, we represented the mention of an entity (i.e., an occurrence of &quot;John Smith&quot; or &quot;Person-x&quot; depending on the corpus used) by the words around all occurrences of the entity in a document. Based on exploratory work on training data, we choose a window of 55 words centered on each mention, merged those, and called the result a &quot;snippet.&quot; (In many cases the snippet incorporates text from only a single occurrence of the entity, but there are documents that contain 2-3 &quot;person-x&quot; instances, and those are merged together.) We then employ three different methods for comparing snippets to determine whether their corresponding mentions are or are not to the same entity. In the remainder of this section, we describe the three methods: incremental vector space, KL divergence, and agglomerative vector space.</Paragraph> <Paragraph position="1"> 5.1. Incremental vector space Our intent with the incremental vector space model is to approximate the work reported by Bagga and Baldwin (1998). Their system takes as input properly formatted documents and uses the University of Pennsylvania's CAMP system to perform within-document coreference resolution, doing more careful work to find additional mentions of the entity in the document. It then extracts all sentences that are relevant for each entity of interest based on the within-document coreference chains produced by CAMP. The sentences extracted form a summary that represents the entity (in contrast to our 55-word snippets). The system then computes the similarity of that summary with each of the other summaries using the vector space model. If the similarity computed is above a predefined threshold, then the two summaries are considered to be coreferent.</Paragraph> <Paragraph position="2"> Each of the summaries was stored as a vector of terms. The similarity between two summaries S1 and S2 is computed as by the cosine of the angle between their corresponding vectors. Terms are weighted by a tf-idf weight as tf*log(N/df), where tf is the number of times that a term occurs in the summary, N is the total number of documents in the collection, and df is the number of documents that contain the term.</Paragraph> <Paragraph position="3"> Because we did not have the same within-document coreference tools, we opted for a simpler variation on Bagga and Baldwin's approach. In our implementation, we represent snippets (combined 55-word sections of text) as vectors and use this model to represent each entity. We calculated term weights and similarity in the same way, however. The only difference is the text used to represent each mention.</Paragraph> <Paragraph position="4"> For both cases, the system operates incrementally on the list of entities as follows. We first create one coreference chain containing a single entity mention (one vector). We then take the next entity vector and compare it against the entity in the link. If the two vectors have a similarity above a pre-defined threshold, then they are regarded to be referring to the same entity and the latter will be added into the same chain.</Paragraph> <Paragraph position="5"> Otherwise, a new coreference link is created for the entity.</Paragraph> <Paragraph position="6"> We continue creating links using this incremental approach until all of the entities have been clustered into a chain. At each step, a new entity is compared against all existing coreference chains and is added into the chain with the highest average similarity if it is above the predefined threshold. Our implementation differs from that of Bagga and Baldwin in the following ways: Bagga and Baldwin use a single-link technique to compare an entity with the entities in a coreference chain. This means they include an entity into a chain as soon as they find one pair-wise entity to entity comparison that is above the predefined threshold. We utilize an average-link comparison and compared an entity to each other entity in a coreference chain, then used the average similarity to determine if the entity should be included into the chain.</Paragraph> <Paragraph position="7"> They utilized the CAMP system developed by the University of Pennsylvania to resolve within document coreferencing and extract a summary for each entity. In our system, we simply extract the snippets for each entity and do not depend on within document coreferencing of an entity.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 5.2 KL Divergence </SectionTitle> <Paragraph position="0"> The second technique that we implemented for entity disambiguation was based on Kullback-Leibler Divergence. For this technique, we represent the snippets in the form of a probability distribution of words, creating a so-called entity language model (Allan and Raghavan, 2002). The KL divergence is a classic measure of the &quot;distance&quot; between two probability distributions. The more dissimilar the distributions are, the higher the KL divergence. It is given by the equation:</Paragraph> <Paragraph position="2"> where x ranges over the entire vocabulary. The smaller the distance calculated by KL divergence, the more similar a document is with another. If the distance is 0, then the two distributions are identical. To deal with zero probabilities, we need some type of smoothing, and we chose to use the asymmetric skew divergence, mixing one distribution with the other as determined by a a (Lee,2001): D(r ||aq + (1 - a)r) Skew divergence best approximates KL divergence when the parameter a is set to a value close to 1. In our experiments, we let a=0.9 We used the incremental approach of Section 5.1, but with probability distributions. Each of the distributions created (from a snippet) was evaluated against the distributions for existing coreference chains. Smaller distances computed through skew divergence indicate that the entity is similar to the entities in the chain. If the distance computed is smaller than a predefined threshold, then the new entity is added into the coreference chain and the probabilistic distribution of the coreference chain's model is updated accordingly. We start with one entity in one coreference chain and continue comparing, inserting, and creating coreference chains until all of the entities have been resolved.</Paragraph> <Paragraph position="3"> Note that the KL divergence approach is modeled directly after the incremental vector space approach.</Paragraph> <Paragraph position="4"> The difference is that the vector is replaced by a probability distribution and the comparison uses divergence rather than cosine similarity.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 5.3 Agglomerative vector space </SectionTitle> <Paragraph position="0"> In our explorations with the previous algorithm, we noticed that if early coreference chains contained misplaced entities, those entities attracted other entities with high similarity and &quot;polluted&quot; the coreference chain with entities that are not part of the truth chain.</Paragraph> <Paragraph position="1"> We therefore switched to an agglomerative approach that builds up the clusters in a way that is order independent. This approach is typically known as bottom-up agglomerative clustering. It is also done in the vector space model, so we again represent the snippets by vectors.</Paragraph> <Paragraph position="2"> We first create a coreference chain containing one entity for every entity to be resolved. For each coreference chain, we then find its nearest neighbor by computing the similarity of the chain against all other chains using the technique described above in Section 5.1. If the highest similarity computed is above a predefined threshold, then we merge those two chains together. If any merging was performed in this iteration, we repeat the whole process of looking for the most similar pair and merging then in the next iteration. We continue this until no more merging is done--i.e., the highest similarity is below the threshold.</Paragraph> <Paragraph position="3"> The only difference between this approach and that in the previous section is that the agglomerative technique requires more comparisons and takes more time. On the other hand, it minimizes problems caused by a single spurious match and it is order independent.</Paragraph> </Section> </Section> class="xml-element"></Paper>