File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/concl/04/n04-1002_concl.xml
Size: 7,624 bytes
Last Modified: 2025-10-06 13:53:56
<?xml version="1.0" standalone="yes"?> <Paper uid="N04-1002"> <Title>Cross-Document Coreference on a Large Scale Corpus</Title> <Section position="8" start_page="0" end_page="40" type="concl"> <SectionTitle> 7. Further Exploration </SectionTitle> <Paragraph position="0"> We conducted additional analysis to explore the issues surrounding cross-document coreferencing. We ran experiments with the John Smith corpus to explore the question of the effectiveness of a model based on the amount of text used to represent an entity.</Paragraph> <Section position="1" start_page="0" end_page="40" type="sub_section"> <SectionTitle> 7.1 Window size and recall/precision </SectionTitle> <Paragraph position="0"> Allan and Raghavan (2002) showed that the size of the snippets correlates positively with the &quot;clarity&quot; (nonambiguity) of the model. As the size of the snippet increases, the ambiguity of the model increases, presumably because it is more likely to include extraneous information from the surrounding text.</Paragraph> <Paragraph position="1"> In our experiment with the John Smith corpus, we used the incremental vector space approach with a threshold of 0.1 and evaluated precision/recall using various window sizes for the snippets. Figure 6 shows the variation. We discovered that the F-Measure peaks at 84.3% with a window size of 55 words. This is the window size that we used for all of our other experiments.</Paragraph> </Section> <Section position="2" start_page="40" end_page="40" type="sub_section"> <SectionTitle> 7.2 Domain-specific sub-corpora </SectionTitle> <Paragraph position="0"> The person-x corpus may appear to be biased due to the manner of its construction. Since the documents were selected by subject, one may argue that the task of clustering entities will be much easier if the entities are clearly from different genres. However, if this is true, then it may account for about 85% of the entities in the person-x corpus that occur only in one domain subject.</Paragraph> <Paragraph position="1"> We hypothesized that coreferencing entities in the same genre domain can be considered to be harder in terms of achieving high precision because the consistency of the contents between documents in the same genre domain makes it significantly harder to create a unique model for each entity to aid the task of distinguishing one entity from another.</Paragraph> <Paragraph position="2"> In order to see how our techniques measure up against this, we reevaluated the effectiveness of our methods of cross-document coreference resolution on a modified version of the person-x corpus. We clustered the documents into their original genre domain (recall that they were created using simple information retrieval queries). Then, we evaluated the precision/recall for each of the clusters and averaged the results to obtain a final precision/recall score. This eliminates the potential bias that clustering the entities becomes easier if the entities are clearly from different genres. Hypothetically, it also makes the task of cross-document coreferencing more challenging than in reality when performed on actual corpora that is not clustered according to genre. Table 2 shows the breakdown of documents and entities in each genre.</Paragraph> <Paragraph position="3"> The results of the experiments show that clustering documents by their domain specific attributes such as domain genre will hurt cross-document coreferencing.</Paragraph> <Paragraph position="4"> The highest F-Measure that was achieved with agglomerative vector space dropped 6% to 77% and incremental dropped a similar 5%. The KL divergence approach, on the other hand, showed a modest increase of 3% to 77%, equaling the agglomerative approach.</Paragraph> <Paragraph position="5"> The reason for this may be because KL divergence relies more on the global property of the corpus and this approach is more effective when the nearest neighbor computation is degraded by the consistency of the word distributions between documents in the same genre domain.</Paragraph> <Paragraph position="6"> 7.3 Runtime comparison.</Paragraph> <Paragraph position="7"> An important observation in our comparison among the algorithms is running time. While we have shown that the agglomerative vector space approach produced the best results in our experiments, it is also important to note that it was noticeably slower. The estimated running time for the agglomerative vector space experiment on the large corpus was approximately 3 times longer than that of the incremental vector space and KL-Divergence. The runtimes of incremental approaches are linear whereas the runtime of our agglomerative vector space approach is O(n2).</Paragraph> <Paragraph position="8"> Is the improvement in our results worth the difference in runtime? The noticeable run time difference in our experiment is caused by the need to cluster a large number of Person-x entities (34,404 entities). In reality, it would be rare to find such a large number of entities across documents with the same name. In the analysis of our reasonably large corpus, less than 16% of entities occur more than 10 times. If the mean size of entities to be disambiguated is relatively small, then there will not be a significant degrade in runtime on the agglomerative approach. Thus, distribution in the domain subject specific clusters.</Paragraph> <Paragraph position="9"> snippet and recall/precision on the John Smith corpus. our conclusion is that the tradeoff between coreference quality versus runtime in our agglomerative approach is definitely worthwhile if the number of same-named entities to be disambiguated is relatively small.</Paragraph> <Paragraph position="10"> 8. Conclusion and Future Work We were able to compare and contrast our results directly with previous work of Bagga and Baldwin by using the same corpus and evaluation technique. In order to perform a careful excursion into the limited work on cross document coreferencing, we deployed different information retrieval techniques for entity disambiguation and clustering. In our experiments, we have shown that the agglomerative vector space clustering algorithm consistently yields better precision and recall throughout most of the tests. It outperforms the incremental vector space disambiguation model and is much more stable with respect to the decision threshold. Both vector space approaches outperform KL divergence except when the entities to be clustered belong to the same genre.</Paragraph> <Paragraph position="11"> We are pleased that our snippet approach worked well on the task of cross document coreferencing since it was easier than running a within document coreference analyzer first. It was also interesting to discover that previous techniques that worked well on a smaller corpus did not show the same promising recall and precision tradeoff on a larger corpus.</Paragraph> <Paragraph position="12"> We are interested in continuing these evaluations in two ways. First, colleagues of ours are working on a more realistic corpus that is not just large but also contains a much richer set of marked up entities. We look forward to trying out techniques on that data when it is available. Second, we intend to extend our work to include new comparison and clustering approaches. It appears that sentence-based snippets and within-document coreference information may provide a small gain. And the subject information has apparently value in some cases, so we hope to determine how to use the information more broadly.</Paragraph> </Section> </Section> class="xml-element"></Paper>