File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/04/n04-1002_evalu.xml
Size: 3,898 bytes
Last Modified: 2025-10-06 13:59:10
<?xml version="1.0" standalone="yes"?> <Paper uid="N04-1002"> <Title>Cross-Document Coreference on a Large Scale Corpus</Title> <Section position="7" start_page="0" end_page="0" type="evalu"> <SectionTitle> 6. Experiments and Results </SectionTitle> <Paragraph position="0"> To evaluate our various techniques for the task of cross-document coreferencing, we used the two test corpora mentioned in Section 4 and the three coreference approaches described in Section 5. The coreference chains are then evaluated using the B-CUBED algorithm to measure precision and recall as described in Section 2. We present the results by corpus.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 6.1 John Smith Corpus Results </SectionTitle> <Paragraph position="0"> Our main goal for the John Smith corpus is to demonstrate that we have successfully approximated the algorithm of Bagga and Baldwin (1998). Figure 1 shows how recall and precision trade off against each other as the decision threshold (should a name be put into a chain) varies in the incremental vector space approach. This graph is nearly identical to the tradeoff curves shown by Bagga and Baldwin, so we believe our variation on their approach is sufficiently accurate to draw conclusions. A key point to note about the graph is that although there is an excellent recall/precision tradeoff point, the results are not stable around that threshold. If the threshold is shifted slightly higher, recall plummets; if it is lowered slightly, precision drops off rapidly.</Paragraph> <Paragraph position="1"> Figure 2 provides an alternative view of the same information, and overlays the other algorithms on it. In this case we show a recall/precision tradeoff curve.</Paragraph> <Paragraph position="2"> Again, in all cases the tradeoff drops off rapidly, though the agglomerative vector space approach takes longer to fall from high accuracy.</Paragraph> <Paragraph position="3"> Figure 3 provides another comparison of the three approaches by highlighting how the F-measure varies algorithms on the John Smith Corpus. Results from Baldwin and Bagga (1998) are estimated and overlaid onto the graph.</Paragraph> <Paragraph position="4"> with the threshold. Note that the agglomerative vector space approach has the highest measure and has a substantially less &quot;pointed&quot; curve: it is much less sensitive to threshold selection and therefore more stable.</Paragraph> <Paragraph position="5"> The agglomerative vector space achieved a peak F measure of 88.2% in comparison to the incremental approach that peaked at 84.3% (comparable to Bagga and Baldwin's reported 84.6%). We also created a single-link version of our incremental algorithm. It achieved a peak F measure of only 81.4%, showing the advance of average link (when compared to our approach) and the advantage of using within-document coreference to find related sentences (when compared to their work).</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 6.2 Person-x Results </SectionTitle> <Paragraph position="0"> We next evaluated the same three algorithms on the much larger Person X corpus. The recall/precision graph in Figure 4, when compared to that in Figure 2, clearly demonstrates that the larger corpus has made the task much harder. However, the agglomerative vector space approach has been impacted the least and maintains excellent performance.</Paragraph> <Paragraph position="1"> Figure 5 shows the F-measure graph. In comparison to Figure 3, all of the techniques are less sensitive to threshold selection, but the two vector space approaches are less sensitive than the KL divergence approach. It is unclear why this is, though may reflect problems with using the skewed divergence for smoothing.</Paragraph> </Section> </Section> class="xml-element"></Paper>