File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/04/c04-1079_evalu.xml

Size: 5,530 bytes

Last Modified: 2025-10-06 13:59:04

<?xml version="1.0" standalone="yes"?>
<Paper uid="C04-1079">
  <Title>Generating Overview Summaries of Ongoing Email Thread Discussions</Title>
  <Section position="7" start_page="0" end_page="0" type="evalu">
    <SectionTitle>
6 Evaluation of Issue Detection Algorithms
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
6.1 The Test Data
</SectionTitle>
      <Paragraph position="0"> The test data used was a portion of the Columbia ACM Student Chapter corpus. This corpus included a total of 300 threads which were constructed using message-ID information found in the header. On average, there were 190 words per thread and 6.9 sentences in the first email.</Paragraph>
      <Paragraph position="1"> Threads longer than two emails2 were categorized manually. We identified discussions that supported a decision-making process. For these, we manually annotated the issue of the thread and the responses to the issue. Although we do not currently use this information, we also classified the responses as being either in agreement or disagreement. According to the assumptions listed in Section 4, we discarded those threads in which the issue was not found in the first email. In total, we identified 37 discussion 2 Longer threads offered a great chance of identifying a discussion.</Paragraph>
      <Paragraph position="2"> threads, each of which forms a test case. A manual annotation of the discussion issues was done by following the instruction: Select the sentence from the first email that subsequent emails are responding to.&amp;quot; These annotated issue sentences formed our gold standard.</Paragraph>
      <Paragraph position="3"> Our approach was designed to operate on the new textual contributions of each participant.</Paragraph>
      <Paragraph position="4"> Thus, the emails underwent a limited preprocessing stage. Email headers, automatically embedded &amp;quot;reply context&amp;quot; text and static signatures were ignored.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
6.2 Evaluation Framework and Results
</SectionTitle>
      <Paragraph position="0"> The evaluation was designed to test if our methods which use dialogue structure improve sentence extraction results. We used the recall-precision metric to compare the results of a system with the manually annotated gold standard. In total, we tested 6 variations of our issue detection algorithms. These included the Centroid method, the SVD Centroid method and the SVD Key Sentence method and the 3 oracles.</Paragraph>
      <Paragraph position="1"> For each test case, the approach being tested was used to extract one or more sentences corresponding to the issue of the discussion, which was then compared to the gold standard. The baseline used was the first n sentences of the first email as a summary, where n ranged from 1 to 3 sentences.</Paragraph>
      <Paragraph position="2"> The recall-precision results of the evaluation are presented in Table 1. On average, the chance of correctly choosing the correct sentence randomly in a test set was 21.9%.</Paragraph>
      <Paragraph position="3"> We used an ANOVA to test whether there was an overall effect between the various methods for recall and precision. We rejected the null hypothesis, that is, the choice of method does affect recall and precision (a=0.05, dfnumerator= 8, dfdenoinator= 324).</Paragraph>
      <Paragraph position="4"> To determine if our techniques were statistically significant compared to the baselines, we ran pair-wise two-tailed student t-tests to compare the three methods and the first oracle to the n=1 baseline since these all returned a single sentence. The results are presented in Table 2. Similarly, Table 3 shows the t-test comparisons for the oracle and oracle baseline against the n=3 baseline.</Paragraph>
      <Paragraph position="5"> Except for the SVD Key Sentence method, all the methods were significantly better than the n=1 baseline. However, a useful recall score was only obtained using the oracle methods. When comparing the oracle methods which returned more than one sentence against the n=3 baseline, we found no significant difference in recall.</Paragraph>
      <Paragraph position="6"> However, when comparing precision performance we found that the difference between the precision of Centroid method and the three oracles were significantly different compared to the baseline.  method to the n=1 baseline (df = 36). The values show the probability of the obtained t value.</Paragraph>
      <Paragraph position="7">  method to the n=3 baseline (df = 36). The values show the probability of the obtained t value.</Paragraph>
      <Paragraph position="8"> The recall and precision statistics for the Centroid method was the most impressive of the three methods proposed, far outperforming the baseline. The results of comparisons involving the oracles, which combine the three methods, showed improved performance, suggesting that such techniques might potentially be useful in an email thread summary. Whilst there was little difference between the recall values of the three oracles and the baselines, the benefit of using a more involved approach such as ours is demonstrated clearly by the gain in precision performance which will impact the usefulness of such a summary. It is also interesting to note that the performance of the oracles was achieved by simply using simple rules without any corpus training.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML