File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/06/w06-0206_evalu.xml

Size: 5,237 bytes

Last Modified: 2025-10-06 13:59:50

<?xml version="1.0" standalone="yes"?>
<Paper uid="W06-0206">
  <Title>Sydney, July 2006. c(c)2006 Association for Computational Linguistics Data Selection in Semi-supervised Learning for Name Tagging</Title>
  <Section position="8" start_page="52" end_page="53" type="evalu">
    <SectionTitle>
6 Evaluation Results and Discussions
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="52" end_page="52" type="sub_section">
      <SectionTitle>
6.1 Data
</SectionTitle>
      <Paragraph position="0"> We evaluated our system on two languages: English and Chinese. Table 2 shows the data used in our experiments.</Paragraph>
      <Paragraph position="1">  For the experiments reported here, sentences were selected if AveCoref &gt; 3.1 (or 3.1xnumber of documents for cross-document coreference) or the sentence margin exceeded the margin threshold.</Paragraph>
      <Paragraph position="2"> We present in section 6.2 - 6.4 the overall performance of precision (P), recall (R) and F-measure (F) for both languages, and also some diagnostic experiment results. For significance testing (using the sign test), we split the test set into 5 folders, 20 texts in each folder of English, and 18 texts in each folder of Chinese.</Paragraph>
    </Section>
    <Section position="2" start_page="52" end_page="53" type="sub_section">
      <SectionTitle>
6.2 Overall Performance
</SectionTitle>
      <Paragraph position="0"> by applying the two semi-supervised learning methods, separately and in combination, to our baseline name tagger.</Paragraph>
      <Paragraph position="1">  For English, the overall system achieves a 13.4% relative reduction on the spurious and incorrect tags, and 12.9% reduction in the missing rate. For Chinese, it achieves a 16.9% relative reduction on the spurious and incorrect tags, and 16.9% reduction in the missing rate.</Paragraph>
      <Paragraph position="2">  For each of the five folders, we found that both bootstrapping and self-training produced an improvement in F score for each folder, and the combination of two methods is always better than each method alone. This allows us to reject the hypothesis that these  Only names which exactly match the key in both extent and type are counted as correct; unlike MUC scoring, no partial credit is given.</Paragraph>
      <Paragraph position="3">  The performance achieved should be considered in light of human performance on this task. The ACE keys used for the evaluations were obtained by dual annotation and adjudication. A single annotator, evaluated against the key, scored F=93.6% to 94.1% for English and 92.5% to 92.7% for Chinese. A second key, created independently by dual annotation and adjudication for a small amount of the English data, scored F=96.5% against the original key.  improvements were random at a 95% confidence level.</Paragraph>
    </Section>
    <Section position="3" start_page="53" end_page="53" type="sub_section">
      <SectionTitle>
6.3 Analysis of Bootstrapping
6.3.1 Impact of Data Size
</SectionTitle>
      <Paragraph position="0"> We can see some flattening of the gain at the end, particularly for the larger English corpus, and that some segments do not help to boost the performance (reflected as dips in the Dev Set curve and gaps in the Test Set curve).</Paragraph>
    </Section>
    <Section position="4" start_page="53" end_page="53" type="sub_section">
      <SectionTitle>
6.3.2 Impact of Data Selection
</SectionTitle>
      <Paragraph position="0"> In order to investigate the contribution of document selection in bootstrapping, we performed diagnostic experiments for Chinese, whose results are shown in Table 5. All the bootstrapping tests (rows 2 - 4) use margin for sentence selection; row 4 augments this with the selection methods described in sections 5.4.2 and 5.4.3.</Paragraph>
      <Paragraph position="1">  Comparing row 2 with row 3, we find that not using document selection, even though it multiplies the size of the corpus, results in 0.3% lower performance (0.3-0.4% loss for each folder). This leads us to conclude that simply relying upon large corpora is not in itself sufficient. Effective use of large corpora demands good confidence measures for document selection to remove off-topic material. By adding sentence selection (results in row 4) the system obtained 0.5% further improvement in F-Measure (0.4-0.7% for each folder). All improvements are statistically significant at the 95% confidence level.</Paragraph>
    </Section>
    <Section position="5" start_page="53" end_page="53" type="sub_section">
      <SectionTitle>
6.4 Analysis of Self-training
</SectionTitle>
      <Paragraph position="0"> We have applied and evaluated different measures to extract high-confidence sentences in selftraining. The contributions of these confidence measures to F-Measure are presented in Table 6.</Paragraph>
      <Paragraph position="1">  It shows that Chinese benefits more from adding name coreference, mainly because there are more coreference links between name abbreviations and full names. And we also can see that the margin is an important measure for both languages. All differences are statistically significant at the 95% confidence level except for the gain using cross-document information for the Chinese name tagging.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML