File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/06/w06-0115_evalu.xml
Size: 7,065 bytes
Last Modified: 2025-10-06 13:59:44
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-0115"> <Title>The Third International Chinese Language Processing Bakeoff: Word Segmentation and Named Entity Recognition</Title> <Section position="6" start_page="109" end_page="115" type="evalu"> <SectionTitle> 4 Results & Discussion </SectionTitle> <Paragraph position="0"> We report results below first for word segmentation and second for named entity recognition.</Paragraph> <Section position="1" start_page="109" end_page="109" type="sub_section"> <SectionTitle> 4.1 Word Segmentation Results </SectionTitle> <Paragraph position="0"> To provide a basis for comparison, we computed baseline and possible topline scores for each of the corpora. The baseline was constructed by left-to-right maximal match implemented by Python script, using the training corpus vocabulary. The topline employed the same procedure, but instead used the test vocabulary. These results are shown in Tables 3 and 4.</Paragraph> <Paragraph position="1"> For the WS task, we computed the following measures using the score(Sproat and Emerson, 2003) program developed for the previous bakeoffs: recall (R), precision (P), equally weighted F-measure (F = 2PR(P+R)), the rate of out-of-vocabulary words (OOV rate) in the test corpus, the recall on OOV (Roov), and recall on in-vocabulary words (Riv). In and out of vocabulary status are defined relative to the training corpus. Following previous bakeoffs, we employ the Central Limit Theorem for Bernoulli trials (Grinstead and Snell, 1997) to compute 95% confidence interval as +-2 radicalBig (p(1[?]p)n ), assuming the binomial distribution is appropriate. For recall, Cr, we assume that recall represents the probability of correct word identification. Symmetrically, for precision, we compute Cp, setting p to the precision value. One can determine if two systems may then be viewed as significantly different at a 95% confidence level by computing whether their confidence intervals overlap.</Paragraph> <Paragraph position="2"> Word segmentation results for all runs grouped by corpus and track appear in Tables 5-12; all tables are sorted by F-score.</Paragraph> </Section> <Section position="2" start_page="109" end_page="113" type="sub_section"> <SectionTitle> 4.2 Word Segmentation Discussion </SectionTitle> <Paragraph position="0"> Across all corpora, the best F-score was achieved in the MSRA Open Track at 0.979. Overall, as would be expected, the best results on Open track runs had higher F-scores than the best results for Closed Track runs on the same corpora. Likewise, the OOV recall rates for the best Open Track systems exceed those of the best Closed Track runs on comparable corpora by exploiting outside information. Unfortunately, few sites submitted runs in both conditions making strong direct comparisons difficult.</Paragraph> <Paragraph position="1"> Many systems strongly outperformed the base-line runs, though none achieved the topline. The closest approach to the topline score was on the CITYU corpus, with the best performing runs achieving 99% of the topline F-score.</Paragraph> <Paragraph position="2"> It is also informative to observe the rather wide variation in scores across the test corpora. The maximum scores were achieved on the MSRA corpus closely followed by the CITYU corpus. The best score achieved on the UPUC Open track condition, however, was lower than all scores but one on the MSRA Open track. However, a comparison of the baseline, topline, and especially the OOV rates may shed some light on this disparity. The UPUC training corpus was only about one-third the size of the MSRA corpus, and the OOV rate for UPUC was more than double that of any of the other corpora, yielding a challenging task, especially in the Closed track. This high OOV rate may also be attributed to a change in register, since the training data for UPUC had been drawn exclusively from the Chinese Treebank and the test data also included data from other newswire and broadcast news sources. In contrast, the MSRA corpus had both the highest baseline and highest topline scores, possibly indicating an easier corpus in some sense. The differences in topline also suggest a greater degree of variance in the UPUC, and in fact all other corpora, relative the MSRA corpus. These differences highlight the continuing challenges of handling out-of-vocabulary words and performing segmentation across different reg-</Paragraph> </Section> <Section position="3" start_page="113" end_page="113" type="sub_section"> <SectionTitle> 4.3 Named Entity Results </SectionTitle> <Paragraph position="0"> We employed a slightly modified version of the CoNLL 2002 scoring script to evaluate NER task submissions. For each submission, we compute overall phrase precision (P), recall(R), and balanced F-measure (F), as well as F-measure for each entity type (PER-F,ORG-F,LOC-F,GPE-F).</Paragraph> <Paragraph position="1"> For each corpus, we compute a baseline performance level as follows. Based on the training data, using a left-to-right pass over the test data, we assign a named entity tag to a span of characters if it was tagged with a single unique NE tag (PER/LOC/ORG/GPE) in the training data.3 All In the case of overlapping spans, we tag the maximal span. These scores for all NER corpora are found in Table 13.</Paragraph> </Section> <Section position="4" start_page="113" end_page="115" type="sub_section"> <SectionTitle> 4.4 Named Entity Discussion </SectionTitle> <Paragraph position="0"> Though fewer sites participated in the NER task, performances overall were very strong, with only 3If the span was a single character and appeared UNtagged in the corpus, we exclude it. Longer spans are retained for tagging even if they might appear both tagged and untagged in the training corpus.</Paragraph> <Paragraph position="1"> two runs performing below baseline. The best F-score overall on the MSRA Open Track reached 0.912, with ten other scores for MSRA and CITYU Open Track above 0.85. Only two sites submitted runs in both Open and Closed Track conditions, and few Open Track runs were submitted at all, again limiting comparability. For the only corpus with substantial numbers of both Open and Closed Track runs, MSRA, the top three runs outperformed all Closed Track runs.</Paragraph> <Paragraph position="2"> System scores and baselines were much higher for the CITYU and MSRA corpora than for the LDC corpus. This disparity can, in part, also be attributed to a substantially smaller training corpus for the LDC than the other two collections. The presence of an additional category, Geo-political entity, which is potentially confused for either location or organization also enhances the difficulty of this corpus. Training requirements, variation across corpora, and most extensive tag sets will continue to raise challenges for named entity recognition.</Paragraph> <Paragraph position="3"> Named entity recognition results for all runs grouped by corpus and track appear in Tables 14-</Paragraph> </Section> </Section> class="xml-element"></Paper>