File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-0115_metho.xml

Size: 6,397 bytes

Last Modified: 2025-10-06 14:10:37

<?xml version="1.0" standalone="yes"?>
<Paper uid="W06-0115">
  <Title>The Third International Chinese Language Processing Bakeoff: Word Segmentation and Named Entity Recognition</Title>
  <Section position="4" start_page="0" end_page="109" type="metho">
    <SectionTitle>
2 Details of the Evaluation
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="108" type="sub_section">
      <SectionTitle>
2.1 Corpora
</SectionTitle>
      <Paragraph position="0"> Five corpora were provided for the evaluation: three in Simplified characters and two in traditional characters. The Simplified character corpora were provided by Microsoft Research Asia (MSRA) for WS and NER, by University of Pennsylvania/University of Colorado (UPUC) for WS, and by the Linguistic Data Consortium (LDC) for NER. The Traditional character corpora were provided by City University of Hong Kong (CITYU) for WS and NER and by the Chinese Knowledge Information Processing Laboratory (CKIP) of the Academia Sinica, Taiwan for WS. Each data provider offered separate training and test corpora.</Paragraph>
      <Paragraph position="1"> General information for each corpus appears in  All data providers were requested to supply the training and test corpora in both the standard local encoding and in Unicode (UTF-8) in a standard XML format with sentence and word tags, and named entity tags if appropriate. For  all providers except the LDC, missing encodings were transcoded by the organizers using the appropriate Python CJK codecs.</Paragraph>
      <Paragraph position="2"> Primary training and truth data for word segmentation were generated by the organizers via a Python script by replacing sentence end tags with newlines and word end tags with a single whitespace character, deleting all other tags and associated newlines. For test data, end of sentence tags were replaced with newlines and all other tags removed. Since the UPUC truth corpus was only provided in white-space separated form, test data was created by automatically deleting line-internal whitespace.</Paragraph>
      <Paragraph position="3"> Primary training and truth data for named entity recognition were converted from the provided XML format to a two-column format similar to that used in the CoNLL 2002 NER task(Sang, 2002) adapted for Chinese, where the first column is the current character and the second column the corresponding tag. Format details may be found at the bakeoff website (http://www.sighan.org/bakeoff2006/).</Paragraph>
      <Paragraph position="4"> For consistency, we tagged only &amp;quot;&lt;NAMEX&gt;&amp;quot; mentions, of either (PER)SON, (LOC)ATION, (ORG)ANIZATION, or (G)EO-(P)OLITICAL (E)NTITY as annotated in the corpora.1 Test was generated as above.</Paragraph>
      <Paragraph position="5"> The LDC required sites to download training data from their website directly in the ACE2 evaluation format, restricted to &amp;quot;NAM&amp;quot; mentions. The organizers provided the sites with a Python script to convert the LDC data to the CoNLL format above, and the same script was used to create the truth data. Test data was created by splitting on newlines or Chinese period characters.</Paragraph>
      <Paragraph position="6"> Comparable XML format data was also provided for all corpora and both tasks.</Paragraph>
      <Paragraph position="7"> The segmentation and NER annotation standard, as appropriate, for each corpus was made  available on the bakeoff website. As observed in previous evaluations, these documents varied widely in length, detail, and presentation language. null Except as noted above, no additional changes were made to the data furnished by the providers.</Paragraph>
    </Section>
    <Section position="2" start_page="108" end_page="109" type="sub_section">
      <SectionTitle>
2.2 Rules and Procedures
</SectionTitle>
      <Paragraph position="0"> The Third Bakeoff followed the structure of the first two word segmentation bakeoffs. Participating groups (&amp;quot;sites&amp;quot;) registered by email form; only the primary contact was required to register, identifying the corpora and tasks of interest. Training data was released for download from the websites (both SIGHAN and LDC) on April 17, 2006.</Paragraph>
      <Paragraph position="1"> Test data was released on May 15, 2006 and results were due 14:00 GMT on May 17. Scores for all submitted runs were emailed to the individual groups by May 19, and were made available to all groups on a web page a few days later.</Paragraph>
      <Paragraph position="2"> Groups could participate in either or both of two tracks for each task and corpus: * In the open track, participants could use any external data they chose in addition to the provided training data. Such data could include external lexica, name lists, gazetteers, part-of-speech taggers, etc. Groups were required to specify this information in their system descriptions.</Paragraph>
      <Paragraph position="3"> * In the closed track, participants could only use information found in the provided training data. Information such as externally obtained word counts, part of speech information, or name lists was excluded.</Paragraph>
      <Paragraph position="4"> Groups were required to submit fully automatic runs and were prohibited from testing on corpora which they had previously used.</Paragraph>
      <Paragraph position="5"> Scoring was performed automatically using a combination of Python and Perl scripts, facilitated by stringent file naming conventions. In cases  where naming errors or minor divergences from required file formats arose, a mix of manual intervention and automatic conversion was employed to enable scoring. The primary scoring scripts were made available to participants for followup experiments.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="109" end_page="109" type="metho">
    <SectionTitle>
3 Participating Sites
</SectionTitle>
    <Paragraph position="0"> A total of 36 sites registered, and 29 submitted results for scoring. The greatest number of participants came from the People's Republic of China (11), followed by Taiwan (7), the United States (5), Japan (2), with one team each from Singapore, Korea, Hong Kong, and Canada. A summary of participating groups with task and track information appears in Table 2. A total of 144 official runs were scored: 101 for word segmentation and 43 for named entity recognition.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML