File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/i05-3017_metho.xml
Size: 8,053 bytes
Last Modified: 2025-10-06 14:09:37
<?xml version="1.0" standalone="yes"?> <Paper uid="I05-3017"> <Title>The Second International Chinese Word Segmentation Bakeoff</Title> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 1. The Academia Sinica corpus, provided </SectionTitle> <Paragraph position="0"> in Unicode (UTF-16), contained characters found in Big Five Plus that are not found in Microsoft's CP950 or standard Big Five. It also contained compatibility characters that led to transcoding errors when converting from Unicode to Big Five Plus. A detailed description of these issues can be found on the Bakeoff 2005</Paragraph> </Section> <Section position="5" start_page="0" end_page="123" type="metho"> <SectionTitle> 1 A fifth (Simplified Chinese) corpus was provided by the University of Pennsylvania, but for numerous technical reasons it </SectionTitle> <Paragraph position="0"> was not used in the evaluation. However, it has been made available (both training and truth data) on the SIGHAN website along with the other corpora.</Paragraph> <Paragraph position="1"> pages on the SIGHAN website. The data also included 11 instances of an invalid character that could not be converted to</Paragraph> </Section> <Section position="6" start_page="123" end_page="126" type="metho"> <SectionTitle> Big Five Plus. 2. The City University of Hong Kong data </SectionTitle> <Paragraph position="0"> was initially supplied in Big Five/ HKSCS. We initially converted this to Unicode but found that there were characters appearing in Unicode Ideograph Extension B, which many systems are unable to handle. City University was gracious enough to provide Unicode versions for their files with all characters in the Unicode BMP. Specific details can be found on the Bakeoff 2005 pages of the SIGHAN website.</Paragraph> <Paragraph position="1"> The truth data was provided in segmented and unsegmented form by all of the providers except Academia Sinica, who only provided the segmented truth files. These were converted to unsegmented form using a simple Perl script. Unfortunately this script also removed spaces separating non-Chinese (i.e., English) tokens. We had no expectation of correct segmentation on non-Chinese text, so the spaces were manually removed between non-Chinese text in the truth data prior to scoring.</Paragraph> <Paragraph position="2"> The Academia Sinica data separated tokens in both the training and truth data using a full-width space instead of one or more half-width (i.e., ASCII) spaces. The scoring script was modified to ignore the type of space used so that teams would not be penalized during scoring for using a different separator.</Paragraph> <Paragraph position="3"> The segmentation standard used by each provider were made available to the participants, though late in the training period. These standards are either extremely terse (MSR), verbose but in Chinese only (PKU, AS), or are verbose and moderately bilingual. The PKU corpus uses a standard derived from GB 13715, the Chinese government standard for text segmentation in computer applications. Similarly AS uses a Taiwanese national standard for segmentation in computer applications. The CityU data was segmented using the LIVAC corpus standard, and the MSR data to Microsoft's internal standard.</Paragraph> <Paragraph position="4"> The standards are available on the bakeoff web site.</Paragraph> <Paragraph position="5"> The PKU data was edited by the organizers to remove a numeric identifier from the start of each line. Unless otherwise noted in this paper no changes beyond transcoding were made to the data furnished by contributors.</Paragraph> <Paragraph position="6"> 2.2.notdef.g0001 Rules and Procedures The bakeoff was run almost identically to the first described in Sproat and Emerson (2003): the detailed instructions provided to the participants are available on the bakeoff website at http://www.sighan.org/bakeoff2005/ .</Paragraph> <Paragraph position="7"> Groups (or &quot;sites&quot; as they were also called) interested in participating in the competition registered on the SIGHAN website. Only the primary researcher for each group was asked to register. Registration was opened on June 1, 2005 and allowed to continue through the time the training data was made available on July 11. When a site registered they selected which corpus or corpora there were interested in using, and whether they would take part in the open or closed tracks (described below.) On July 11 the training data was made available on the Bakeoff website for downloading: the same data was used regardless of the tracks the sites registered for. The web site did not allow a participant to add a corpus to the set they initially selected, though at least one asked us via email to add one and this was done manually. Groups were given until July 27 to train their systems, when the testing data was released on the web site. They then had two days to process the test corpora and return them to the organizer via email on Jul 29 for scoring. Each participant's results were posted to their section of the web site on August.notdef.g00016, and the summary results for all participants were made available to all groups on August 12.</Paragraph> <Paragraph position="8"> Two tracks were available for each corpus, open and closed: * In the open tests participants could use any external data in addition to the training corpus to train their system. This included, but was not limited to, external lexica, character set knowledge, part-of-speech information, etc. Sites participating in an open test were required to describe this external data in their system description.</Paragraph> <Paragraph position="9"> * In closed tests, participants were only allowed to use information found in the training data. Absolutely no other data or information could be used beyond that in the training document. This included knowledge of character sets, punctuation characters, etc. These seemingly artificial restrictions (when compared to &quot;real world&quot; systems) were formulated to study exactly how far one can get without supplemental information.</Paragraph> <Paragraph position="10"> Other obvious restrictions applied: groups could not participate using corpora that they or their organization provided or that they had used before or otherwise seen.</Paragraph> <Paragraph position="11"> Sites were allowed submit multiple runs within a track, allowing them to compare various approaches.</Paragraph> <Paragraph position="12"> Scoring was done automatically using a combination of Perl and shell scripts. Participants were asked to submit their data using very strict naming conventions to facilitate this: in only a couple of instances were these not followed and human intervention was required.</Paragraph> <Paragraph position="13"> After the scoring was done the script would mail the detailed results to the participant. The scripts used for scoring can be downloaded from the Bakeoff 2005 web site. It was provided to the participants to aid in the their data analysis. As noted above, some of the training/truth data used a full-width space to separate tokens: the scoring script was modified to ignore the differences between full-width and half-width spaces. This is the only case where the half-width/full-width distinction was ignored: a system that converted tokens from full-width to half-width was penalized by the script.</Paragraph> <Paragraph position="14"> Thirty-six sites representing 10 countries initially signed up for the bakeoff. The People's Republic of China had the greatest number with 17, followed by the United States (6), Hong Kong (5), Taiwan (3), six others with one each. Of these, 23 submitted results for scoring and subsequently submitted a paper for these proceedings. A summary of participating groups and the tracks for which they submitted results can be found in Table.notdef.g00012 on the preceding page. All together 130 runs were submitted for scoring.</Paragraph> </Section> class="xml-element"></Paper>