File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/05/i05-3017_intro.xml

Size: 1,258 bytes

Last Modified: 2025-10-06 14:02:58

<?xml version="1.0" standalone="yes"?>
<Paper uid="I05-3017">
  <Title>The Second International Chinese Word Segmentation Bakeoff</Title>
  <Section position="3" start_page="0" end_page="0" type="intro">
    <SectionTitle>
2.notdef.g0001 Details of the Contest
2.1.notdef.g0001 The Corpora
</SectionTitle>
    <Paragraph position="0"> Four corpora were used in the evaluation, two each using Simplified and Traditional Chinese characters.1 The Simplified Chinese corpora were provided by Beijing University and Microsoft Research Beijing. The Traditional Chinese corpora were provided by Academia Sinica in Taiwan and the City University of Hong Kong.</Paragraph>
    <Paragraph position="1"> Each provider supplied separate training and truth data sets. Details on each corpus are provided in Table.notdef.g00011.</Paragraph>
    <Paragraph position="2"> With one exception, all of the corpora were provided in a single character encoding. We decided to provide all of the data in both Unicode (UTF-8 encoding) and the standard encoding used in each locale. This would allow systems that use one or the other encoding to chose appropriately while ensuring consistent transcoding across all sites. This conversion was problematic in two cases:</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML