File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/05/i05-3005_intro.xml

Size: 1,756 bytes

Last Modified: 2025-10-06 14:02:58

<?xml version="1.0" standalone="yes"?>
<Paper uid="I05-3005">
  <Title>Morphological features help POS tagging of unknown words across language varieties</Title>
  <Section position="4" start_page="32" end_page="32" type="intro">
    <SectionTitle>
2 Data
</SectionTitle>
    <Paragraph position="0"> Chinese Treebank 5.0 (CTB) contains 500K words of newspaper and magazine articles annotated with segmentation, part-of-speech, and syntactic constituency information. It includes data from three major media sources, XH1 from PRC, HKSAR2 from Hong Kong, and SM3 from Taiwan. In terms of genre, both XH and HKSAR focus on politics and economic issues, and SM more on topics such as culture, health, education and travel. All of the files in CTB are encoded using Guo Biao (GB) and use simplified characters.</Paragraph>
    <Paragraph position="1"> We did some cleanup of character encoding errors in CTB before running our experiments. Taiwan and Hong Kong still use the traditional forms of characters, while PRC-Mainland has adopted simplified forms of many characters, which also collapse some distinctions between characters. Additionally a different character set encoding is standardly used. The articles in HKSAR and SM originally used traditional characters and Big 5 encoding, but prior to inclusion in the CTB corpus they had been converted into simplified characters and GB. Some errors seem to have crept into this conversion process, accidentally leaving traditional characters such as ` instead of simplified (after), for b (for), *T and T and (what), all of which we fixed. We also normalized half width numbers, alphabets, and punctuation to full width. Finally we removed the NONE- traces left over from CTB parse trees.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML