File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/02/w02-1801_intro.xml

Size: 1,949 bytes

Last Modified: 2025-10-06 14:01:37

<?xml version="1.0" standalone="yes"?>
<Paper uid="W02-1801">
  <Title>Extraction of Translation Unit from Chinese-English Parallel Corpora</Title>
  <Section position="3" start_page="0" end_page="0" type="intro">
    <SectionTitle>
2 Preprocessing the corpus
</SectionTitle>
    <Paragraph position="0"> The Hong Kong legal documents were collected from Internet. The corpus is composed of laws and amendments issued by the Hong Kong Special Administration Region (HKSAR) during.</Paragraph>
    <Paragraph position="1"> All the texts in it are in both Chinese and English. We selected about 6 million words of both Chinese texts and English words (6,833,762 Chinese words and 6,391,919 English words).</Paragraph>
    <Paragraph position="2"> All the Chinese texts in the corpus are encoded with Big-5 code. Since all our Chinese tools can only deal with Chinese GB code. We firstly converted all the Chinese texts from Big-5 code into GB code. Then the Corpus was aligned with a length-based sentence aligner. For the legal documents have been already well arranged with section by section, which makes the sentence alignment much easier and the precision is high. The Chinese texts were then segmented and pos-tagged with a program developed by the institute of Computational linguistics, Peking University. And all the English Texts were tokenized, lemmatized, and pos-tagged with a freely available tree-based tagger. Two tag sets were used for Chinese and English respectively, ICL/PKU tag set for Chinese texts and UPENN tag set for English texts. Figure 2. shows a sample of the corpus after preprocessing.</Paragraph>
    <Paragraph position="3">  In Figure 2., both corpus was arranged one token per line. The XML-like tag &lt;s&gt; marks the start of the sentence. The single-letter tags right to the Chinese tokens are their part of speech tags. The two columns right to the English tokens are part of speech tags and lemmas.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML