File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/95/w95-0114_intro.xml

Size: 1,044 bytes

Last Modified: 2025-10-06 14:06:00

<?xml version="1.0" standalone="yes"?>
<Paper uid="W95-0114">
  <Title>i Compiling Bilingual Lexicon Entries From a Non-Parallel English-Chinese Corpus</Title>
  <Section position="3" start_page="173" end_page="173" type="intro">
    <SectionTitle>
2 A Non-parallel Corpus of Chinese and English
</SectionTitle>
    <Paragraph position="0"> We use parts of the HKUST English-Chinese Bilingual Corpora for our experiments (Wu 1994), consisting of transcriptions of the Hong Kong Legislative Council debates in both English and Chinese. We use the data from 1988-1992, taking the first 73618 sentences from the English text, and the next 73618 sentences from the Chinese text. There are no overlapping sentences between the texts. The topic of these debates varies though is to some extent confined to the same domain, namely the political and social issues of Hong Kong. Although we select the same number of sentences from each language, there are 22147 unique words from English, and only 7942 unique words from Chinese.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML