XML Viewer - w02-1211

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/02/w02-1211_metho.xml
Size: 11,816 bytes
Last Modified: 2025-10-06 14:08:04
<?xml version="1.0" standalone="yes"?>
<Paper uid="W02-1211">
  <Title>Constructing of a Large-Scale Chinese-English Parallel Corpus</Title>
  <Section position="2" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 Resource Collection
</SectionTitle>
    <Paragraph position="0"> Unlike single linguistic resource, the parallel resource for special language pair is limited no matter what language pair is. Although the Chinese and English both are most popular language in the world, we still encounter much difficult in obtaining parallel corpus resource from Internet for following reasons: g108g32There are seldom web pages in China provide the same content in English pages and in Chinese pages; g108g32The English news in web are translated freely other than literally with many content omission; g108g32Some bilingual texts are restricted and used only to member.</Paragraph>
    <Paragraph position="1"> After two years efforts, there are totally about 16,000KB untagged Chinese-English parallel texts in hand. The genres of the resource we collected are showed in table 3.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 General Principles
</SectionTitle>
      <Paragraph position="0"> The coding of the parallel corpus is in broad agreement with the TEI Guideline for electronic texts. The eXtendible Make-up Language (XML) is used for the text coding. Textual features are marked by tags enclosed within angle brackets.</Paragraph>
      <Paragraph position="1"> For example, a title is marked by start tag &lt;title&gt; and an end tag &lt;/title &gt;. Every element has some attributes to identifier of the element.</Paragraph>
      <Paragraph position="2"> The document type definition (DTD) for the texts in the corpus may differs in some respects from the TEI model. The general principle for coding are based on following consideration: g108g32Comply with TEI guide lines on the whole; g108g32Define the tag with clear meaning used by most people in china; g108g32Only used the attributes which can be easily and automatically get from source texts, except the alignment link, which is the key attribute in this corpus and several steps are used to keep high precise (See section 4 for detail); g108g32Try to keep all the interim resource in hand in case information loses, such as, the title tag in HTML files.</Paragraph>
      <Paragraph position="3"> The overall structure of a Chinese-English Parallel corpus is shown by this example:  There are two main parts in a text: a header and the main text. Every text has an unique identifier that is, article id, in this case UH001 (indicating text 001 of the Unix Handbook)</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.2 The header
</SectionTitle>
      <Paragraph position="0"> Each text is described by a header, which has four parts in accordance with the TEI guidelines: a file description, an encoding description, a profile description, and a revision description.</Paragraph>
      <Paragraph position="1"> The file description gives bibliographical information on the source text. The elements include title, author, www address (If the text is obtain from Internet), etc. The encoding description in our corpus is very brief, only the project name and the DTD file name are listed.</Paragraph>
      <Paragraph position="2"> The country or region use the language is indicated in the profile description. The description under &lt;language&gt; used in our corpus is in terms of labels like: Mainland Chinese (MaC), Hong Kong Chinese (HKC), Taiwan Chinese (TwC), Singapore Chinese (SiC), American English (AmE), British English (BrE), Canadian English (CaE), etc.</Paragraph>
      <Paragraph position="3"> Another tag used in the profile description is &lt;textclass&gt;. According to the parallel resource in hand, the texts are grouped into 4 genres (as show in table 3), such as, News , Literature, Science &amp; Technical,Government Report.</Paragraph>
      <Paragraph position="4"> A series of changes are listed in the revise description and specified the change, the date of the change, the person responsible for the change, and the nature of the change.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.3 Text Units and Alignment Unit
</SectionTitle>
      <Paragraph position="0"> The corpus texts are segmented according to the natural units, such as: chapter, paragraph, sentence (S-unit), and word. The English words are simply marked by spacing as in ordinary written text. The Chinese words are not indicated by space in order to avoid the segment error.</Paragraph>
      <Paragraph position="1"> An ID is given to every paragraph to indicate the relative position in whole chapter. The sentence is called S-unit, the same as Johansson, Ebeling and Oksefjell (1999) to underline that they are not necessarily sentences in a grammatical sense. The sentence alignment type between Chinese</Paragraph>
    </Section>
    <Section position="4" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
S-unit and English S-unit maybe 1:1, 2:1, 3:1,
</SectionTitle>
      <Paragraph position="0"> 1:2, 1:3,2:2, 3:2, 2:3. Links between parallel texts are showed by attributes of S-Alignment.</Paragraph>
      <Paragraph position="1"> One of the Chinese alignment unit (it may beyond one S-unit) are linked with the correspondence English alignment unit.</Paragraph>
    </Section>
    <Section position="5" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.4 Sample Text
</SectionTitle>
      <Paragraph position="0"> A sample text of our Chinese-English parallel corpus is showed in figure 1.</Paragraph>
    </Section>
  </Section>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
Figure1 Sample Text
4 Sentence Alignment
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.1 Algorithm Overview
</SectionTitle>
      <Paragraph position="0"> The key attribute in this corpus is alignment link, which connect the one or more Chinese sentence with one or more correspond English sentence.</Paragraph>
      <Paragraph position="1"> In order to keep high precise in sentence alignment, several steps are used with the human and computer cooperation.</Paragraph>
      <Paragraph position="2"> The first step to extract structural information for parallel corpus is paragraph alignment and sentence alignment, that is noting which paragraph and sentence in one language correspond to which paragraph and sentence in another language.</Paragraph>
      <Paragraph position="3"> This problem has been studied by many researchers and a number of quite encouraging results have been reported. However, almost all bilingual corpora used in research are clear (nearly without sentence omission or insertion) and literal translation bilingual texts. The performance tends to deteriorate significantly when these approaches are applied to noisy complex corpora (with sentence omission or insertion, less literal translation).</Paragraph>
      <Paragraph position="4"> There are basically three kinds of approaches on sentence alignment: the length-based approach (Gale &amp; Church 1991 and Brown et al. 1991), the lexical approach (key &amp; Roscheisen 1993), and the combination of them (Chen 1993, Wu 1994 and Langlais 1998, etc.).</Paragraph>
      <Paragraph position="5"> The first published algorithms for aligning sentences in parallel texts are length-based approach proposed by Gale &amp; Church (1991) and Brow et al (1991). Based on the observation that short sentences tend to be translated as short sentences and long sentences as long sentences, they calculate the most likely sentence correspondences as a function of the relative length of the candidates. The basic approach of Brow et al. is similar to Gale and Church, but works by comparing sentence length in words rather than characters. While the idea is simple, the models can still be quite effective when used to clear and literal translated corpora. Once the algorithm had accidentally mis-aligned a pair sentence, it tends to be unable to correct itself and get back on track before the end of the paragraph. Use alone, length-based alignment algorithms are therefore neither very robust nor reliable.</Paragraph>
      <Paragraph position="6"> Kay &amp; Roscheisen (1993) use a partial alignment of lexical items induce a maximum likelihood at sentence level. The method is reliable but time consuming.</Paragraph>
      <Paragraph position="7"> Chen (1993) combines the length-based approach and lexicon-based approach together.</Paragraph>
      <Paragraph position="8"> A translation model is used to estimate the cost of a certain alignment, and the best alignment is found by using dynamic programming as the length-based method. The method is robust, fast enough to be practical and more accurate than previous methods.</Paragraph>
      <Paragraph position="9"> The first sentence alignment model used to align English-Chinese bilingual texts is proposed by Wu (1994). For lack of cognates in English-Chinese, he used lexical cues to add the robust of his model.</Paragraph>
      <Paragraph position="10"> All of these works are test on nearly clear and literal translation bilingual corpora.</Paragraph>
      <Paragraph position="11"> There are seldom papers related to paragraph alignment. It's believed by most of the researchers that the paragraph alignment is an easier task than sentence alignment. Gale &amp; Church (1991) suggest that the same length-based algorithm can be used to align paragraph also.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.2 The Alignment Steps
</SectionTitle>
      <Paragraph position="0"> Sentence alignment algorithm of our system can be outlined as follows:</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.3 Computer-Aided Checking
</SectionTitle>
      <Paragraph position="0"> It's obviously difficult to increase greatly the accuracy and robust of sentence alignment only by length based approach. So a lexicon checking process is added to our system. The alignment results obtained by length based approach are checked by an English-Chinese lexicon. A score S A is given to every alignment sentence pair. The score S A is calculated by following idea, that is, the twice number of correctly matched English words and Chinese words to the sum of number of English and Chinese words. In figure 2, the interface for human checking is showed in order to processes the noise Chinese-English parallel resource.</Paragraph>
    </Section>
    <Section position="4" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.4 Experiment Results
</SectionTitle>
      <Paragraph position="0"> We tested our alignment algorithm with part of a computer handbook (Sco Unix handbook). There are about 4681 English sentences and 4430 Chinese sentences in this computer handbook after filter noisy figures and tables. The detail experiment result of automatic sentence alignment is show in table 4. The total precision is about 95%.</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
5 Bilingual Concordance Design
</SectionTitle>
    <Paragraph position="0"> We also designed a bilingual concordance tool used for discovering facts during the translation between Chinese and English. Besides a listing of the keywords with the contexts in which they appear, the correspondence translation sentence also be presented in this tool. The options may include bilingual concordances, sorting in a variety of orders, and producing basic text statistics. The intended interface is showed in</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML