File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/relat/00/w00-1314_relat.xml

Size: 2,601 bytes

Last Modified: 2025-10-06 14:15:39

<?xml version="1.0" standalone="yes"?>
<Paper uid="W00-1314">
  <Title>Word Alignment of English-Chinese Bilingual Corpus Based on Chunks</Title>
  <Section position="3" start_page="110" end_page="110" type="relat">
    <SectionTitle>
2 Related Work
</SectionTitle>
    <Paragraph position="0"> There are basically two kinds of approaches on word alignment: the statistical-based approaches (Brown et. al., 1990; Gale &amp; Church, 1991; Dagan et. al. 1993; Chang, 1994), and the lexicon-based approaches (Ker &amp; Chang, 1997; Wang et. al., 1999).</Paragraph>
    <Paragraph position="1"> Several translation models based on word alignment are built by Brown et al. (1990) in order to implement the English-French statistical machine translation. The probabilities, such as translation probability, fertility probability, distortion probability, are estimated by EM algorithm. The Z 2 measure is used by Gale &amp; Church (1991) to align partial words.</Paragraph>
    <Paragraph position="2"> Dagan (1993) uses an improved Brown model to align the words for texts including OCR noise.</Paragraph>
    <Paragraph position="3"> They first align word partially by character string matching. Then use the translation model to align words. Chang (1994) uses the POS probability rather than translation probability in Brown model to align the English-Chinese POS tagged corpus. Ker &amp; Chang (1997) propose an approach to align Chinese English corpus based on semantic class. There are two semantic classes are used in their model. One is the semantic class of Longman lexicon of contemporary English, the other is synonymy Chinese dictionary. The semantic class rules of translation between Chinese and English are extracted from large-scale training corpus. Then Chinese and English words are aligned by these rules. Wang (1999) also uses the lexicons to align the Chinese English bilingual corpus. His model is based on bilingual lexicon, sense similarity and location distortion probability.</Paragraph>
    <Paragraph position="4"> The statistical-based approaches need complex training and are sensitive to training data. It's a pity that almost no linguistic knowledge is used in these approaches. The lexicon-based approaches seem simplify the word alignment problem and can't obtain much translation information above word level. To combine these two approaches in a better way is the direction in near future. In this paper we proposed a method to align the bilingual corpus base on chunks. The linguistic knowledge such as POS tag and Chunk tag are used in a simply statistical model.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML