File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/06/w06-3111_intro.xml
Size: 4,936 bytes
Last Modified: 2025-10-06 14:04:09
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-3111"> <Title>Partitioning Parallel Documents Using Binary Segmentation</Title> <Section position="2" start_page="0" end_page="78" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> Current statistical machine translation systems use bilingual sentences to train the parameters of the translation models. The exploitation of more bilingual sentences automatically and accurately as well as the use of these data with the limited computational requirements become crucial problems.</Paragraph> <Paragraph position="1"> The conventional method for producing parallel sentences is to break the documents into sentences and to align these sentences using dynamic programming. Previous investigations can be found in works such as (Gale and Church, 1993) and (Ma, 2006).</Paragraph> <Paragraph position="2"> A disadvantage is that only the monotone sentence alignments are allowed.</Paragraph> <Paragraph position="3"> Another approach is the binary segmentation method described in (Simard and Langlais, 2003), (Xu et al., 2005) and (Deng et al., 2006), which separates a long sentence pair into two sub-pairs recursively. The binary reordering in alignment is allowed but the segmentation decision is only optimum in each recursion step.</Paragraph> <Paragraph position="4"> Hence, a combination of both methods is expected to produce a more satisfying result. (Deng et al., 2006) performs a two-stage procedure. The documents are first aligned at level using dynamic programming, the initial alignments are then refined to produce shorter segments using binary segmentation. But on the Chinese-English FBIS training corpus, the alignment accuracy and recall are lower than with Champollion (Ma, 2006).</Paragraph> <Paragraph position="5"> We refine the model in (Xu et al., 2005) using a log-linar combination of different feature functions and combine it with the approach of (Ma, 2006). Here the corpora produced using both approaches are concatenated, and each corpus is assigned a weight. During the training of the word alignment models, the counts of the lexicon entries are linear interpolated using the corpus weights. In the experiments on the Chinese-English FBIS corpus the translation performance is improved by 0.4% of the BLEU score compared to the performance only with Champollion.</Paragraph> <Paragraph position="6"> The remainder of this paper is structured as follows: First we will briefly review the baseline statistical machine translation system in Section 2. Then, in Section 3, we will describe the refined binary segmentation method. In Section 4.1, we will introduce the methods to extract bilingual sentences from document aligned texts. The experimental results will be presented in Section 4.</Paragraph> <Paragraph position="7"> 2 Review of the Baseline Statistical</Paragraph> <Section position="1" start_page="78" end_page="78" type="sub_section"> <SectionTitle> Machine Translation System </SectionTitle> <Paragraph position="0"> In this section, we briefly review our translation system and introduce the word alignment models.</Paragraph> <Paragraph position="1"> In statistical machine translation, we are given a source language sentence fJ1 = f1 ...fj ...fJ, which is to be translated into a target language sentence eI1 = e1 ...ei ...eI. Among all possible target language sentences, we will choose the sentence with the highest probability:</Paragraph> <Paragraph position="3"> The decomposition into two knowledge sources in Equation 1 allows independent modeling of target language model Pr(eI1) and translation model Pr(fJ1 |eI1)1. The translation model can be further extended to a statistical alignment model with the following equation:</Paragraph> <Paragraph position="5"> The alignment model Pr(fJ1 ,aJ1|eI1) introduces a 'hidden' word alignment a = aJ1, which describes a mapping from a source position j to a target position aj.</Paragraph> <Paragraph position="6"> 1The notational convention will be as follows: we use the symbol Pr(*) to denote general probability distributions with (nearly) no specific assumptions. In contrast, for model-based probability distributions, we use the generic symbol p(*). The IBM model 1 (IBM-1) (Brown et al., 1993) assumes that all alignments have the same probability by using a uniform distribution:</Paragraph> <Paragraph position="8"> We use the IBM-1 to train the lexicon parameters p(f|e), the training software is GIZA++ (Och and Ney, 2003).</Paragraph> <Paragraph position="9"> To incorporate the context into the translation model, the phrase-based translation approach (Zens et al., 2005) is applied. Pairs of source and target language phrases are extracted from the bilingual training corpus and a beam search algorithm is implemented to generate the translation hypothesis with maximum probability.</Paragraph> </Section> </Section> class="xml-element"></Paper>