File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/03/n03-1017_intro.xml
Size: 6,454 bytes
Last Modified: 2025-10-06 14:01:40
<?xml version="1.0" standalone="yes"?> <Paper uid="N03-1017"> <Title>Statistical Phrase-Based Translation</Title> <Section position="3" start_page="0" end_page="0" type="intro"> <SectionTitle> 2 Evaluation Framework </SectionTitle> <Paragraph position="0"> In order to compare different phrase extraction methods, we designed a uniform framework. We present a phrase translation model and decoder that works with any phrase translation table.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.1 Model </SectionTitle> <Paragraph position="0"> The phrase translation model is based on the noisy channel model. We use Bayes rule to reformulate the translation probability for translating a foreign sentence a0 into During decoding, the foreign input sentence a0 is segmented into a sequence of a0 phrases a1a2a4a3a5 . We assume a uniform probability distribution over all possible segmentations. null Each foreign phrase a1a2a7a6 in a1a2a4a3a5 is translated into an English phrase</Paragraph> <Paragraph position="2"> direction is inverted from a modeling standpoint.</Paragraph> <Paragraph position="3"> Reordering of the English output phrases is modeled by a relative distortion probability distribution a10 a5a12a11 a6a14a13 a6 denotes the start position of the foreign phrase that was translated into the a19 th English phrase, and a15a16a6a18a17 a5 denotes the end position of the foreign phrase translated into the a5a18a19 a13a21a20 a10 th English phrase. In all our experiments, the distortion probability distribution a10 a5a23a22 a10 is trained using a joint probability model (see Section 3.3). Alternatively, we could also use a simpler distortion model a10 a5a12a11 a6 a13 a15a16a6a18a17 a5 a10 a12a25a24a27a26 a28a16a29</Paragraph> <Paragraph position="5"> appropriate value for the parameter a24 .</Paragraph> <Paragraph position="6"> In order to calibrate the output length, we introduce a factor a35 for each generated English word in addition to the trigram language model a3 LM. This is a simple means to optimize performance. Usually, this factor is larger than 1, biasing longer output.</Paragraph> <Paragraph position="7"> In summary, the best English output sentence a1 best given a foreign input sentence a0 according to our model</Paragraph> <Paragraph position="9"> For all our experiments we use the same training data, trigram language model [Seymore and Rosenfeld, 1997], and a specialized decoder.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.2 Decoder </SectionTitle> <Paragraph position="0"> The phrase-based decoder we developed for purpose of comparing different phrase-based translation models employs a beam search algorithm, similar to the one by Jelinek [1998]. The English output sentence is generated left to right in form of partial translations (or hypotheses). null We start with an initial empty hypothesis. A new hypothesis is expanded from an existing hypothesis by the translation of a phrase as follows: A sequence of untranslated foreign words and a possible English phrase translation for them is selected. The English phrase is attached to the existing English output sequence. The foreign words are marked as translated and the probability cost of the hypothesis is updated.</Paragraph> <Paragraph position="1"> The cheapest (highest probability) final hypothesis with no untranslated foreign words is the output of the search.</Paragraph> <Paragraph position="2"> The hypotheses are stored in stacks. The stack a43a7a44 contains all hypotheses in which a45 foreign words have been translated. We recombine search hypotheses as done by Och et al. [2001]. While this reduces the number of hypotheses stored in each stack somewhat, stack size is exponential with respect to input sentence length. This makes an exhaustive search impractical.</Paragraph> <Paragraph position="3"> Thus, we prune out weak hypotheses based on the cost they incurred so far and a future cost estimate. For each stack, we only keep a beam of the best a46 hypotheses.</Paragraph> <Paragraph position="4"> Since the future cost estimate is not perfect, this leads to search errors. Our future cost estimate takes into account the estimated phrase translation cost, but not the expected distortion cost.</Paragraph> <Paragraph position="5"> We compute this estimate as follows: For each possible phrase translation anywhere in the sentence (we call it a translation option), we multiply its phrase translation probability with the language model probability for the generated English phrase. As language model probability we use the unigram probability for the first word, the bigram probability for the second, and the trigram probability for all following words.</Paragraph> <Paragraph position="6"> Given the costs for the translation options, we can compute the estimated future cost for any sequence of consecutive foreign words by dynamic programming. Note that this is only possible, since we ignore distortion costs. Since there are only a46 a5a47a46a49a48 a20 a10a51a50a53a52 such sequences for a foreign input sentence of length a46 , we can pre-compute these cost estimates beforehand and store them in a table. During translation, future costs for uncovered foreign words can be quickly computed by consulting this table.</Paragraph> <Paragraph position="7"> If a hypothesis has broken sequences of untranslated foreign words, we look up the cost for each sequence and take the product of their costs.</Paragraph> <Paragraph position="8"> The beam size, e.g. the maximum number of hypotheses in each stack, is fixed to a certain number. The number of translation options is linear with the sentence length. Hence, the time complexity of the beam search is quadratic with sentence length, and linear with the beam size.</Paragraph> <Paragraph position="9"> Since the beam size limits the search space and therefore search quality, we have to find the proper trade-off between speed (low beam size) and performance (high beam size). For our experiments, a beam size of only 100 proved to be sufficient. With larger beams sizes, only few sentences are translated differently. With our decoder, translating 1755 sentence of length 5-15 words takes about 10 minutes on a 2 GHz Linux system. In other words, we achieved fast decoding, while ensuring high quality.</Paragraph> </Section> </Section> class="xml-element"></Paper>