File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/05/w05-0822_intro.xml
Size: 5,498 bytes
Last Modified: 2025-10-06 14:03:12
<?xml version="1.0" standalone="yes"?> <Paper uid="W05-0822"> <Title>PORTAGE: A Phrase-based Machine Translation System</Title> <Section position="4" start_page="1" end_page="129" type="intro"> <SectionTitle> 4 concludes and gives pointers to future work. 2 Portage </SectionTitle> <Paragraph position="0"> Portage operates in three main phases: preprocessing of raw data into tokens, with translation suggestions for some words or phrases generated by rules; decoding to produce one or more translation hypotheses; and error-driven rescoring to choose the best final hypothesis. (A fourth postprocessing phase was not needed for the shared task.)</Paragraph> <Section position="1" start_page="129" end_page="129" type="sub_section"> <SectionTitle> 2.1 Preprocessing </SectionTitle> <Paragraph position="0"> Preprocessing is a necessary first step in order to convert raw texts in both source and target languages into a format suitable for both model training and decoding (Foster et al., 2003). For the supplied Europarl corpora, we relied on the existing segmentation and tokenization, except for French, which we manipulated slightly to bring into line with our existing conventions (e.g., converting l ' an into l' an). For the Hansard corpus used to supplement our French-English resources (described in section 3 below), we used our own alignment based on Moore's algorithm (Moore, 2002), segmentation, and tokenization procedures.</Paragraph> <Paragraph position="1"> Languages with rich morphology are often problematic for statistical machine translation because the available data lacks instances of all possible forms of a word to efficiently train a translation system. In a language like German, new words can be formed by compounding (writing two or more words together without a space or a hyphen in between). Segmentation is a crucial step in preprocessing languages such as German and Finnish texts. In addition to these simple operations, we also developed a rule-based component to detect numbers and dates in the source text and identify their translation in the target text. This component was developed on the Hansard corpus, and applied to the French-English texts (i.e. Europarl and Hansard), on the development data in both languages, and on the test data.</Paragraph> </Section> <Section position="2" start_page="129" end_page="129" type="sub_section"> <SectionTitle> 2.2 Decoding </SectionTitle> <Paragraph position="0"> Decoding is the central phase in SMT, involving a search for the hypotheses t that have highest probabilities of being translations of the current source sentence s according to a model for P(t|s). Our model for P(t|s) is a log-linear combination of four main components: one or more trigram language models, one or more phrase translation models, a distortion model, and a word-length feature. The trigram language model is implemented in the SRILM toolkit (Stolcke, 2002). The phrase-based translation model is similar to the one described in (Koehn, 2004), and relies on symmetrized IBM model 2 word-alignments for phrase pair induction.</Paragraph> <Paragraph position="1"> The distortion model is also very similar to Koehn's, with the exception of a final cost to account for sentence endings.</Paragraph> <Paragraph position="2"> s To set weights on the components of the log-linear model, we implemented Och's algorithm (Och, 2003). This essentially involves generating, in an iterative process, a set of nbest translation hypotheses that are representative of the entire search space for a given set of source sentences. Once this is accomplished, a variant of Powell's algorithm is used to find weights that optimize BLEU score (Papineni et al, 2002) over these hypotheses, compared to reference translations. Unfortunately, our implementation of this algorithm converged only very slowly to a satisfactory final nbest list, so we used two different ad hoc strategies for setting weights: choosing the best values encountered during null , with the exception of a ch as the ability to decode either w ards.</Paragraph> <Paragraph position="3"> translarent language pairs of the sha d t hared t the iterations of Och's algorithm (French-English), and a grid search (all other languages). To perform the actual translation, we used our decoder, Canoe, which implements a dynamic-programming beam search algorithm based on that of Pharaoh (Koehn, 2004). Canoe is input-output compatible with Pharaoh few extensions su back ards or forw</Paragraph> </Section> <Section position="3" start_page="129" end_page="129" type="sub_section"> <SectionTitle> 2.3 Rescoring </SectionTitle> <Paragraph position="0"> To improve raw output from Canoe, we used a rescoring strategy: have Canoe generate a list of nbest translations rather than just one, then reorder the list using a model trained with Och's method to optimize BLEU score. This is identical to the final pass of the algorithm described in the previous section, except for the use of a more powerful log-linear model than would have been feasible to use inside the decoder. In addition to the four basic features of the initial model, our rescoring model included IBM2 model probabilities in both directions (i.e., P(s|t) and P(t|s)); and an IBM1-based feature designed to detect whether any words in one language seemed to be left without satisfactory tions in the other language. This missing-word feature was also applied in both directions.</Paragraph> </Section> </Section> class="xml-element"></Paper>