File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/06/w06-3118_intro.xml
Size: 9,450 bytes
Last Modified: 2025-10-06 14:04:09
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-3118"> <Title>PORTAGE: with Smoothed Phrase Tables and Segment Choice Models</Title> <Section position="3" start_page="0" end_page="135" type="intro"> <SectionTitle> 2 Portage </SectionTitle> <Paragraph position="0"> Becausethis is thesecondparticipation ofPortagein such a shared task, a description of the base system can be found elsewhere (Sadat et al, 2005). Briefly, Portage is a research vehicle and development prototype system exploiting the state-of-the-art in statistical machine translation (SMT). It uses a custom built decoder followed by a rescoring module that adjusts weights based on a number of features defined on the source sentence. We will devote space to discussing changes made since the 2005 shared task.</Paragraph> <Section position="1" start_page="0" end_page="134" type="sub_section"> <SectionTitle> 2.1 Phrase-Table Smoothing </SectionTitle> <Paragraph position="0"> Phrase-based SMT relies on conditional distributions p(s|t) and p(t|s) that are derived from the joint frequencies c(s,t) of source/target phrase pairs observed in an aligned parallel corpus. Traditionally, relative-frequency estimation is used to derive conditional distributions, ie p(s|t) = c(s,t)/summationtexts c(s,t). However, relative-frequency estimation has the well-known problem of favouring rare events. For instance, any phrase pair whose constituents occur only once in the corpus will be assigned a probability of 1, almost certainly higher than the probabilities of pairs for which much more evidence exists.</Paragraph> <Paragraph position="1"> During translation, rare pairs can directly compete with overlapping frequent pairs, so overestimating their probabilities can significantly degrade performance. null To address this problem, we implemented two simple smoothing strategies. The first is based on the Good-Turing technique as described in (Church and Gale, 1991). This replaces each observed joint frequency c with cg = (c + 1)nc+1/nc, where nc is the number of distinct pairs with frequency c (smoothed for large c). It also assigns a total count mass of n1 to unseen pairs, which we distributed in proportion to the frequency of each conditioning phrase. The resulting estimates are:</Paragraph> <Paragraph position="3"> pg(t|s) are analogous.</Paragraph> <Paragraph position="4"> The second strategy is Kneser-Ney smoothing (Kneser and Ney, 1995), using the interpolated variant described in (Chen and Goodman., 1998):1</Paragraph> <Paragraph position="6"> where D = n1/(n1 + 2n2), n1+([?],t) is the number of distinct phrases s with which t co-occurs, and pk(s) = n1+(s,[?])/summationtexts n1+(s,[?]), with n1+(s,[?]) analogous to n1+([?],t).</Paragraph> <Paragraph position="7"> Our approach to phrase-table smoothing contrasts to previous work (Zens and Ney, 2004) in which smoothed phrase probabilities are constructed from word-pair probabilities and combined in a log-linear model with an unsmoothed phrase-table. We believe the two approaches are complementary, so a combination of both would be worth exploring in future work.</Paragraph> </Section> <Section position="2" start_page="134" end_page="134" type="sub_section"> <SectionTitle> 2.2 Feature-Rich DT-based distortion </SectionTitle> <Paragraph position="0"> In a recent paper (Kuhn et al, 2006), we presented a newclassofprobabilistic&quot;SegmentChoiceModels&quot; (SCMs) for distortion in phrase-based systems. In some situations, SCMs will assign a better distortion score to a drastic reordering of the source sentence than to no reordering; in this, SCMs differ from the conventional penalty-based distortion, which always favours less rather than more distortion.</Paragraph> <Paragraph position="1"> We developed a particular kind of SCM based on decision trees (DTs) containing both questions of a positional type (e.g., questions about the distance of a given phrase from the beginning of the source sentence or from the previously translated phrase) and word-based questions (e.g., questions about the presence or absence of given words in a specified phrase).</Paragraph> <Paragraph position="2"> The DTs are grown on a corpus consisting of segment-aligned bilingual sentence pairs. This 1As for Good-Turing smoothing, this formula applies only to pairs s,t for which c(s,t) > 0, since these are the only ones considered by the decoder.</Paragraph> <Paragraph position="3"> segment-aligned corpus is obtained by training a phrase translation model on a large bilingual corpus and then using it (in conjunction with a distortion penalty) to carry out alignments between the phrases in the source-language sentence and those in the corresponding target-language sentence in a second bilingual corpus. Typically, the first corpus (on which the phrase translation model is trained) is the same as the second corpus (on which alignment is carried out). To avoid overfitting, the alignment algorithm is leave-one-out: statistics derived from a particular sentence pair are not used to align that sentence pair.</Paragraph> <Paragraph position="4"> Note that the experiments reported in (Kuhn et al, 2006) focused on translation of Chinese into English. The interest of the experiments reported here onWMTdatawastoseeifthefeature-richDT-based distortion model could be useful for MT between other language pairs.</Paragraph> <Paragraph position="5"> 3 Application to the Shared Task: Methods</Paragraph> </Section> <Section position="3" start_page="134" end_page="135" type="sub_section"> <SectionTitle> 3.1 Restricted Resource Exercise </SectionTitle> <Paragraph position="0"> The first exercise that was done is to replicate the conditions of 2005 as closely as possible to see the effects of one year of research and development.</Paragraph> <Paragraph position="1"> The second exercise was to replicate all three of these translation exercises using the 2006 language model, and to do the three exercises of translating out of English into French, Spanish, and German. This was our baseline for other studies. A third exercise involved modifying the generation of the phrase-table to incorporate our Good-Turing smoothing. All six language pairs were re-processed with these phrase-tables. The improvement in the results on the devtest set were compelling. This became the baseline for further work. A fourth exercise involved replacing penalty-based distortion modelling with the feature-rich decision-tree based distortion modelling described above. A fifth exercise involved the use of a Kneser-Ney phrase-table smoothing algorithm as an alternative to GoodTuring. null For all of these exercises, 1-best results after decoding werecalculated as well as rescoringon 1000best lists of results using 12 feature functions (13 in the case of decision-tree based distortion modelling). The results submitted for the shared task were the results of the third and fourth exercises where rescoring had been applied.</Paragraph> </Section> <Section position="4" start_page="135" end_page="135" type="sub_section"> <SectionTitle> 3.2 Open Resource Exercise </SectionTitle> <Paragraph position="0"> Our goal in this exercise was to conduct a comparative study using additional training data for the French-English shared task. Results of WPT 2005 showed an improvement of at least 0.3 BLEU point when exploiting different resources for the French-English pair of languages. In addition to the training resources used in WPT 2005 for the French-English task, i.e. Europarl and Hansard, we used a bilingual dictionary, Le Grand Dictionnaire Terminologique (GDT) 2 to train translation models and the English side of the UN parallel corpus (LDC2004E13) to train an English language model. Integrating terminological lexicons into a statistical machine translation engine is not a straightforward operation, since we cannot expect them to come with attached probabilities. The approach we took consists on viewing all translation candidates of each source term or phrase as equiprobable (Sadat et al, 2006).</Paragraph> <Paragraph position="1"> In total, the data used in this second part of our contribution to WMT 2006 is described as follows: (1) A set of 688,031 sentences in French and En-</Paragraph> <Paragraph position="3"> A set of 6,056,014 sentences in French and English extracted from the Hansard parallel corpus, the official record of Canada's parliamentary debates. (3) A set of 701,709 sentences in French and English extracted from the bilingual dictionary GDT. (4) Language models were trained on the French and English parts of the Europarl and Hansard. We used the provided Europarl corpus while omitting data from Q4/2000 (October-December), since it is reserved for development and test data. (5) An additional English language model was trained on 128 million words of the UN Parallel corpus.</Paragraph> <Paragraph position="4"> For the supplied Europarl corpora, we relied on the existing segmentation and tokenization, except for French, which we manipulated slightly to bring intolinewithourexistingconventions(e.g., converting l ' an into l' an, aujourd ' hui into aujourd'hui). For the Hansard corpus used to supplement our French-English resources, we used our own align- null and tokenization procedures. English preprocessing simply included lower-casing, separating punctuation from words and splitting off 's.</Paragraph> </Section> </Section> class="xml-element"></Paper>