File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/n06-1004_metho.xml
Size: 14,439 bytes
Last Modified: 2025-10-06 14:10:06
<?xml version="1.0" standalone="yes"?> <Paper uid="N06-1004"> <Title>Segment Choice Models: Feature-Rich Models for Global Distortion in Statistical Machine Translation</Title> <Section position="5" start_page="26" end_page="27" type="metho"> <SectionTitle> 3 A Trainable Decision Tree SCM </SectionTitle> <Paragraph position="0"> Almost any machine learning technique could be used to create a trainable SCM. We implemented one based on decision trees (DTs), not because DTs necessarily yield the best results but for software engineering reasons: DTs are a quick way to explore a variety of features, and are easily interpreted when grown (so that examining them can suggest further features). We grew N DTs, each defined by the number of choices available at a given moment. The highest-numbered DT has a &quot;+&quot; to show it handles N+1 or more choices. E.g., if we set N=4, we grow a &quot;2-choice&quot;, a &quot;3-choice&quot;, a &quot;4-choice&quot;, and a &quot;5+-choice tree&quot;. The 2-choice tree handles cases where there are 2 segments in the RS, assigning a probability to each; the 3-choice tree handles cases where there are 3 segments in the RS, etc. The 5+-choice tree is different from the others: it handles cases where there are 5 segments in the RS to choose from, and cases where there are more than 5. The value of N is arbitrary; e.g., for N=8, the trees go from &quot;2-choice&quot; up to &quot;9+-choice&quot;.</Paragraph> <Paragraph position="1"> Suppose a left-to-right decoder with an N=4 SCM is translating a sentence with seven phrases.</Paragraph> <Paragraph position="2"> Initially, when the DSH is empty, the 5+-choice tree assigns probabilities to each of these seven. It will use the 5+-choice tree twice more, to assign probabilities to six RS, then to five. To extend the hypothesis, it will then use the 4-choice tree, the 3-choice tree, and finally the 2-choice tree. Disperps for this SCM are calculated on test corpus DSHs in the same left-to-right way, using the tree for the number of choices in the RS to find the probability of each segment choice.</Paragraph> <Paragraph position="3"> Segments need labels, so the N-choice DT can assign probabilities to the N segments in the RS.</Paragraph> <Paragraph position="4"> We currently use a &quot;following&quot; labeling scheme. Let X be the original source position of the last word put into the DSH, plus 1. In Figure 2, this was word 7, so X=8. In our scheme, the RS segment whose first word is closest to X is labeled &quot;A&quot;; the second-closest segment is labeled &quot;B&quot;, etc. Thus, segments are labeled in order of the (Koehn, 2004) penalty; the &quot;A&quot; segment gets the lowest penalty. Ties between segments on the right and the left of X are broken by first labeling the right segment. In Figure 2, the labels for the RS are &quot;A&quot; = [8 9], &quot;B&quot; = [6], &quot;C&quot; = [4], &quot;D&quot; = [2 3]. Figure 3. Some question types for choice DTs Figure 3 shows the main types of questions used for tree-growing, comprising position questions and word-based questions. Position questions pertain to location, length, and ordering of segments. Some position questions ask about the distance between the first word of a segment and the &quot;following&quot; position X: e.g., if the answer to &quot;pos(A)-pos(X)=0?&quot; is yes, then segment A comes immediately after the last DSH segment in the source, and is thus highly likely to be chosen.</Paragraph> <Paragraph position="5"> There are also questions relating to the &quot;leftmost&quot; and &quot;parallel&quot; predictors (above, sec. 2.2). The fseg() and bseg() functions count segments in the RS from left to right and right to left respectively, allowing, e.g., the question whether a given segment is the second last segment in the RS. The only word-based questions currently implemented ask whether a given word is contained in a given segment (or anywhere in the DSH, or anywhere in the RS). This type could be made richer by allowing questions about the position of a given word in a given segment, questions about syntax, etc.</Paragraph> <Paragraph position="6"> Figure 4 shows an example of a 5+-choice DT.</Paragraph> <Paragraph position="7"> The &quot;+&quot; in its name indicates that it will handle cases where there are 5 or more segments in the RS. The counts stored in the leaves of this DT represent the number of training data items that ended up there; the counts are used to estimate probabilities. Some smoothing will be done to avoid zero probabilities, e.g., for class C in node 3.</Paragraph> <Paragraph position="8"> For &quot;+&quot; DTs, the label closest to the end of the alphabet (&quot;E&quot; in Figure 4) stands for a class that can include more than one segment. E.g., if this and the fourth closest &quot;D&quot;. That leaves 3 segments, all labeled &quot;E&quot;. The DT shown yields probability Pr(E) that one of these three will be chosen. Currently, we apply a uniform distribution within this &quot;furthest from X&quot; class, so the probability of any one of the three &quot;E&quot; segments is estimated as Pr(E)/3.</Paragraph> <Paragraph position="9"> To train the DTs, we generate data items from the second-pass DSH corpus. Each DSH generates several data items. E.g., moving across a sevensegment DSH from left to right, there is an example of the seven-choice case, then one of the sixchoice case, etc. Thus, this DSH provides three items for training the 5+-choice DT and one item 1.</Paragraph> <Paragraph position="10"> 3.</Paragraph> <Paragraph position="11"> 2. 5.</Paragraph> <Paragraph position="12"> 4.</Paragraph> <Paragraph position="13"> 1. Position Questions</Paragraph> </Section> <Section position="6" start_page="27" end_page="28" type="metho"> <SectionTitle> 2. Word-Based Questions </SectionTitle> <Paragraph position="0"> E.g., &quot;and a1 DSH?&quot;, &quot;November a1 B?&quot;, etc.</Paragraph> <Paragraph position="1"> each for training the 4-choice, 3-choice, and 2-choice DTs. The DT training method was based on Gelfand-Ravishankar-Delp expansion-pruning (Gelfand et al., 1991), for DTs whose nodes contain probability distributions (Lazarides et al., 1996).</Paragraph> </Section> <Section position="7" start_page="28" end_page="29" type="metho"> <SectionTitle> 4 Disperp Experiments </SectionTitle> <Paragraph position="0"> We carried out SCM disperp experiments for the English-Chinese task, in both directions. That is, we trained and tested models both for the distortion of English into Chinese-like phrase order, and the distortion of Chinese into English-like phrase order. For reasons of space, details about the &quot;distorted English&quot; experiments won't be given here. Training and development data for the distorted Chinese experiments were taken from the NIST 2005 release of the FBIS corpus of Xinhua news stories. The training corpus comprised 62,000 FBIS segment alignments, and the development &quot;dev&quot; corpus comprised a disjoint set of 2,306 segment alignments from the same FBIS corpus.</Paragraph> <Paragraph position="1"> All disperp results are obtained by testing on &quot;dev&quot; corpus.</Paragraph> <Paragraph position="2"> Figure 5 shows disperp results for the models described earlier. The y axis begins at 1.0 (minimum value of disperp). The x axis shows number of alignments (DSHs) used to train DTs, on a log scale. Models A-D are fixed in advance; Model P's single parameter a0 was optimized once on the entire training set of 62K FBIS alignments (to 0.77) rather than separately for each amount of training data. Model P, the normalized version of Koehn's distortion penalty, is superior to Models A-D, and the DT-based SCM is superior to Model P.</Paragraph> <Paragraph position="3"> The Figure 5 DT-based SCM had four trees (2choice, 3-choice, 4-choice, and 5+-choice) with position-based and word-based questions. The word-based questions involved only the 100 most frequent Chinese words in the training corpus. The system's disperp drops from 3.1 to 2.8 as the number of alignments goes from 500 to 62K.</Paragraph> <Paragraph position="4"> Figure 6 examines the effect of allowing word-based questions. These questions provide a significant disperp improvement, which grows with the amount of training data.</Paragraph> <Paragraph position="5"> Distorted Chinese: effect of allowing word qns In the &quot;four-DT&quot; results above, examples with five or more segments are handled by the same &quot;5+-choice&quot; tree. Increasing the number of trees allows finer modeling of multi-segment cases while spreading the training data more thinly.</Paragraph> <Paragraph position="6"> Thus, the optimal number of trees depends on the amount of training data. Fixing this amount to 32K alignments, we varied the number of trees. Figure 7 shows that this parameter has a significant impact on disperp, and that questions based on the most frequent 100 Chinese words help performance for any number of trees.</Paragraph> <Paragraph position="7"> In Figure 8 the number of the most frequent Chinese words for questions is varied (for a 13-DT system trained on 32K alignments). Most of the improvement came from the 8 most frequent words, especially from the most frequent, the comma &quot;,&quot;. This behaviour seems to be specific to Chinese. In our &quot;distorted English&quot; experiments, questions about the 8 most frequent words also gave a significant improvement, but each of the 8 words had a fairly equal share in the improvement. Distorted Chinese: Disperp vs. #words (all trees grown on 32K alignments) Finally, we grew the DT system used for the MT experiments: one with 13 trees and questions about the 25 most frequent Chinese words, grown on 88K alignments. Its disperp on the &quot;dev&quot; used for the MT experiments (a different &quot;dev&quot; from the one above - see Sec. 5.2) was 2.42 vs. 3.48 for the baseline Model P system: a 30% drop.</Paragraph> </Section> <Section position="8" start_page="29" end_page="29" type="metho"> <SectionTitle> 5 Machine Translation Experiments </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="29" end_page="29" type="sub_section"> <SectionTitle> 5.1 SCMs for Decoding </SectionTitle> <Paragraph position="0"> SCMs assume that the source sentence is fully segmented throughout decoding. Thus, the system must guess the segmentation for the unconsumed part of the source (&quot;remaining source&quot;: RS). For the results below, we used a simple heuristic: RS is broken into one-word segments. In future, we will apply a more realistic segmentation model to RS (or modify DT training to reflect accurately RS treatment during decoding).</Paragraph> </Section> <Section position="2" start_page="29" end_page="29" type="sub_section"> <SectionTitle> 5.2 Chinese-to-English MT Experiments </SectionTitle> <Paragraph position="0"> The training corpus for the MT system's phrase tables consists of all parallel text available for the NIST MT05 Chinese-English evaluation, except the Xinhua corpora and part 3 of LDC's &quot;Multiple-Translation Chinese Corpus&quot; (MTCCp3). The English language model was trained on the same corpora, plus 250M words from Gigaword. The DT-based SCM was trained and tuned on a subset of this same training corpus (above). The dev corpus for optimizing component weights is MTCCp3.</Paragraph> <Paragraph position="1"> The experimental results below were obtained by testing on the evaluation set for MTeval NIST04.</Paragraph> <Paragraph position="2"> Phrase tables were learned from the training corpus using the &quot;diag-and&quot; method (Koehn et al., 2003), and using IBM model 2 to produce initial word alignments (these authors found this worked as well as IBM4). Phrase probabilities were based on unsmoothed relative frequencies. The model used by the decoder was a log-linear combination of a phrase translation model (only in the P(source|target) direction), trigram language model, word penalty (lexical weighting), an optional segmentation model (in the form of a phrase penalty) and distortion model. Weights on the components were assigned using the (Och, 2003) method for max-BLEU training on the development set. The decoder uses a dynamic-programming beam-search, like the one in (Koehn, 2004). Future-cost estimates for all distortion models are assigned using the baseline penalty model.</Paragraph> </Section> <Section position="3" start_page="29" end_page="29" type="sub_section"> <SectionTitle> 5.3 Decoding Results </SectionTitle> <Paragraph position="0"> systems use the distortion penalty in (Koehn, 2004) with a0 optimized on &quot;dev&quot;, while &quot;DT&quot; systems use the DT-based SCM. &quot;1x&quot; is the default beam width, while &quot;4x&quot; is a wider beam (our notation reflects decoding time, so &quot;4x&quot; takes four times as long as &quot;1x&quot;). &quot;PP&quot; denotes presence of the phrase penalty component. The advantage of DTs as measured by difference between the score of the best DT system and the best DP system is 0.75 BLEU at 1x and 0.5 BLEU at 4x. With a 95% bootstrap confidence interval of +-0.7 BLEU (based on 1000-fold resampling), the resolution of these results is too coarse to draw firm conclusions.</Paragraph> <Paragraph position="1"> Thus, we carried out another 1000-fold bootstrap resampling test on NIST04, this time for pairwise system comparison. Table 1 shows results for BLEU comparisons between the systems with the default (1x) beam. The entries show how often the A system (columns) had a better score than the B The table shows that both DT-based 1x systems performed better than either of the DP systems more than 99% of the time (underlined results).</Paragraph> <Paragraph position="2"> Though not shown in the table, the same was true with 4x beam search. The DT 1x system with a phrase penalty had a higher score than the DT 1x system without one about 66% of the time.</Paragraph> </Section> </Section> class="xml-element"></Paper>