File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/p06-1077_metho.xml
Size: 15,192 bytes
Last Modified: 2025-10-06 14:10:18
<?xml version="1.0" standalone="yes"?> <Paper uid="P06-1077"> <Title>Tree-to-String Alignment Template for Statistical Machine Translation</Title> <Section position="4" start_page="609" end_page="611" type="metho"> <SectionTitle> 2 Tree-to-String Alignment Template </SectionTitle> <Paragraph position="0"> A tree-to-string alignment template z is a triple <~T, ~S, ~A> , which describes the alignment ~A between a source parse tree ~T = T(FJprime1 ) 2 and a target string ~S = EIprime1 . A source string FJprime1 , which is the sequence of leaf nodes of T(FJprime1 ), consists of both terminals (source words) and non-terminals (phrasal categories). A target string EIprime1 is also composed of both terminals (target words) and non-terminals (placeholders). An alignment ~A is defined as a subset of the Cartesian product of source and target symbol positions:</Paragraph> <Paragraph position="2"> 2We use T(*) to denote a parse tree. To reduce notational overhead, we use T(z) to represent the parse tree in z. Similarly, S(z) denotes the string in z.</Paragraph> <Paragraph position="3"> Figure 1 shows three TATs automatically learned from training data. Note that when demonstrating a TAT graphically, we represent non-terminals in the target strings by blanks.</Paragraph> <Paragraph position="4"> In the following, we formally describe how to introduce tree-to-string alignment templates into probabilistic dependencies to model Pr(eI1|fJ1 ) 3. In a first step, we introduce the hidden variable T(fJ1 ) that denotes a parse tree of the source sen-</Paragraph> <Paragraph position="6"> Next, another hidden variable D is introduced to detach the source parse tree T(fJ1 ) into a sequence of K subtrees ~TK1 with a preorder transversal. We assume that each subtree ~Tk produces a target string ~Sk. As a result, the sequence of subtrees ~TK1 produces a sequence of target strings ~SK1 , which can be combined serially to generate the target sentence eI1. We assume that</Paragraph> <Paragraph position="8"> is actually generated by the derivation of ~SK1 .</Paragraph> <Paragraph position="9"> Note that we omit an explicit dependence on the detachment D to avoid notational overhead.</Paragraph> <Paragraph position="11"> the symbol Pr(*) to denote general probability distribution with no specific assumptions. In contrast, for model-based probability distributions, we use generic symbol p(*).</Paragraph> <Paragraph position="12"> cess To further decompose Pr(~S|~T), the tree-to-string alignment template, denoted by the variable z, is introduced as a hidden variable.</Paragraph> <Paragraph position="14"> Therefore, the TAT-based translation model can be decomposed into four sub-models: 1. parse model: Pr(T(fJ1 )|fJ1 ) 2. detachment model: Pr(D|T(fJ1 ),fJ1 ) 3. TAT selection model: Pr(z|~T) 4. TAT application model: Pr(~S|z, ~T) Figure 2 shows how TATs work to perform translation. First, the input source sentence is parsed. Next, the parse tree is detached into five subtrees with a preorder transversal. For each subtree, a TAT is selected and applied to produce a string. Finally, these strings are combined serially to generate the translation (we use X to denote the non-terminal):</Paragraph> <Paragraph position="16"> Following Och and Ney (2002), we base our model on log-linear framework. Hence, all knowledge sources are described as feature functions that include the given source string fJ1 , the target string eI1, and hidden variables. The hidden variable T(fJ1 ) is omitted because we usually make use of only single best output of a parser. As we assume that all detachment have the same probability, the hidden variable D is also omitted. As a result, the model we actually adopt for experiments is limited because the parse, detachment, and TAT application sub-models are simplified.</Paragraph> <Paragraph position="18"> For our experiments we use the following seven feature functions 4 that are analogous to default feature set of Pharaoh (Koehn, 2004). To simplify the notation, we omit the dependence on the hidden variables of the model.</Paragraph> <Paragraph position="20"> 4When computing lexical weighting features (Koehn et al., 2003), we take only terminals into account. If there are no terminals, we set the feature value to 1. We use lex(*) to denote lexical weighting. We denote the number of TATs used for decoding by K and the length of target string by I.</Paragraph> </Section> <Section position="5" start_page="611" end_page="612" type="metho"> <SectionTitle> 3 Training </SectionTitle> <Paragraph position="0"> To extract tree-to-string alignment templates from a word-aligned, source side parsed sentence pair <T(fJ1 ),eI1,A> , we need first identify TSAs (Tree-String-Alignment) using similar criterion as suggested in (Och and Ney, 2004). A TSA is a triple <T(fj2j1 ),ei2i1, -A)> that is in accordance with the following constraints: 1. [?](i,j) [?] A : i1 [?] i [?] i2 - j1 [?] j [?] j2 2. T(fj2j1 ) is a subtree of T(fJ1 ) Given a TSA <T(fj2j1 ),ei2i1, -A> , a triple <T(fj4j3 ),ei4i3, ^A> is its sub TSA if and only if: 1. T(fj4j3 ),ei4i3, ^A> is a TSA 2. T(fj4j3 ) is rooted at the direct descendant of the root node of T(fj1j2 ) 3. i1 [?] i3 [?] i4 [?] i2 4. [?](i,j) [?] -A : i3 [?] i [?] i4 - j3 [?] j [?] j4 Basically, we extract TATs from a TSA <T(fj2j1 ),ei2i1, -A> using the following two rules: 1. If T(fj2j1 ) contains only one node, then <T(fj2j1 ),ei2i1, -A> is a TAT 2. If the height of T(fj2j1 ) is greater than one, then build TATs using those extracted from sub TSAs of <T(fj2j1 ),ei2i1, -A> .</Paragraph> <Paragraph position="1"> Usually, we can extract a very large amount of TATs from training data using the above rules, making both training and decoding very slow. Therefore, we impose three restrictions to reduce the magnitude of extracted TATs: 1. A third constraint is added to the definition of TSA:</Paragraph> <Paragraph position="3"> This constraint requires that both the first and last symbols in the target string must be aligned to some source symbols.</Paragraph> <Paragraph position="4"> 2. The height of T(z) is limited to no greater than h.</Paragraph> <Paragraph position="5"> 3. The number of direct descendants of a node of T(z) is limited to no greater than c. Table 1 shows the TATs extracted from the TSA in Figure 3 with h = 2 and c = 2.</Paragraph> <Paragraph position="6"> As we restrict that T(fj2j1 ) must be a subtree of T(fJ1 ), TATs may be treated as syntactic hierar- null chical phrase pairs (Chiang, 2005) with tree structure on the source side. At the same time, we face the risk of losing some useful non-syntactic phrase pairs. For example, the phrase pair I9d?V-- President Bush made can never be obtained in form of TAT from the TSA in Figure 3 because there is no subtree for that source string.</Paragraph> </Section> <Section position="6" start_page="612" end_page="612" type="metho"> <SectionTitle> 4 Decoding </SectionTitle> <Paragraph position="0"> We approach the decoding problem as a bottom-up beam search.</Paragraph> <Paragraph position="1"> To translate a source sentence, we employ a parser to produce a parse tree. Moving bottom-up through the source parse tree, we compute a list of candidate translations for the input subtree rooted at each node with a postorder transversal. Candidate translations of subtrees are placed in stacks. Figure 4 shows the organization of candidate translation stacks.</Paragraph> <Paragraph position="2"> placed in stacks according to the root index set by postorder transversal A candidate translation contains the following information: 1. the partial translation 2. the accumulated feature values 3. the accumulated probability A TAT z is usable to a parse tree T if and only if T(z) is rooted at the root of T and covers part of nodes of T. Given a parse tree T, we find all usable TATs. Given a usable TAT z, if T(z) is equal to T, then S(z) is a candidate translation of T. If T(z) covers only a portion of T, we have to compute a list of candidate translations for T by replacing the non-terminals of S(z) with candidate translations of the corresponding uncovered subtrees.</Paragraph> <Paragraph position="3"> For example, when computing the candidate translations for the tree rooted at node 8, the TAT used in Figure 5 covers only a portion of the parse tree in Figure 4. There are two uncovered sub-trees that are rooted at node 2 and node 7 respectively. Hence, we replace the third symbol with the candidate translations in stack 2 and the first symbol with the candidate translations in stack 7. At the same time, the feature values and probabilities are also accumulated for the new candidate translations.</Paragraph> <Paragraph position="4"> To speed up the decoder, we limit the search space by reducing the number of TATs used for each input node. There are two ways to limit the TAT table size: by a fixed limit (tatTable-limit) of how many TATs are retrieved for each input node, and by a probability threshold (tatTable-threshold) that specify that the TAT probability has to be above some value. On the other hand, instead of keeping the full list of candidates for a given node, we keep a top-scoring subset of the candidates. This can also be done by a fixed limit (stack-limit) or a threshold (stack-threshold). To perform recombination, we combine candidate translations that share the same leading and trailing bigrams in each stack.</Paragraph> </Section> <Section position="7" start_page="612" end_page="614" type="metho"> <SectionTitle> 5 Experiments </SectionTitle> <Paragraph position="0"> Our experiments were on Chinese-to-English translation. The training corpus consists of 31,149 sentence pairs with 843,256 Chinese words and 949,583 English words. For the language model, we used SRI Language Modeling Toolkit (Stolcke, 2002) to train a trigram model with modified Kneser-Ney smoothing (Chen and Goodman, 1998) on the 31,149 English sentences. We selected 571 short sentences from the 2002 NIST MT Evaluation test set as our development corpus, and used the 2005 NIST MT Evaluation test set as our test corpus. We evaluated the translation quality using the BLEU metric (Papineni et al., 2002), as calculated by mteval-v11b.pl with its default setting except that we used case-sensitive matching of n-grams.</Paragraph> <Section position="1" start_page="613" end_page="613" type="sub_section"> <SectionTitle> 5.1 Pharaoh </SectionTitle> <Paragraph position="0"> The baseline system we used for comparison was Pharaoh (Koehn et al., 2003; Koehn, 2004), a freely available decoder for phrase-based translation models:</Paragraph> <Paragraph position="2"> We ran GIZA++ (Och and Ney, 2000) on the training corpus in both directions using its default setting, and then applied the refinement rule &quot;diagand&quot; described in (Koehn et al., 2003) to obtain a single many-to-many word alignment for each sentence pair. After that, we used some heuristics, which including rule-based translation of numbers, dates, and person names, to further improve the alignment accuracy.</Paragraph> <Paragraph position="3"> Given the word-aligned bilingual corpus, we obtained 1,231,959 bilingual phrases (221,453 used on test corpus) using the training toolkits publicly released by Philipp Koehn with its default setting.</Paragraph> <Paragraph position="4"> To perform minimum error rate training (Och, 2003) to tune the feature weights to maximize the system's BLEU score on development set, we used</Paragraph> </Section> <Section position="2" start_page="613" end_page="613" type="sub_section"> <SectionTitle> 5.2 Lynx </SectionTitle> <Paragraph position="0"> On the same word-aligned training data, it took us about one month to parse all the 31,149 Chinese sentences using a Chinese parser written by Deyi Xiong (Xiong et al., 2005). The parser was trained on articles 1[?]270 of Penn Chinese Tree-bank version 1.0 and achieved 79.4% (F1 measure) as well as a 4.4% relative decrease in error rate. Then, we performed TAT extraction described in section 3 with h = 3 and c = 5 and obtained 350,575 TATs (88,066 used on test corpus). To run our decoder Lynx on development and test corpus, we set tatTable-limit = 20, tatTable-threshold = 0, stack-limit = 100, and stack-threshold = 0.00001.</Paragraph> </Section> <Section position="3" start_page="613" end_page="614" type="sub_section"> <SectionTitle> 5.3 Results </SectionTitle> <Paragraph position="0"> Table 2 shows the results on test set using Pharaoh and Lynx with different feature settings. The 95% confidence intervals were computed using Zhang's significance tester (Zhang et al., 2004). We modified it to conform to NIST's current definition of the BLEU brevity penalty. For Pharaoh, eight features were used: distortion model d, a trigram language model lm, phrase translation probabilities ph(f|e) and ph(e|f), lexical weightings lex(f|e) and lex(e|f), phrase penalty pp, and word penalty wp. For Lynx, seven features described in section 2 were used. We find that Lynx outperforms Pharaoh with all feature settings. With full features, Lynx achieves an absolute improvement of 0.006 over Pharaoh (3.1% relative). This difference is statistically significant (p < 0.01). Note that Lynx made use of only 88,066 TATs on test corpus while 221,453 bilingual phrases were used for Pharaoh.</Paragraph> <Paragraph position="1"> The feature weights obtained by minimum er- null ror rate training for both Pharaoh and Lynx are shown in Table 3. We find that ph(f|e) (i.e. h2) is not a helpful feature for Lynx. The reason is that we use only a single non-terminal symbol instead of assigning phrasal categories to the target string. In addition, we allow the target string consists of only non-terminals, making translation decisions not always based on lexical evidence.</Paragraph> </Section> <Section position="4" start_page="614" end_page="614" type="sub_section"> <SectionTitle> 5.4 Using bilingual phrases </SectionTitle> <Paragraph position="0"> It is interesting to use bilingual phrases to strengthen the TAT-based model. As we mentioned before, some useful non-syntactic phrase pairs can never be obtained in form of TAT because we restrict that there must be a corresponding parse tree for the source phrase. Moreover, it takes more time to obtain TATs than bilingual phrases on the same training data because parsing is usually very time-consuming.</Paragraph> <Paragraph position="1"> Given an input subtree T(Fj2j1 ), if Fj2j1 is a string of terminals, we find all bilingual phrases that the source phrase is equal to Fj2j1 . Then we build a TAT for each bilingual phrase <fJprime1 ,eIprime1 , ^A> : the tree of the TAT is T(Fj2j1 ), the string is eIprime1 , and the alignment is ^A. If a TAT built from a bilingual phrase is the same with a TAT in the TAT table, we prefer to the greater translation probabilities.</Paragraph> <Paragraph position="2"> Table 4 shows the effect of using bilingual phrases for Lynx. Note that these bilingual phrases are the same with those used for Pharaoh.</Paragraph> </Section> </Section> class="xml-element"></Paper>